Keyboard Logs as Natural Annotations for Word Segmentation

Fumihiko Takahashi∗ Shinsuke Mori Yahoo Corporation ACCMS, Kyoto University Midtown Tower, 9-7-1 Akasaka, Minato-ku, Yoshida Honmachi, Sakyo-ku, , Japan Kyoto, Japan [email protected] [email protected]

Abstract chine translation of patents, text mining of med- ical texts, and marketing on the micro-blog site In this paper we propose a framework to Twitter1. Some papers have reported low accuracy improve word segmentation accuracy us- on WS or the joint task of WS and part-of-speech ing input method logs. An input method (POS) tagging of Japanese or Chinese in these do- is software used to type sentences in lan- mains (Mori and Neubig, 2014; Kaji and Kitsure- guages which have far more characters gawa, 2014; Liu et al., 2014) than the number of keys on a keyboard. To cope with this problem, we propose a way The main contributions of this paper are: to collect information from people as they type 1) an input method server that proposes Japanese or Chinese on computers. These lan- word candidates which are not included in guages use far more characters than the number of the vocabulary, 2) a publicly usable input keys on a keyboard, so users use software called an method that logs user behavior (like typ- input method (IM) to type text in these languages. ing and selection of word candidates), and Unlike written texts in these languages, which lack 3) a method for improving word segmen- word boundary information, text entered with an tation by using these logs. We conducted IM can provide word boundary information that word segmentation experiments on tweets can used by NLP systems. As we show in this pa- from Twitter, and showed that our method per, logs collected from IMs are a valuable source improves accuracy in this domain. Our of word boundary information. method itself is domain-independent and An IM consists of a client (front-end) and a only needs logs from the target domain. server (back-end). The client receives a key se- quence typed by the user and sends a phoneme 1 Introduction sequence (kana in Japanese or pinyin in Chinese) The first step of almost all natural language or some predefined commands to the server. The processing (NLP) for languages with ambiguous server converts the phoneme sequence into normal word boundaries (such as Japanese and Chinese) written text as a word sequence or proposes word is solving the problem of word identification am- candidates for the phoneme sequence in the region biguity. This task is called word segmentation specified by the user. We noticed that the actions (WS) and the accuracy of state-of-the-art methods performed by people using the IM (such as typ- based on machine learning techniques is more than ing and selecting word candidates) provide infor- 98% for Japanese and 95% for Chinese (Neubig mation about word boundaries, including context et al., 2011; Yang and Vozila, 2014). Compared information. to languages like English with clear word bound- In this paper, we first describe an IM for aries, this ambiguity poses an additional problem Japanese which allows us to collect this informa- for NLP tasks in these languages. To make matters tion. We then propose an automatic word seg- worse, the domains of the available training data menter that uses IM logs as a language resource to often differ from domains where there is a high improve its performance. Finally, we report exper- demand for NLP, which causes a severe degrada- imental results showing that our method increases tion in WS performance. Examples include ma- the accuracy of a word segmenter on Twitter texts by using logs collected from a browser add-on ver- *This work was done when the first author was at Kyoto University. 1https://twitter.com/ (accessed in 2015 May).

1186 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1186–1196, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics. sion of our IM. and Oda (2009) proposed to incorpolate dictio- The three main contributions of this paper are: naries for human into a WS system with a differ- ent word definition. CRFs were also extended to an IM server that proposes word candidates • enable training from partially annotated sentences which are not included in the vocabulary (Tsuboi et al., 2008). When using partially anno- (Section 3), tated sentences for WS training data, word bound- a publicly usable IM that logs user behavior ary information exists only between some charac- • (such as typing and selection of word candi- ter pairs and is absent for others. This extension dates) (Section 4), was adopted in Chinese WS to make use of so- called natural annotations (Yang and Vozila, 2014; a method for improving word segmentation Jiang et al., 2013). In that work, tags in hyper-texts • by using these logs (Section 5). were regarded as annotations and used to improve WS performance. The IM logs used in this paper To the best of our knowledge, this is the first paper are also classified as natural annotations, but con- proposing a method for using IM logs to success- tain much more noise. In addition, we need an IM fully improve WS. that is specifically designed to collect logs as nat- ural annotations. 2 Related Work Server design is the most important factor in The main focus of this paper is WS. Corpus-based, capturing information from IM logs. The most or empirical, methods were proposed in the early popular IM servers are based on statistical lan- 90’s (Nagata, 1994). Then (Mori and Kurata, guage modeling (Mori et al., 1999; Chen and 2005) extended it by lexicalizing the states like Lee, 2000; Maeta and Mori, 2012). Their param- many researches in that era, grouping the word- eters are trained from manually segmented sen- POS pairs into clusters inspired by the class-based tences whose words are annotated with phoneme n-gram model (Brown et al., 1992), and making sequences, and from sentences automatically an- the history length variable like a POS tagger in notated with NLP tools which are also based on English (Ron et al., 1996). In parallel, there were machine learning models trained on the annotated attempts at solving Chinese WS in a similar way sentences. Thus normal IM servers are not capa- (Sproat and Chang, 1996). WS or the joint task of ble of presenting out-of-vocabulary (OOV) words WS and POS tagging can be seen as a sequence (which provide large amounts of information on labeling problem. So conditional random fields word boundaries) as conversion candidates. To (CRFs) (Peng et al., 2004; Lafferty et al., 2001) make our IM server capable of presenting OOV have been applied to this task and showed bet- words, we extend a statistical IM server based on ter performance than POS-based Markov models (Mori et al., 2006), and ensure that it is compu- (Kudo et al., 2004). The training time of sequence- tationally efficient enough for practical use by the based methods tends to be long, especially when public. we use partially annotated data. Thus a simple The target domain in our experiments is Twit- method based on pointwise classification has been ter, a site where users post short messages called shown to be as accurate as sequence-based meth- tweets. Since tweets are an immediate and power- ods and fast enough to make active learning prac- ful reflection of public attitudes and social trends, tically possible (Neubig et al., 2011). Since the there have been numerous attempts at extracting pointwise method decides whether there is a word information from them. Examples include infor- boundary or not between two characters without mation analysis of disasters (Sakai et al., 2010), referring to other word boundary decisions in the estimation of depressive tendencies (Tsugawa et same sentence, it is straightforward to train the al., 2013), speech diarization (Higashinaka et al., model from partially annotated sentences. We 2011), and many others. These works require pre- adopt this WS system for our experiments. processing of tweets with NLP tools, and WS is Along with the of models, the NLP the first step. So it is clear that there is strong de- community has become increasingly aware of the mand for improving WS accuracy. Another reason importance of language resources (Neubig and why we have chosen Twitter for the test domain is Mori, 2010; Mori and Neubig, 2014). So Mori that the tweets typed using our server are open and

1187 we can avoid privacy problems. Our method does equation above shows, our pair-based engine also not utilize any other characteristics of tweets. So needs an accurate way of automatically estimat- it also works in other domains such as blogs. ing pronunciation (phoneme sequences). How- ever, recently an automatic pronunciation estima- 3 Input Method Suggesting OOV Words tor (Mori and Neubig, 2011) that delivers as accu- In this section we propose a practical statistical IM rate as state-of-the-art word segmenters has been server that suggests OOV word candidates in ad- proposed. As we explain in Section 6, in our ex- dition to words in its vocabulary. periments both our IM engine and existing ones delivered accuracy of 91%. 3.1 Statistical Input Method 3.2 Enumerating Substrings as Candidate An input method (IM) is software which converts Words a phoneme sequence into a word sequence. This is useful for languages which contain far more char- Essentially, the IM engine which we have ex- acters than keys on a keyboard. Since there are plained above does not have the ability to enumer- some ambiguities in conversion, a conversion en- ate words which are unknown to the word seg- gine based on a word n-gram model has been pro- menter and the pronunciation estimator used to posed (Chen and Lee, 2000). Today, almost all IM build the training data. The aim of our research is engines are based on statistical methods. to gather language information from user behav- For the LM unit, instead of words we propose to ior as they use an IM. So we extend the basic IM adopt word-pronunciation pairs u = y, w . Thus engine to enumerate all the substrings in a corpus h i given a phoneme sequence yl = y y y as with all possible pronunciations. For that purpose, 1 1 2 ··· l the input, the goal of our IM engine is to output we adopt the notion of a stochastically segmented m corpus (SSC) (Mori and Takuma, 2004) and ex- a word sequence wˆ1 that maximizes the probabil- l tend it to the pronunciation annotation to words. ity P (w, y1) as follows:

m l 3.2.1 Stochastically Segmented Corpora wˆ1 = argmax P (w, y1), w An SSC is defined as a combination of a raw cor- m+1 l i 1 pus Cr (hereafter referred to as the character se- P (w, y1) = P (ui ui−n+1), nr | − quence x1 ) and word boundary probabilities of i=1 ∏ the form Pi, which is the probability that a word where the concatenation of yi in each ui is equal to boundary exists between two characters xi and the input: yl = y y y . In addition u (j xi+1. These probabilities are estimated by a model 1 1 2 ··· m j ≤ 0) are special symbols introduced to simplify the based on logistic regression (LR)(Fan et al., 2008) notation and um+1 is a special symbol indicating trained on a manually segmented corpus referring a sentence boundary. to the same features as those used in (Neubig et As in existing statistical IM engines, parame- al., 2011). Since there are word boundaries be- ters are estimated from a corpus whose sentences fore the first character and after the last character are segmented into words annotated with their pro- of the corpus, P0 = Pnr = 1. Then word n-gram nunciations as follows: frequencies on an SSC are calculated as follows:

i Word 0-gram frequency: This is defined as the i 1 F (ui n+1) P (ui ui−n+1) = i−1 , (1) expected number of words in the SSC: | − F (ui−n+1) − nr 1 − where F ( ) denotes the frequency of a pair se- f( ) = 1 + P . · i quence in the corpus. In contrast to IM engines · ∑i=1 based on a word n-gram model, ours does not re- quire an additional model describing relationships Word n-gram frequency (n 1): Consider the ≥ n between words and pronunciations, and thus it is situation in which a word sequence w1 occurs much simpler and more practical. in the SSC as a subsequence beginning at the Existing statistical IM engines only need an ac- (i + 1)-th character and ending at the k-th char- curate automatic word segmenter to estimate the acter and each word wm in the word sequence is parameters of the word n-gram model. As the equal to the character sequence beginning at the

1188 xi xb1 xe1 xb2 xe2 xbn xbn+1 xen xk+1

w1 w2 wn

| n {z } | {z } | {z } fr(w ) = Pi(1 Pb )Pe (1 Pb )Pe (1 Pb )(1 Pb )Pe 1 − 1 1 − 2 2 ··· − n − n+1 n Figure 1: Word n-gram frequency in a stochastically segmented corpus.

bm-th character and ending at the em-th charac- 3.2.3 Pseudo-Stochastically Tagged Corpora em ter (x = wm, 1 m n; em + 1 = bm ≤ ∀ ≤ We can annotate a word with its all possi- b , 1 m n 1; b = i + 1; e = k) m+1 ≤ ∀ ≤ − 1 n ble pronunciations and their probabilities, as is (See Figure 1 for an example). The word n- done in an SSC. We call a corpus containing n gram frequency of a word sequence fr(w1 ) in sequences of words (w w w ) annotated 1 2 ··· i ··· the SSC is defined by the summation of the with a sequence of pairs of a pronunciation and stochastic frequency at each occurrence of the its probability ( yi,1, pi,1 , yi,2, pi,2 , , where n h i h i ··· character sequence of the word sequence w1 p = 1, for i) a stochastically tagged cor- j i,j ∀ over all of the occurrences in the SSC: pus (STC)3. We can estimate these probabilities ∑ n em 1 using an LR model built from sentences annotated n − fr(w1 ) = Pi (1 Pj) Pem , with pronunciations (Mori and Neubig, 2011). n   −   (i,e ) On m=1 j=bm Similar to pSSC we then produce a pseudo- ∑1 ∈ ∏  ∏  n   stochastically tagged sentence (pSTC) from an where e1 = (e1, e2,  , en) and On = n em ··· STC as follows: (i, e ) x = wm, 1 m n . { 1 | bm ≤ ≤ } We calculate word n-gram probabilities by divid- For each wi in the sentence • ing word n-gram frequencies by word (n 1)- − 1. generate a random number 0 p < 1, gram frequencies. For a detailed explanation and ≤ 2. annotate wi with its j-th phoneme se- a mathematical proof of this method, please refer j 1 quence y , where − p p < to (Mori and Takuma, 2004). i,j 1 i,j ≤ j p 1 i,j ∑ 3.2.2 Pseudo-Stochastically Segmented Corpora Now we have∑ a corpus in the same format as a The computational costs (in terms of both time and standard corpus annotated with variable pronunci- space) for calculating an n-gram model from an ation. SSC are very high2, so it is not a practical tech- By estimating the parameters in Equation (1) nique for implementing an IM engine. In order from a pSTC derived from a pSSC, our IM en- to reduce the computational costs we approximate gine can also suggest OOV word candidates with an SSC using a deterministically tagged corpus, various possible segmentation and pronunciations which is called a pseudo-stochastically segmented without incurring high computational costs. corpus (pSSC) (Kameko et al., 2015). The follow- 3.2.4 Suggestion of OOV Words ing is the method for producing a pSSC from an Here we give an intuitive explanation why our SSC. IM engine can suggest OOV words for a certain For i = 1 to n 1 phoneme sequence. Let us take an OOV word • r − example: “横アリ/yo-ko-a-ri,” an abbreviation of 1. output a character xi, “横浜アリーナ ” ( city arena). A WS 2. generate a random number 0 p < 1, ≤ system tends to segment it into “横” (side) and “ 3. output a word boundary if p < Pi or アリ” (ant) because they are frequent nouns. In output nothing otherwise. a pSSC, however, some occurrences of the string Now we have a corpus in the same format as “横アリ ” are remain concatenated as the correct a standard segmented corpus with variable (non- word. For pronunciation, the first character has constant) segmentation. two possible pronunciations “yo-ko” and “o-u.” 2This is because an SSC has many words and word frag- 3Because the existence or non-existence of a word bound- ments. Additionally, word n-gram frequencies must be cal- ary information can also be expressed as a tag, a stochasti- culated using floating point numbers instead of integers. cally tagged corpus includes stochastic segmentation.

1189 Figure 2: Input method for collecting logs.

So deterministic pronunciation estimation of this characters for a phoneme sequence. If the new word has the risk of outputting the erroneous phoneme sequence itself is what the user result “o-u-a-ri.” This prevents our engine from wants to write, the user may not go to the presenting “横アリ” as a conversion candidate for next phase. The server records the keys typed the input “yo-ko-a-ri.” The pSTC, however, con- to enter the phoneme sequence, cursor move- tains two possible pronunciations for this word ments, and the phoneme sequence if the user and allows our engine to present the OOV word selects it as-is. “横アリ” for the input “yo-ko-a-ri.” Thus when the user of our IM engine types “yo- Conversion result editing: Then the user presses ko-a-ri-ni-i-ku” and selects “横アリ に (to) 行く a space key to make the IM engine con- (go),” the engine can learn an OOV word “横ア vert the phoneme sequence to the most likely リ/yo-ko-a-ri” with context “に/ni 行く/i-ku”. word sequence based on Equation (1). Some- times the user changes some word bound- 4 Input Method Logs aries, makes the IM engine enumerate can- didate words covering the region, and selects In this section we first propose an IM which al- the intended one from the list of candidates. lows us to collect user logs. We then examine the The server records a space key and the final characteristics of these logs and some difficulties word sequence. in using them as language resources. 4.2 Characteristics of Input Method Logs 4.1 Collecting Logs from an Input Method Table 1 shows an example of interesting log mes- As Figure 2 shows, the client of our IM, running sages from the same IP address4. In many cases, on the user’s PC, is used to input characters and users type sentence fragments but not a complete to modify conversion results. The server logs both sentence. So in the example there are six frag- input from the client and the results of conversions ments within a short period indicated by the times- performed in response to requests from the client. tamps. If the user selects the phoneme sequence Our IM has two phases: phoneme typing and as-is without going to the conversion result editing conversion result editing. In each phase, the client phase, we can expect that there are word bound- sends the typed keys to the server with a timestamp aries on both sides of the phoneme sequence. In- and its IP address. 4In reality, logs from different IPs are stored in the order Phoneme typing: First the user inputs ASCII that they were received.

1190 Table 1: Input method logs of a tweet ‘横アリに比べると安めかと’ (It is cheap compared with Yoko- hama arena). Timestamp Phoneme sequence Edit result Note 18:37:11.21 よこありに/yo-ko-a-ri-ni 横アリ/yo-ko-a-ri に/ni (with Yokohama arena) 18:37:12.60 くらっべる/ku-ra-b-be-ru くらっ/ku-ra-b ベル/be-ru Mistyping 18:37:14.94 くらべる/ku-ra-be-ru 比べ/ku-ra-be る/ru Revised input (compare) 18:37:15.32 と/to N/A (inflectional ending) 18:37:19.82 ものの/mo-no-no N/A Discarded in the twitter 18:37:22.42 やすめかと/ya-su-me-ka-to 安め/ya-su-me か/ka と/to (cheap) side the phoneme sequence, however, there is no generated directly from the logs “Log-as-is.” Ex- information. If the user goes to the conversion amples in Table 1 are converted as following. result editing phase, we can expect that the final Example of Log-as-is (12 annotations) word sequence has correct word boundary infor- 横-ア-リ|に と  mation. く-ら-っ|ベ-ル も の の There are two main problems that make it dif- 比-べ|る 安-め|か|と ficult to directly use IM logs as a training cor- pus for word segmentation. The first problem is Here the number of annotations is the sum of “-” fragmentation. IM users send the phoneme se- and “|”. In this example, one entry corresponds quences for sentence fragments to the engine to to one entry of the training data for the word seg- avoid editing long conversion results that require menter. As you can easily imagine, Log-as-is may many cursor movements. Thus the phoneme se- contain mistaken results (noise) and short entries quence and the final word sequence tend to be (fragmentation). Both are harmful for a word seg- sentence fragments (as we noted above) and as a menter. result they lose context information. The second To cope with the fragmentation problem, we problem is noise. Word boundary information is propose to connect some logs based on their times- unreliable even when it is present because of mis- tamps. If the difference between the timestamps of takenly selected conversions or words entered sep- two sequential logs is short, both logs are proba- arately. From these observations, the IM logs are bly from the same sentence. So we connect two treated as partially segmented sentence fragments sequential logs if the time difference between the that include noise. last key of the first log and the first key of the sec- ond log is smaller than a certain threshold s. In the 5 Word Segmentation Using Input experiment we set s = 500[ms] based on observa- Method Logs tions of our behavior5. This method is referred to as “Log-chunk.” Using this method, we obtain In this section we first explain various ways to the following from the examples in Table 1. generate language resources for a word segmenter from IM logs. We then describe an automatic word Example of Log-chunk (15 annotations) segmenter which utilizes these resources. In the 横-ア-リ|に|く-ら-っ|ベ-ル  examples below we use the three-valued notation 比-べ|る|と|も の の (Mori and Oda, 2009) to denote partial segmenta- 安-め|か|と tion as follows: We see that Log-chunk contains more context in- | : there is a word boundary,   formation than Log-as-is. - : there is not a word boundary, For preventing the noise problem, we propose : there is no information. to filter out logs with a small number of conver- 5.1 Input Method Logs as Language sions. We expect that an edited sentence will have Resources many OOV words and not much noise. Therefore we use logs which were converted more than n The phoneme sequences and edit results in the fi- c times. In the experiment we set n = 2 based on nal selection themselves are considered to be par- c tially segmented sentences. We call the corpus 5The results were stable for s in preliminary experiments.

1191 Table 2: Corpus specifications. Table 3: Language resources derived from logs. #sent. #words #char. #sentence #annotations Training fragments BCCWJ 56,753 1,324,951 1,911,660 Log-as-is 32,119 39,708 Newspaper 8,164 240,097 361,843 Log-chunk 8,685 63,144 Conversation 11,700 147,809 197,941 Log-mconv 4,610 10,852 Test Log-chunk-mconv 1,218 14,242 BCCWJ-test 6,025 148,929 212,261 TWI-test 2,976 37,010 58,316 and character type n-grams (n = 1, 2, 3) around the decision points in a window with a width of observations of our behavior6. This method is re- 6 characters. Additional features are triggered if ferred to as “Log-mconv.” Using this method, the character n-grams in the window match with char- examples in Table 1 becomes the following. acter sequences in the dictionary. This approach is Example of Log-mconv (3 annotations) called pointwise because the word boundary deci- sion is made without referring to the other deci- 横-ア-リ|に  sions on the points j = i. As you can see from the 6 As this example shows, Log-mconv contains shortexplanation given above, we can also use partially entries (fragmentation) like Log-as-is. However, segmented sentences from IM logs for training in we expect that the annotated tweets do not include the standard way. mistaken boundaries or conversions that were dis- carded. 6 Evaluation Obviously we can combine Log-chunk and Log-mconv to avoid both the fragmentation and As an evaluation of our methods, we measured noise problems. This combination is referred to as the accuracy of WS without using logs (the base- “Log-chunk-mconv.” line) and using logs converted by several methods. There are two test corpora: one is the general do- 5.2 Training a Word Segmenter on Logs main corpus from which we built the baseline WS, The IM logs give us partially segmented sentence and the other is the same domain that the IM logs fragments, so we need a word segmenter capa- were collected from, Twitter. ble of learning from them. We can use a word segmenter based on a sequence classifier (Tsuboi 6.1 Corpora et al., 2008; Yang and Vozila, 2014; Jiang et al., 2013) or one based on a pointwise classifier (Neu- The annotated corpus we used to build the base- big et al., 2011). Although both types are viable, line word segmenter is the manually annotated we adopt the latter in the experiments because it part (core data) of the Balanced Corpus of Con- requires much less training time while delivering temporary Written Japanese (BCCWJ) (Maekawa, comparable accuracy. 2008), plus newspaper articles and daily conver- Here is a brief explanation of the word seg- sation sentences. We also used a 234,652-word menter based on the pointwise method. For more dictionary (UniDic) provided with the BCCWJ. A detail the reader may refer to (Neubig et al., 2011). small portion of the BCCWJ core data is reserved The input is an unsegmented character sequence for testing. In addition, we manually segmented x = x x x . The word segmenter decides if sentences randomly obtained from Twitter8 during 1 2 ··· k there is a word boundary ti = 1 or not ti = 0 the same period as the log collection for the test by using support vector machines (SVMs) (Fan et corpus. Table 2 shows the details of these corpora. al., 2008)7. The features are character n-grams which is lower than that of SVM (see Table 4). To make an 6 The results were stable for nc in the preliminary experi- SSC, however, we use an LR model because we need word ments. boundary probabilities. 7The reason why we use SVM for word segmentation is 8We extracted body text from 1,592 tweets excluding that the accuracy is generally higher than that based on LR. It mentions, hash tags, URLs, and ticker symbols. Then we was so in the experiments of this paper. The F-measure of LR divided the body text into sentences by separating on newline on TWI-test was 91.30 (Recall = 89.50, Precision = 93.17), characters, resulting in 2,976 sentences.

1192 Table 4: WS accuracy on the tweets. Recall [%] Precision [%] F-measure Baseline 90.31 94.05 92.14 + Log-as-is 90.33 93.77 92.02 + Log-chunk 91.04 94.29 92.64 + Log-mconv 90.62 94.09 92.32 + Log-chunk-mconv 91.40 94.45 92.90

Table 5: WS accuracy on BCCWJ. Recall [%] Precision [%] F-measure Baseline 99.01 98.97 98.99 + Log-as-is 99.02 98.87 98.94 + Log-chunk 99.05 98.88 98.96 + Log-mconv 98.98 98.91 98.95 + Log-chunk-mconv 98.98 98.92 98.95

6.2 Models using Input Method Logs accuracy of the baseline method on BCCWJ-test To make the training data for our IM server, we and TWI-test shows that WS of tweets is very dif- first chose randomly selected tweets (786,331 sen- ficult. The fact that the precision on TWI-test is tences) in addition to the unannotated part of the much higher than the recall indicates that the base- BCCWJ (358,078 sentences). We then trained LR line model suffers from over-segmentation. This models which estimate word boundary probabili- over-segmentation problem is mainly caused by ties and pronunciation probabilities for words (and OOV words being divided into known words. For 横アリ word candidates) from the training data shown in example, “ ” (Yokohama arena) is divided 横 アリ Table 2 and UniDic. We made a pSTC for our into the two know words “ ” (side) and “ ” IM engine from 1,207,182 sentences randomly ob- (ant). tained from Twitter by following the procedure When we compare the F-measures on TWI-test, which we explained in Subsection 3.2.39. all the models referring to the IM logs outperform We launched our IM as a browser add-on for the baseline model trained only from the BCCWJ. Twitter and collected 19,770 IM logs from 7 users The highest is the Log-chunk-mconv model and between April 24 and December 31, 2014. Fol- the improvement over the baseline is statistically lowing the procedures in Section 5.1, we obtained significant (significance level: 1%). In addition the language resources shown in Table 3. We com- the accuracies of the five methods on the BCCWJ bined them with the training corpus and dictionar- (Table 5) are almost the same and there is no statis- ies to build four WSs, which we compared with tical significance (significance level: 1%) between the baseline. any two of them. We analyzed the words misrecognized by the 6.3 Results and Discussion WSs, which we call error words. Table 6 shows Following the standard in WS experiments, the the number of error words, the number of OOV evaluation criteria are recall, precision, and F- words, and the ratio of OOV words to error words. measure (their harmonic mean). Recall is the Here the vocabulary is the set of the words appear- number of correctly segmented words divided by ing in the training data or in UniDic (see Table 2). the number of words in the test corpus. Preci- Although the result of the WS trained on Log-as- sion is the number of correctly segmented words is contains more error words than the baseline, the divided by the number of words in the system out- OOV ratio is less than the baseline. This means put. that the IM logs have a potential to reduce errors Table 4 and 5 show WS accuracy on TWI-test caused by OOV words. and BCCWJ-test, respectively. The difference in Table 6 also indicates that the best method Log- 9There is no overlap with the test data. chunk-mconv had the greatest success in reducing

1193 Table 6: Ratio of OOV words in error words. #Error words #OOV words (ratio[%]) Baseline 446 103 (23.09) + Log-as-is 467 89 (19.06) + Log-chunk 428 81 (18.93) + Log-mconv 443 88 (19.86) + Log-chunk-mconv 413 74 (17.79)

92.9 cessful example of how to build a useful tool for the NLP community. This process has three steps: Log-chunk-mconv 92.7 1) design a useful NLP application that can collect user logs, 2) deploy it for public use, and 3) devise 92.5 a method for mining data from the logs.

F-measure Log-chunk 92.3 7 Conclusion This paper described the design of a publicly us- 92.1 0 20000 40000 60000 80000 100000 able IM which collects natural annotations for use #characters as training data for another system. Specifically, we (1) described how to construct an IM server Figure 3: Relationship between WS accuracy on that suggests OOV word candidates, (2) designed the tweets and log size. a publicly usable IM that collects logs of user behavior, and (3) proposed a method for using this data to improve word segmenters. Tweets errors caused by OOV words. However, the ma- from Twitter are a promising source of data with jority of error words are in-vocabulary words. It great potential for NLP, which is one reason why can be said that our log chunking method (Log- we used them as the target domain for our ex- chunk or Log-chunk-mconv) enabled the WSs to periments. The experimental results showed that eliminate many known word errors by using con- our methods improve accuracy in this domain. text information. Our method itself is domain-independent and only To investigate the impact of the log size, we needs logs from the target domain, so it is worth measured WS accuracy on TWI-test when vary- testing on other domains and with much longer pe- ing the log size during training. Figure 3 shows riods of data collection. the results. Table 4 says that Log-chunk-mconv and Log-chunk increase the accuracy nicely. The Acknowledgments graph, however, clarifies that Log-chunk-mconv This work was supported by JSPS Grants-in-Aid achieves high accuracy with fewer training data for Scientific Research Grant Number 26280084 converted from logs. In other words, the method and Microsoft CORE project. We thank Dr. Log-chunk-mconv is good at distilling the in- Hisami Suzuki, Dr. Koichiro Yoshino, and Mr. formative parts and filtering out the noisy parts. Daniel Flannery for their valuable comments and These characteristics are very important properties suggestions on the manuscript. We are also grate- to have as we consider deploying our IM to a wider ful to the anonymous users of our input method. audience. An IM is needed to type Japanese and the number of Japanese speakers is more than 100 million. If we can use input logs of even 1% of References them for the same or longer period10, the idea we Peter F. Brown, Vincent J. Della Pietra, Peter V. deS- propose in this paper can improve WS accuracy on ouza, Jennifer C. Lai, and Robert L. Mercer. 1992. various domains efficiently and automatically. Class-based n-gram models of natural language. As a final remark, this paper describes a suc- Computational Linguistics, 18(4):467–479.

10The number of users using our system in this paper is 7 Zheng Chen and -Fu Lee. 2000. A new statistical for 8 months. approach to Chinese pinyin input. In Proceedings

1194 of the 38th Annual Meeting of the Association for Shinsuke Mori and Graham Neubig. 2011. A point- Computational Linguistics, pages 241–247. wise approach to pronunciation estimation for a TTS front-end. In Proceedings of the InterSpeech2011, Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- pages 2181–2184. Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Shinsuke Mori and Graham Neubig. 2014. Language Machine Learning Research, 9:1871–1874. resource addition: Dictionary or corpus? In Pro- ceedings of the Nineth International Conference on Ryuichiro Higashinaka, Noriaki Kawamae, Kugatsu Language Resources and Evaluation, pages 1631– Sadamitsu, Yasuhiro Minami, Toyomi Meguro, Ko- 1636. hji Dohsaka, and Hirohito Inagaki. 2011. Building a conversational model from two-tweets. IEEE Trans- Shinsuke Mori and Hiroki Oda. 2009. Automatic word actions on ASRU, pages 330–335. segmentation using three types of dictionaries. In Proceedings of the Eighth International Conference Wenbin Jiang, Meng Sun, Yajuan Lu, Yating Yang, and Pacific Association for Computational Linguistics, Qun Liu. 2013. Discriminative learning with natu- pages 1–6. ral annotations: Word segmentation as a case study. In Proceedings of the 51st Annual Meeting of the As- sociation for Computational Linguistics, pages 761– Shinsuke Mori and Daisuke Takuma. 2004. Word 769. n-gram probability estimation from a Japanese raw corpus. In Proceedings of the Eighth International Nobuhiro Kaji and Masaru Kitsuregawa. 2014. Ac- Conference on Speech and Language Processing, curate word segmentation and POS tagging for pages 1037–1040. Japanese microblogs: Corpus annotation and joint modeling with lexical normalization. In Proceed- Shinsuke Mori, Tsuchiya Masatoshi, Osamu Yamaji, ings of the 2014 Conference on Empirical Methods and Makoto Nagao. 1999. Kana-kanji conversion in Natural Language Processing, pages 99–109. by a stochastic model. Transactions of Information Processing Society of Japan, 40(7):2946–2953. (in Hirotaka Kameko, Shinsuke Mori, and Yoshimasa Tsu- Japanese). ruoka. 2015. Can symbol grounding improve low- level NLP? Word segmentation as a case study. In Shinsuke Mori, Daisuke Takuma, and Gakuto Kurata. Proceedings of the 2015 Conference on Empirical 2006. Phoneme-to-text transcription system with an Methods in Natural Language Processing. infinite vocabulary. In Proceedings of the 21st Inter- national Conference on Computational Linguistics, Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. pages 729–736. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings Masaaki Nagata. 1994. A stochastic Japanese mor- of the Conference on Empirical Methods in Natural phological analyzer using a forward-DP backward- Language Processing, pages 230–237. A∗ n-best search algorithm. In Proceedings of the 15th International Conference on Computational John Lafferty, Andrew McCallum, and Fernando Linguistics, pages 201–207. Pereira. 2001. Conditional random fields: Prob- abilistic models for segmenting and labeling se- Graham Neubig and Shinsuke Mori. 2010. Word- quence data. In Proceedings of the Eighteenth based partial annotation for efficient corpus con- ICML, pages 282–289. struction. In Proceedings of the Seventh Interna- tional Conference on Language Resources and Eval- Yijia Liu, Yue Zhang, Wangxiang Che, Ting Liu, and uation, pages 2723–2727. Fan Wu. 2014. Domain adaptation for CRF-based Chinese word segmentation using free annotations. Graham Neubig, Yosuke Nakata, and Shinsuke Mori. In Proceedings of the 2014 Conference on Empiri- 2011. Pointwise prediction for robust, adaptable cal Methods in Natural Language Processing, pages Japanese morphological analysis. In Proceedings of 864–874. the 49th Annual Meeting of the Association for Com- Kikuo Maekawa. 2008. Balanced corpus of con- putational Linguistics, pages 529–533. temporary written Japanese. In Proceedings of the 6th Workshop on Asian Language Resources, pages Fuchun Peng, Fangfang Feng, and Andrew McCallum. 101–102. 2004. Chinese segmentation and new word detec- tion using conditional random fields. In Proceed- Hirokuni Maeta and Shinsuke Mori. 2012. Statistical ings of the 20th International Conference on Com- input method based on a phrase class n-gram model. putational Linguistics, pages 562–568. In Workshop on Advances in Text Input Methods. Dana Ron, Yoram Singer, and Naftali Tishby. 1996. Shinsuke Mori and Gakuto Kurata. 2005. Class-based The power of amnesia: Learning probabilistic au- variable memory length markov model. In Proceed- tomata with variable memory length. Machine ings of the InterSpeech2005, pages 13–16. Learning, 25:117–149.

1195 Takeshi Sakai, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 851–860.

Richard Sproat and Chilin Shih William Gale Nancy Chang. 1996. A stochastic finite-state word- segmentation algorithm for Chinese. Computational Linguistics, 22(3):377–404.

Yuta Tsuboi, Hisashi Kashima, Shinsuke Mori, Hiroki Oda, and Yuji Matsumoto. 2008. Training condi- tional random fields using incomplete annotations. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 897–904. Sho Tsugawa, Yukiko Mogi, Yusuke Kikuchi, Fumio Kishino, Kazuyuki Fujita, Yuichi Itoh, and Hiroyuki Ohsaki. 2013. On estimating depressive tendencies of Twitter users utilizing their tweet data. In VR’13, pages 1–4. Fan Yang and Paul Vozila. 2014. Semi-supervised Chinese word segmentation using partial-label learning with conditional random fields. In Proceed- ings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 90–98.

1196