Substring Frequency Features for Segmentation of Japanese Katakana Words with Unlabeled Corpora
Total Page:16
File Type:pdf, Size:1020Kb
Substring Frequency Features for Segmentation of Japanese Katakana Words with Unlabeled Corpora Yoshinari Fujinuma∗ Alvin C. Grissom II Computer Science Mathematics and Computer Science University of Colorado Ursinus College Boulder, CO Collegeville, PA [email protected] [email protected] Abstract 20% of the katakana words in newspaper articles are OOV words (Breen, 2009). For example, some Word segmentation is crucial in natu- katakana compound loanwords are not transliter- ral language processing tasks for unseg- ated but rather “Japanized” (e.g., ¬½êン¹¿ mented languages. In Japanese, many out- ンド gasorinsutando “gasoline stand”, meaning of-vocabulary words appear in the pho- “gas station” in English) or abbreviated (e.g., ¹ netic syllabary katakana, making segmen- マーÈフ©ン±ー¹ sumatofonk¯ esu¯ (“smart- tation more difficult due to the lack of phone case”), which is abbreviated as スマホケー clues found in mixed script settings. In this ¹ sumahokesu¯ ). Abbreviations may also undergo paper, we propose a straightforward ap- phonetic and corresponding orthographic changes, proach based on a variant of tf-idf and ap- as in the case of ¹ マ ーÈ フ © ン sumatofon¯ ply it to the problem of word segmentation (“smartphone”), which, in the abbreviated term, in Japanese. Even though our method uses shortens the long vowel a¯ to a, and replaces フ© only an unlabeled corpus, experimental re- fo with ホ ho. This change is then propagated to sults show that it achieves performance compound words, such as ¹マホ±ー¹ suma- comparable to existing methods that use hokesu¯ (“smartphone case”). Word segmentation manually labeled corpora. Furthermore, of compound words is important for improving it improves performance of simple word results in downstream tasks, such as information segmentation models trained on a manu- retrieval (Braschler and Ripplinger, 2004; Alfon- ally labeled corpus. seca et al., 2008), machine translation (Koehn and Knight, 2003), and information extraction from 1 Introduction microblogs (Bansal et al., 2015). In languages without explicit segmentation, word Hagiwara and Sekine(2013) incorporated an segmentation is a crucial step of natural lan- English corpus by projecting Japanese transliter- guage processing tasks. In Japanese, this prob- ations to words from an English corpus; how- lem is less severe than in Chinese because of ever, loanwords that are not transliterated (such as the existence of three different scripts: hiragana, sumaho for “smartphone”) cannot be segmented katakana, and kanji, which are Chinese charac- by the use of an English corpus alone. We inves- ters.1 However, katakana words are known to tigate a more efficient use of a Japanese corpus degrade word segmentation performance because by incorporating a variant of the well-known tf- of out-of-vocabulary (OOV) words which do not idf weighting scheme (Salton and Buckley, 1988), appear manually segmented corpora (Nakazawa which we refer to as term frequency-inverse sub- et al., 2005; Kaji and Kitsuregawa, 2011). Cre- string frequency or tf-issf. The core idea of our ap- 2 ation of new words is common in Japanese; around proach is to assign scores based on the likelihood that a given katakana substring is a word token, ∗Part of the work was done while the first author was at Amazon Japan. using only statistics from an unlabeled corpus. 1Hiragana and katakana are the two distinct character sets representing the same Japanese sounds. These two character sets are used for different purposes, with katakana typically used for transliterations of loanwords. Kanji are typically 2Our code is available at https://www.github.com/ used for nouns. akkikiki/katakana segmentation. 74 Proceedings of the The 8th International Joint Conference on Natural Language Processing, pages 74–79, Taipei, Taiwan, November 27 – December 1, 2017 c 2017 AFNLP Our contributions are as follows: ID Notation Feature 1 yi unigram 2 yi 1, yi bigram 1. We show that a word segmentation model − 3 length(yi) num. of characters in yi using tf-issf alone outperforms a previously proposed frequency-based method and that Table 1: Features used for the structured percep- it produces comparable results to various tron. Japanese tokenizers. katakana can, in principle, be a meaningful char- 2. We show that tf-issf improves the F1-score of acter n-gram, this rarely occurs, and tf-issf suc- word segmentation models trained on manu- cessfully penalizes the score of such sequences of ally labeled data by more than 5%. characters. 2 Proposed Method Figure1 shows an example of a word lattice for the loan compound word “smartphone case” In this section, we describe the katakana word seg- with the desired segmentation path in bold. When mentation problem and our approach to it. building a lattice, we only create a node for a sub- 2.1 Term Frequency-Inverse Substring string that appears as a term in the unlabeled cor- 4 Frequency pus and does not start with a small katakana letter or a prolonged sound mark “ー”, as such charac- Let S be a sequence of katakana characters, Y be ters are rarely the first character in a word. Includ- the set of all possible segmentations, y Y be ∈ ing terms or consecutive katakana characters from a possible segmentation, and y be a substring of i an unlabeled corpus reduces the number of OOV y. Then, for instance, y = y y ...y is a possible 1 2 n words. word segmentation of S with n segments. We now propose a method to segment katakana 2.2 Structured Perceptron with tf-issf OOV words. Our approach, term frequency- inverse substring frequency (tf-issf), is a variant To incorporate manually labeled data and to com- of the tf-idf weighting method, which computes a pare with other supervised Japanese tokenizers, score for each candidate segmentation. We calcu- we use the structured perceptron (Collins, 2002). This model represents the score as late the score of a katakana string yi with tf(yi) Score(y) = w φ(y), (4) tf-issf(yi) = , (1) · sf(yi) where φ(y) is a feature function and w is a weight where tf(yi) is the number of occurrences of yi vector. Features used in the structured perceptron as a katakana term in a corpus and sf(yi) is the are shown in Table1. We use the surface-level number of unique katakana terms that have yi as a substring. We regard consecutive katakana char- features used by Kaji and Kitsuregawa(2011) and acters as a single katakana term when computing decode with the Viterbi algorithm. We incorpo- tf-issf. rate tf-issf into the structured perceptron as the ini- We compute the product of tf-issf scores over a tial feature weight for unigrams instead of initial- 5 string to score the segmentation izing the weight vector to 0. Specifically, we use log(tf-issf(yi) + 1) for the initial weights to avoid overemphasizing the tf-issf value (Kaji and Kit- Score(y) = tf-issf(yi), (2) yi y suregawa, 2011). In this way, we can directly ad- Y∈ just tf-issf values using a manually labeled corpus. and choose the optimal segmentation y∗ with Unlike the approach of Xiao et al.(2002), which y∗ = arg max Score(y S). (3) uses tf-idf to resolve segmentation ambiguities in y Y | ∈ Chinese, we regard each katakana term as one doc- Intuitively, if a string appears frequently as ument to compute its inverse document (substring) a word substring, we treat it as a meaning- frequency. less sequence.3 While substrings of consecutive 4Such letters are ¡ a, £ i, ¥ u, § e, © o, õ ka, ヶ ke, 3A typical example is a single character substring. How- à tsu, ã ya, ュ yu, ョ yo, and ヮ wa. ever, it is possible for single-character substrings could be 5We also attempt to incorporate tf-issf as an individual word tokens. feature, but this does not improve the segmentation results. 75 スマホケース sumahokēsu ス マホ ケース BOS su maho kēsu EOS ケー スマホ kē ス sumaho su Figure 1: An example lattice for a katakana word segmentation. We use the Viterbi algorithm to find the optimal segmentation from the beginning of the string (BOS) to the end (EOS), shown by the bold edges and nodes. Only those katakana substrings which exist in the training corpus as words are considered. This example produces the correct segmentation, スマホ / ケース sumaho / kesu¯ (“smartphone case”). 3 Experiments data. We filter out duplicate hashtags because the Twitter Streaming API collects a set of sample We now describe our experiments. We run our tweets that are biased compared with the overall proposed method under two different settings: 1) tweet stream (Morstatter et al., 2013). We follow using only an unlabeled corpus (UNLABEL), and the BCCWJ annotation guidelines (Ogura et al., 2) using both an unlabeled corpus and a labeled 2011) to conduct the annotation.7 corpus (BOTH). For the first experiment, we estab- lish a baseline result using an approach proposed 3.2 Baselines by Nakazawa et al.(2005) and compare this with We follow previous work and use a frequency- using tf-issf alone. We conduct an experiment in based method as the baseline (Nakazawa et al., the second setting to compare with other super- 2005; Kaji and Kitsuregawa, 2011): vised approaches, including Japanese tokenizers. 1 n n ( i=1 tf(yi)) 3.1 Dataset y∗ = arg max C (5) y Y l + α We compute the tf-issf value for each katakana ∈ Q N substring using all of 2015 Japanese Wikipedia where l is the average length of all segmented sub- as an unlabeled training corpus. This consists of strings. Following Nakazawa et al.(2005) and 1, 937, 006 unique katakana terms. Kaji and Kitsuregawa(2011), we set the hyper- Following Hagiwara and Sekine(2013), we parameters to C = 2500, N = 4, and α = 0.7.