Segmentation-free Compositional n-gram Embedding

Geewook Kim and Kazuki Fukui and Hidetoshi Shimodaira Department of Systems Science, Graduate School of Informatics, Kyoto University Mathematical Statistics Team, RIKEN Center for Advanced Intelligence Project geewook, k.fukui @sys.i.kyoto-u.ac.jp, [email protected] { }

Abstract Japanese, whose word boundaries are not explic- itly indicated. Second, word segmentation has am- We propose a new type of representation learn- biguities (Luo et al., 2002; Li et al., 2003). For ing method that models words, phrases and sentences seamlessly. Our method does not example, a word 線形代数学 (linear depend on word segmentation and any human- algebra) can be seen as a single word or sequence annotated resources (e.g., word dictionaries), of words, such as 線形 代数学 (linear algebra). | | yet it is very effective for noisy corpora writ- Word segmentation errors negatively influence ten in unsegmented languages such as Chinese subsequent processes (Xu et al., 2004). For exam- and Japanese. The main idea of our method ple, we may lose some words in training corpora, is to ignore word boundaries completely (i.e., leading to a larger Out-Of-Vocabulary (OOV) segmentation-free), and construct representa- tions for all character n-grams in a raw cor- rate (Sun et al., 2005). Moreover, segmentation pus with embeddings of compositional sub-n- errors, such as segmenting きのう (yesterday) as grams. Although the idea is simple, our exper- き のう (tree brain), produce false co-occurrence | | iments on various benchmarks and real-world information. This problem is crucial for most ex- datasets show the efficacy of our proposal. isting methods as they are based on distributional hypothesis (Harris, 1954), which 1 Introduction can be summarized as: “a word is characterized Most existing word embedding models (Mikolov by the company it keeps”(Firth, 1957). et al., 2013; Pennington et al., 2014; Bojanowski To enhance word segmentation, some recent et al., 2017) take a sequence of words as their in- works (Junyi, 2013; Sato, 2015; Jeon, 2016) made put. Therefore, the conventional models are de- rich resources publicly available. However, main- pendent on word segmentation (Yang et al., 2017; taining them up-to-date is difficult and it is infea- Shao et al., 2018), which is a process of convert- sible for them to cover all types of words. To ing a raw corpus (i.e., a sequence of characters) avoid the negative impacts of word segmentation into a sequence of segmented character n-grams. errors, Oshikiri(2017) proposed a word embed- After the segmentation, the segmented charac- ding method called segmentation-free word em- ter n-grams are assumed to be words, and each bedding (sembei). The key idea of sembei is word’s representation is constructed from distri- to directly embed frequent character n-grams from bution of neighbour words that co-occur together a raw corpus without conducting word segmenta- across the estimated word boundaries. However, tion. However, most of the frequent n-grams are in practice, this kind of approach has several prob- non-words (Kim et al., 2018), and hence sembei lems. First, word segmentation is difficult espe- still suffers from the OOV problems. The fun- cially when texts in a corpus are noisy or unseg- damental problem also lies in its extension (Kim mented (Saito et al., 2014; Kim et al., 2018). For et al., 2018), although it uses external resources to example, word segmentation on social network reduce the number of OOV. To handle OOV prob- service (SNS) corpora, such as Twitter, is a chal- lems, Bojanowski et al.(2017) proposed a novel lenging task since it tends to include many mis- compositional word embedding method with sub- spellings, informal words, neologisms, and even word modeling, called subword-information skip- emoticons. This problem becomes more severe gram (sisg). The key idea of sisg is to ex- in unsegmented languages, such as Chinese and tend the notion of vocabulary to include subwords,

3207 Proceedings of NAACL-HLT 2019, pages 3207–3215 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics xi n , ,xi 2,xi 1,xi,xi+1, ,xj 1,xj,xj+1,xj+2, ,xj+n Corpus 線 形 代 数 学 勉 強 中 ❗ left ··· ··· ··· right

Context n-gram Target n-gram Context n-gram AAACanichVG7SgNBFD1Z3/EVY6OkCSYRq3DXRrESbSx95QGJyO466uBmd9ndBDT4A3ZWgqkURMTPsPEHLPIJol0EGwtvNguiQb3DzJw5c8+dMzO6Y0rPJ2pGlJ7evv6BwaHo8Mjo2HhsIp737KpriJxhm7Zb1DVPmNISOV/6pig6rtAquikK+tFqe79QE64nbWvbP3bETkU7sOS+NDSfqUK6LK1kPr0bS1GWgkh2AzUEKYSxbsduUcYebBioogIBCz5jExo8biWoIDjM7aDOnMtIBvsCp4iytspZgjM0Zo94POBVKWQtXrdreoHa4FNM7i4rk8jQE91Rix7pnp7p49da9aBG28sxz3pHK5zd8bOprfd/VRWefRx+qf707GMfi4FXyd6dgGnfwujoaycXra2lzUx9lq7phf1fUZMe+AZW7c242RCbDUT5A9Sfz90N8vNZlbLqxnxqeSX8ikEkMIM5fu8FLGMN68gF7s5xiUbkVYkr00qik6pEQs0kvoWS/gTObour V (a) linear algebra learn study -ing ! 2

number (b) algebra learn xi,xi+1,xi+2,xi+3, ,xj 3,xj 2,xj 1,xj linear algebra ···

S(x (i:j)) V Sub-n-grams AAACk3ichVFBS+NAFP6MuqvVXesuguAlWJV6KS9eVtyLuB68CGptFayUJDvtzpomIZkWNfQP7HnBgycFEfEfeNXD/oE9+BPEo4IXD76mAVFRX8i8b75535tvZizfkaEiuuzQOru6P3zs6U319X/6PJAe/FIMvXpgi4LtOV6wbpmhcKQrCkoqR6z7gTBrliPWrK0frfW1hghC6bmrascXmzWz6sqKtE3FVDmdGytJV89nS0psK6sSxVmqaLvZLEdZOfN7sjmpl8K6FQqlF8fK6QzlKA79JTASkEESS176GCX8hAcbddQg4EIxdmAi5G8DBgg+c5uImAsYyXhdoIkUa+tcJbjCZHaLxyrPNhLW5XmrZxirbd7F4T9gpY5x+k8ndEP/6JSu6P7VXlHco+Vlh7PV1gq/PPBnOH/3rqrGWeHXo+pNzwoVTMdeJXv3Y6Z1Crutb+zu3eRnVsajCTqka/Z/QJd0zidwG7f20bJY2UeKH8B4ft0vQXEqZ1DOWJ7KzM4lT9GDEYwiy/f9DbNYwBIKvO9fnOEcF9qQ9l2b0+bbpVpHovmKJ6EtPgAbLps9 2 ⇢

endeavor learn Figure 2: A graphical illustration of the proposed mathmatics study model trying to compute a representation for a char- (c) algebra linear algebra acter n-gram x(i,j). The co-occurrence of x(i,j) and studying its neighbouring context n-grams are used to train em- beddings of compositional n-grams.

Figure 1: A Japanese tweet with manual segmenta- hikiri, 2017; Kim et al., 2018). In scne, the vec- tion. (a) is the segmentation result of a widely-used tor representation of a target character n-gram is word segmenter which conventional word embedding defined as follows. Let x1x2 xN be a raw un- ··· methods are dependent on. (b) and (c) show the em- segmented corpus of N characters. For a range bedding targets and their co-occurrence information to i, i + 1, . . . , j specified by index t = (i, j), 1 be considered in our proposed method scne on the ≤ i j N, we denote the substring xixi+1 xj boundaries of 数 学 and 学 勉. Unlike conventional ≤ ≤ ··· word embedding methods,| scne| considers all possible as x(i,j) or xt. In a training phase, scne first character n-grams on all boundaries (e.g., 線 形, 形 代, counts frequency of character n-grams in the raw 代 数, ) in the raw corpus without segmentation.| | corpus to construct n-gram set V by collecting M- | ··· most frequent n-grams with n nmax, where ≤ M and n are hyperparameters. For any target namely, substrings of words, for enriching the rep- max character n-gram x = xixi+1 xj in the cor- resentations of words by the embeddings of its (i,j) ··· pus, scne constructs its representation v subwords. In sisg, the embeddings of OOV (or x(i,j) d ∈ unseen) words are computed from the embedings R by summing the embeddings of its sub-n- of their subwords. However, sisg requires word grams as follows: segmentation as a prepossessing step, and the way X v = z , of collecting co-occurrence information is depen- x(i,j) s s∈S(x ) dent on the results of explicit word segmentation. (i,j) For solving the issues of word segmentation and 0 0 where S(x(i,j)) = x(i0,j0) V i i j OOV, we propose a simple but effective unsuper- { ∈ | ≤ ≤ ≤ j consists of all sub-n-grams of target x(i,j), and vised representation learning method for words, } d the embeddings of sub-n-grams zs R , s V phrases and sentences, called segmentation-free are model parameters to be learned.∈ The objective∈ compositional n-gram embedding (scne). The of scne is similar to that of Mikolov et al.(2013), key idea of scne is to train embeddings of char-   acter -grams to compose representations of all k n X  X  >  X  >  log σ v ux + log σ v us˜ , character n-grams in a raw corpus, and it enables xt c − xt t∈D c∈C(t) s˜∼P  treating all words, phrases and sentences seam- neg lessly (see Figure1 for an illustrative explanation). 1 where σ(x) = 1+exp(−x) , = (i, j) 1 i Our experimental results on a range of datasets D { | ≤ ≤ j N, j i + 1 ntarget , and ((i, j)) = suggest that scne can compute high-quality rep- ≤0 0 − ≤ 0 } 0 C (i , j ) x 0 0 V, j = i 1 or i = j + 1 . { | (i ,j ) ∈ − } D resentations for words and sentences although it is the set of indexes of all possible target n-grams does not consider any word boundaries and is not in the raw corpus with n ntarget, where ntarget ≤ dependent on any human annotated resources. is a hyperparameter. (t) is the set of indexes of C 2 Segmentation-free Compositional contexts of the target xt, that is, all character n- grams in V that are adjacent to the target (see Fig- n-gram Embedding (scne) ures1 and2). The negative sampling distribution Our method scne successfully combines a sub- Pneg of s˜ V is proportional to its frequency in ∈ d word model (Zhang et al., 2015; Wieting et al., the corpus. The model parameters zs, us˜ R , ∈ 2016; Bojanowski et al., 2017; Zhao et al., 2018) s, s˜ V , are learned by maximizing the objective. ∈ with an idea of character n-gram embedding (Os- We set ntarget = nmax in our experiments.

3208 2.3 Comparison to Bojanowski et al.(2017) (We wrote a paper) To deal with OOV words as well as rare words, Bojanowski et al.(2017) proposed subword infor- mation skip-gram (sisg) that enriches word em- beddings with the representations of its subwords, sisg i.e., sub-character n-grams of words. In , a vector representation of a target word is encoded : (frequent) n-grams in the vocabulary as the sum of the embeddings of its subwords. For instance, subwords of length n = 3 of the Figure 3: An example of a frequent n-gram lattice. word where are extracted as , where “<”,“>” are special symbols Although we examine frequent n-grams for added to the original word to represent its left and simplicity, incorporating supervised word bound- right word boundaries. Then, a vector representa- ary information or byte pair encoding into the con- tion of where is encoded as the sum of the embed- struction of compositional n-gram set would be an dings of these subwords and that of the special se- interesting future work (Kim et al., 2018; Sennrich quence , which corresponds to the orig- et al., 2016; Heinzerling and Strube, 2018). inal word itself. Although sisg is powerful, it requires the information of word boundaries as its 2.1 Comparison to Oshikiri(2017) input, that is, semantic units need to be specified when encoding targets. Therefore, it cannot be di- To avoid the problems of word segmentation, rectly applied to unsegmented languages. Unlike Oshikiri(2017) proposed segmentation-free word sisg, scne does not require such information. embedding (sembei)(Oshikiri, 2017) that con- The proposed scne is much simpler, but due to its siders the M-most frequent character n-grams as simpleness, the embedding target of scne should individual words. Then, a frequent n-gram lat- contains many non-words, which seems to be a tice is constructed, which is similar to a word problem (see Figure1). However, our experimen- lattice used in morphological analysis (see Fig- tal results show that scne successfully captures ure3). Finally, the pairs of adjacent n-grams in the semantics of words and even sentences for un- the lattice are considered as target-context pairs segmented languages without using any knowl- and they are fed to existing word embedding meth- edge of word boundaries (see Section3). ods, e.g., skipgram (Mikolov et al., 2013). Al- though sembei is simple, the frequent n-gram 3 Experiments vocabulary tends to include a vast amount of non- words (Kim et al., 2018). Furthermore, its vocab- In this section, we perform two intrinsic and two ulary size is limited to M, hence, sembei can not extrinsic tasks at both word and sentence level, avoid the undesirable issue of OOV. The proposed focusing on unsegmented languages. The imple- scne avoids these problems by taking all possi- mentation of our method is available on GitHub1. ble character n-grams as embedding targets. Note 3.1 Common Settings that the target-context pairs of sembei are fully contained in those of scne (see Figure1). Baselines: We use skipgram (Mikolov et al., 2013), sisg (Bojanowski et al., 2017) and 2.2 Comparison to Kim et al.(2018) sembei (Oshikiri, 2017) as word embedding To overcome the problem of OOV in sembei, baselines. For sentence embedding, we first test Kim et al.(2018) proposed an extension of simple baselines obtained by averaging the word vectors over a word-segmented sentence. In ad- sembei called word-like n-gram embedding dition, we examine several recent successful sen- (wne). In wne, the n-gram vocabulary is fil- pv-dbow pv-dm tered to have more vaild words by taking advan- tence embedding methods, , (Le sent2vec tage of a supervised probabilistic word segmenter. and Mikolov, 2014) and (Pagliardini Although wne reduce the number of non-words, et al., 2018) in an extrinsic task. Note that both scne sembei there is still the problem of OOV since its vocabu- and have embeddings of frequent lary size is limited. In addition, wne is dependent character n-grams as their model parameters, but on word segmenter while scne does not. 1www.github.com/kdrl/SCNE

3209 skipgram sisg sembei sembei-sum scne skipgramrich sisgrich sembei rich rich sembei-sum scne Wiki. 51.0 59.2 49.0 48.6 62.2 SNS 41.3 47.0 38.9 41.5 60.0 70 Diff. -9.7 -12.2 -10.1 -7.1 -2.2 80 60 50 Table 1: Spearman rank correlations of the word sim- 75 ilarity task on two different Chinese corpora. Best 40 scores are boldface and 2nd best scores are underlined. 30 70 20 Spearman rank correlation 10 50 100 200 300 10 50 100 200 300 Corpus size (MB) Corpus size (MB) Japanese. Given a set of word pairs, or sen- tence pairs, and their human annotated similarity Figure 4: Word (left) and sentence (right) similarity tasks on portions of Chinese Wikipedia corpus. scores, we calculated Spearman’s rank correlation between the cosine similarities of the embeddings and the scores. We use the dataset of Jin and Wu the differences come from training strategies, such (2012) and Wang et al.(2017) for Chinese word as embedding targets and the way of collecting co- and sentence similarity respectively. Note that the occurrence information (see Section 2.1 for more conventional models, such as skipgram, cannot details). For contrasting scne with sembei, we provide the embeddings for OOV words, while the also propose a variant of sembei (denoted by compositional models, such as sisg and scne, sembei-sum) as one of baselines, which com- can compute the embeddings by using their sub- poses word and sentence embeddings by simply word modeling. In order to show comparable re- summing up the embeddings of their sub-n-grams sults, we use the null vector for these OOV words which are learned by sembei. following Bojanowski et al.(2017). Hyperparameters Tuning: To see the effect Results: To see the effect of training corpus size, of rich resources for the segmentation-dependent we train all models on portions of Wikipedia2. baselines, we employ widely-used word seg- The results are shown in Figure4. As it can be menter with two settings: Using only a basic dic- seen, the proposed scne is competitive with or tionary (basic) or using a rich dictionary together outperforms the baselines for both word and sen- (rich). The dimension of embeddings is 200, the tence similarity tasks. Moreover, it is worth noting number of epochs is 10 and the number of nega- that scne provides high-quality representations tive samples is 10 for all the methods. The n-gram even when the size of training corpus is small, vocabulary size M = 2 106 is used for sisg, which is crucial for practical real-world settings × sembei and scne. The other hyperparameters, where rich data is not available. For a next ex- such as learning rate and nmax, are carefully ad- periment to see the effect of noisiness of train- justed via a grid search in the validation set. In ing corpus, we test both noisy SNS corpus and the word similarity task, 2-fold cross validation the Wikipedia corpus3 of the same size. The re- is used for evaluation. In the sentence similarity sults are reported in Table1. As it can be seen, task, we use the provided validation set. In the the performance of segmentation-dependent meth- downstream tasks, vector representations are com- ods (skipgram, sisg) are decreased greatly by bined with a supervised logistic regression classi- the noisiness of the corpus, while scne degrades fier. We repeat training and testing of the classifier only marginally. The other two segmentation-free 10 times, while the prepared dataset is randomly methods (sembei, sembei-sum) performed split into train (60%) and test (40%) sets at each poorly. This shows the efficacy of our method in time, and the hyperparameters are tuned by 3-fold the noisy texts. On the other hand, in preliminary cross validation in the train set. We adopt mean ac- experiments on English (not shown), scne did not curacy as the evaluation metric. See Appendix A.1 get better results than our segmentation-dependent for more experimental details. baselines and it will be a future work to incorpo- rate easily obtainable word boundary information 3.2 Word and Sentence Similarity into scne for segmented languages. We measure the ability of models to capture se- 2We use 10, 50, 100, 300MB of Wikipedia from the head. mantic similarity for words and sentences in Chi- 3We use 100MB of Sina Weibo posts for Chinese SNS nese; see Appendix A.2 for the experiment in corpus and 100MB of Chinese Wikipedia corpus.

3210 Wikipedia corpora Noisy SNS corpora Chinese Japanese Korean Chinese Japanese Korean All Intersec. All Intersec. All Intersec. All Intersec. All Intersec. All Intersec.

skipgrambasic 8.9 (11) 81.0 7.8 (10) 75.7 11.5 (15) 77.1 2.9 (5) 58.2 2.9 (7) 41.4 2.4 (7) 35.4 skipgramrich 9.5 (12) 81.0 16.7 (20) 75.8 11.9 (15) 76.9 3.0 (5) 58.2 4.1 (9) 40.9 2.5 (7) 34.2 sisgbasic 79.2 (100) 82.3 72.2 (100) 75.7 72.2 (100) 76.2 71.0 (100) 64.8 67.1 (100) 46.9 63.4 (100) 39.8 sisgrich 79.5 (100) 82.2 73.3 (100) 74.7 72.4 (100) 76.6 70.8 (100) 64.9 67.5 (100) 46.0 63.3 (100) 37.7 sembei 21.8 (25) 79.0 18.2 (23) 70.1 14.2 (19) 41.8 4.5 (7) 59.6 4.9 (10) 41.9 5.0 (13) 33.7 sembei-sum 76.8 (100) 74.2 69.9 (100) 61.3 66.3 (100) 56.0 72.3 (100) 56.4 66.3 (100) 40.7 64.3 (100) 34.8 scne (Proposed) 79.8 (100) 81.5 73.9 (100) 74.0 73.2 (100) 73.9 74.9 (100) 65.0 68.1 (100) 47.6 65.3 (100) 38.2

Table 2: Noun category prediction accuracies (higher is better) and coverages [%] (in parentheses, higher is better).

Segmentation-free Chinese Japanese Korean performed well when using the intersection of the pv-dbowbasic 82.84 85.24 84.16 covered nouns, especially for the noisy corpora. pv-dbowrich 83.47 85.55 84.80 pv-dmbasic 76.96 80.67 66.35 pv-dmrich 77.94 81.37 67.32 3.4 sent2vecbasic 85.09 87.12 82.31 sent2vecrich 85.39 87.20 82.34 skipgrambasic 85.79 86.76 84.06 As a sentence-level evaluation, we perform senti- skipgramrich 85.77 87.16 84.48 ment analysis on movie review data. We use 101k, sisgbasic 85.67 87.22 84.34 sisgrich 85.04 87.25 84.35 56k and 200k movie reviews and their scores re- sembei-sum X 83.41 80.80 74.98 scnenmax=8 X 87.07 87.42 84.15 spectively from Chinese, Japanese and Korean scne 87.76 88.03 86.74 nmax=16 X movie review websites (see Appendix A.1.6 for Table 3: Sentiment classification accuracies [%]. more details). Each review is labeled as positive or negative by its rating score. Sentence embed- ding models are trained using the whole movie re- views as training corpus. Among the reviews, 5k 3.3 Noun Category Prediction positive and 5k negative reviews are randomly se- As a word-level downstream task, we conduct a lected, and the selected reviews are used to train noun category prediction on Chinese, Japanese and test the classifiers as explained in Section 3.1. and Korean4. Most settings are the same as those Results: The results are reported in Table3. The of Oshikiri(2017). Noun words and their semantic accuracies show that scne is also very effective in categories are extracted from Wikidata (Vrandeciˇ c´ the sentence-level application. In this experiment, and Krotzsch¨ , 2014) with a predetermined seman- we observe that the larger nmax contributes to the tic category set5, and the classifier is trained to performance improvement in sentence-level appli- predict the semantic category of words from the cation by allowing our model to capture composed learned word representations, where unseen words representations for longer phrases or sentences. are skipped in training and treated as errors in test- ing. To see the effect of the noisiness of corpora, 4 Conclusion both noisy SNS corpus and Wikipedia corpus of We proposed a simple yet effective unsupervised the same size are examined as training corpora6. method to acquire general-purpose vector repre- Results: The results are reported in Table2. Since sentations of words, phrases and sentences seam- the set of covered nouns (i.e., non-OOV words) lessly, which is especially useful for languages depends on the methods, we calculate accuracies whose word boundaries are not obvious, i.e., un- in two ways for a fair comparison: Using all the segmented languages. Although our method does nouns and using the intersection of the covered not rely on any manually annotated resources or nouns. scne achieved the highest accuracies in word segmenter, our extensive experiments show all the settings when using all the nouns, and also that our method outperforms the conventional ap- proaches that depend on such resources. 4Although Korean has spacing, word boundaries are not obviously determined by space. 5{food, song, music band name, manga, fictional charac- Acknowledgments ter name, television series, drama, chemical compound, dis- ease, taxon, city, island, country, year, business enterprise, We would like to thank anonymous reviewers for public company, profession, university, language, book} their helpful advice. This work was partially sup- 6For each language, we use 100MB of Wikipedia and SNS data as training corpora. For the SNS data, we use Sina ported by JSPS KAKENHI grant 16H02789 to HS Weibo for Chinese and Twitter for the rest. and 18J15053 to KF.

3211 References Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed Representa- Piotr Bojanowski, Edouard Grave, Armand Joulin, and tions of Words and Phrases and their Composition- Tomas Mikolov. 2017. Enriching Word Vectors with ality. In C. J. C. Burges, L. Bottou, M. Welling, Subword Information. Transactions of the Associa- Z. Ghahramani, and K. Q. Weinberger, editors, Ad- tion for Computational Linguistics, 5(1):135–146. vances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc. Daan van Esch. 2012. Leidon Weibo Corpus. Takamasa Oshikiri. 2017. Segmentation-Free Word J.R. Firth. 1957. A synopsis of linguistic theory 1930- Embedding for Unsegmented Languages. In Pro- 1955. Studies in linguistic analysis, pages 1–32. ceedings of the 2017 Conference on Empirical Meth- Zellig S Harris. 1954. Distributional structure. Word, ods in Natural Language Processing, pages 767– 10(2-3):146–162. 772. Association for Computational Linguistics. Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. Benjamin Heinzerling and Michael Strube. 2018. 2018. Unsupervised Learning of Sentence Embed- BPEmb: Tokenization-free Pre-trained Subword dings Using Compositional n-Gram Features. In Embeddings in 275 Languages. In Proceedings of Proceedings of the 2018 Conference of the North the 11th Language Resources and Evaluation Con- American Chapter of the Association for Computa- ference, pages 2989–2993. European Language Re- tional Linguistics: Human Language Technologies, source Association. Volume 1 (Long Papers), pages 528–540. Associa- Heewon Jeon. 2016. NIA (National Information Soci- tion for Computational Linguistics. ety Agency) Dictionary. Jeffrey Pennington, Richard Socher, and Christopher Peng Jin and Yunfang Wu. 2012. SemEval-2012 Task Manning. 2014. Glove: Global Vectors for Word 4: Evaluating Chinese Word Similarity. In *SEM Representation. In Proceedings of the 2014 Con- 2012: The First Joint Conference on Lexical and ference on Empirical Methods in Natural Language Computational Semantics – Volume 1: Proceedings Processing (EMNLP), pages 1532–1543. Associa- of the main conference and the shared task, and tion for Computational Linguistics. Volume 2: Proceedings of the Sixth International Itsumi Saito, Kugatsu Sadamitsu, Hisako Asano, and Workshop on Semantic Evaluation (SemEval 2012), Yoshihiro Matsuo. 2014. Morphological Analysis pages 374–377. Association for Computational Lin- for Japanese Noisy Text based on Character-level guistics. and Word-level Normalization. In Proceedings of COLING 2014, the 25th International Conference Sun Junyi. 2013. Jieba. on Computational Linguistics: Technical Papers, Geewook Kim, Kazuki Fukui, and Hidetoshi Shi- pages 1773–1782. Dublin City University and As- modaira. 2018. Word-like character n-gram embed- sociation for Computational Linguistics. ding. In Proceedings of the 2018 EMNLP Workshop Yuya Sakaizawa and Mamoru Komachi. 2018. Con- W-NUT: The 4th Workshop on Noisy User-generated struction of a Japanese Word Similarity Dataset. In Text , pages 148–152. Association for Computational Proceedings of the 11th Language Resources and Linguistics. Evaluation Conference, pages 948–951. European Quoc Le and Tomas Mikolov. 2014. Distributed Language Resource Association. Representations of Sentences and Documents. In Toshinori Sato. 2015. Neologism dictionary based on Proceedings of the 31st International Conference the language resources on the Web for Mecab. on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1188–1196. Rico Sennrich, Barry Haddow, and Alexandra Birch. PMLR. 2016. Neural of Rare Words with Subword Units. In Proceedings of the 54th An- Mu Li, Jianfeng Gao, Changning Huang, and Jianfeng nual Meeting of the Association for Computational Li. 2003. Unsupervised Training for Overlapping Linguistics (Volume 1: Long Papers), pages 1715– Ambiguity Resolution in Chinese Word Segmenta- 1725. Association for Computational Linguistics. tion. In Proceedings of the Second SIGHAN Work- shop on Chinese Language Processing - Volume 17, Yan Shao, Christian Hardmeier, and Joakim Nivre. SIGHAN ’03, pages 1–7. Association for Computa- 2018. Universal Word Segmentation: Implementa- tional Linguistics. tion and Interpretation. Transactions of the Associa- tion for Computational Linguistics, 6:421–435. Xiao Luo, Maosong Sun, and Benjamin K. Tsou. 2002. Covering Ambiguity Resolution in Chinese Word Chengjie Sun, Chang-Ning Huang, Xiaolong Wang, Segmentation Based on Contextual Information. In and Mu Li. 2005. Detecting Segmentation Errors Proceedings of the 19th International Conference in Chinese Annotated Corpus. In Proceedings of on Computational Linguistics - Volume 1, COLING the Fourth SIGHAN Workshop on Chinese Language ’02, pages 1–7. Association for Computational Lin- Processing, pages 1–8. Association for Computa- guistics. tional Linguistics.

3212 Denny Vrandeciˇ c´ and Markus Krotzsch.¨ 2014. Wiki- context window and γ is the initial learning rate. data: A Free Collaborative Knowledge Base. Com- For sembei and scne, we used the initial learn- munications of the ACM, 57:78–85. ing rate 0.01 and nmin = 1. The maximum length Shaonan Wang, Jiajun Zhang, and Chengqing Zong. of n-gram to consider nmax is grid searched over 2017. Exploiting Word Internal Structures for 4, 6, 8 in the word and sentence similarity tasks. Generic Chinese Sentence Representation. In Pro- In{ the} noun category prediction task, we used ceedings of the 2017 Conference on Empirical Meth- sembei scne ods in Natural Language Processing, pages 298– nmax = 8 for and the nmax of is 303. Association for Computational Linguistics. grid searched over 4, 6, 8 . For sentiment analy- { } sis task, we tested both nmax = 8 and nmax = 16 John Wieting, Mohit Bansal, Kevin Gimpel, and Karen for sembei and scne to see the effect of large Livescu. 2016. Charagram: Embedding Words and Sentences via Character n-grams. In Proceedings of nmax. After carefully monitoring the loss curve the 2016 Conference on Empirical Methods in Nat- and the performance in the word and sentence ural Language Processing, pages 1504–1515. Asso- similarity tasks, we set the number of epochs 10 ciation for Computational Linguistics. for all methods. In preliminary experiments, we Jia Xu, Richard Zens, and Hermann Ney. 2004. Do also tested the number of epochs 20 for the word- We Need Chinese Word Segmentation for Statisti- segmentation-dependent baselines but there were cal Machine Translation? In ACL SIGHAN Work- no significant differences. In the two supervised shop 2004, pages 122–128. Association for Compu- downstream tasks, the learned vector representa- tational Linguistics. tions are combined with the logistic regression Jie Yang, Yue Zhang, and Fei Dong. 2017. Neural classifier. The parameter C, which is the inverse of Word Segmentation with Rich Pretraining. In Pro- regularization strength of the classifier, is adjusted ceedings of the 55th Annual Meeting of the Associa- via a grid search over C 0.1, 0.5, 1, 5, 10 . tion for Computational Linguistics (Volume 1: Long ∈ { } Papers), pages 839–849. Association for Computa- Again, as explained in the main paper, the hyper- tional Linguistics. paramters are grid searched on the determined val- idation set for all experiments. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In C. Cortes, N. D. Lawrence, D. D. A.1.2 Implementations Lee, M. Sugiyama, and R. Garnett, editors, Ad- Here we provide the list of implementations vances in Neural Information Processing Systems of baselines which are used in our experi- 28, pages 649–657. Curran Associates, Inc. ments. For skipgram7, sisg8, sembei9, and Jinman Zhao, Sidharth Mudgal, and Yingyu Liang. sent2vec10, we use the official implementa- 2018. Generalizing Word Embeddings using Bag tions provided by the authors. Meanwhile, as for of Subwords. In Proceedings of the 2018 Confer- ence on Empirical Methods in Natural Language pv-dbow and pv-dm, we employ a widely-used 11 Processing, pages 601–606. Association for Com- implementation of Gensim library . putational Linguistics. A.1.3 Word Segmenters and Word A Appendices Dictionaries for Unsegmented Languages A.1 Experimental Details Below we list the word segmentation tools and A.1.1 Hyperparameters Tuning word dictionaries which are used in our experi- For skipgram, we performed a grid search over ments. We employed a widely-used word segmen- (h, γ) 1, 5, 10 0.01, 0.025 , where h is the ∈ { }×{ } tation tool for each language. size of context window and γ is the initial learning For Chinese language, we used jieba12 with its rate. For sisg, we performed a grid search over 7 (h, γ, nmin, nmax) 1, 5, 10 0.01, 0.025 https://code.google.com/archive/p/ ∈ { } × { } × 1, 3 4, 8, 12 , where h is the size of con- / { } × { } 8https://github.com/facebookresearch/ text window, γ is the initial learning rate, nmin fastText is the minimum length of character n-gram and 9https://github.com/oshikiri/ nmax is the maximum length of character n-gram. w2v-sembei 10 For pv-dbow, pv-dm and sent2vec, we per- https://github.com/epfml/sent2vec 11https://radimrehurek.com/gensim/ formed a grid search over (h, γ) 5, 10 models/doc2vec.html ∈ { } × 0.01, 0.05, 0.1, 0.2, 0.5 , where h is the size of 12https://github.com/fxsjy/jieba { } 3213 13 default dictionary or with an extended dictio- skipgramrich sisgrich sembei sembei-sum scne 14 Wiki. 8.3 15.4 4.0 9.3 24.1 nary , which fully supports both traditional and SNS 5.3 12.7 2.8 9.3 23.0 simplified Chinese characters. Diff. -3.0 -2.7 -1.2 -0.0 -1.1 For Japanese, we used MeCab15 with its default Table 4: Spearman rank correlations of the word simi- dictionary called IPADIC15 along with specially larity task on two different Japanese corpora. designed neologisms-extended dictionary called mecab-ipadic-NEologd16. Note that, because mecab-ipadic-NEologd this extended dictionary is following Oshikiri(2017). We determined the se- specially designed to include many neologisms, mantic category set used in our experiments as there is a significant word coverage improvement follows: First, we collected Wikidata objects that by using this word dictionary as it can be seen in have Chinese, Japanese, Korean and English la- the Japanese noun category prediction task in the bels. Next, we sorted the categories by the number main paper. of noun words, and removed categories (e.g., Wiki- 17 For Korean, we used mecab-ko with its de- media category or Wikimedia template) that do not 18 fault dictionary called mecab-ko-dic along with represent any semantic category. We also removed 19 another extended dictionary called NIADic . out several categories that contain too many noun words (e.g., human) or too few noun words (e.g., A.1.4 Training Corpora academic discipline). Since there were several du- We prepared Wikipedia corpora and SNS corpora plicated labels for different Wikidata objects, the for Chinese, Japanese and Korean for our exper- number of nouns for each language is slightly dif- iments. For the Wikipedia corpora, we used the ferent. Each category has at least 0.1k words and first 10, 50, 100, 200 and 300MB of texts from the no more than 5k words. The numbers of extracted 20 publicly available Wikipedia dumps . The texts noun words that are used in our experiments were 21 are extracted by using WikiExtractor tool . For 22,468, 22,396 and 22,298 for Chinese, Japanese Chinese SNS corpus, we used 100MB of Leiden and Korean, respectively. Weibo Corpus (van Esch, 2012) from the head. For Japanese and Korean SNS corpora, we col- A.1.6 Movie Review Datasets lected Japanese and Korean tweets using Twitter In the main paper, three movie review datasets Streaming API. We removed usernames and URLs are used to evaluate the quality of sentence from the SNS corpora. There were many informal embeddings. We used 101,114, 55,837 and words, emoticons and misspellings in the SNS cor- 200,000 movie reviews and their rating scores pora. We preserved them without preprocessing to from Yahoo奇摩電影22, Yahoo!映画23 and Naver see the effect of the noisiness of training corpora Movies24 for Chinese, Japanese and Korean, re- in our experiments. spectively.

A.1.5 Preprocess of Wikidata A.2 Additional Experiment on Japanese For the noun category prediction task, we ex- In this section, we show the results of Japanese tracted noun words and their semantic categories word similarity experiments. We use the datasets from Wikidata (Vrandeciˇ c´ and Krotzsch¨ , 2014) of Sakaizawa and Komachi(2018). It con- tains 4427 pairs of words with human similarity 13 https://github.com/fxsjy/jieba/blob/ scores. We omit sentence similarity task since master/jieba/dict.txt 14https://github.com/fxsjy/jieba/blob/ there is no public widely-used benchmark dataset master/extra_dict/dict.txt.big for Japanese yet. Following the main paper, given 15http://taku910.github.io/mecab/ a set of word pairs and their human annotated sim- 16https://github.com/neologd/ mecab-ipadic-neologd ilarity scores, we calculated Spearman’s rank cor- 17https://bitbucket.org/eunjeon/ relation between the cosine similarities of the em- mecab-ko beddings and the human scores. We use 2-fold 18https://bitbucket.org/eunjeon/ mecab-ko-dic 22https://github.com/fychao/ 19https://github.com/haven-jeon/NIADic ChineseMovieReviews 20https://dumps.wikimedia.org/ 23https://github.com/dennybritz/ 21https://github.com/attardi/ sentiment-analysis/tree/master/data wikiextractor 24https://github.com/e9t/nsmc

3214 cross validation for hyperparameters tuning. The same grid search is performed as explained in Sec- tion A.1.1. To see the effect of the noisiness of training corpora, we use two Japanese corpora, 100MB of Wikipedia corpus and 100MB of noisy SNS corpus (Twitter), which are also used in the Japanese noun category prediction task in the main paper. As seen in Table4, the experiment results for Japanese are similar to those of Chinese in the main paper.

3215