Word-Like Character N-Gram Embedding
Total Page:16
File Type:pdf, Size:1020Kb
Word-like character n-gram embedding Geewook Kim and Kazuki Fukui and Hidetoshi Shimodaira Department of Systems Science, Graduate School of Informatics, Kyoto University Mathematical Statistics Team, RIKEN Center for Advanced Intelligence Project fgeewook, [email protected], [email protected] Abstract Table 1: Top-10 2-grams in Sina Weibo and 4-grams in We propose a new word embedding method Japanese Twitter (Experiment 1). Words are indicated called word-like character n-gram embed- by boldface and space characters are marked by . ding, which learns distributed representations FNE WNE (Proposed) Chinese Japanese Chinese Japanese of words by embedding word-like character n- 1 ][ wwww 自己 フォロー grams. Our method is an extension of recently 2 。␣ !!!! 。␣ ありがと proposed segmentation-free word embedding, 3 !␣ ありがと ][ wwww 4 .. りがとう 一个 !!!! which directly embeds frequent character n- 5 ]␣ ございま 微博 めっちゃ grams from a raw corpus. However, its n-gram 6 。。 うござい 什么 んだけど vocabulary tends to contain too many non- 7 ,我 とうござ 可以 うござい 8 !! ざいます 没有 line word n-grams. We solved this problem by in- 9 ␣我 がとうご 吗? 2018 troducing an idea of expected word frequency. 10 了, ください 哈哈 じゃない Compared to the previously proposed meth- ods, our method can embed more words, along tion tools are used to determine word boundaries with the words that are not included in a given in the raw corpus. However, these segmenters re- basic word dictionary. Since our method does quire rich dictionaries for accurate segmentation, not rely on word segmentation with rich word which are expensive to prepare and not always dictionaries, it is especially effective when the available. Furthermore, when we deal with noisy text in the corpus is in unsegmented language and contains many neologisms and informal texts (e.g., SNS data), which contain a lot of neolo- words (e.g., Chinese SNS dataset). Our ex- gisms and informal words, using a word segmenter perimental results on Sina Weibo (a Chinese with a poor word dictionary results in significant microblog service) and Twitter show that the segmentation errors, leading to degradation of the proposed method can embed more words and quality of learned word embeddings. improve the performance of downstream tasks. To avoid the difficulty, segmentation-free word embedding 1 Introduction has been proposed (Oshikiri, 2017). It does not require word segmentation as a pre- Most existing word embedding methods re- processing step. Instead, it examines frequencies quire word segmentation as a preprocessing of all possible character n-grams in a given cor- step (Mikolov et al., 2013; Pennington et al., 2014; pus to build up frequent n-gram lattice. Subse- Bojanowski et al., 2017). The raw corpus is first quently, it composes distributed representations of converted into a sequence of words, and word n-grams by feeding their co-occurrence informa- co-occurrence in the segmented corpus is used to tion to existing word embedding models. In this compute word vectors. This conventional method method, which we refer to as Frequent character is referred to as Segmented character N-gram Em- N-gram Embedding (FNE), the top-K most fre- bedding (SNE) for making a distinction clear in quent character n-grams are selected as n-gram the argument below. Word segmentation is almost vocabulary for embedding. Although FNE does obvious for segmented languages (e.g., English), not require any word dictionaries, the n-gram vo- whose words are delimited by spaces. On the other cabulary tends to include a vast amount of non- hand, when dealing with unsegmented languages words. For example, only 1.5% of the n-gram vo- (e.g., Chinese and Japanese), whose word bound- cabulary is estimated as words at K = 2M in Ex- aries are not obviously indicated, word segmenta- periment 1 (See Precision of FNE in Fig. 2b). 148 Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 148–152 Brussels, Belgium, Nov 1, 2018. c 2018 Association for Computational Linguistics Pi 0.04 0.69 0.05 0.93 0.05 0.89 0.85 0.85 0.38 0.06 0.06 0.34 to deal with corpora consisting of a lot of ne- Corpus 海 鮮 料 理 美 味 い ! ( ゚ ∀ ゚ ) SNE seafood dish delicious ! ologisms. However, in many cases, such dic- delicious ! (emoticon) seafood dish delicious i (kana) tionaries are costly to obtain or to maintain up- sea ingredient reason tasty taste FNE to-date. Though recent studies have employed character-based methods to deal with large size vocabulary for NLP tasks ranging from machine seafood dishes delicious ! (emoticon) seafood dish delicious i (kana) translation (Costa-jussa` and Fonollosa, 2016; Lu- sea ingredient reason tasty taste ong and Manning, 2016) to part-of-speech tag- WNE ging (Dos Santos and Zadrozny, 2014), they still require a segmentation step. Some other studies Figure 1: A Japanese tweet with manual segmentation. employed character-level or n-gram embedding The output of a standard Japanese word segmenter4 is without word segmentation (Schutze¨ , 2017; Dhin- shown in SNE. The n-grams included in the vocab- gra et al., 2016), but most cases are task-specific ularies of each method are shown in FNE and WNE and do not set their goal as obtaining word vec- (K=2×106). Words are black and non-words are gray. tors. As for word embedding tasks, subword (or n- gram) embedding techniques have been proposed Since the vocabulary size K is limited, we to deal with morphologically rich languages (Bo- would like to reduce the number of non-words in janowski et al., 2017) or to obtain fast and sim- the vocabulary in order to embed more words. To ple architectures for word and sentence represen- this end, we propose another segmentation-free tations (Wieting et al., 2016), but these methods word embedding method, called Word-like char- do not consider a situation where word bound- acter N-gram Embedding (WNE). While FNE aries are missing. To obtain word vectors with- only considers n-gram frequencies for construct- out word segmentation, Oshikiri (2017) proposed ing the n-gram vocabulary, WNE considers how a new pipeline of word embedding which is effec- likely each n-gram is a “word”. Specifically, we tive for unsegmented languages. introduce the idea of expected word frequency (ewf) in a stochastically segmented corpus (Mori 3 Frequent n-gram embedding and Takuma, 2004), and the top-K n-grams with the highest ewf are selected as n-gram vocabulary A new pipeline of word embedding for unseg- for embedding. In WNE, ewf estimates the fre- mented languages, referred to as FNE in this pa- quency of each n-gram appearing as a word in the per, has been proposed recently in Oshikiri (2017). corpus, while the raw frequency of the n-gram is First, the frequencies of all character n-grams in a used in FNE. As seen in Table 1 and Fig. 1, WNE raw corpus are counted for selecting the K-most tends to include more dictionary words than FNE. frequent n-grams as the n-gram vocabulary in WNE incorporates the advantage of dictionary- FNE. This way of determining n-gram vocabulary based SNE into FNE. In the calculation of ewf, we can also be found in Wieting et al. (2016). Then use a probabilistic predictor of word boundary. We frequent n-gram lattice is constructed by enumer- do not expect the predictor is very accurate—If it ating all possible segmentations with the n-grams is good, SNE is preferred in the first place. A naive in the vocabulary, allowing partial overlapping of predictor is sufficient for giving low ewf score to n-grams in the lattice. For example, assuming that the vast majority of non-words so that words, in- there is a string “短い学術論文” (short academic cluding neologisms, are easier to enter the vocab- paper) in a corpus, and if 短い (short), 学術 (aca- ulary. Although our idea seems somewhat simple, demic), 論文 (paper) and 学術論文 (academic pa- our experiments show that WNE significantly im- per) are included in the n-gram vocabulary, then proves word coverage while achieving better per- word and context pairs are (短い, 学術), (短い, 学 formances on downstream tasks. 術論文) and (学術, 論文). Co-occurrence frequen- 2 Related work cies over the frequent n-gram lattice are fed into the word embedding model to obtain vectors of n- The lack of word boundary information in unseg- grams in the vocabulary. Consequently, FNE suc- mented languages, such as Chinese and Japanese, ceeds to learn embeddings for many words while raises the need for an additional step of word seg- avoiding the negative impact of the erroneous seg- mentation, which requires rich word dictionaries mentations. 149 Although FNE is effective for unsegmented lan- while the raw frequency of w is expressed as guages, it tends to embed too many non-words. X This is undesirable since the number of embed- freq(w) = 1: ding targets is limited due to the time and memory (i;j)2I(w) constraints, and the non-words in the vocabulary 4.2 Probabilistic predictor of word boundary could degrade the quality of the word embeddings. In this paper, a logistic regression is used for 4 Word-like n-gram embedding estimating word boundary probability. For ex- planatory variables, we employ the association To reduce the number of non-words in the n-gram strength (Sproat and Shih, 1990) of character n- vocabulary of FNE, we change the selection cri- grams; similar statistics of word n-grams are used terion of n-grams.