Segmentation-Free Compositional N-Gram Embedding

Segmentation-free Compositional n-gram Embedding Geewook Kim and Kazuki Fukui and Hidetoshi Shimodaira Department of Systems Science, Graduate School of Informatics, Kyoto University Mathematical Statistics Team, RIKEN Center for Advanced Intelligence Project geewook, k.fukui @sys.i.kyoto-u.ac.jp, [email protected] f g Abstract Japanese, whose word boundaries are not explic- itly indicated. Second, word segmentation has am- We propose a new type of representation learn- biguities (Luo et al., 2002; Li et al., 2003). For ing method that models words, phrases and sentences seamlessly. Our method does not example, a compound word =²代数Q (linear depend on word segmentation and any human- algebra) can be seen as a single word or sequence annotated resources (e.g., word dictionaries), of words, such as =² 代数Q (linear algebra). j j yet it is very effective for noisy corpora writ- Word segmentation errors negatively influence ten in unsegmented languages such as Chinese subsequent processes (Xu et al., 2004). For exam- and Japanese. The main idea of our method ple, we may lose some words in training corpora, is to ignore word boundaries completely (i.e., leading to a larger Out-Of-Vocabulary (OOV) segmentation-free), and construct representations for all character n-grams in a raw cor- rate (Sun et al., 2005). Moreover, segmentation pus with embeddings of compositional sub-n- errors, such as segmenting き.う (yesterday) as grams. Although the idea is simple, our exper- きのう (tree brain), produce false co-occurrence j j iments on various benchmarks and real-world information. This problem is crucial for most ex- datasets show the efficacy of our proposal. isting word embedding methods as they are based on distributional hypothesis (Harris, 1954), which 1 Introduction can be summarized as: “a word is characterized Most existing word embedding models (Mikolov by the company it keeps”(Firth, 1957). et al., 2013; Pennington et al., 2014; Bojanowski To enhance word segmentation, some recent et al., 2017) take a sequence of words as their in- works (Junyi, 2013; Sato, 2015; Jeon, 2016) made put. Therefore, the conventional models are de- rich resources publicly available. However, main- pendent on word segmentation (Yang et al., 2017; taining them up-to-date is difficult and it is infea- Shao et al., 2018), which is a process of convert- sible for them to cover all types of words. To ing a raw corpus (i.e., a sequence of characters) avoid the negative impacts of word segmentation into a sequence of segmented character n-grams. errors, Oshikiri(2017) proposed a word embed- After the segmentation, the segmented charac- ding method called segmentation-free word em- ter n-grams are assumed to be words, and each bedding (sembei). The key idea of sembei is word’s representation is constructed from distri- to directly embed frequent character n-grams from bution of neighbour words that co-occur together a raw corpus without conducting word segmenta- across the estimated word boundaries. However, tion. However, most of the frequent n-grams are in practice, this kind of approach has several prob- non-words (Kim et al., 2018), and hence sembei lems. First, word segmentation is difficult espe- still suffers from the OOV problems. The fun- cially when texts in a corpus are noisy or unseg- damental problem also lies in its extension (Kim mented (Saito et al., 2014; Kim et al., 2018). For et al., 2018), although it uses external resources to example, word segmentation on social network reduce the number of OOV. To handle OOV prob- service (SNS) corpora, such as Twitter, is a chal- lems, Bojanowski et al.(2017) proposed a novel lenging task since it tends to include many mis- compositional word embedding method with sub- spellings, informal words, neologisms, and even word modeling, called subword-information skip- emoticons. This problem becomes more severe gram (sisg). The key idea of sisg is to ex- in unsegmented languages, such as Chinese and tend the notion of vocabulary to include subwords, 3207 Proceedings of NAACL-HLT 2019, pages 3207–3215 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics xi n , ,xi 2,xi 1,xi,xi+1, ,xj 1,xj,xj+1,xj+2, ,xj+n Corpus 線形代数学勉強中 ❗ − left ··· − − ··· − ··· right Context n-gram Target n-gram Context n-gram <latexit sha1_base64="mSzrLh9h7iZYOgXVx9csmEA7Wyw=">AAACanichVG7SgNBFD1Z3/EVY6OkCSYRq3DXRrESbSx95QGJyO466uBmd9ndBDT4A3ZWgqkURMTPsPEHLPIJol0EGwtvNguiQb3DzJw5c8+dMzO6Y0rPJ2pGlJ7evv6BwaHo8Mjo2HhsIp737KpriJxhm7Zb1DVPmNISOV/6pig6rtAquikK+tFqe79QE64nbWvbP3bETkU7sOS+NDSfqUK6LK1kPr0bS1GWgkh2AzUEKYSxbsduUcYebBioogIBCz5jExo8biWoIDjM7aDOnMtIBvsCp4iytspZgjM0Zo94POBVKWQtXrdreoHa4FNM7i4rk8jQE91Rix7pnp7p49da9aBG28sxz3pHK5zd8bOprfd/VRWefRx+qf707GMfi4FXyd6dgGnfwujoaycXra2lzUx9lq7phf1fUZMe+AZW7c242RCbDUT5A9Sfz90N8vNZlbLqxnxqeSX8ikEkMIM5fu8FLGMN68gF7s5xiUbkVYkr00qik6pEQs0kvoWS/gTObour</latexit> V (a) linear algebra learn study -ing ! 2 number (b) algebra learn xi,xi+1,xi+2,xi+3, ,xj 3,xj 2,xj 1,xj linear algebra ··· − − − S(x (i:j)) V Sub-n-grams <latexit sha1_base64="ORTRUvlJdRd2JX8jSkVa8+PVKFs=">AAACk3ichVFBS+NAFP6MuqvVXesuguAlWJV6KS9eVtyLuB68CGptFayUJDvtzpomIZkWNfQP7HnBgycFEfEfeNXD/oE9+BPEo4IXD76mAVFRX8i8b75535tvZizfkaEiuuzQOru6P3zs6U319X/6PJAe/FIMvXpgi4LtOV6wbpmhcKQrCkoqR6z7gTBrliPWrK0frfW1hghC6bmrascXmzWz6sqKtE3FVDmdGytJV89nS0psK6sSxVmqaLvZLEdZOfN7sjmpl8K6FQqlF8fK6QzlKA79JTASkEESS176GCX8hAcbddQg4EIxdmAi5G8DBgg+c5uImAsYyXhdoIkUa+tcJbjCZHaLxyrPNhLW5XmrZxirbd7F4T9gpY5x+k8ndEP/6JSu6P7VXlHco+Vlh7PV1gq/PPBnOH/3rqrGWeHXo+pNzwoVTMdeJXv3Y6Z1Crutb+zu3eRnVsajCTqka/Z/QJd0zidwG7f20bJY2UeKH8B4ft0vQXEqZ1DOWJ7KzM4lT9GDEYwiy/f9DbNYwBIKvO9fnOEcF9qQ9l2b0+bbpVpHovmKJ6EtPgAbLps9</latexit> 2 ⇢ endeavor learn Figure 2: A graphical illustration of the proposed mathmatics study model trying to compute a representation for a char- (c) algebra linear algebra acter n-gram x(i;j). The co-occurrence of x(i;j) and studying its neighbouring context n-grams are used to train embeddings of compositional n-grams. Figure 1: A Japanese tweet with manual segmenta- hikiri, 2017; Kim et al., 2018). In scne, the vec- tion. (a) is the segmentation result of a widely-used tor representation of a target character n-gram is word segmenter which conventional word embedding defined as follows. Let x1x2 xN be a raw un- ··· methods are dependent on. (b) and (c) show the em- segmented corpus of N characters. For a range bedding targets and their co-occurrence information to i; i + 1; : : : ; j specified by index t = (i; j), 1 be considered in our proposed method scne on the ≤ i j N, we denote the substring xixi+1 xj boundaries of 数 Q and Q Ä. Unlike conventional ≤ ≤ ··· word embedding methods,j scnej considers all possible as x(i;j) or xt. In a training phase, scne first character n-grams on all boundaries (e.g., = ², ² 代, counts frequency of character n-grams in the raw 代数, ) in the raw corpus without segmentation.j j corpus to construct n-gram set V by collecting M- j ··· most frequent n-grams with n nmax, where ≤ M and n are hyperparameters. For any target namely, substrings of words, for enriching the rep- max character n-gram x = xixi+1 xj in the cor- resentations of words by the embeddings of its (i;j) ··· pus, scne constructs its representation v subwords. In sisg, the embeddings of OOV (or x(i;j) d 2 unseen) words are computed from the embedings R by summing the embeddings of its sub-n- of their subwords. However, sisg requires word grams as follows: segmentation as a prepossessing step, and the way X v = z ; of collecting co-occurrence information is depen- x(i;j) s s2S(x ) dent on the results of explicit word segmentation. (i;j) For solving the issues of word segmentation and 0 0 where S(x(i;j)) = x(i0;j0) V i i j OOV, we propose a simple but effective unsuper- f 2 j ≤ ≤ ≤ j consists of all sub-n-grams of target x(i;j), and vised representation learning method for words, g d the embeddings of sub-n-grams zs R , s V phrases and sentences, called segmentation-free are model parameters to be learned.2 The objective2 compositional n-gram embedding (scne). The of scne is similar to that of Mikolov et al.(2013), key idea of scne is to train embeddings of char- 8 9 acter -grams to compose representations of all k n X < X > X > = log σ v ux + log σ v us~ ; character n-grams in a raw corpus, and it enables xt c − xt t2D :c2C(t) s~∼P ; treating all words, phrases and sentences seam- neg lessly (see Figure1 for an illustrative explanation). 1 where σ(x) = 1+exp(−x) , = (i; j) 1 i Our experimental results on a range of datasets D f j ≤ ≤ j N; j i + 1 ntarget , and ((i; j)) = suggest that scne can compute high-quality rep- ≤0 0 − ≤ 0 g 0 C (i ; j ) x 0 0 V; j = i 1 or i = j + 1 . f j (i ;j ) 2 − g D resentations for words and sentences although it is the set of indexes of all possible target n-grams does not consider any word boundaries and is not in the raw corpus with n ntarget, where ntarget ≤ dependent on any human annotated resources. is a hyperparameter. (t) is the set of indexes of C 2 Segmentation-free Compositional contexts of the target xt, that is, all character n- grams in V that are adjacent to the target (see Fig- n-gram Embedding (scne) ures1 and2). The negative sampling distribution Our method scne successfully combines a sub- Pneg of s~ V is proportional to its frequency in 2 d word model (Zhang et al., 2015; Wieting et al., the corpus. The model parameters zs; us~ R , 2 2016; Bojanowski et al., 2017; Zhao et al., 2018) s; s~ V , are learned by maximizing the objective. 2 with an idea of character n-gram embedding (Os- We set ntarget = nmax in our experiments. 3208 2.3 Comparison to Bojanowski et al.(2017) (We wrote a paper) To deal with OOV words as well as rare words, Bojanowski et al.(2017) proposed subword information skip-gram (sisg) that enriches word embeddings with the representations of its subwords, sisg i.e., sub-character n-grams of words. In , a vector representation of a target word is encoded : (frequent) n-grams in the vocabulary as the sum of the embeddings of its subwords.

Segmentation-Free Compositional N-Gram Embedding

A Survey of Orthographic Information in Machine Translation 3

Word Representation Or Word Embedding in Persian Texts Siamak Sarmady Erfan Rahmani

Language Resource Management System for Asian Wordnet Collaboration and Its Web Service Application

Generating Sense Embeddings for Syntactic and Semantic Analogy for Portuguese

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Creating Language Resources for Under-Resourced Languages: Methodologies, and Experiments with Arabic

Adaptation of Deep Bidirectional Transformers for Afrikaans Language

Unsupervised Learning of Cross-Modal Mappings Between

The Interplay Between Lexical Resources and Natural Language Processing

A Representation Framework for Cross-Lingual/Interlingual Lexical Semantic Correspondences

The Crْbadلn Project: Corpus Building for Under-Resourced Languages

Faqs & Recommendations on LR and MT