Bag-Of-Embeddings for Text Classification

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Bag-of-Embeddings for Text Classification 1 2 1 3 Peng Jin, Yue Zhang, Xingyuan Chen, ⇤ Yunqing Xia 1 School of Computer Science, Leshan Normal University, Leshan China, 614000 2 Singapore University of Technology and Design, Singapore 487372 3 Search Technology Center, Microsoft, Beijing China, 100087 Abstract More sources of information have been explored for text classification, including parts of speech [Lewis, 1995], syn- Words are central to text classification. It has tax structures [Post and Bergsma, 2013; Tetsuji et al., 2010] been shown that simple Naive Bayes models with and semantic compositionality [Moilanen and Pulman, 2007]. word and bigram features can give highly compet- However, such features have demonstrated limited gains over itive accuracies when compared to more sophisti- bag-of-words features [Wang and Manning, 2012]. One use- cated models with part-of-speech, syntax and se- ful feature beyond bag-of-words is bag-of-ngrams. Wang and mantic features. Embeddings offer distributional Manning [2012] show that bigram features are particularly features about words. We study a conceptually useful for sentiment classification. Bigrams offer a certain simple classification model by exploiting multi- degree of compositionality while being relatively less sparse prototype word embeddings based on text classes. compared with larger n-gram features. For example, the bi- The key assumption is that words exhibit differ- gram “abnormal return” strongly indicates finance, although ent distributional characteristics under different text both “abnormal” and “return” can be common across dif- classes. Based on this assumption, we train multi- ference classes. Similar examples include “world cup” and prototype distributional word representations for “large bank”, where bi-grams indicate text classes, but the different text classes. Given a new document, its words do not. text class is predicted by maximizing the proba- One intuitive reason behind the strength of bigrams is that bilities of embedding vectors of its words under they resolve the ambiguity of polysemous words. In the above the class. In two standard classification benchmark examples the words ”return”, ”cup”, and ”bank” have dif- datasets, one is balance and the other is imbalance, ferent meanings under different document classes, and the our model outperforms state-of-the-art systems, on correct identification of their word sense under a ngram con- both accuracy and macro-average F-1 score. text is useful for identifying the document class. For example, when the word ”bank” exists under a context with words 1 Introduction such as ”card” and ”busy”, it strongly indicates the ”finance” sense. This fact suggests a simple extension to bag-of-word Text classification is important for a wide range of web ap- features by incorporating context and word sense informa- [ et al. ] plications, such as web search Chekuri , 1997 , opinion tion. We propose a natural extension to the skip-gram word [ ] [ mining Vo and Zhang, 2015 , and event detection Kumaran embedding model [Mikolov et al., 2013] to this end. and Allan, 2004]. Dominant approaches in the literature treat text classification as a standard classification problem, using Word embeddings are low-dimensional, dense vector rep- supervised or semi-supervised machine learning methods. A resentation of words, first proposed in neural language mod- key research topic is the design of effective feature represen- els [Bengio et al., 2003]. Traditionally, embeddings are tations. trained during the training of a neural language model as a Words are central to text classification. Bag-of-words part of the model parameters. Mikolov et al. [2013] define models [Harris, 1954], such as Naive Bayes [Mccallum and specific objective functions for efficient training of word em- Nigam, 1998] can give highly competitive baselines com- beddings, by simplifying the original training objective of a pared with much more sophisticated models with complex neural language model. The skip-gram model trains word feature representations. The main intuition behind is that embeddings by maximizing the probabilities of words given there is typically a salient set of words that signal each their context windows. Two sets of embeddings are defined document class. For example, in news the words “coach”, for the same word as a target word and a context word, re- “sport”, “basketball” occur relatively frequently in sports, and spectively. The probability of a target word is estimated by the words “chipset”, “compiler” and “Linux” are relatively the cosine similarities between the target embedding and the unique for information technology. content embeddings of its context words. This offers a way to estimate word probability via the embedding probability that ⇤Corresponding author: Xingyuan Chen ([email protected]) readily integrates context information. 2824 To further integrate word sense information, we make a [Kalchbrenner et al., 2014] and deep convolutional neural simple extension by training multi-prototype target word em- network [Santos and Gatti, 2014]. There has also been work beddings, with one distinct vector trained for a word under on directly learning distributed vector representations of para- each class. The context vectors of words remain the same graphs and sentences [Le and Mikolov, 2014], which has been across different classes. Here by associating word senses with shown as useful as the above mentioned neural network mod- document classes, we make the assumption that each word els for text classification. Such neural network result in vector exhibits one sense in each document class, and the sense of representation of text data. Along this line, Tang et al. [2014] a word differs across different classes. This assumption is is the closet in spirit to our work. They learn text embed- highly coarse-grained and does not correspond to the defi- dings specifically for end tasks, such as classification. We nition of word senses in linguistics. However, it empirically also learn word embeddings specifically for text classifica- works effectively, and we find that the definition of sense here tion. However, rather than learning one vector for each word, can capture subtle differences between word meanings across which Tang et al. [2014] do, we learn multi-prototype em- different document classes. beddings, with certain words having multiple vector forms Under the above assumptions, the probability of a class according to the text class. In addition, we have a very sim- given a document can be calculated from the probabilities ple document model, with the only parameters being word of the embeddings of each word under this class. Since the vectors. The simplicity demonstrates the effectiveness of in- probability of each word embeddings is calculated separately, corporating text class information into distributed word rep- we call this model a bag-of-embeddings model. Training re- resentation. quires text corpora with class labels, some of them are ob- Our work is also related to prior work on multi-prototype tained from naturally labeled [Go et al., 2009] and some are word embeddings [Reisinger and Mooney, 2010; Huang et hand-labeled [Lewis, 1995]. The bag-of-embeddings model al., 2012; Tian et al., 2014]. Previous methods define word is conceptually simple, with the only parameters being word prototypes according to word senses from ontologies, or in- embeddings. We show that maximum-likelihood training for duce word senses automatically. They share a more linguistic document classification is consistent with the skip-gram ob- focus on word similarities. In contrast, our objective is to jective for training multi-prototype word embeddings. improve text classification performance, and hence we train Experiments on two standard document classification multi-prototype embeddings based on text classes. benchmark data show that our model achieve higher accuracies and macro-F1 scores compared than state-of- 3 Method the-art models. Our method achieves the best re- ported accuracies for both the balanced and the im- 3.1 The Skip-gram Model balanced datasets. The source code of this paper is Our bag-of-embeddings model extends the skip-gram model released at https://github.com/hiccxy/Bag-of-embedding-for- [Mikolov et al., 2013], which is a simplification of neural text-classification. language models for efficient training of word embeddings. The skip-gram model works by maximizing the probabilities of words being predicted by their context words. 2 Related Work In particular, two sets of embeddings are defined for each Text classification has traditionally been solved as a standard word w, when w is used as the output target word and as the classification task, using supervised learning approaches such input context word, respectively. We use ~v (w) and ~u (w) to as Naive Bayes [Mccallum and Nigam, 1998], logistic re- denote target (output) embedding vector and context (in- gression [Nigam et al., 1999], and support vector machines put) embedding vector of w, respectively. [Joachims, 1998]. Word features, and in particular bag-of- Given a target word w and a context w0 of w, the prob- words features, have been used for classification. There have ability of ~v (w) given ~u (w0) is defined based on the cosine been research efforts to incorporate more complex features similarity between ~v (w) and ~u (w0), as follows, into text classification models. Lewis [1995] uses parts-of-

Load more