Document Classification with Distributions of Word Vectors

Document Classification with Distributions of Word Vectors Chao Xing∗y, Dong Wang∗, Xuewei Zhang∗z, Chao Liu∗ ∗Center for Speaker and Language Technologies (CSLT) Tsinghua University, Beijing 100084, P.R.China E-mail: [email protected]; [email protected] ySchool of Software, Beijing Jiaotong University, Beijing 100044, P.R.China E-mail: [email protected] zShenyang Jianzhu University, Sheyang 110168, P.R.China E-mail: [email protected] Abstract—The word-to-vector (W2V) technique represents LDA learns semantic clusters based on word co-occurrences, words as low-dimensional continuous vectors in such a way that which ignores semantic relevance among words and therefore semantic related words are close to each other. This produces is still a bag-of-words model. Second, the learning and infer- a semantic space where a word or a word collection (e.g., a document) can be well represented, and thus lends itself to a ence process is based on variational Bayesian approximation, multitude of applications including document classification. Our which is sensitive to the initial condition and is fairly slow. previous study demonstrated that representations derived from Third, the topics learned by LDA is highly determined by word vectors are highly promising in document classification word frequencies, which leads to difficulty in learning with and can deliver better performance than the conventional LDA less frequent but important topics. model. This paper extends the previous research and proposes to model distributions of word vectors in documents or document In the previous study [4], we proposed a document clas- classes. This extends the naive approach to deriving document representations by average pooling and explores the possibility sification approach based on word vectors and achieved very of modeling documents in the semantic space. Experiments on promising results. Word vectors are continuous representations the sohu text database confirmed that the new approach may of words derived from certain word-to-vector (W2V) model produce better performance on document classification. such as a neural network. By this representation, semantic or syntactic related words are located close to each other I. INTRODUCTION in the word vector space [5]. A seminal work on word The increasingly accumulated text documents on the In- vectors is conducted by Bengio and colleagues when studying ternet require effective document classification techniques. A neural language modeling [6], and the following work involves typical document classification system is composed of four various W2V models and efficient learning algorithms [7], components: text pre-processing, document vector extraction, [8], [9], [10], [11]. Recently, word vectors have been applied discriminative modeling and document classifier. Among these to a multitude of applications, including sentiment classifica- components, the document vector extraction component is tion [12], biometric name entity extraction [13], dependency particularly important. Since different documents may involve parsing [14], synonyms recognition [15]. different numbers of words, it is not trivial to represent Although word vectors have exhibited great potential on a variable-length document as a fixed-length vector while document classification in our previous work [4], the previous keeping the most class discriminant information. methods were rather simple. Particularly, we used the simple A popular approach to document vector extraction is based average pooling to extract document vectors, which ignores on various topic models. This approach first represents a the distribution of word vectors in a class or a document, document as a raw vector (e.g., TF-IDF), and then learns a leading to less representative document vectors. This paper group of topics B = fβ1; β2; ...βK g based on the raw vectors extends the research on W2V-based document classification. of a large corpus. A document is then represented by the image Basically, we argue that the distribution of word vectors is a vector of its raw vector projected onto the topic group B. Typ- more appropriate representation than the pooled centroid for a ical methods based on topic models include Latent Semantic document or a class. Two models are presented in this paper: Indexing (LSI) [1] and its probabilistic version, probabilistic a class specific Gaussian mixture model (CSGMM) which Latent Semantic Indexing (pLSI) [2]. A more comprehensive models word vectors of a document class, and a semantic approach is based on the Latent Dirichlet Allocation (LDA) space allocation (SSA) model which represents a document model [3], which places a Dirichlet distribution as a prior on as a posterior probability over a global GMM components in the topic distribution over B, and a document is represented the word vector space. by the posterior distribution that the document belongs to topics in B. The LDA-based approach usually delivers highly The rest of the paper is organized as follows: Section II competitive performance on a number of NLP tasks including introduces the W2V-based document classification and com- document classification, partly due to its nature of embedding pares it with the LDA-based approach; Section III presents the documents in the low-dimensional semantic (topic) space. proposed statistical models for word vectors. The experiments Despite the considerable success on document classification, are presented in Section IV, and the paper is concluded by the LDA model suffers from a number of disadvantages. First, Section V. 978-616-361-823-8 © 2014 APSIPA APSIPA 2014 II. W2V-BASED DOCUMENT CLASSIFICATION B. From word vector to document vector A. Word vectors Document classification largely relies on quality of the Word representation is a fundamental problem in natural document representations, or document vectors. The simple language processing. The conventional one-hot coding rep- vector space model (VSM) represents a document as a raw resents a word as a sparse vector of size jV j and all the TF-IDF vector, and the LDA-based approach, as well as other dimensions are zeros except the one that corresponds to the approaches based on topic models, represents a document as word. This simple presentation is discrete, high-dimensional an image of the document’s raw vector on a topic group B. and does not embed any semantic relationship among words. Due to the semantic interpolation of the topics, LDA can learn This often leads to much difficulty in model training and semantic relationships among words and thus usually delivers inference, for example, the well-known smoothness problem considerably good performance on document classification [3]. in language modeling [16]. The concept of word vectors offers a new approach to deriv- An alternative approach embeds words in a low-dimensional ing document vectors. Since word vectors represent semantic 1 continuous space where relevant words are close to each other. meanings of words , and the meaning of a document can The ‘relevance’ might be in the sense of semantic meanings, be regarded as an aggregation of meanings of the words it syntactic roles, sentimental polarities, or any others depending involves, document vectors can be derived from word vectors. on the model objectives [7], [8], [9], [10], [11]. This dense In the previous study [4], a simple average pooling approach and continuous word representation, often called word vector was proposed to derive document vectors from word vectors. (WV), offers a multitude of advantages: first, the dimension of Letting ci;j denote the word vector of the j-th word token of the vector is often much lower than the one-hot coding, leading document i, the document vector vi can be computed as: to much more efficient models; second, the word vector space Ji is continuous, offering the possibility to model texts using 1 X vi = ci;j (1) continuous models; third, the relationships among words are Ji embedded in their word vectors, providing a simple way to j=1 compute aggregated semantics for word collections such as where Ji is the number of word tokens in the document. paragraphs and documents. For these reasons, word vectors Both the LDA-based approach and W2V-based approach have been quickly adopted by the NLP community and have represent a document as a low-dimensional continuous vector, been applied to a multitude of text processing tasks [17], [13], and both embed semantic meanings in the vector. However, [14], [15]. there are a number of fundamental differences between them. A simple and efficient W2V model that we choose in this First, the LDA model learns the semantic group (topics) from work is the skip-gram model [18], where the training objective a collection of documents, and thus the topics are specific is to predict the left and right neighbours given a particular to the training corpus; the W2V model learns the semantic word, as shown in Fig. 1. This model is a neural network embedding for each word, and hence the semantic meaning where the input is a token of word wi, denoted by ewi . This is ‘general’ for all documents in the same language. Second, input token is mapped to its word vector cwi , by looking up the LDA model extracts topics by inferring shared patterns a embedding matrix U. This word vector, cwi , is then used to of word co-occurrences, hence purely frequency-driven; the predict the word vectors of its left and right C neighbouring W2V model, in contrast, extracts word semantic meanings by words (C = 2 in Fig. 1). looking at context similarities, thus beyond a bag-of-words model. Third, the LDA-based approach derives document vectors from global topics and so can be regarded as a top-down c'w i-2 approach, whereas the W2V-based approach derives document vectors from word vectors hence a bottom-up approach. c' wi-1 A number of experiments were conducted in [4] to inves- U tigate potential of the W2V-based approach and compare it e w wi C i with the LDA-based approach.

Document Classification with Distributions of Word Vectors

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Fasttext.Zip: Compressing Text Classification Models

Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms

Investigating Classification for Natural Language Processing Tasks

Linked Data Triples Enhance Document Relevance Classification

Natural Language Processing Security- and Defense-Related Lessons Learned

Using Wordnet to Disambiguate Word Senses for Text Classification

An Automatic Text Document Classification Using Modified Weight and Semantic Method

Natural Language Information Retrieval Text, Speech and Language Technology

The Impact of NLP Techniques in the Multilabel Text Classification Problem

Word Embeddings for Multi-Label Document Classification

Experiments with One-Class SVM and Word Embeddings for Document Classification