Self-Discriminative Learning for Unsupervised Document Embedding
Total Page:16
File Type:pdf, Size:1020Kb
Self-Discriminative Learning for Unsupervised Document Embedding Hong-You Chen∗1, Chin-Hua Hu∗1, Leila Wehbe2, Shou-De Lin1 1Department of Computer Science and Information Engineering, National Taiwan University 2Machine Learning Department, Carnegie Mellon University fb03902128, [email protected], [email protected], [email protected] Abstract ingful a document embedding as they do not con- sider inter-document relationships. Unsupervised document representation learn- Traditional document representation models ing is an important task providing pre-trained such as Bag-of-words (BoW) and TF-IDF show features for NLP applications. Unlike most competitive performance in some tasks (Wang and previous work which learn the embedding based on self-prediction of the surface of text, Manning, 2012). However, these models treat we explicitly exploit the inter-document infor- words as flat tokens which may neglect other use- mation and directly model the relations of doc- ful information such as word order and semantic uments in embedding space with a discrimi- distance. This in turn can limit the models effec- native network and a novel objective. Exten- tiveness on more complex tasks that require deeper sive experiments on both small and large pub- level of understanding. Further, BoW models suf- lic datasets show the competitiveness of the fer from high dimensionality and sparsity. This is proposed method. In evaluations on standard document classification, our model has errors likely to prevent them from being used as input that are relatively 5 to 13% lower than state-of- features for downstream NLP tasks. the-art unsupervised embedding models. The Continuous vector representations for docu- reduction in error is even more pronounced in ments are being developed. A successful thread scarce label setting. of work is based on the distributional hypothesis, and use contextual information for context-word 1 Introduction predictions. Similar to Word2Vec (Mikolov et al., 2013), PV (Le and Mikolov, 2014) is optimized by predicting the next words given their contexts Rapid advance in deep methods for natural lan- in a document, but it is conditioned on a unique guage processing has contributed to a growing document vector. Word2Vec-based methods for need for vector representation of documents as in- computing document embeddings achieve state- put features. Applications for such vector repre- of-the-art performance on document embedding. sentations include machine translation (Sutskever Such methods rely on one strong underlying as- et al., 2014), text classification (Dai and Le, 2015), sumption: it is necessary to train the document image captioning (Mao et al., 2015), multi-lingual embedding to optimize the prediction of the tar- document matching (Pham et al., 2015), question get words in the document. In other words, the answering (Rajpurkar et al., 2016), and more. This objective requires the model to learn to predict the work studies unsupervised training for encoders target words in surface text. We argue that there that can efficiently encode long paragraph of text are several concerns with such a self-prediction as- into compact vectors to be used as pre-trained fea- sumption. tures. Existing solutions are mostly based on the The strategy of predicting target words there- assumption that a good document embedding can fore only exploits in-document information, and be learned through modeling the intra-document do not explicitly model the inter-document dis- information by predicting the occurrence of terms tances. We believe an ideal embedding space inside the document itself. We argue that such an should also infer the relations among training doc- assumption might not be sufficient to obtain mean- uments. For example, if all documents in the ∗ Equally contribution. corpus are about machine learning, then the con- 2465 Proceedings of NAACL-HLT 2019, pages 2465–2474 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics cept of machine learning becomes less critical in of our knowledge, this is the first deep embedding the embedding. However, if the corpus contains work to explicitly model the inter-document rela- documents from different areas of computer sci- tionship. ence, then the concept of machine learning should Second, in our approach the predictions are in- be encoded in any document relevant to it. We ferred at the sentence level. This avoids the effect therefore claim that the embedding of a document of only predicting the surface meaning in word should not only depend on the document itself level (e.g. good vs. nice). Unlike previous work, but also the other documents in the corpus, even our model is explicitly optimized to represent doc- though previous work seldom makes this consid- uments as combinations of sequence embedding eration. beyond words seen in training. In addition, accurate predictions at the lexicon Below we summarize the key contributions: or word level do not necessarily reflect that the • ”true semantics” have been learned. For example, We present a deep and general framework in IMDB dataset review No.10007: and a novel objective for learning document representation unsupervisedly. Our models "::: the father did such a good job:" are end-to-end, easy to implement, and flexi- ble to extend. Obviously, good can be replaced with synonyms • We perform experiments through sentiment like nice without significantly altering the mean- analysis and topic classification to show that ing of the sentence. However, since the syn- our model, referred to as self-discriminative onyms are treated as independent tokens in PV and document embedding (SDDE), is competitive Doc2VecC, the lexicon good must be predicted ex- to the state-of-the-art solutions based on tra- actly. Moreover, to accurately predict the final ditional context-prediction objectives. word job, the embedding probably only needs to know that did a good job is a very common phrase, • Our extensive experiments quantitatively and without having to understand the true meaning of qualitatively show that SDDE learns more ef- job. This example shows that in order to accu- fective features that capture more document- rately predict a local lexicon, the embedding might level information. To the best of our knowl- opt to encode the syntactic relationship instead of edge, SDDE is the first deep network to true semantics. Enforcing document embeddings model inter-instance information at docu- to make predictions at the word level could be too ment level. strong of an objective. More specifically, we argue that the true semantics should not only depend on • We further propose to evaluate unsupervised a small context, but also the relations with other document embedding models in weakly- training documents at document level. supervised classification. That is, lots of un- To address the above concerns we propose a labeled documents with only few labels at- novel model for learning document embedding un- tached to some of them, which is a realistic supervisedly. In contrast with previous work (PV scenario that unsupervised embedding could and Doc2Vec), we model documents according to be particularly useful. two aspects. 2 Related Work First, we abandon the concept of context word prediction when training an embedding model. Here we give an overview of other related meth- Instead we propose a self-supervision learning ods on learning unsupervised text representations. framework to model inter-document information. Besides BoW and TF-IDF, Latent Dirichlet Al- Conceptually, we use the embedding to determine location models (Deerwester et al., 1990; Blei whether a sentence belongs to a document. Our et al., 2003) leverage the orthogonality of high- encoder is equipped with a discriminator to clas- dimensional BoW features by clustering a prob- sify whether a sentence embedding is derived from abilistic BoW for latent topics. a document given that document’s embedding. Several models extend from Word2Vec This explicitly enforces documents to be spread (Mikolov et al., 2013), using context-word reasonably in the embedding space without any la- predictions for training document embedding bels so that they can be discriminated. To the best end-to-end. PV (Le and Mikolov, 2014) keeps 2466 a document embedding matrix in memory and is ment size n = jX j, in which each document Xi jointly trained. The required training parameters is a set of sentences Si; are linear to the number of documents and thus 1 jxij prohibit PV from being trained on a large corpus. •S i = fsi ; ··· ; si g: a document divided into Moreover, expensive inference for new documents a set of sentences, of set size jSij, in which j jV|×Tj is required during testing. To address the above each sentence si 2 R contains a sequence concerns, Doc2VecC (Chen, 2017) combines PV of variable length Tj of word one-hot vectors 1 Tj jV|×1 and a denoising autoencoder (DEA) (Chen et al., wj ; ··· ; wj , each in R . S is the set of n 2012) with BoW vectors as global document in- S total sentences Si in the training corpus, of formation instead. The final document embedding i=1 are then produced by simply averaging the jointly size m = jSj; trained word embedding. • h U 2 Another thread of work uses two-stage pipelines w: the size of the word embedding and Rhw×|Vj to construct sentence/document embedding from : the word embedding projection ma- u U pre-trained word embedding. Arora et al.(2017) trix. We use w to denote the column in for w propose post-processing weighting strategies on word top of word embedding to build sentence represen- • hs: the size of the sentence embedding and es 2 tations. WME (Wu et al., 2018) propose a random Rhs : the embedding of sentence s. feature kernel method based on distance between hs pairs of words, which also shows inter-document • di 2 R : document Xi’s embedding. information helps. However, the cost scales with h ×n the size of training samples such that it is hard to Our goal is to learn a function F : X!R s be applied on large-scale dataset.