Algorithms for Generalized Topic Modeling
Total Page:16
File Type:pdf, Size:1020Kb
Algorithms for Generalized Topic Modeling Avrim Blum Nika Haghtalab Toyota Technological Institute at Chicago Computer Science Department [email protected] Carnegie Mellon University [email protected] Abstract lations, like shooting a free throw or kicking a field goal. Better would be a model in which sentences are drawn i.i.d. Recently there has been significant activity in developing al- from a distribution over sentences. Even better would be gorithms with provable guarantees for topic modeling. In this paragraphs drawn i.i.d. from a distribution over paragraphs work we consider a broad generalization of the traditional topic modeling framework, where we no longer assume that (this would account for the word correlations that exist within words are drawn i.i.d. and instead view a topic as a complex a coherent paragraph). Or, even better, how about a model in distribution over sequences of paragraphs. Since one could which paragraphs are drawn non-independently, so that the not hope to even represent such a distribution in general (even second paragraph in a document can depend on what the first if paragraphs are given using some natural feature representa- paragraph was saying, though presumably with some amount tion), we aim instead to directly learn a predictor that given a of additional entropy as well? This is the type of model we new document, accurately predicts its topic mixture, without study here. learning the distributions explicitly. We present several natural Note that an immediate problem with considering such conditions under which one can do this from unlabeled data a model is that now the task of learning an explicit distri- only, and give efficient algorithms to do so, also discussing issues such as noise tolerance and sample complexity. More bution (over sentences or paragraphs) is hopeless. While a generally, our model can be viewed as a generalization of the distribution over words can be reasonably viewed as a proba- multi-view or co-training setting in machine learning. bility vector, one could not hope to learn or even represent an explicit distribution over sentences or paragraphs. Indeed, except in cases of plagiarism, one would not expect to see 1 Introduction the same paragraph twice in the entire corpus. Moreover, this Topic modeling is an area with significant recent work is likely to be true even if we assume paragraphs have some in the intersection of algorithms and machine learning natural feature-vector representation. Instead, we bypass this [4, 5, 3, 1, 2, 8]. In topic modeling, a topic (such as sports, issue by aiming to directly learn a predictor for documents— business, or politics) is modeled as a probability distribution that is, a function that given a document, predicts its mixture over words, expressed as a vector ai. A document is gener- over topics—without explicitly learning topic distributions. ated by first selecting a mixture w over topics, such as 80% Another way to think of this is that our goal is not to learn sports and 20% business, and then choosing words i.i.d. from a model that could be used to write a new document, but the associated mixture distribution, which in this case would instead just a model that could be used to classify a document be 0:8asports +0:2abusiness. Given a large collection of such written by others. This is much as in standard supervised documents (and some assumptions about the distributions learning where algorithms such as SVMs learn a decision ai as well as the distribution over mixture vectors w) the boundary (such as a linear separator) for making predictions goal is to recover the topic vectors ai and then to use the ai on the labels of examples without explicitly learning the dis- to correctly classify new documents according to their topic tributions D+ and D− over positive and negative examples mixtures. Algorithms for this problem have been developed respectively. However, our setting is unsupervised (we are with strong provable guarantees even when documents con- not given labeled data containing the correct classifications sist of only two or three words each [5, 1, 18]. In addition, of the documents in the training set) and furthermore, rather algorithms based on this problem formulation perform well than each data item belonging to one of the k classes (topics), empirically on standard datasets [9, 15]. each data item belongs to a mixture of the k topics. Our goal As a theoretical model for document generation, however, is given a new data item to output what that mixture is. an obvious problem with the standard topic modeling frame- We begin by describing our high level theoretical formula- work is that documents are not actually created by inde- tion. This formulation can be viewed as a generalization both pendently drawing words from some distribution. Moreover, of standard topic modeling and of multi-view learning or co- important words within a topic often have meaningful corre- training [10, 12, 11, 7, 21]. We then describe several natural Copyright c 2018, Association for the Advancement of Artificial assumptions under which we can indeed efficiently solve the Intelligence (www.aaai.org). All rights reserved. problem, learning accurate topic mixture predictors. Pk 2 Preliminaries mixture distribution i=1 wiai. The document vector x^ is We assume that paragraphs are described by n real-valued then the vector of word counts, normalized by dividing by features and so can be viewed as points x in an instance the number of words in the document so that kx^k1 = 1. space X ⊆ Rn. We assume that each document consists of As a thought experiment, consider infinitely long docu- at least two paragraphs and denote it by (x1; x2). Further- ments. In the standard framework, all infinitely long docu- more, we consider k topics and partial membership func- ments of a mixture weight w have the same representation tions f ; : : : ; f : X! [0; 1], such that f (x) determines Pk 1 k i x = i=1 wiai. This representation implies x · vi = wi for the degree to which paragraph x belongs to topic i, and, all i 2 [k], where V = (v1;:::; vk) is the pseudo-inverse Pk k i=1 fi(x) = 1. For any vector of probabilities w 2 R — of A = (a1;:::; ak). Thus, by partitioning the document which we sometimes refer to as mixture weights — we define into two halves (views) x1 and x2, our noise-free model with w n X = fx 2 R j 8i; fi(x) = wig to be the set of all para- fi(x) = vi · x generalizes the standard topic model for long graphs with partial membership values w. We assume that documents. However, our model is substantially more gen- both paragraphs of a document have the same partial mem- eral: features within a view can be arbitrarily correlated, the 1 2 S w w bership values, that is (x ; x ) 2 w X × X , although views themselves can also be correlated, and even in the zero- we also allow some noise later on. To better relate to the noise case, documents of the same mixture can look very literature on multi-view learning, we also refer to topics as different so long as they have the same projection to the span “classes” and refer to paragraphs as “views” of the document. of the a1;:::; ak. Much like the standard topic models, we consider an unla- For a shorter document x^, each feature x^i is drawn accord- Pk beled sample set that is generated by a two-step process. First, ing to a distribution with mean xi, where x = i=1 wiai. we consider a distribution P over vectors of mixture weights Therefore, x^ can be thought of as a noisy measurement of x. and draw w according to P. Then we consider distribution The fewer the words in a document, the larger is the noise in Dw over the set X w × X w and draw a document (x1; x2) x^. Existing work in topic modeling, such as [5, 2], provide according to Dw. We consider two settings. In the first set- elegant procedures for handling large noise that is caused ting, which is addressed in Section3, the learner receives the by drawing only 2 or 3 words according to the distribution instance (x1; x2). In the second setting, the learner receives induced by x. As we show in Section4, our method can samples (x^1; x^2) that have been perturbed by some noise. We also tolerate large amounts of noise under some conditions. discuss two noise models in Sections4 and F.2. In both cases, While our method cannot deal with documents that are only the goal of the learner is to recover the partial membership 2- or 3-words long, the benefit is a model that is much more functions fi. general in many other respects. More specifically, in this work we consider partial mem- Generalization of Co-training Framework bership functions of the form fi(x) = f(vi · x), where n v1;:::; vk 2 R are linearly independent and f : R ! [0; 1] Here, we briefly discuss how our model is a generalization of is a monotonic function.1 For the majority of this work, we the co-training framework. The standard co-training frame- consider f to be the identity function, so that fi(x) = vi · x. work of [10] considers learning a binary classifier from pri- 1 2 Define ai 2 spanfv1;:::; vkg such that vi · ai = 1 and marily unlabaled instances, where each instance (x ; x ) is vj ·ai = 0 for all j 6= i. In other words, the matrix containing a pair of views that have the same classification.