The Author-Topic Model for Authors and Documents
Total Page:16
File Type:pdf, Size:1020Kb
The Author-Topic Model for Authors and Documents Michal Rosen-Zvi Thomas Griffiths Mark Steyvers Padhraic Smyth Dept. of Computer Science Dept. of Psychology Dept. of Cognitive Sciences Dept. of Computer Science UC Irvine Stanford University UC Irvine UC Irvine Irvine, CA 92697-3425, USA Stanford, CA 94305, USA Irvine, CA 92697, USA Irvine, CA 92697-3425, USA Abstract Here, we consider how these approaches can be used to address another fundamental problem raised by large We introduce the author-topic model, a gen- document collections: modeling the interests of au- erative model for documents that extends La- thors. tent Dirichlet Allocation (LDA; Blei, Ng, & By modeling the interests of authors, we can answer Jordan, 2003) to include authorship informa- a range of important queries about the content of tion. Each author is associated with a multi- document collections. With an appropriate author nomial distribution over topics and each topic model, we can establish which subjects an author is associated with a multinomial distribution writes about, which authors are likely to have writ- over words. A document with multiple au- ten documents similar to an observed document, and thors is modeled as a distribution over topics which authors produce similar work. However, re- that is a mixture of the distributions associ- search on author modeling has tended to focus on the ated with the authors. We apply the model problem of authorship attribution { who wrote which to a collection of 1,700 NIPS conference pa- document { for which discriminative models based on pers and 160,000 CiteSeer abstracts. Exact relatively superficial features are often sufficient. For inference is intractable for these datasets and example, the \stylometric" approach (e.g., Holmes & we use Gibbs sampling to estimate the topic Forsyth, 1995) finds stylistic features (e.g., frequency and author distributions. We compare the of certain stop words, sentence lengths, diversity of an performance with two other generative mod- author's vocabulary) that discriminate between differ- els for documents, which are special cases of ent authors. the author-topic model: LDA (a topic model) In this paper we describe a generative model for doc- and a simple author model in which each au- thor is associated with a distribution over ument collections, the author-topic model, that simul- words rather than a distribution over top- taneously models the content of documents and the interests of authors. This generative model represents ics. We show topics recovered by the author- topic model, and demonstrate applications each document with a mixture of topics, as in state- of-the-art approaches like Latent Dirichlet Allocation to computing similarity between authors and (Blei et al., 2003), and extends these approaches to entropy of author output. author modeling by allowing the mixture weights for different topics to be determined by the authors of the 1 Introduction document. By learning the parameters of the model, we obtain the set of topics that appear in a corpus and their relevance to different documents, as well as Characterizing the content of documents is a standard identifying which topics are used by which authors. problem addressed in information retrieval, statistical natural language processing, and machine learning. A The paper is organized as follows. In Section 2, we representation of document content can be used to or- discuss generative models for documents using authors ganize, classify, or search a collection of documents. and topics, and introduce the author-topic model. We Recently, generative models for documents have begun devote Section 3 to describing the Gibbs sampler used to explore topic-based content representations, model- for inferring the model parameters, and in Section 4 ing each document as a mixture of probabilistic top- we present the results of applying this algorithm to ics (e.g., Blei, Ng, & Jordan, 2003; Hofmann, 1999). two collections of computer science documents|NIPS conference papers and abstracts from the CiteSeer to estimate these parameters, including variational in- database. We conclude and discuss further research ference (Blei et al., 2003), expectation propagation directions in Section 5. (Minka & Lafferty, 2002), and Gibbs sampling (Grif- fiths & Steyvers, 2004). However, this topic model 2 Generative models for documents provides no explicit information about the interests of authors: while it is informative about the content of documents, authors may produce several documents { We will describe three generative models for docu- often with co-authors { and it is consequently unclear ments: one that models documents as a mixture of how the topics used in these documents might be used topics (Blei et al., 2003), one that models authors with to describe the interests of the authors. distributions over words, and one that models both authors and documents using topics. All three models 2.2 Modeling authors with words use the same notation. A document d is a vector of Nd words, wd, where each wid is chosen from a vocabulary Topic models illustrate how documents can be mod- of size V , and a vector of Ad authors ad, chosen from a eled as mixtures of probability distributions. This sug- set of authors of size A. A collection of D documents gests a simple method for modeling the interests of au- is defined by D = f(w1; a1); : : : ; (wD; aD)g. thors. Assume that a group of authors, ad, decide to write the document d. For each word in the document 2.1 Modeling documents with topics an author is chosen uniformly at random, and a word is chosen from a probability distribution over words A number of recent approaches to modeling document that is specific to that author. content are based upon the idea that the probabil- This model is similar to a mixture model proposed ity distribution over words in a document can be ex- by McCallum (1999) and is equivalent to a variant pressed as a mixture of topics, where each topic is of LDA in which the mixture weights for the differ- a probability distribution over words (e.g., Blei, et ent topics are fixed. The underlying graphical model al., 2003; Hofmann, 1999). We will describe one such is shown in Figure 1(b). x indicates the author of a model { Latent Dirichlet Allocation (LDA; Blei et al., given word, chosen uniformly from the set of authors 2003).1 In LDA, the generation of a document col- a . Each author is associated with a probability dis- lection is modeled as a three step process. First, for d tribution over words φ, generated from a symmetric each document, a distribution over topics is sampled Dirichlet(β) prior. Estimating φ provides information from a Dirichlet distribution. Second, for each word about the interests of authors, and can be used to an- in the document, a single topic is chosen according to swer queries about author similarity and authors who this distribution. Finally, each word is sampled from write on subjects similar to an observed document. a multinomial distribution over words specific to the However, this author model does not provide any in- sampled topic. formation about document content that goes beyond This generative process corresponds to the hierarchical the words that appear in the document and the au- Bayesian model shown (using plate notation) in Fig- thors of the document. ure 1(a). In this model, φ denotes the matrix of topic distributions, with a multinomial distribution over V 2.3 The author-topic model vocabulary items for each of T topics being drawn in- dependently from a symmetric Dirichlet(β) prior. θ The author-topic model draws upon the strengths of is the matrix of document-specific mixture weights for the two models defined above, using a topic-based rep- these T topics, each being drawn independently from resentation to model both the content of documents a symmetric Dirichlet(α) prior. For each word, z de- and the interests of authors. As in the author model, notes the topic responsible for generating that word, a group of authors, ad, decide to write the document drawn from the θ distribution for that document, and d. For each word in the document an author is chosen w is the word itself, drawn from the topic distribution uniformly at random. Then, as in the topic model, a φ corresponding to z. Estimating φ and θ provides topic is chosen from a distribution over topics specific information about the topics that participate in a cor- to that author, and the word is generated from the pus and the weights of those topics in each document chosen topic. respectively. A variety of algorithms have been used The graphical model corresponding to this process is 1 shown in Figure 1(c). As in the author model, x indi- The model we describe is actually the smoothed LDA model (Blei et al., 2003) with symmetric Dirichlet priors cates the author responsible for a given word, chosen (Griffiths & Steyvers, 2004) as this is closest to the author- from ad. Each author is associated with a distribution topic model. over topics, θ, chosen from a symmetric Dirichlet(α) Topic (LDA) Author Author-Topic ad ad α θ α θ x A z x z β φ w β φ w β φ w Nd Nd N T D A D T d D (a) (b) (c) Figure 1: Generative models for documents. (a) Latent Dirichlet Allocation (LDA; Blei et al., 2003), a topic model. (b) An author model. (c) The author-topic model. prior. The mixture weights corresponding to the cho- from several local maxima of the posterior distribu- sen author are used to select a topic z, and a word is tion. generated according to the distribution φ correspond- The LDA model has two sets of unknown parameters { ing to that topic, drawn from a symmetric Dirichlet(β) the D document distributions θ, and the T topic distri- prior.