Latent Semantic Analysis and Topic Modeling: Roads to Text Meaning

Archeology of computational linguistics Internet 1990s–2000s: Statistical learning and Latent Semantic Analysis algorithms, evaluation, corpora Web in 1980s: Standard resources and tasks 1990s and Topic Modeling: Penn Treebank, WordNet, MUC 1970s: Kernel (vector) spaces • Natural Roads to Text Meaning clustering, information retrieval (IR) language 1960s: Representation Transformation processing Finite state machines (FSM) and Hồ Tú Bảo Augmented transition networks • Information (ATNs) retrieval and Japan Advanced Vietnamese 1960s: Representation−beyond the word level extraction Institute of Science Academy of Science lexical features, tree structures, on the Web and Technology and Technology networks (adapted from E. Hovy, COLING 2004) 1 2 PageRank algorithm (Google) Latent semantic analysis & topic models Google the word ‘weather Larry Page, The LSA approach makes three claims forecast’ Æ Answer: 4.2 Sergey Brin (1) semantic information can be derived from a million pages. word-document co-occurrence matrix; How does Google know which pages are the most important? (2) dimensionality reduction is an essential part of this derivation; Google assigns a number to each individual page (PageRank (3) words and documents can be represented as points number) computed via the in Euclidean space. A B C eigenvalue problem A C A 1/2 1/2 0 Pw = λw Different from (3), topic models express the B 1/2 0 1 semantic information of words and documents 9 Current size of P: 4.2x10 B C 0 1/2 0 by ‘topics’. ‘Latent’ = ‘hidden’, ‘unobservable’, ‘presently inactive’, … 3 4 What is topic? Notation and terminology The subject matter of a speech, text, meeting, discourse, etc. A word is the basic unit of discrete data, from vocabulary indexed by {1,…,V} = V. The vth word is The topic of a text captures “what a document is v u about”, i.e., the meaning of the text. represented by a V-vector w such that w = 1 and w = 0 for u≠v A text can be represented by a “bag of words” for several purposes and you can see the words. A document is a sequence of N words denote by But how can you see (know) the topics of the text? d = (w1, w2,…, wN) How a topic is represented, discovered, etc.? A corpus is a collection of M documents denoted by D = {d1, d2,…, dM} Topic modeling = Finding ‘word patterns’ of topic A ‘topic’ consists of a cluster of words that frequently occur together. 5 6 Term frequency–inverse document Vector space model in IR frequency d1 d2 d3 d4 d5 d6 q1 xy. cos(xy , ) = rock 2 1 0 2 0 1 1 xy tf-idf of a word ti in document dj (Salton & McGill, 1983) granite 1 0 1 0 0 0 0 marble 1 2 0 0 0 0 1 cos(d3, q1) = 0 nij, D ×log music 0 0 0 1 2 0 0 cos(d5, q1) = 0 n ∑k kj, {:dtj ij∈ d } song 0 0 0 1 0 2 0 cos(d4, q1) ≠ 0 band 0 0 0 0 1 0 0 cos(d6, q1) ≠ 0 Results in a txd matrix – thus reducing the corpus to a fixed-length list Given a query, says, q1 = (‘rock’, ‘marble’) Æ d3 more Used for search engines relevant to q1 than d4, d6 even cos(d3, q1) = 0. Problem of synonymy (one meaning can be expressed by multiple words, e.g. ‘group’, ‘cluster’), and polysemy (a word can have multiple meanings, e.g. ‘rock’). ni,j = # times ti occurs in dj 7 8 LSI: Latent semantic indexing LSI: Latent semantic indexing (Deerwester et al., 1990) documents dims T LSI is a dimensionality reduction technique that projects dims documents C = UDV documents to a lower-dimensional semantic space and, C U V dims = # singular dims dims D words in doing so, causes documents with similar topical words values = # (absolute) content to be close to one another in the resulting space. values of eigenvalues In particular, two documents which share no terms with T T C = UDV by singular value decomposition such that UU = I each other directly, but which do share many terms with and VVT = I and D is a diagonal matrix whose diagonal entries third document, will end up being similar in the are the singular values of C. projected space. Idea of LSI: to strip away most of dimensions and only keep Similarity between LSI and PCA? those which capture the most variation in the document collection (typically, from |V| = hundreds of thousands to k = between 100 and 200). 9 10 LSI: Example 1 Exchangeability D3 0.8 LSI clusters documents in the D2 reduced-dimension semantic Q1 0.6 space according to word A finite set of random variables { x 1 , K , x N } is said to be co-occurrence patterns. D1 0.4 exchangeable if the joint distribution is invariant to Dimensions loosely correspond 0.2 permutation. If π is a permutation of the integers from 1 with topic boundaries. 0 to N: -1 -0.8 -0.6 -0.4 -0.2 -0.2 p(x1, xN ) = p(xπ (1) , , xπ (N) ) D1 D2 D3 D4 D5 D6 Q1 K K rock 2 1 0 2 0 1 1 D4 -0.4 granite 1 0 1 0 0 0 0 An infinite sequence of random is infinitely exchangeable marble 1 2 0 0 0 0 1 D6 -0.6 music 0 0 0 1 2 0 0 if every finite subsequence is exchangeable song 0 0 0 1 0 2 0 -0.8 band 0 0 0 0 1 0 0 D5 -1 D1 D2 D3 D4 D5 D6 Q1 Dim. 1 -0.888 -0.759 -0.615 -0.961 -0.388 -0.851 -0.845 Dim. 2 0.460 0.652 0.789 -0.276 -0.922 -0.525 0.534 11 12 bag-of-words assumption Probabilistic topic models: key ideas documents dims Word order is ignored dims documents LSA “bag-of-words” – exchangeability, not i.i.d C U dims D dims V words words Theorem (De Finetti, 1935): if ()x1,x2,K,xN are infinitely exchangeable, then the joint probability documents topics has a representation as a mixture: Topic documents C Φ words words Θ models topics p(x1, x2,K, xN ) Normalized co- occurrence matrix for some random variable θ N Key idea: documents are mixtures of latent topics, where p(x , x , , x ) = dθ p(θ ) p(x θ ) a topic is a probability distribution over words. 1 2 K N ∫ ∏ i i=1 Hidden variables, generative processes, and statistical inference are the foundation of probabilistic modeling of topics. 13 14 Probabilistic topic models: processes Mixture of unigrams model (Nigam et al., 2000) Generative models: generating a Statistical inference (invert): to document know which topic model is most likely Simple, each document one topic to have generated the data, it infers (appropriate for supervised classification). Choose a distribution over topics and the document length; Probability distribution over words Generates a document by For each word w , choose a topic at associated with each topic i choosing a topic z zw random according to this distribution, Distribution over topics for each Nd generating N words independently from M and choose a word from the topic- document the conditional multinomial distribution word distribution. Topic responsible for generating each p(w|z) word N A topic is associated with a specific d language model that generates words pd()= ∑ pz ()∏ pw (n |) z appropriate to the topic. z n=1 Y Y • Nodes are random variables • Edges denote possible dependence • Plates denotes replicated structure • Pattern of conditional dependence Xn … between the ensemble of random variables N N X1 X2 XN pyx(,1 ,...,) xNn= py ()∏ px ( | y ) n=1 Observable variables Latent 15 16 How to calculate? Nd Probabilistic latent semantic indexing pd()= ∑ pz ()∏ pw (n |) z z n=1 (Hofmann, 1999) We must draw the multinomial distributions p(z) and p(w|z) pLSI: Each word is generated from a single topic, different If each document is annotated with a topic z words in the document may be using maximum likelihood estimation Æ p(z) Limitations: generated from different topics. d zw N count # times each word w appeared in all 1. a document can d M documents labeled with z and then normalize only contain a Each document is represented single topic. Æ p(w|z) as a list of mixing proportions p(d, w ) = p(d) p(w | z) p(z | d) 2. the distributions n ∑ n for the mixture components. z If topics are not known for documents have no priors and are assumed to be EM algorithm can be used to estimate p(d) Generative process: learned Once the model has been trained, inference completely from Choose a document dm with p(d) data can be performed using Bayes’ rule to obtain For each word wn in the dm the most likely topics for each document. Choose a zn from a multinomial conditioned on dm, i.e., from p(z|dm) Choose a wn from a multinomial conditioned on zn, i.e., from p(w|zn). 17 18 Limitations Latent Dirichlet allocation Dirichlet Per-word Topic The model allows multiple topics in each document, but parameter topic assignment hyperparameter Per-document Observed Per-topic the possible topic proportions have to be learned from the topic proportions word word proportions document collection (T-1)-simplex (V-1)-simplex pLSI does not make any assumptions about how the mixture weights θ are generated, making it difficult to test the generalizability of the model to new documents. α θ Z W φ β d d,n d,n N t d T Topic distribution must be learned for each document in M the collection Æ # parameters grows with the number of 1.

Latent Semantic Analysis and Topic Modeling: Roads to Text Meaning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support