Latent Semantic Analysis and Topic Modeling: Roads to Text Meaning

Archeology of computational linguistics Internet 1990s–2000s: Statistical learning and Latent Semantic Analysis algorithms, evaluation, corpora Web in 1980s: Standard resources and tasks 1990s and Topic Modeling: Penn Treebank, WordNet, MUC 1970s: Kernel (vector) spaces • Natural Roads to Text Meaning clustering, information retrieval (IR) language 1960s: Representation Transformation processing Finite state machines (FSM) and Hồ Tú Bảo Augmented transition networks • Information (ATNs) retrieval and Japan Advanced Vietnamese 1960s: Representation−beyond the word level extraction Institute of Science Academy of Science lexical features, tree structures, on the Web and Technology and Technology networks (adapted from E. Hovy, COLING 2004) 1 2 PageRank algorithm (Google) Latent semantic analysis & topic models Google the word ‘weather Larry Page, The LSA approach makes three claims forecast’ Æ Answer: 4.2 Sergey Brin (1) semantic information can be derived from a million pages. word-document co-occurrence matrix; How does Google know which pages are the most important? (2) dimensionality reduction is an essential part of this derivation; Google assigns a number to each individual page (PageRank (3) words and documents can be represented as points number) computed via the in Euclidean space. A B C eigenvalue problem A C A 1/2 1/2 0 Pw = λw Different from (3), topic models express the B 1/2 0 1 semantic information of words and documents 9 Current size of P: 4.2x10 B C 0 1/2 0 by ‘topics’. ‘Latent’ = ‘hidden’, ‘unobservable’, ‘presently inactive’, … 3 4 What is topic? Notation and terminology The subject matter of a speech, text, meeting, discourse, etc. A word is the basic unit of discrete data, from vocabulary indexed by {1,…,V} = V. The vth word is The topic of a text captures “what a document is v u about”, i.e., the meaning of the text. represented by a V-vector w such that w = 1 and w = 0 for u≠v A text can be represented by a “bag of words” for several purposes and you can see the words. A document is a sequence of N words denote by But how can you see (know) the topics of the text? d = (w1, w2,…, wN) How a topic is represented, discovered, etc.? A corpus is a collection of M documents denoted by D = {d1, d2,…, dM} Topic modeling = Finding ‘word patterns’ of topic A ‘topic’ consists of a cluster of words that frequently occur together. 5 6 Term frequency–inverse document Vector space model in IR frequency d1 d2 d3 d4 d5 d6 q1 xy. cos(xy , ) = rock 2 1 0 2 0 1 1 xy tf-idf of a word ti in document dj (Salton & McGill, 1983) granite 1 0 1 0 0 0 0 marble 1 2 0 0 0 0 1 cos(d3, q1) = 0 nij, D ×log music 0 0 0 1 2 0 0 cos(d5, q1) = 0 n ∑k kj, {:dtj ij∈ d } song 0 0 0 1 0 2 0 cos(d4, q1) ≠ 0 band 0 0 0 0 1 0 0 cos(d6, q1) ≠ 0 Results in a txd matrix – thus reducing the corpus to a fixed-length list Given a query, says, q1 = (‘rock’, ‘marble’) Æ d3 more Used for search engines relevant to q1 than d4, d6 even cos(d3, q1) = 0. Problem of synonymy (one meaning can be expressed by multiple words, e.g. ‘group’, ‘cluster’), and polysemy (a word can have multiple meanings, e.g. ‘rock’). ni,j = # times ti occurs in dj 7 8 LSI: Latent semantic indexing LSI: Latent semantic indexing (Deerwester et al., 1990) documents dims T LSI is a dimensionality reduction technique that projects dims documents C = UDV documents to a lower-dimensional semantic space and, C U V dims = # singular dims dims D words in doing so, causes documents with similar topical words values = # (absolute) content to be close to one another in the resulting space. values of eigenvalues In particular, two documents which share no terms with T T C = UDV by singular value decomposition such that UU = I each other directly, but which do share many terms with and VVT = I and D is a diagonal matrix whose diagonal entries third document, will end up being similar in the are the singular values of C. projected space. Idea of LSI: to strip away most of dimensions and only keep Similarity between LSI and PCA? those which capture the most variation in the document collection (typically, from |V| = hundreds of thousands to k = between 100 and 200). 9 10 LSI: Example 1 Exchangeability D3 0.8 LSI clusters documents in the D2 reduced-dimension semantic Q1 0.6 space according to word A finite set of random variables { x 1 , K , x N } is said to be co-occurrence patterns. D1 0.4 exchangeable if the joint distribution is invariant to Dimensions loosely correspond 0.2 permutation. If π is a permutation of the integers from 1 with topic boundaries. 0 to N: -1 -0.8 -0.6 -0.4 -0.2 -0.2 p(x1, xN ) = p(xπ (1) , , xπ (N) ) D1 D2 D3 D4 D5 D6 Q1 K K rock 2 1 0 2 0 1 1 D4 -0.4 granite 1 0 1 0 0 0 0 An infinite sequence of random is infinitely exchangeable marble 1 2 0 0 0 0 1 D6 -0.6 music 0 0 0 1 2 0 0 if every finite subsequence is exchangeable song 0 0 0 1 0 2 0 -0.8 band 0 0 0 0 1 0 0 D5 -1 D1 D2 D3 D4 D5 D6 Q1 Dim. 1 -0.888 -0.759 -0.615 -0.961 -0.388 -0.851 -0.845 Dim. 2 0.460 0.652 0.789 -0.276 -0.922 -0.525 0.534 11 12 bag-of-words assumption Probabilistic topic models: key ideas documents dims Word order is ignored dims documents LSA “bag-of-words” – exchangeability, not i.i.d C U dims D dims V words words Theorem (De Finetti, 1935): if ()x1,x2,K,xN are infinitely exchangeable, then the joint probability documents topics has a representation as a mixture: Topic documents C Φ words words Θ models topics p(x1, x2,K, xN ) Normalized co- occurrence matrix for some random variable θ N Key idea: documents are mixtures of latent topics, where p(x , x , , x ) = dθ p(θ ) p(x θ ) a topic is a probability distribution over words. 1 2 K N ∫ ∏ i i=1 Hidden variables, generative processes, and statistical inference are the foundation of probabilistic modeling of topics. 13 14 Probabilistic topic models: processes Mixture of unigrams model (Nigam et al., 2000) Generative models: generating a Statistical inference (invert): to document know which topic model is most likely Simple, each document one topic to have generated the data, it infers (appropriate for supervised classification). Choose a distribution over topics and the document length; Probability distribution over words Generates a document by For each word w , choose a topic at associated with each topic i choosing a topic z zw random according to this distribution, Distribution over topics for each Nd generating N words independently from M and choose a word from the topic- document the conditional multinomial distribution word distribution. Topic responsible for generating each p(w|z) word N A topic is associated with a specific d language model that generates words pd()= ∑ pz ()∏ pw (n |) z appropriate to the topic. z n=1 Y Y • Nodes are random variables • Edges denote possible dependence • Plates denotes replicated structure • Pattern of conditional dependence Xn … between the ensemble of random variables N N X1 X2 XN pyx(,1 ,...,) xNn= py ()∏ px ( | y ) n=1 Observable variables Latent 15 16 How to calculate? Nd Probabilistic latent semantic indexing pd()= ∑ pz ()∏ pw (n |) z z n=1 (Hofmann, 1999) We must draw the multinomial distributions p(z) and p(w|z) pLSI: Each word is generated from a single topic, different If each document is annotated with a topic z words in the document may be using maximum likelihood estimation Æ p(z) Limitations: generated from different topics. d zw N count # times each word w appeared in all 1. a document can d M documents labeled with z and then normalize only contain a Each document is represented single topic. Æ p(w|z) as a list of mixing proportions p(d, w ) = p(d) p(w | z) p(z | d) 2. the distributions n ∑ n for the mixture components. z If topics are not known for documents have no priors and are assumed to be EM algorithm can be used to estimate p(d) Generative process: learned Once the model has been trained, inference completely from Choose a document dm with p(d) data can be performed using Bayes’ rule to obtain For each word wn in the dm the most likely topics for each document. Choose a zn from a multinomial conditioned on dm, i.e., from p(z|dm) Choose a wn from a multinomial conditioned on zn, i.e., from p(w|zn). 17 18 Limitations Latent Dirichlet allocation Dirichlet Per-word Topic The model allows multiple topics in each document, but parameter topic assignment hyperparameter Per-document Observed Per-topic the possible topic proportions have to be learned from the topic proportions word word proportions document collection (T-1)-simplex (V-1)-simplex pLSI does not make any assumptions about how the mixture weights θ are generated, making it difficult to test the generalizability of the model to new documents. α θ Z W φ β d d,n d,n N t d T Topic distribution must be learned for each document in M the collection Æ # parameters grows with the number of 1.

Latent Semantic Analysis and Topic Modeling: Roads to Text Meaning

WELDA: Enhancing Topic Models by Incorporating Local Word Context

Generative Topic Embedding: a Continuous Representation of Documents

Linked Data Triples Enhance Document Relevance Classification

Document and Topic Models: Plsa

Improving Distributed Word Representation and Topic Model by Word-Topic Mixture Model

Topic Modeling: Beyond Bag-Of-Words

Semantic N-Gram Topic Modeling

Modeling Topical Relevance for Multi-Turn Dialogue Generation

Text Analysis of QC/Issue Tracker by Natural Language Processing (NLP)

Topic Modeling Approaches for Understanding COVID-19

Distributional Approaches to Semantic Analysis

Topic Modeling for Natural Language Understanding