Latent Semantic Analysis (LSA)

Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

References: 1. G.W.Furnas, S. Deerwester, S.T. Dumais, T.K. Landauer, R. Harshman, L.A. Streeter, K.E. Lochbaum, “Information Retrieval using a Singular Value DecomposDecompositionition Model of Latent Semantic Structure ,” SIGIR1988 2. J.R. Bellegarda, ”Latent semantic mapping,” IEEE Signal Processing Magazine, September 2005 3. T. K. Landauer, D. S. McNamara, S. Dennis, W. Kintsch (eds.) , Handbook of Latent Semantic Analysis, Lawrence Erlbaum, 2007 Taxonomy of Classic IR Models

Set Theoretic Fuzzy Extended Boolean Classic Models Boolean Algebraic VtVector U Probabilistic Generalized Vector s Retrieval: Latent Semantic e Adhoc Analysis (LSA) r Filtering Neural Networks Structured Models Probabilistic T Non-Overlappin g Lists a Proximal Nodes Inference Network s Belief Network k Browsing Language Models Browsing Flat Structure Guided Hypertext

IR – Berlin Chen 2 Classification of IR Models Along Two Axes • Matching Strategy – Literal term matching (matching word patterns between the query and documents) • E.g., Vector Space Model (VSM), Hidden Markov Model (HMM), Language Model (LM) – Concept matching (matching word meanings between the query and documents) • E.g., Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), Word Topic Model (WTM) • Learning Capability – Term weighting , query expansion , document expansion, etc. • E.g., Vector Space Model, Latent Semantic Indexing • Most models are based on linear algebra operations – Solid theoretical foundations (optimization algorithms) • E.g., Hidden Markov Model (HMM), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), Word Topic Model (WTM) • Most models belong to the language modeling approach IR – Berlin Chen 3 Two Perspectives for IR Models (()cont.)

• Literal Term Matching vs. Concept Matching

中國解放軍蘇愷戰香港星島日報篇報導引述軍事觀察家的話表示，到二機零零五年台灣將完全喪失空中優勢，原因是中國大陸戰機不論是數量或是性能上都將超越台灣，報導指出中國在大量引進俄羅斯先進武器的同時也得加快研發自製武器系統，目前西安飛機製造廠任職的改進型飛豹戰機即將部署尚未與蘇愷三十通道地對地攻擊住宅飛機，以督促遇到挫折的監控其戰機目前也已經取得了重大階段性的認知成果。根據日本媒體報導在台海戰爭隨時可能爆發情況之下北京方面的基本方針，使用高科技答應局部戰爭。因此，解放軍打算在二零零中共新一四年前又有包括蘇愷三十二期在內的兩百架蘇霍伊戰代空軍戰鬥機。力

– There are usually many ways to express a given concept, so literal terms in a user’s query may not match those of a relevant document

IR – Berlin Chen 4 Latent Semantic Analyy(sis (LSA)

• Also called Latent Semantic Indexingg( (LSI ), Latent Semantic Mapping (LSM), or Two-Mode Factor Analysis

– Three important claims made for LSA • The semantic information can derived from a word-document co-occurrence matrix

• The dimension reduction is an essential part of its derivation

• Words and documents can be represented as points in the Euclidean space

– LSA exploits the meaning of words by removing “noise” that is present due to the variability in word choice • Namely, synonymy and polysemy that are found in documents

T. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis. Hillsdale, NJ: Erlbaum. IR – Berlin Chen 5 Latent Semantic Analysis: Schematic • Dimension Reduction and Feature Extraction feature space – PCA k T Y X y i = φi X y φ ˆ ∑ i i X n k i=1 n

φ1 φk φ1 φk orthonormal basis 2 min Xˆ − X for a given k – SVD (in LSA)

latent semantic Σ VT k space A’ A U kxk rxr rxn mxr mxn k r ≤ min(m,n) mxn latent semantic 2 space min A′ − A F for a given k

IR – Berlin Chen 6 LSA: An Example

– Singular Value Decomposition (SVD) used for the word- document matrix • A least-squares method for dimension reduction

Projection of a Vector x : x ϕ ϕ 2 1

y 2 θ 1 T y 1 ϕ x y = x cos θ = x 1 = ϕ T x 1 1 1 x φ1

, where φ1 = 1 IR – Berlin Chen 7 LSA: Latent Structure Space

• Two alternative frameworks to circumvent vocabulary mismatch

Doc terms structure model

doc expansion

latent semantic literal term matching structure retriileval

query expansion

Query terms structure model

IR – Berlin Chen 8 LSA: Another Example (()1/2)

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. IR – Berlin Chen 9 LSA: Another Example (()2/2)

Words similar in meaning are “near” Query: “ ” each other in the LSA space even human computer interaction if they never co-occur in a document; Documents similar in concept are “near” each other in the LSA space even if An OOV word they share no words in common.

Three sorts of basic comparisons - Compare two words -Compare two documents - Compare a word to a document

IR – Berlin Chen 10 LSA: Theoretical Foundation (()1/10)

Row A ∈ Rn •Singgp()ular Value Decomposition (SVD) ∈ m compositions Col A R d1 d2 dn d1 d2 dn w Both U and V has orthonormal w1 1 w T column vectors w2 2 Σr V rxm UTU=I units rxr rxn rXr A = Umxr T V V=IrXr r ≤ mi(in(m,n) w wm m mxn mxr K ≤ r 2 2 ||A||F ≥ ||A’|| F d1 d2 dn d1 d2 dn

w1 m n w1 2 T A = a 2 w Σk V’ kxm F ∑∑ ij w2 2 i==11j kxk kxn A’ = U’mxk Docs and queries are represented in a k-dimensional space. The quantities of w wm m the axes can be properly weighted mxn mxk according to the associated diagonal

values of Σk IR – Berlin Chen 11 LSA: Theoretical Foundation (2/10)

• “term-document” matrix A has to do with the co-occurrences btbetween terms ( or unit it)s) and dd documen ts (or compos itions ) – Contextual information for words in documents is discarded •“bag-of-words” modeling

• Feature extraction for the entities a i , j of matrix A 1. Conventional tf-idf statistics

2. Or, a i , j :occurrence frequency weiggyghted by negative entrop y occurrence count fi, j m εai, j = × ()1 − ε i , d j = ∑ fi, j d j i=1 negative normalized entropy document length normalized entropy of term i occurrence count of term i τ in the collection 1 n ⎛ fi, j fi, j ⎞ n = − ∑ ⎜ log τ ⎟, τ = ∑ f i log n ⎜ ⎟ i i, j 0 ≤ ε ≤ 1 j =1⎝ i i ⎠ j =1 i IR – Berlin Chen 12 LSA: Theoretical Foundation (()3/10)

•Singgp()ular Value Decomposition (SVD) – ATA is symmetric nxn matrix

• All eigenvalues λj are nonnegative real numbers λ λ λ ≥ λ ≥ .... ≥ λ ≥ 0 2 1 2 n Σ = diag( 1, 1,...,λn ) n • All eigenvectors vj are orthonormal ( ∈ R ) v T v = 1 T V = [v 1 v 2 ... v n ] j j (V V = I nxn )

• DfiDefine silingular values: sigma σ j = λ j , j = 1,...,n – As the square roots of the eigenvalues of ATA

– As the lengths of the vectors Av1, Av2 , …., Avn

σ = Av 2 For λi≠ 0, i=1,…r, 1 1 T T T Avi = vi A Avi = vi λi vi = λi Av , Av , …. , Av Av { 1 2 r } is an σ 2 = 2 ⇒ Avi = σ i orthogonal basis of Col A ..... IR – Berlin Chen 13 LSA: Theoretical Foundation (()4/10)

• {Av1, Av2 , …. , Avr } is an orthogonal basis of Col A Av • Av = (Av )T Av = vT AT Av = λ vT v = 0 i j i j i j j i j – Suppose that A (or ATA) has rank r ≤ n

λ1 ≥ λ2 ≥ .... ≥ λr > 0, λ r +1 = λr + 2 = .... = λn = 0

– Define an orthonormal basis {u1, u2 ,…., ur}forCol} for Col A 1 1 u = Av = Av ⇒σ u = Av i Av i σ i i i i U is also an i i V : an orthonormal matrix orthonormal matrix ⇒[][]u1 u2... ur Σr = A v1 v2 vr (mxr) Known in advance m • Extend to an orthonormal basis {u1, u2 ,…, um} of R mn ⇒ []u1 u2... ur ...um Σ = A[v1 v2... vr ...vn ] 2 2 A = aij T T F ∑σ ∑ ⇒ UΣ = AV ⇒ UΣV = AVV ij==11σ 2 T I A = 2 + 2 + ... + σ 2 ? ⇒ A = UΣV ⎛ Σ 0 ⎞ nxn ? F 1 2 r ⎜ r r×()n−r ⎟ Σ m×n = ⎜ ⎟ ()⎝0 m−r ×r 0 ()()m−r × n−r ⎠ IR – Berlin Chen 14 LSA: Theoretical Foundation (()5/10) v spans the u spans the i i A row space of A row space of AT

mxn

n m V1 R R U1

U VT V2 2 U

⎛ Σ 0 ⎞⎛V T ⎞ UΣV T = (U U )⎜ 1 ⎟⎜ 1 ⎟ AX = 0 1 2 ⎜ ⎟⎜ T ⎟ ⎝ 0 0 ⎠⎝V2 ⎠ T = U1 Σ1V1 T = AV1V1 UΣ = AV = A

IR – Berlin Chen 15 LSA: Theoretical Foundation (()6/10) • Additional Explanations – Each row of U is related to the projection of a corresponding row of A onto the basis formed by columns of V A = UΣV T ⇒ AV = UΣV TV = UΣ ⇒ UΣ = AV • the i-th entry of a row of U is related to the projection of a corresponding row of A onto the i-th column of V – Each row of V is related to the projection of a corresponding row oftthbifdbf A T onto the basis formed by U A = UΣV T T ⇒ ATU = (UΣV T ) U = VΣU TU = VΣ ⇒ VΣ = ATU • the i-th entry of a row of V is related to the projection of a corresponding row of A T onto the i-th column of U IR – Berlin Chen 16 LSA: Theoretical Foundation (()7/10)

• Fundamental comparisons based on SVD – The original word-document matrix (A)

d d d 1 2 n • compare tttwo terms → dtdot prod uc t of ft two rows of A w 1 – or an entry in AAT w2 A • compare two docs → dot product of two columns of A – or an entry in ATA w m • compare a term and a doc → each individual entry of A mxn wj wi – The new word-document matrix (A’) • Compare two terms A’A’T=(U’Σ’V’T) (U’Σ’V’T)T=U’Σ’V’TV’Σ’TU’T =(U’Σ’)(U’Σ’)T

U’=Umxk For stretching Irxr → dot product of two rows of U’U Σ’ or shrinking Σ’=Σk T T T T T T T T V’=Vnxk • Compare two docs A’ A’=(U’Σ’V’ ) ’(U’Σ’V’ ) =V’Σ’ ’U U’Σ’V’ =(V’Σ’)(V’Σ’) ds d → dot product of two rows of V’Σ’ k • Compare a query word and a doc → each individual entry of A’ (scaled by the square root of singular values ) IR – Berlin Chen 17 LSA: Theoretical Foundation (()8/10)

• Fold-in: find the representation for a pseudo-document q – For obj ec ts (new quer ies or docs ) tha t did no t appear i n the original analysis • Fold-in a new mx1 qqy(uery (or doc ) vector See Figure A in next page ˆ T − 1 The separate dimensions q1× k = (q )1× m U m × k Σ k × k are differentially weighted. Just like a row of V Query is represented by the weighted sum of it constituent term vectors scaled by the inverse of singular values. – Represented as the weighted sum of its component word (or term) vectors – Cosine measure between the query and doc vectors in the latent semantic space (docs are sorted in descending order of their cosine values) qˆ Σ 2 dˆ T sim (qˆ , dˆ )= coine ( qˆ Σ , dˆΣ ) = qˆ Σ dˆΣ row vectors IR – Berlin Chen 18 LSA: Theoretical Foundation (()9/10)

• Fold-in a new 1 X n term vector − 1 See Figure B below tˆ1 × k = t 1 × n V n × k Σ k × k

IR – Berlin Chen 19 LSA: Theoretical Foundation (()10/10)

• Note that the first k columns of U and V are orthogg,onal, but the rows of U and V (i.e., the word and document vectors), consisting k elements, are not orthogonal • Alternatively, A can be written as the sum of k rank-1 matrices k T A ≈ A k = ∑ i =1 u iσ i v i

–u i and v i are respectively the eigenvectors of U and V • LSA with relevance feedback (query expansion) T − 1 T qˆ1× k = (q )1× m U m × k Σ k × k + (d )1× nV n × k

–isd a binary vector whose elements specify which documents

to add to the query IR – Berlin Chen 20 LSA: A Simple Evaluation

•Experimental results – HMM is consistently better than VSM at all recall levels – LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated on

TDT-3 SD collection. (Using word-level indexing terms) IR – Berlin Chen 21 LSA: Pro and Con (()1/2)

• Pro (Advantages) – A clean formal framework and a clearly defined optimization criterion (least-squares) • Conceptual simplicity and clarity

– Handle synonymy problems (“heterogeneous vocabulary”) • Replace individual terms as the descriptors of documents by independent “artificial concepts” that can specified by any one of several terms (or documents) or combinations

– Good results for high-recall search • Take term co-occurrence into account

IR – Berlin Chen 22 LSA: Pro and Con (()2/2)

• Disadvantages – Contextual or positional information for words in documents is discarded (the so-called bag-of-words assumption)

– High computational complexity (e. g., SVD decomposition)

– Exhaustive comparison of a query against all stored documents is needed (()cannot make use of inverted files ?)

– LSA offers only a partial solution to polysemy (e.g. bank, bass,…) • Every term is represented as just one point in the latent space (represented as weighted average of different meanings of a term)

– To date, aside from folding-in, there is no optimal way to add information (new words or documents) to an existing word- document space • Re-compute SVD (or the reduced space) with the added

information is a more direct and accurate solution IR – Berlin Chen 23 LSA: Junk E-mail Filtering

• One vector represents the centriod of all e-mails that are of i nt erest t o th e user, whil e the o ther the cent ri od of all e-mails that are not of interest

IR – Berlin Chen 24 LSA: Dyyggnamic Language Model Adaptation ()(1/4)

• Let wq denote the word about to be predicted, and Hq-1 the admissible LSA history (context) for this particular word ~ – The vector representation of Hq-1 is expressed by dq−1 • Which can be then projected into the latent semantic space ~ ~ ~T LSA representation vq−1 = vq−1S = dq−1U []change of notation : S = Σ ~ ~ • Iteratively update d q − 1 and v q − 1 as the decoding evolves ~ nq −1 ~ 1− εi T VSM representation dq = dq−1 + [0...1...0] nq nq ~ T 1 ~ LSA representation v = v~ S = d U = [](n −1)v + (1− ε )u q q q−1 n q q−1 i i q with 1 ~ or = []λ⋅( n −1)v +(1− ε )u exponential q q−1 i i decay nq IR – Berlin Chen 25 LSA: Dyyggp()namic Language Model Adaptation (2/4)

• Integration of LSA with N-grams

(n+l) (n) (l) Pr(wq | Hq−1 ) = Pr(wq | Hq−1,Hq−1)

where Hq−1 denotessomesuitable history for word wq, and the superscripts (n) and (l) refer to the n - gram

component (wq−1wq−2...wq−n+1, with n > 1), the LSA ~ component (dq−1) : This expression can be rewritten as : (l) (n) Pr(wq,H | H ) Pr(w | H (n+l) ) = q−1 q−1 q q−1 (l) (n) ∑ Pr(wi ,Hq−1 | Hq−1) wi∈V

IR – Berlin Chen 26 LSA: Dynamic Language Model Adaptation (3/4)

• Integration of LSA with N-grams (cont.) (l) (n) Pr(wq,Hq−1 | Hq−1) = Assume the probability of the document history given the current word is not affected (n) (l) (n) by the immediate context preceding it Pr(wq | Hq−1)⋅ Pr(Hq−1 | wq,Hq−1) ~ = Pr(wq | wq−1wq−2 Lwq−n+1)⋅ Pr(dq−1 | wqwq−1wq−2 Lwq−n+1) ~ = Pr(wq | wq−1wq−2 Lwq−n+1)⋅ Pr(dq−1 | wq ) ~ ~ Pr(wq | dq−1)Pr(dq−1) = Pr(wq | wq−1wq−2 Lwq−n+1)⋅ Pr(wq ) (n+l) Pr(wq | Hq−1 ) = ~ Pr(wq | dq−1) Pr(wq | wq−1wq−2 Lwq−n+1)⋅ Pr(wq ) ~ Pr(wi | dq−1) ∑ Pr(wi | wq−1wq−2 Lwq−n+1)⋅ wi∈V Pr(wi ) IR – Berlin Chen 27 LSA: Dyyggnamic Language Model Adaptation ()(4/4) ~ Intuitively, Pr(wq | dq−1) reflects the "relevance" of word wq ~ to the admissible history, as observed through dq−1 : ~ Pr(wq | dq−1) ~ ≈ K(wq | dq−1) u Sv~T = cos(u S1 2 ,v~ S1 2 ) = q q−1 q q−1 1 2 ~ 1 2 uq S vq−1S As such, it will be highest for words whose meaning aligns most ~ closely with the semanticfavric of dq−1 (i.e., relevant "content" words), and lowest for words which do not convey any particular information about this fabric ((g,e.g.,"function" works like"the").

J. Bellegarda, Latent Semantic Mapping: Principles & Applications (Synthesis Lectures on Speech and Audio Processing), 2008 IR – Berlin Chen 28 LSA: Cross-lingual Language Model Adaptation (1/2) • Assume that a document-aligned (instead of sentence- aligned) Chinese-English bilingual corpus is provided

Lexical triggers and latent semantic analysis for cross-lingual language model adaptation, TALIP 2004, 3(2) IR – Berlin Chen 29 LSA: Cross-lingual Language Model Adaptation (2/2)

• CL-LSA adappggted Language Model E d C di is a relevant English doc of the Mandarin i doc being transcribed, obtained by CL-IR

E PAdapt (ck ck −1,ck −2 ,di )

E = λ ⋅ PPCL-LCA-Unigram (ck di )+ PBG-Trigram (ck ck −1,ck −2 )

E E PCL-LCA-Unigram (c di )= ∑ PT (c e)P(e di ) e sim(cr,er)γ P ()c e ≈ (γ >>1) T ∑sim()cr′,er γ c′

IR – Berlin Chen 30 LSA: SVDLIBC

•Douggy Rohde's SVD C Library version 1.3 is based on the SVDPACKC library

• Download it at http://tedlab.mit.edu/~dr/

IR – Berlin Chen 31 LSA: Exercise (()1/4)

Row Col. Nonzero • Given a sparse term-document matrix #Tem # Doc entries 4 3 6 – E.g., 4 terms and 3 docs 2 nonzero entries Doc 2 at Col 0 0 2.3 Col 0, Row 0 2 3.8 Cl0RCol 0, Row 2 2.3 0.0 4.2 1 nonzero entry 1 0.0 1.3 2.2 at Col 1 Term 1 1.3 Col 1, Row 1 383.8 000.0 050.5 3 3t3 nonzero entry at Col 2 0.0 0.0 0.0 0 4.2 Col 2, Row 0 1 2.2 Col 2, Row 1 Cl2RCol 2, Row 2 – EhEach en try can b e we ihtdbighted by TFxIDF score 2 0.5 • Perform SVD to obtain term and document vectors representdithltted in the laten t semanti c space • Evaluate the information retrieval capability of the LSA approach by using varying sizes (e. g., 100, 200,..., 600 etc.) of LSA dimensionality

IR – Berlin Chen 32 LSA: Exercise (()2/4)

• Example: term-document matrix Indexing Nonzero Doc no. Term no. entries 51253 2265 218852 77 508 7.725771 596 16.213399 612 13.080868 709 7.725771 713 7.725771 744 7.725771 1190 7.7 2577 1 1200 16.213399 1259 7.725771 …… output LSA100-Ut • SVD command (IR_svd.bat) LSA100-S svd -rstr st -o LSA100 -d 100 Term-Doc-Matrix LSA100-Vt No. of reserved name of sparse sparse matrix input prefix of output files eigenvectors matrix input IR – Berlin Chen 33 LSA: Exercise (()3/4)

• LSA100-Ut 51253 words 100 51253 0.003 0.001 …….. 0.002 0.002 …….

word vector (uT): 1x100 • LSA100-Vt 2265 docs • LSA100-S 100 2265 0.021 0.035 …….. 100 2686.18 0.012 0.022 ……. 829.941 559. 59 100 eigenvalues …. T doc vector (v ): 1x100 IR – Berlin Chen 34 LSA: Exercise (()4/4)

• Fold-in a new mx1 query vector

ˆ T − 1 The separate dimensions q1× k = (q )1× m U m × k Σ k × k are differentially weighted Just like a row of V Query represented by the weighted sum of it constituent term vectors

• Cosine measure between the query and doc vectors in the latent semantic space

qˆ Σ 2 dˆ T sim (qˆ , dˆ )= coine ( qˆ Σ , dˆΣ ) = qˆ Σ dˆΣ

IR – Berlin Chen 35