Archeology of computational linguistics

Internet „ 1990s–2000s: Statistical learning and ‰ algorithms, evaluation, corpora Web in „ 1980s: Standard resources and tasks 1990s and Topic Modeling: ‰ Penn , WordNet, MUC „ 1970s: Kernel (vector) spaces • Natural Roads to Text Meaning ‰ clustering, information retrieval (IR) language „ 1960s: Representation Transformation processing ‰ Finite state machines (FSM) and Hồ Tú Bảo Augmented transition networks • Information (ATNs) retrieval and Japan Advanced Vietnamese „ 1960s: Representation−beyond the word level extraction Institute of Science Academy of Science ‰ lexical features, tree structures, on the Web and Technology and Technology networks

(adapted from E. Hovy, COLING 2004) 1 2

PageRank algorithm (Google) Latent semantic analysis & topic models

„ Google the word ‘weather Larry Page, „ The LSA approach makes three claims forecast’ Æ Answer: 4.2 Sergey Brin (1) semantic information can be derived from a million pages. word-document co-occurrence matrix; „ How does Google know which pages are the most important? (2) dimensionality reduction is an essential part of this derivation; „ Google assigns a number to each individual page (PageRank (3) words and documents can be represented as points number) computed via the in Euclidean space. A B C eigenvalue problem A C A 1/2 1/2 0 Pw = λw „ Different from (3), topic models express the B 1/2 0 1 semantic information of words and documents 9 „ Current size of P: 4.2x10 B C 0 1/2 0 by ‘topics’.

‘Latent’ = ‘hidden’, ‘unobservable’, ‘presently inactive’, … 3 4 What is topic? Notation and terminology

„ The subject matter of a speech, text, meeting, discourse, etc. „ A word is the basic unit of discrete data, from vocabulary indexed by {1,…,V} = V. The vth word is „ The topic of a text captures “what a document is v u about”, i.e., the meaning of the text. represented by a V-vector w such that w = 1 and w = 0 for u≠v „ A text can be represented by a “bag of words” for several purposes and you can see the words. „ A document is a sequence of N words denote by „ But how can you see (know) the topics of the text? d = (w1, w2,…, wN) How a topic is represented, discovered, etc.? „ A corpus is a collection of M documents denoted by

D = {d1, d2,…, dM}

Topic modeling = Finding ‘word patterns’ of topic

A ‘topic’ consists of a cluster of words that frequently occur together. 5 6

Term frequency–inverse document Vector space model in IR frequency d1 d2 d3 d4 d5 d6 q1 xy. cos(xy , ) = rock 2 1 0 2 0 1 1 xy „ tf-idf of a word ti in document dj (Salton & McGill, 1983) granite 1 0 1 0 0 0 0 marble 1 2 0 0 0 0 1 cos(d3, q1) = 0 nij, D ×log music 0 0 0 1 2 0 0 cos(d5, q1) = 0 n ∑k kj, {:dtj ij∈ d } song 0 0 0 1 0 2 0 cos(d4, q1) ≠ 0 band 0 0 0 0 1 0 0 cos(d6, q1) ≠ 0 „ Results in a txd matrix – thus reducing the corpus to a fixed-length list „ Given a query, says, q1 = (‘rock’, ‘marble’) Æ d3 more „ Used for search engines relevant to q1 than d4, d6 even cos(d3, q1) = 0.

„ Problem of synonymy (one meaning can be expressed by multiple words, e.g. ‘group’, ‘cluster’), and polysemy (a word can have multiple meanings, e.g. ‘rock’). ni,j = # times ti occurs in dj 7 8 LSI: Latent semantic indexing LSI: Latent semantic indexing (Deerwester et al., 1990) documents dims T „ LSI is a dimensionality reduction technique that projects dims documents C = UDV documents to a lower-dimensional semantic space and, C U V dims = # singular dims dims D words in doing so, causes documents with similar topical words values = # (absolute) content to be close to one another in the resulting space. values of eigenvalues

„ In particular, two documents which share no terms with T T „ C = UDV by singular value decomposition such that UU = I each other directly, but which do share many terms with and VVT = I and D is a diagonal matrix whose diagonal entries third document, will end up being similar in the are the singular values of C. projected space. „ Idea of LSI: to strip away most of dimensions and only keep „ Similarity between LSI and PCA? those which capture the most variation in the document collection (typically, from |V| = hundreds of thousands to k = between 100 and 200).

9 10

LSI: Example 1 Exchangeability D3 0.8 ƒ LSI clusters documents in the D2 reduced-dimension semantic Q1 0.6 space according to word „ A finite set of random variables { x 1 , K , x N } is said to be co-occurrence patterns. D1 0.4 exchangeable if the joint distribution is invariant to ƒ Dimensions loosely correspond 0.2 permutation. If π is a permutation of the integers from 1 with topic boundaries. 0 to N: -1 -0.8 -0.6 -0.4 -0.2 π -0.2 p(x1, xN ) = p(x (1) , , xπ (N) ) D1 D2 D3 D4 D5 D6 Q1 K K rock 2 1 0 2 0 1 1 D4 -0.4 granite 1 0 1 0 0 0 0 „ An infinite sequence of random is infinitely exchangeable marble 1 2 0 0 0 0 1 D6 -0.6 music 0 0 0 1 2 0 0 if every finite subsequence is exchangeable song 0 0 0 1 0 2 0 -0.8 band 0 0 0 0 1 0 0 D5 -1 D1 D2 D3 D4 D5 D6 Q1

Dim. 1 -0.888 -0.759 -0.615 -0.961 -0.388 -0.851 -0.845 Dim. 2 0.460 0.652 0.789 -0.276 -0.922 -0.525 0.534

11 12 bag-of-words assumption Probabilistic topic models: key ideas

documents dims

„ Word order is ignored dims documents LSA „ “bag-of-words” – exchangeability, not i.i.d C U dims D dims V words words

„ Theorem (De Finetti, 1935): if ()x1,x2,K,xN are infinitely exchangeable, then the joint probability documents topics has a representation as a mixture: Topic documents C Φ

words words Θ models topics p(x1, x2,K, xN ) Normalized co- occurrence matrix for some random variable θ θ θ N „ Key idea: documents are mixtures of latent topics, where p(x , x , , x ) = d p( ) p(x θ ) a topic is a probability distribution over words. 1 2 K N ∫ ∏ i i=1 „ Hidden variables, generative processes, and statistical inference are the foundation of probabilistic modeling of topics.

13 14

Probabilistic topic models: processes Mixture of unigrams model (Nigam et al., 2000) Generative models: generating a Statistical inference (invert): to document know which topic model is most likely „ Simple, each document one topic to have generated the data, it infers (appropriate for supervised classification). ‰ Choose a distribution over topics and the document length; ‰ Probability distribution over words „ Generates a document by ‰ For each word w , choose a topic at associated with each topic i ‰ choosing a topic z zw random according to this distribution, ‰ Distribution over topics for each Nd ‰ generating N words independently from M and choose a word from the topic- document the conditional multinomial distribution word distribution. ‰ Topic responsible for generating each p(w|z) word N „ A topic is associated with a specific d language model that generates words pd()= ∑ pz ()∏ pw (n |) z appropriate to the topic. z n=1

Y Y • Nodes are random variables • Edges denote possible dependence • Plates denotes replicated structure • Pattern of conditional dependence Xn … between the ensemble of random variables N N X1 X2 XN pyx(,1 ,...,) xNn= py ()∏ px ( | y ) n=1 Observable variables Latent 15 16 How to calculate? Nd Probabilistic latent semantic indexing pd()= ∑ pz ()∏ pw (n |) z z n=1 (Hofmann, 1999)

„ We must draw the multinomial distributions p(z) and p(w|z) „ pLSI: Each word is generated from a single topic, different „ If each document is annotated with a topic z words in the document may be ‰ using maximum likelihood estimation Æ p(z) Limitations: generated from different topics. d zw N ‰ count # times each word w appeared in all 1. a document can d M documents labeled with z and then normalize only contain a „ Each document is represented single topic. Æ p(w|z) as a list of mixing proportions p(d, w ) = p(d) p(w | z) p(z | d) 2. the distributions n ∑ n for the mixture components. z „ If topics are not known for documents have no priors and are assumed to be ‰ EM algorithm can be used to estimate p(d) „ Generative process: learned „ Once the model has been trained, inference completely from ‰ Choose a document dm with p(d) data can be performed using Bayes’ rule to obtain ‰ For each word wn in the dm

the most likely topics for each document. ƒ Choose a zn from a multinomial conditioned on dm, i.e., from p(z|dm)

ƒ Choose a wn from a multinomial conditioned on zn, i.e., from p(w|zn).

17 18

Limitations Latent Dirichlet allocation

Dirichlet Per-word Topic „ The model allows multiple topics in each document, but parameter topic assignment hyperparameter Per-document Observed Per-topic ‰ the possible topic proportions have to be learned from the topic proportions word word proportions document collection (T-1)-simplex (V-1)-simplex

‰ pLSI does not make any assumptions about how the mixture weights θ are generated, making it difficult to test the generalizability of the model to new documents. α θ Z W φ β d d,n d,n N t d T „ Topic distribution must be learned for each document in M the collection Æ # parameters grows with the number of 1. Draw each topic φt ~ Dir(β), t=1,..,T 1. From collection of documents, infer

documents (billion documents?). 2. For each document: - per-word topic assignment zd,n 1. Draw topic proportions θd ~ Dir(α) - per-document topic proportions θd „ Blei et al. (2003) extended this model by introducing a 2. For each word: - per-topic word distribution φt 1. Draw z ~ Mult(θ ) Dirichlet prior on θ, calling Latent Dirichlet Allocation d,n d 2. Use posterrior expectations to 2. Draw wd,n ~ Mult(φzd,n) perform the tasks: IR, similarity, ... (LDA).

Choose Nd from a Poisson distribution with parameter ξ 19 20 Latent Dirichlet allocation LDA model φ θα θ θ α β T kxVmatrix β, β = p(wj=1,zi=1) k Doc 1 ij Γ()∑ α α p()= i=1 i 1 −1 αk −1 word distributions φ k 1 L k ∏i=1 Γ()αi … T over each topic θ T1 T2 T Dirichlet prior on the per-document topic distributions α θ z w N φ N d Z β Z M pppzpwz(,,θ zwαβ , θα )= β θ ( )∏ (nnn )( , ) β T n=1 Joint distribution of topic mixture θ, a set of N topic z, a set of N words w φ w1 w2 … wv φ

N ⎛⎞k w w α θ z w pp(,)()w α βθα= ⎜⎟ pzpwzd ()(,)nnn θ βθ N ∫ ⎜⎟∏∑ d ⎝⎠n=1 zn M Marginal distribution of a document by integrating over θ and summing over z

M ⎛⎞Nd pD(,)α βθα= p ( ) pz ( )(,) pw θ z d βθk topic distributions θ ∏∏∫ ddnddndnd⎜⎟∑ over each document dn==11⎝⎠zdn Probability of collection by product of marginal probabilities of single documents

21 22

Generative process Inference: parameter estimation

αr β r αα αβ φ r (2) Per-document topic r φθϕ k β T θϑθm distribution generation φ

ty parameter i l

abi estimation

rob w p α θ z Nd topics M

Parameter estimation methods: Mean field variational methods (Blei et al., 2001, 2003) Expectation propagation (Minka & Lafferty 2002) Gibbs sampling (Griffiths & Steyvers 2004) Collapsed variational inference (Teh et al., 2006)

23 24 θα θ θ α Γ()∑k α Evolution p()= i=1 i 1 −1 αk −1 A geometric interpretation k 1 L k ∏i=1 Γ()αi word 1 N φ ppzpwz()w = () (n |) p(d, w ) = p(d) p(w | z) p(z | d) ∑ ∏ n ∑ n β T z n=1 z

α θ z w N Nd zwd d zwN M d M M word simplex

S S S S S S St-1 St St+1 t-1 t t+1 t-1 t t+1 ......

word 2 ......

Ot-1 Ot Ot+1 Ot-1 Ot Ot+1 Ot-1 Ot Ot+1 Maximum Entropy Markov Conditional Random Fields Hidden Markov Models Models (MEMMs) (CRFs) (HMMs) [McCallum et al., 2000] [Lafferty et al., 2001] [Baum et al., 1970] More accurate than HMMs More accurate than MEMMs word 3

25 26

A geometric interpretation A geometric interpretation word 1 word 1 The mixture of unigrams The pLSI model allows model can place documents multiple topics per only at the corners of the document and therefore can place documents within the topic simplex, as only a single topic 1 topic 1 topic simplex topic simplex topic is permitted for each topic simplex, but they document. must be at one of the specified points.

word simplex word simplex

topic 2 topic 2 word 2 word 2 topic 3 topic 3

word 3 word 3

27 28 A geometric interpretation Inference word 1 By contract, the LDA „ We want to useθαβ LDA to compute the posterior model can place distribution of the hidden variables given a document: documents at any point topic 1 within the topic simplex. topic simplex pzw(,,θ |αβ , ) α βp(,zw | , , )= α pw(|,)αβ

word simplex „ Unfortunately, this is αintractableθ to compute in general. We marginalize over hidden variables and write (3) as:

k N k V θ β Γ( ) j ∑i i ⎛ α −1 ⎞⎛ w ⎞ topic 2 p(w | , ) = ⎜ i ⎟⎜ ( ) n ⎟dθ word 2 ∫⎜∏ i ⎟⎜∏∑∏ i ij ⎟ topic 3 Γ( ) ∏i i ⎝ i=1 ⎠⎝ n=1 i=1 j=1 ⎠

„ Variety of approximate inference algorithms for LDA

word 3

29 30

Inference Example

• From 16000 „ Expectation Maximization documents of AP corpus Æ ‰ But poor results (local maxima) 100-topic LDA model.

•An example „ Gibbs Sampling article from the AP corpus. ‰ Parameters: φ, θ Each color ‰ Start with initial random assignment codes a different ‰ Update parameter using other parameters factor from which the ‰ Converges after ‘n’ iterations word is putatively ‰ Burn-in time generated

31 32 TASA corpus, 37000 texts, 300 topics Topic models for text data http://www.ics.uci.edu/~smyth/topics.html

„ David Blei's Latent Dirichlet Allocation (LDA) code in C, using variational learning. „ A topic-based browser of UCI and UCSD faculty research interests built by Newman and Asuncion at UCI in 2005. „ A topic-based browser for 330,000 New York Times articles, by Dave Newman, UCI. „ Wray Buntine's topic-based search interface to Wikipedia. „ Dave Blei and John Lafferty's browsable 100-topic By giving equal probability to the first two topics, one could construct a document about a person that has taken too many drugs, and how that model of journal articles from Science. affected color perception. „ LSA tools and application http://LSA.colorado.edu

33 34

Directions for hidden topic discovery Visual words

„ Idea: Given a collection of images, „ Hidden Topic Discovery from Documents ‰ Think of each image as „ Application in Web Search Analysis & Disambiguation a document.

‰ Think of feature „ Application in Medical Information patches of each image (Disease Classification) as words.

„ Application in Digital Library (Info. Navigation) ‰ Apply the LDA model to

extract topics. Examples of ‘visual words’ „ Potential Applications in Intelligent Advertising & Recommendation „ J. Sivic et al., Discovering object „ Many others categories in image collections. MIT AI Lab Memo AIM-2005-005, Feb. 2005

35 36 Related works and problems Topic analysis of Wikipedia

„ Hyperlink modeling using LDA, Erosheva …, PNAS, 2004 „ Topic-oriented crawling to download documents from Wikipedia „ Finding scientific topics, Griffiths & Steyvers, PNAS, 2006 ‰ Arts: architecture, fine art, dancing, fashion, film, museum, music, … ‰ Business: advertising, e-commerce, capital, finance, investment, …

„ Author-Topic model for scientific literature, Rozen-Zvi …, UAI, 2004 ‰ Computers: hardware, software, database, digital, multimedia, …

‰ Education: course, graduate, school, professor, university, … „ Author-Topic-Recipient model for email data, McCallum …, IJCAI’05 ‰ Engineering: automobile, telecommunication, civil engineering, … „ Modeling Citation Influences, Dietz et al., ICML 2007 ‰ Entertainment: book, music, movie, movie star, painting, photos, … ‰ Health: diet, disease, therapy, healthcare, treatment, nutrition, … „ Word sense disambiguation, Blei …, 2007 ‰ Mass-media: news, newspaper, journal, television, … „ Classify short and sparse text & Web …, P.X. Hieu …, www2008 ‰ Politics: government, legislation, party, regime, military, war, … ‰ Science: biology, physics, chemistry, ecology, laboratory, patent, …

„ Automatic Labeling of Multinomial Topic Models, Mei …, KDD 2007??? ‰ Sports: baseball, cricket, football, golf, tennis, olympic games, …

„ DLA-based doc. models of ad-hoc retrieval, Croft …, SIGIR 2006??? Raw data: 3.5GB, 471,177 docs „ Connection to language modeling??? Preprocessing: remove duplicates, HTML, stop & rare words „ Topic models and emerging trend detection??? Final data: 240MB, 71,986 docs, 882,376 paragraphs, 60,649 unique words, 30,492,305 words

„ Similarity between words, between documents Æ clustering??? JWikiDocs: Java Wikipedia Document Crawling Tool http://jwebpro.sourceforge.net/ „ etc. GibbsLDA++: C/C++ Latent Dirichlet Allocation http://gibbslda.sourceforge.net/

(source: next some slides from P.X. Hieu, www08) 37 38

Topic discovery from MEDLINE The full list of topics at http://gibbslda.sourceforge.net/ohsumed-topics.txt

MEDLINE (Medical Literature Analysis and Retrieval System Online): database of life sciences and biomedical information. It covers the fields of medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care.

More than 15 million records from approximately 5,000 selected publications (NLM Systems, Feb 2007) covering biomedicine and health from 1950 to the present.

Our topic analysis on 400MB MEDLINE data including 348,566 medical document abstracts from 1987 to 1991. The outputs (i.e., hidden topics) will be used for “disease classification”

39 40 Application in disease classification Application in digital library

Comparison of disease classification accuracy [Blei & Lafferty 2007] using maximum entropy method

68 66.19 65.72 66 65.23 65.68 65.2 64 62.57 62

60 58.95 58

Classification accuracy (%) accuracy Classification 56 Baseline (w ithout topics) With hidden topic inference 54

K .1K 7K 1 2.2 4.5K 6.7K 9.0K 3.5K 11.2K 1 15. 18.0K 20.2K 22.5K Size of labeled training data

with the increment of training data

Source: http://www.cs.princeton.edu/~blei/modeling-science.pdf 41 42

Potential in intelligent advertising

Demo of Information Navigation in Journalhttp://www.cs.cmu.edu/~lemur/science/ Science This was done with Correlated Topic Model (CTM) – a more advanced variant of LDA

Amazon recommendation Google context sensitive advertising Can Amazon and Gmail understand us?

43 44 Applications in scientific trends Analyzing a topic Dynamic Topic Models [Blei & Lafferty 2006]

Analyzed Data:

Source: http://www.cs.princeton.edu/~blei/modeling-science.pdf 45 46

Visualizing trends within a topic Summary

„ LSA and topic models are roads to text meaning.

„ Can be viewed as a dimensionality reduction technique.

„ Exact inference is intractable, we can approximate instead.

„ Various applications and fundamentals for digitalized era.

„ Exploiting latent information depends on applications, the fields, researcher backgrounds, …

47 48 Key references Some other references

„ S Deerwester, et al. (1990). Indexing by latent semantic „ Sergey Brin, Lawrence Page, The anatomy of a large-scale analysis. Journal American Society for Information hypertextual Web search engine, seventh international conference on World Wide Web 7, p.107-117, April 1998, Science (citation 3574). Brisbane, Australia. „ Deerwester, T. (1999). Probabilistic Latent Semantic „ Taher H. Haveliwala, Topic-sensitive PageRank, 11th Analysis. Uncertainty in AI (citation 722). international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA. „ Nigam et al. (2000). Text classification from labeled and „ M. Richardson and P. Domingos. The intelligent surfer: unlabeled documents using EM, Probabilistic combination of link and content information in (citation 454). PageRank. NIPS 14. MIT Press, 2002. „ Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent „ Lan Nie , Brian D. Davison , Xiaoguang Qi, Topical link analysis Dirichlet Allocation. J. of Machine Learning Research for web search, 29th ACM SIGIR conference on Research and (citation 699). development in information retrieval, August 06-11, 2006, Seattle, Washington, USA.

49 50

Mixture models Estimation in mixture models

„ is a model in which independent variables are fractions of a total. Discrete random variable X is a mixture of n „ Known distribution Y and sample from X, would like to component discrete random variables Yi. determine the αi and θi values.

„ a probability mixture model is a probability distribution that is a „ Expectation-maximization algorithm is an iterative convex combination of other probability distributions n „ The expectation step: with guessed parameters, for f ()xafx= () XiY∑ i each data point xj and distribution Yi i=1 „ In a parametric mixture model, the component distributions af(;) x θ y = iY j i are from a parametric family, with unknown parameters θi ij, fx() n Xj „ The maximization step: fxX ()= ∑ afxiY (;θ i ) i=1 N yx „ and continuous mixture 1 ∑ j ij, j ayiiji= ∑ , and μ = fxXY( )=≥∀∈Θ= h (θθθθ ) fx ( ; ) d , h ( ) 0, θ and θθ h ( ) d 1 Nyi=1 ij, ∫∫ΘΘ ∑ j

51 52 Distributions Distributions

„ Binomial distribution: discrete probability distribution of # successes „ Beta distribution: Used to model random variables that vary between in n Bernoulli trials (success/failure outcomes) two finite limits, characterized by two parameters α>0 and β>0. Beta n! distribution is quite useful for modeling proportions. fknp(;, )=== PX ( k ) pknk (1 − p )− knk!(− )! Γ+()α β ∞ fx()=− xα −−11 (1 x )β Γ=()z tedtzt−−1 ‰ n=10, #red=2, #blue=3, p = 0.4, P(red = 4) = … αβ ∫ ΓΓ()() 0

„ Examples: „ Multinomial distribution: discrete probability distribution of # ‰ The percentage of impurities in a certain manufactured product occurrences of each outcome (xi) among k outcomes in n independent ‰ The proportion of flat (by weight) in a piece of meat trials (outcomes probabilities p1, …, pk)

k „ Dirichlet distribution Dir(α): the multivariate generalization of the ⎧ n! x1 xk αα θα ⎪ pp1 ...ki , when xn= beta distribution, and conjugate prior of the multinomial distribution fx( ,..., xnp ; , ,..., p ) = xx!... ! ∑i=1 11kk ⎨ 1 k in Bayesian statistics. ⎪ α ⎩0 otherwise k Γ()∑i=1αi −1 α −1 ==PX(11 x and ... and Xkk = x ) 1 k f (xx111 ,...,kk− ; ,..., )== p ( ) k x 1L x k ∏i=1 Γ() ‰ n=10, #red=2, #blue=2, #black=1, p = 0.4, P(red=5, blue=2, black=3)=… αi

53 54

Distributions Simplex

Discrete variables Continuous variables „ A simplex, sometimes called a hyper tetrahedron is the generalization of a tetrahedral region of space to n „ Binomial distribution: „ Beta distribution: dimensions. The boundary of a k-simplex has k+1 0-faces Discretely distributing the Continuously distributing (polytope vertices), k(k+1)/2 1-faces (polytope edges), unit to two outcomes in n the unit between two and ⎛⎞ k + 1 i-faces, experiments. limits (2-simplex). ⎜⎟ ⎝⎠i +1

„ Multinomial distribution: „ Dirichlet distribution: Discretely distributing the Continuously distributing unit to k outcomes in n the unit between k experiments limits (k-simplex).

graphs for the n-simplexes with n=2 to 7 λ λ k e−λ fk(; )= Poisson distribution k! 55 56 LDA and exchangeability

„ Nd = 4, (w1 = ?, w2 = ?, w3 = ?, w4 = ?) word token „ We assume that words are generated by topics and that those topics are infinitely exchangeable within a „ P(w1=band, w2=music, w3=song, w4=rock|topic=music) document.

„ By de Finetti’s theorem:θ θ ⎛ N ⎞ p(w,z) = p( )⎜ p(z )p(w z )⎟dθ ∫ ⎜∏ n n n ⎟ ⎝ n=1 ⎠

„ By marginalizing out the topic variables, we get eq. 3 in the previous slide.

57 58

• Density on unigram distributions Notations p(w|θ;β) under LDA for three words and four topics. pw(,)θ ββθ= ∑ pwz (,)() pz z • The triangle is the 2-D simplex „ P(z): distribution over topics z in a particular document representing all possible multinomial distributions over three words. „ P(w|z): prob. distribution over words w given topic z • Each vertex corresponds to a distribution that assigns probability „ P(zi = j): probability that the jth topic was sampled for one to one of the words; the ith word token • the midpoint of an edge gives probability 0.5 to two of the words; „ P(wi|zi = j) as the probability of word wi under topic j. and the centroid of the triangle is the uniform distribution over all „ Distribution over words within a document three words. T • The four points marked with an x are the locations of the multinomial pw()iiii==−∑ pw ( | z j)(pz j ) distributions p(w|z) for each of the four j=1 topics, and the surface shown on top of (j) the simplex is an example of a density „ φ = P(w|z=j): multinomial dist. over words for topic j over the (V - 1)-simplex (multinomial distributions of words) given by LDA. „ θ(d)= P(z): multinomial dist. over topics for document d

φ and θ: which words are important for which topic and which topics are important for a particular document 59 60 Bayesian statistics: general philosophy

„ Interpretation of probability extended to degree of belief (subjective probability). Use this for hypotheses:

probability of the data assuming prior probability, i.e., hypothesis H (the likelihood) before seeing the data

P(xr| H)π(H) P(H| xr) = ∫P(xr| H)π(H)dH

posterior probability, i.e., normalization involves sum after seeing the data over all possible hypotheses

„ Bayesian methods can provide more natural treatment of non- repeatable phenomena: probability that Kitajima wins gold medal in Olympic 2008, ...

61