DOCUMENTS ORGANIZATION

TOPIC MODELS – LATENT DIRICHLET ALLOCATION

Fabio Stella Associate Professor c/o Department of Informatics, Systems and Communication University of Milano-Bicocca

Text Mining – Fabio Stella Documents Organization: TOPIC MODELS – LDA DOCUMENTS ORGANIZATION

Part of the material presented in this lecture is taken from the following tutorial.

David Blei (2012). Tutorial on Topic Models, International

Conference on Machine Learning, ICML 2012,

http://www.cs.princeton.edu/~blei/papers/icml-2012-tutorial.pdf

Transcription and interpretation errors are responsibility of the lecturer.

Text Mining – Fabio Stella Documents Organization: TOPIC MODELS – LDA DOCUMENTS ORGANIZATION

The lecture introduces:

 LATENT DIRICHLET ALLOCATION (LDA)

 TOPIC-WORDS DISTRIBUTION

 DOCUMENT-TOPICS DISTRIBUTION

 CORPUS-TOPICS DISTRIBUTION

Text Mining – Fabio Stella Documents Organization: TOPIC MODELS – LDA TOPIC MODELS: LATENT DIRICHLET ALLOCATION 1

The first and most applied topic model is the LATENT DIRICHLET ALLOCATION (LDA)

The main assumption underlying LDA is that documents exhibit multiple topics

 each topic is a distribution over words

 each document is a mixture of corpus-wide topics

 each word is drawn from one of those topics

Text Mining – Fabio Stella Documents Organization: TOPIC MODELS – LDA TOPIC MODELS: TOPIC-WORDS DISTRIBUTION 2

A TOPIC is a PROBABILITY DISTRIBUTION OVER THE WORDS OF THE VOCABULARY W1:N (the vocabulary consists

of N unique words used to represent the document corpus).

This is just a portion of the PROBABILITY DISTRIBUTION OVER THE WORDS OF THE VOCABULARY TOPIC 1 TOPIC 2 TOPIC 3 TOPIC 4 TOPIC 5 people 0.0101 growth 0.0093 england 0.0101 people 0.0137 labour 0.0134 games 0.0097 economy(just the0.008510 most frequentgame 0.0088words are shownusers 0.0093to summarizegovernmentTOPIC0.01041). music 0.0080 economic 0.0072 wales 0.0060 net 0.0087 election 0.0103 digital 0.0078 sales 0.0071 players 0.0058 software 0.0078 party 0.0101 technology 0.0073 market 0.0068 ireland 0.0055 internet 0.0066 blair 0.0101 game 0.0071 china 0.0064 club 0.0054 mobile 0.0057 people 0.0092 PW |TOPIC 1 wW1 TOPIC-WORDS DISTRIBUTION mobile 0.0063 prices1:0.0058N win 0.0052 security:N0.0055 brown 0.0073 video 0.0059 world 0.0054 six 0.0049 service 0.0055 minister 0.0070 players 0.0045 bank 0.0051 cup 0.0046 technology 0.0054 howard 0.0056 apple 0.0043 rise 0.0049 time 0.0045 phone 0.0054 prime 0.0055 Consider a single word wW1:N of a document, if the word is about TOPIC 1, then such word w will be TOPIC 6 TOPIC 7 TOPIC 8 TOPIC 9 TOPIC 10 music 0.0145 company 0.0124“PEOPLE” withworld probability0.0109 equalfilmto0.02420.0101 law 0.0070 awards 0.0089 firm 0.0099 win 0.0067 films 0.0071 government 0.0069 band 0.0072 deal 0.0079 set 0.0067 star 0.0047 people 0.0061  “GAMES” with probability equal to 0.0097 award 0.0064 shares 0.0061 champion 0.0063 series 0.0046 police 0.0058 album 0.0061 yukos 0.0057 final 0.0058 comedy 0.0044 court 0.0053  “MUSIC” with probability equal to 0.0080 won 0.0055 market 0.0047 olympic 0.0057 movie 0.0042 lord 0.0051 song 0.0053 oil 0.0046 won 0.0055 director 0.0041 home 0.0049 film 0.0050 financial 0.0046 roddick 0.0046 actor 0.0041 told 0.0046 british 0.0046 bid 0.0044 time 0.0045 office 0.0040 rights 0.0043 Text Mining – Fabiotop 0.0045 Stella offer 0.0044 race 0.0044 festival 0.0039 billDocuments0.0041 Organization: TOPIC MODELS – LDA TOPIC MODELS: DOCUMENT-TOPICS DISTRIBUTION 3

Each document d (a news article), of a document corpus D (the BBC news articles corpus), is associated

with a DOCUMENT-TOPICS DISTRIBUTION. PT10 | d   0.32

PT8 | d   0.68 D DOCUMENT-TOPICS DISTRIBUTION  d  PT1:K | d 

PT1:10 | d   0,0,0,0,0,0,0,0.68,0,0.32

NEWS ARTICLE TOPIC 1 TOPIC 2 TOPIC 3 TOPIC 4 TOPIC 5 TOPIC 6 TOPIC 7 TOPIC 8 TOPIC 9 TOPIC 10 "Minister digs in over doping row. The Belgian sports minister at the centre of the doping row says he will not apologise for making allegations against her. Claude Eerdekens claims the US Open champion tested positive for ephedrine at an exhibition event last month. Criticised for making the announcementhe said: "I will never apologise". This product is banned and it's up to her to explain why it's there" Kuznetsova says the stimulant may have been in a cold remedy she took. The Russian said she did nothing wrong by taking the medicine during the eventThe Women's Association cleared Kuznetsova of any offence because the drug is not banned when taken out of competition. Eerdekens said he made the statement in order to protect the other three players that took part in the tournament Belgian Justine Henin-Hardenne Nathalie Dechy of France and Russia's . TOPIC 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 0.32 But Dechy is fuming that she has been implicated in the row "How can you be happy when you see your face on the cover page and talking about doping?". Dechy said "I'm really upset about it and I think the Belgian government did a really bad job about this". I think we deserve an apology from the guy. You cannot say anything like this - you cannot say some stuff like thissaying it's one of these girlsThis is terrible". Dementieva is also angry and says that Dechy and herself are the real victims of the scandal". You have no idea what I have been through all these daysIt's been too hard on me" she said. The WTA are trying to handle this problem by saying there are three victims but I see only two victims in this story - me and Nathalie Dechywho really have nothing to do with this". To be honest with youI don't feel like I want to talk to Sveta at allI'm just very upset with the way everything has happened""

Text Mining – Fabio Stella Documents Organization: TOPIC MODELS – LDA TOPIC MODELS: CORPUS-TOPICS DISTRIBUTION 4

A document corpus D (the BBC news articles corpus) is associated with a CORPUS-TOPICS DISTRIBUTION.

D ORPUS OPICS DISTRIBUTION PT1:K |  C -T

D PT3 | 

PT1:10 | D  0.08,0.12,0.14,0.09,0.12,0.09,0.10,0.09,0.08,0.09

Text Mining – Fabio Stella Documents Organization: TOPIC MODELS – LDA TOPIC MODELS: LATENT DIRICHLET ALLOCATION 5

We now are ready to provide some more insights on learning a LATENT DIRICHLET ALLOCATION topic model.

Given a document corpus D, learning a LATENT DIRICHLET ALLOCATION (LDA) topic model, gives the following:

TOPIC 1 TOPIC 2 TOPIC 3 TOPIC 4 TOPIC 5 people 0.0101 growth 0.0093 england 0.0101 people 0.0137 labour 0.0134 games 0.0097 economy 0.0085 game 0.0088 users 0.0093 government 0.0104 music 0.0080 economic 0.0072 wales 0.0060 net 0.0087 election 0.0103 PW1:N |T1:K  digital 0.0078 sales 0.0071 players 0.0058 software 0.0078 party 0.0101 technology 0.0073 market 0.0068 ireland 0.0055 internet 0.0066 blair 0.0101 game 0.0071 china 0.0064 club 0.0054 mobile 0.0057 people 0.0092 mobile 0.0063 prices 0.0058 win 0.0052 security 0.0055 brown 0.0073 video 0.0059 world 0.0054 six 0.0049 service 0.0055 minister 0.0070 D players 0.0045 bank 0.0051 cup 0.0046 technology 0.0054 howard 0.0056 PT1:K | d  d  apple 0.0043 rise 0.0049 time 0.0045 phone 0.0054 prime 0.0055 TOPIC 6 TOPIC 7 TOPIC 8 TOPIC 9 TOPIC 10 music 0.0145 company 0.0124 world 0.0109 film 0.0242 law 0.0070 awards 0.0089 firm 0.0099 win 0.0067 films 0.0071 government 0.0069 band 0.0072 deal 0.0079 set 0.0067 star 0.0047 people 0.0061 N = 6 283 award 0.0064 shares 0.0061 champion 0.0063 series 0.0046 police 0.0058 album 0.0061 yukos 0.0057 final 0.0058 comedy 0.0044 court 0.0053 won 0.0055 market 0.0047 olympic 0.0057 movie 0.0042 lord 0.0051 song 0.0053 oil 0.0046 won 0.0055 director 0.0041 home 0.0049 K = 10 film 0.0050 financial 0.0046 roddick 0.0046 actor 0.0041 told 0.0046 british 0.0046 bid 0.0044 time 0.0045 office 0.0040 rights 0.0043 top 0.0045 offer 0.0044 race 0.0044 festival 0.0039 bill 0.0041

Text Mining – Fabio Stella Documents Organization: TOPIC MODELS – LDA TOPIC MODELS: LATENT DIRICHLET ALLOCATION 6

sales service

Given a LATENT DIRICHLET ALLOCATION (LDA) topic model growth internet sales bank rise document d china net users D world gold bank PW1:N |T1:K  PT1:K | d  d  n=17 software phone oil growth A document d, consisting of n (non unique) words, is

obtained as follows TOPIC 1 TOPIC 2 TOPIC 3 TOPIC 4 TOPIC 5 people 0.0101 growth 0.0093 england 0.0101 people 0.0137 labour 0.0134 games 0.0097 economy 0.0085 game 0.0088 users 0.0093 government 0.0104 music 0.0080 economic 0.0072 wales 0.0060 net 0.0087 election 0.0103 digital 0.0078 sales 0.0071 players 0.0058 software 0.0078 party 0.0101 technology 0.0073 market 0.0068 ireland 0.0055 internet 0.0066 blair 0.0101 FOR i=1:n % for all words of document d game 0.0071 china 0.0064 club 0.0054 mobile 0.0057 people 0.0092 mobile 0.0063 prices 0.0058 win 0.0052 security 0.0055 brown 0.0073 video 0.0059 world 0.0054 six 0.0049 service 0.0055 minister 0.0070 th players 0.0045 bank 0.0051 cup 0.0046 technology 0.0054 howard 0.0056 Sample the i topic from the corpus-topics distribution P(T1:K|d) apple 0.0043 rise 0.0049 time 0.0045 phone 0.0054 prime 0.0055

TOPIC 6 TOPIC 7 TOPIC 8 TOPIC 9 TOPIC 10 Sample a word w from the topic-words distribution P(W1:N|Tk) music 0.0145 company 0.0124 world 0.0109 film 0.0242 law 0.0070 awards 0.0089 firm 0.0099 win 0.0067 films 0.0071 government 0.0069 band 0.0072 deal 0.0079 set 0.0067 star 0.0047 people 0.0061 Add the word w to document d award 0.0064 shares 0.0061 champion 0.0063 series 0.0046 police 0.0058 album 0.0061 yukos 0.0057 final 0.0058 comedy 0.0044 court 0.0053 won 0.0055 market 0.0047 olympic 0.0057 movie 0.0042 lord 0.0051 END song 0.0053 oil 0.0046 won 0.0055 director 0.0041 home 0.0049 film 0.0050 financial 0.0046 roddick 0.0046 actor 0.0041 told 0.0046 british 0.0046 bid 0.0044 time 0.0045 office 0.0040 rights 0.0043 top 0.0045 offer 0.0044 race 0.0044 festival 0.0039 bill 0.0041

Text Mining – Fabio Stella Documents Organization: TOPIC MODELS – LDA