An Unsupervised Topic Segmentation Model Incorporating Word Order
Total Page:16
File Type:pdf, Size:1020Kb
An Unsupervised Topic Segmentation Model Incorporating Word Order Shoaib Jameel and Wai Lam The Chinese University of Hong Kong One line summary of the work We will see how maintaining the document structure such as paragraphs, sentences, and the word order helps improve the performance of a topic model. Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 1 Outline Motivation Related Work I Probabilistic Unigram Topic Model (LDA) I Probabilistic N-gram Topic Models I Topic Segmentation Models Overview of our model I Our N-gram Topic Segmentation model (NTSeg) Text Mining Experiments of NTSeg I Word-Topic and Segment-Topic Correlation Graph I Topic Segmentation Experiment I Document Classification Experiment I Document Likelihood Experiment Conclusions and Future Directions Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 2 Motivation Many works in the topic modeling literature assume exchangeability among the words. As a result we see many ambiguous words in topics. For example, consider few topics obtained from the NIPS collection using the Latent Dirichlet Allocation (LDA) model: Example Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 architecture order connectionist potential prior recurrent first role membrane bayesian network second binding current data module analysis structures synaptic evidence modules small distributed dendritic experts The problem with the LDA model Words in topics are not insightful. Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 3 Motivation Many works in the topic modeling literature assume exchangeability among the words. As a result we see many ambiguous words in topics. For example, consider few topics obtained from the NIPS collection using the Latent Dirichlet Allocation (LDA) model: Example Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 architecture order connectionist potential prior recurrent first role membrane bayesian network second binding current data module analysis structures synaptic evidence modules small distributed dendritic experts The problem with the LDA model Words in topics are not insightful. Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 3 Latent Dirichlet Allocation Model (LDA) (Blei et al., JMLR-2003) Generative Process Graphical Model in Plate 1 Draw θ from Dirichlet(α), Diagram where each θ(d) consists of topic distribution for document θ α d 2 Draw φ from Dirichlet(β), where φ encompasses word distribution for topic z β 3 For every word in the document d (d) 1 Draw a topic zi from Multinomial (θ(d)) (d) w φ 2 Draw a word wi from Multinomial (φ (d) ) N Z zi M Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 4 Relaxing the Bag-of-Words Assumption in a Topic Model Can the bag-of-words assumption be relaxed in a topic model? This makes more sense as this is how documents are written by humans and also read. more Here things are lot more dog a organized. I can understand noticed that it was actually the cat I which noticed a dog. lot cat understand don’t things here Figure : This is how the bag-of-words looks (left) - complete chaos. The one on the right makes more sense to us. Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 5 Relaxing the Bag-of-Words Assumption Bigram Topic Model (BTM) (Wallach, ICML-2006) Some Properties of the model Graphical Model of BTM Word is generated by both the α topic and the previous word θ Inspired by the Hierarchical Dirichlet Language Model (d) (d) (d) (d) zi 1 zi zi+1 zi+2 Better empirical results than − the LDA model (d) (d) (d) (d) w A limitation of the model wi 1 wi wi+1 i+2 − I Always generates bigrams in M a topic σ δ ZV Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 6 Relaxing the Bag-of-Words Assumption LDA-Collocation Model (LDACOL) (Griffiths et al., Psy. Rev-2007) Some Properties of the model Graphical Model of LDACOL Word is generated by the α topic, the previous word and a binary bigram status variable θ Each word has a topic (d) (d) (d) (d) ψ assignment and a collocation zi 1 zi zi+1 zi+2 − V assignment (d) (d) (d) xi xi+1 xi+2 Can form longer order phrases γ (d) (d) (d) (d) wi 1 wi wi+1 wi+2 − Can generate both unigrams M and bigram words β φ σ δ A limitation of the model Z V I Only the first word in a bigram has a topic assignment Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 7 Relaxing the Bag-of-Words Assumption Topical N-Gram Model (TNG) (Wang et al., ICDM-2007) Some Properties of the model Graphical Model of TNG Extends the LDACOL model α Each word has a topic assignment and a collocation θ assignment (d) (d) (d) (d) ψ zi 1 zi zi+1 zi+2 Can form longer order phrases − ZV (d) (d) (d) Can generate both unigrams xi xi+1 xi+2 γ and bigram words (d) (d) (d) (d) wi 1 wi wi+1 wi+2 − A limitation of the model M I Words in a bigram may have β φ σ δ different topic assignments Z ZV Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 8 What Topic N-gram models do - An Illustration Abstract We give necessary and sufficient conditions for uniqueness of the support vector solution for the problems of 1 Para. pattern recognition and regression estimation, for a general class of cost functions. We sho that if the solution is not unique, all support vectors are necessarily at bound, and we give some simple examples of non-unique solutions. We note that uniqueness of the primal (dual) silution does not necessarily imply uniqueness of the dual (primal) solution. We show how to compute the threshold b when the solution is unique, but when all support vectors are bound, in which ... aa 2 Para. Acknowledgements C. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent Technologies for their case the usual method for determining b does not work... support. Reference [1] R. Fletcher, Practical Methods of Optimization. John Wiley and Sons, Inc., 2nd edition, 1987. Topic 1 Topic 2 support vector acknowledgements cost functions reference Consider the document as a whole Find topical n-grams in the document Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 9 Bag of Words in Topic Segmentation These models maintain the document structure such as paragraphs or sentences Assume that words within a paragraph or a sentence are exchangeable Introduces the notion of segment-topics or super-topics and word-topics Paragraph n in the document d follow 1 this is will next 2 paragraph follow this is 3 paragraph next 2 will Paragraph n +1 in the document d Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 10 A Topic Segmentation Model (LDSEG) (Shafiei et al., Canadian AI-2008) Model Properties Graphical Model in Plate Performs topic segmentation Diagram Can work at paragraph and τ ρ sentence level π c a binary variable gives the Ω y c change in topics segment-wise Segments come from a θ α predefined number of super-topics β The super-topics comprise of z a mixture of word-topics w φ N Z S M Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 11 A Topic Segmentation Model (LDSEG) (Shafiei et al., Canadian AI-2008) Model Properties Graphical Model in Plate This region is similar to the Diagram LDA model τ ρ Segments exhibit multiple π topics Ω y c Words are generated from a predefined number of word-topics θ α β z w φ N Z S M Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 12 Topic Segmentation Illustration using the LDSEG model Abstract We give necessary and sufficient conditions for uniqueness of the support vector solution for the problems of pattern recognition and regression estimation, for a general class of cost functions. We sho that if the solution is not 1 Para. unique, all support vectors are necessarily at bound, and we give some simple examples of non-unique solutions. We Super-Topic 7 note that uniqueness of the primal (dual) silution does not necessarily imply uniqueness of the dual (primal) solution. We show how to compute the threshold b when the solution is unique, but when all support vectors are bound, in which ... Acknowledgements C. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent Technologies for their 2 Para. case the usual method for determining b does not work... Super-Topic 4 support. Reference [1] R. Fletcher, Practical Methods of Optimization. John Wiley and Sons, Inc., 2nd edition, 1987. Word-Topic 1 Word-Topic 2 Super-Topic 1 Super-Topic 2 support acknowledgements Word-Topic 1 Word-Topic 5 vector reference Word-Topic 9 Word-Topic 8 cost Word-Topic 12 Word-Topic 1 Performs topic segmentation Unigram words are assigned to the word-topics Segments are assigned to the document-topics or super-topics Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 13 Our Research Contributions We propose a model called, NTSeg Our proposed model maintains the document structure such as paragraphs and sentences Maintains the order of the words in the document Detects and coordinates two topic granularity levels I Segment-Topics I Word-Topics Derivation of the posterior inference scheme Conducted extensive text mining experiments I Shown improvement over state-of-the-art models Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 14 Our Proposed Model (NTSeg) Ω ρ α K π(d) τ (d) (d) (d) ys 1 ys − (d) cs 1 − (d) cs (d) (d) θs 1 θs − (d) (d) (d) z z z (d) (d) (d) s 1,n 1 s 1,n s 1,n+1 zs,n 1 zs,n zs,n+1 − − − − − (d) (d) x (d) (d) xs 1,n s 1,n+1 xs,n xs,n+1 − − (d) (d) (d) (d) (d) (d) ws 1,n 1 ws 1,n ws 1,n+1 w − − − ws,n 1 ws,n s,n+1 − − M β φ ψ γ σ δ Z ZV ZV Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 15 Our Proposed Model (NTSeg) Ω ρ α K π(d) τ (d) (d) (d) ys−1 ys (d) cs−1 (d) cs (d) (d) θs−1 θs (d) (d) (d) d (d) z − − z − z − (d) ( ) s 1,n 1 s 1,n s 1,n+1 zs,n−1