<<

arXiv:2108.10755v1 [cs.CL] 24 Aug 2021 cfe al” u ohwrscudb frequent be could words both but table”, “coffee ( hc hs oesaiecnb bcrdi the in obscured be can arise tokens these which proportions of terms in expressed documents with aetDrcltalcto LA models (LDA) allocation Dirichlet Latent Introduction 1 als hspolmi mlfidi languages coffee in no amplified is containing problem cafes This of tables. discussion a item in furniture the on focusing high as with interpret tempt- to topic is ing “table” a and “coffee” instance, both of For probabilities in documents. counts unigram in context as text the of rendering Unfortunately, bag-of-words topic. often is each topic of a model, words, this highest-probability its In on based interpreted corpus. collection the text types in or word large unigram topics, over of distributions a probability inference unsupervised in the trends through and themes lie al. et Blei i Cheevaprawatdomrong Jin [email protected] iiei ett okfrwrsta should that words for look to text Wikipedia hllnkr University Chulalongkorn The oddcmn oocrecs oee,it However, doc- co-occurrences. of word-document collection a in words ingests (LDA) oeefciea itnusigtpc than topics models. unmerged distinguishing those at and in effective coherent, more result more clearer, are tokens that merged keys topic met- with this trained that on show topics we Based metrics, established other differ. and ric models the vocabular- of the where ies clus- settings the in measuring quality for Wetering metric new verbs. a event propose complex and nouns, nouns, proper compound as such together, grouped be rdc oesa nu oteLAmodel. LDA the to input as to (WPE) tokens Encoding produce Pair Word and statistics, h s fPasnscisurd( explore chi-squared we Pearson’s Here, of use Thai. the and boundaries Chinese as for word such results marked best without the achieve languages to how using unclear is topics latent their discover to Allocation uments Dirichlet Latent Traditionally, oeTa od:CloainTknzto o aetDiric Latent for Tokenization Collocation Words: Than More χ 2 , , t 2003 n P oeiesaetandon trained are tokenizers WPE and rvd sflisgt into insights useful provide ) Abstract χ 2 test, ) [email protected] loainModels Allocation avyMd College Mudd Harvey lxnr Schofield Alexandra t - od a mrv h ieiod oeec,and coherence, likelihood, the improve can words new this Using methods. proposed previously with ihu akdwr onais uha Chinese as such boundaries, word marked without Ta)adSntclnug Ciee.Inspired (Chinese). language Sinitic and (Thai) ( in fhmnjdmnso oeec ( coherence of judgments human of tions fatpc hc a o orlt ( keys, correlate or not may words, which high-probability topic, a the of in human- found arise the words to the and of model evaluate coherence a thematic must of registered notori- fit One statistical are large the models evaluate. both analyzing to LDA complex data, in ously text of popularity amounts their Despite Work Related 2 models. LDA of distinctiveness topic adjacent merging to approaches we three evaluations, all model that find topic existing and metric possible not was which set- treatments, differentpre-processing a by in caused vocabularies topics variable of with ting coherence the assess new to a method introduce also we coefficients, silhouette by language Kra-Dai (English), language Indo- including strate-European families merging language different apply to We gies (WPE). encoding pair nfidprsltkn ro oLAmdlinfer- model test, LDA Chi-square Pearson to ence: prior tokens phrasal conceptually- unified into words adjacent multiple merge Meaningful words. careful these text. without of lost recombination original be can topics the express of interpretation not of do concept while words, the that, may standalone segments as languages into functional these units for conceptual tokenizers split Thai: and oehtcreaewt rehmnjudgments human true with only may correlate these somewhat expectation the with even mation, 2009 2009 alc tal. et Wallach nti ae,w vlaetretcnqe to techniques three evaluate we paper, this In ; fit of evaluations combine often Analyses ). in tal. et Mimno , [email protected] 2009 hllnkr University Chulalongkorn , tao .Rutherford T. Attapol 2011 n uoae approxima- automated and ) ae nmta infor- mutual on based ) t saitc n word and -statistic, hn tal. et Chang hlet Bouma , , (Lau et al., 2014). A limitation of these existing that even a few co-occurrences can trigger signifi- approaches, however, is that they expect the vo- cance. cabulary and tokenization to remain constant be- Taking inspiration from byte-pair encoding, or tween two models. For our evaluation, we use BPE, we propose an alternative to obtain word- a normalized log likelihood approach to capture pair encoding (WPE) tokens. To do this, we first fit while accounting for changes in vocabulary tokenize a large corpus and then collect bigram (Schofield and Mimno, 2016). counts for all bigrams found in the corpus. Sec- Pre-processing steps can meaningfully al- ond, we merge the most frequent bigram to form a ter the results of the LDA models even in new WPE token. This new bigram is then treated languages with good tokenization heuristics as a word in all occurrences. Next, we continue to such as English (Schofield and Mimno, 2016; repeat the counting and merging process with one Schofield et al., 2017). We believe that languages extra word type. Finally, we obtain a vocabulary that do not have clear tokenization standards de- list of both unigram and WPE tokens. serve investigation into what kind of process- ing is appropriate. Many works recognize that 4 Evaluation Metrics LDA results can be improved when input are Held-Out Likelihood. When multi-word phrases including phrases (Lindsey et al., 2012; Yu et al., are converted to individual tokens, the number of 2013; El-Kishky et al., 2014; Wang et al., 2016; tokens in the document decreases while the size Bin et al., 2018; Li et al., 2018). We consider it of the corpus vocabulary increases. It is therefore valuable to specifically assess approaches to deter- illogical to compare the likelihoods of the word- mining these phrases. token model and WPE-token model directly. In order to normalize the scores between the 3 Collocations and Word Pair Encoding two models that do not have the exact same vo- Collocations consist of two or more words that can cabulary and tokens, we use the log-likelihood express conventional meaning. Since collocations ratio between the LDA model likelihood and can convey information about multi-word entities, the null (unigram) likelihood for each model. context, and word usage, we hypothesize that the In other words, we normalize the LDA model L introduction of multi-word tokens, which capture likelihood ( model) by dividing it with the un- L collocations as unigrams through concatenation, igram likelihood ( unigram) as introduced by can help achieve more useful and coherent topic Schofield and Mimno (2016): models. For languages that do not have clear word boundaries, there is a possible additional benefit to L − L log model log unigram multi-word tokens: it can be hard to intuit whether PTLLnorm = (1) N inferred word boundaries will have a large impact on the final results. Merging adjacent words into New Metric: Concatenation-based Embed- ’multi-word’ tokens may help remedy the potential ding Silhouette (CBES) Previous measures of problem of a segmentation that is not optimal for topic coherence rely on statistics from the training topic modeling purposes. data and assume that the vocabularies are identical Many methods are possible to select colloca- for both models, which is not the case for our set- tions from tokenized text, such as frequency, mean tings. To address this, we propose a new metric and variance, and statistical hypothesis testing. In called a concatenation-based embedding silhou- this paper, we evaluate Pearson’s chi-squared test ette (CBES), which measures the coherence within (χ2) and the t-statistic for word co-occurrence, the same topic and also the distinguishability of two hypothesis tests to determine if two words different topics in the LDA results. CBES extends are collocated significantly more than would oc- silhouette coefficients (Rousseeuw, 1987), a com- cur randomly. To implement these tests, we use mon clustering evaluation metric, by projecting to- the NLTK package to compute (Bird, 2006). We kens and multiword tokens into the same space impose a minimum frequency in the corpus for and computing the silhouette coefficients in this each selected bigram: otherwise, top bigrams from vector space in the usual way. the χ2 test will contain only exceptional rare A good topic should have all of their topic words, as these are expected to co-occur so rarely keys close to each other and away from other %merged χ2: debes jugar, euskaltel euskadi, taare zameen, chetro Corpus Docs Tokens χ2 t WPE ketl, hetch hetchy, ngwat mahop, mullum malarum, NYTimes 80K 2M 15.43 15.99 16.55 pazz jop, phnom penh, eisernen kreuzes, sirimalle SOTU 21K 1M 12.21 12.90 13.53 chettu, kasa vubu, moondram pirai, gjems onstad, Yelp 200K 13M 9.47 10.33 12.16 lettow vorbeck, pather panchali, ioann zlatoust, kud TNC 2K 4M 11.43 11.50 8.40 wafter, poquita ropa, viribus unitis BEST 4K 6M 12.43 12.50 9.53 t: united states, new york, world war, km h, take place, Wongnai 40K 8M 5.68 5.71 4.48 miles km, los angeles, united kingdom, first time, high Prachathai 68K 119M 13.41 13.45 10.36 school, tropical storm, new zealand, war ii, , Chinanews 100K 3M 12.62 14.35 10.62 mph km, h mph, north america, air force, two years, Dianping 100K 4M 3.22 3.84 2.59 peak number Douban 200K 2M 4.61 5.18 3.99 WPE: unite state, new york, take place, first time, unite kingdom, follow year, world war ii, also know, next Table 1: A survey of corpora providing the number day, new york city, high school, los angeles, north amer- of documents and tokens, as well as the percentage of ica, even though, new zealand, follow day, become first, also use, year old, take part unigram tokens merged using each approach. Figure 1: Different collocation scoring methods result in different top 20 English collocations. words that do not belong in the same topic. Sil- houette coefficients computed in this vector space capture exactly this. It is critical that embed- Word: court federal judge charge case former trial say dings from the two models that we want to com- rule today state sentence supreme prison justice accuse order law file jury pare must be from the same embedding space. χ2: today rule washington say judge state law court fed- We achieve this by concatenating each document eral case legal supreme court may ban order abortion with the versions containing χ2, t, and WPE col- lawyers allow violate laws t: washington today state say rule judge federal law case locations before training the embeddings. We ban may court supreme court violate seek right settle- use the gensim (Reh˚uˇrekˇ and Sojka, 2010) im- ment abortion july allow plementation of with the Continuous Bag-of-Word WPE: judge federal today charge court case trial washington former say lawyer lawyers accuse (CBOW) algorithm (Mikolov et al., 2013) to ob- supreme court rule state hear order file federal judge tain word embeddings. Figure2: Topic keysabout judgesin State of the Union 5 Experiments We test our methods on various corpora in English, Thai, and Chinese (Table 1). The English corpora language. For Thai and Chinese, we use the en- are drawn from The New York Times (Sandhaus, tire Wikipedia database, but for English we use the 2008), the Yelp Dataset 1, and United States State filtered Wiki103 dataset (Merity et al., 2016). En- of the Union addresses (1790 to 2018) divided glish, Thai, and Chinese documents are tokenized into paragraphs 2. The Thai data come from the with NLTK (Bird, 2006), Attacut (Chormai et al., news articles in Prachathai 3, the restaurant re- 2020), and Stanford Word Segmenter (Tseng et al., views from Wongnai 4, the BEST corpus 5, and 2005) respectively. We follow the same pre- the Thai National Corpus (Aroonmanakun, 2007). precessing steps for the training and the test doc- The Chinese data come from three corpora: the uments: lemmatize and lowercase in English, and remove stopwords, symbols and digits for all lan- news articles from Chinanews, restaurant reviews 2 6 guages. We limit the χ , t and WPE approaches to from Dianping, and the movie reviews from 2 Douban 7. Each corpus is separated into 75% train- 100,000 types. Note that the top χ collocations ing documents and 25% test documents. are full of specific names and rare words from We train the χ2, t and WPE-based tokenizers Wikipedia because they appear together more than for each language on Wikipedia articles for that they would do randomly (Figure 1). We use MAL- LET (McCallum, 2002) with the default hyperpa- 1 www.yelp.com/dataset rameters to train and evaluate topic models in both 2www.kaggle.com/rtatman/state-of-the-union-corpus- 1989-2017 word and multi-word documents with 10, 50, 100 3github.com/PyThaiNLP/prachathai-67k topics. We run the experiment 10 times for each 4www.kaggle.com/c/wongnai-challenge-review-rating- combination of corpus, type of model (word, Chi, prediction t or WPE) and number of topics to compute the 5thailang.nectec.or.th/downloadcenter 6github.com/zhangxiangxiao/glyph means of the normalized held-out likelihood and 7www.kaggle.com/utmhikari/doubanmovieshortcomments CBES explained in section 4. 10 topics 50 topics 100 topics Word χ2 t WPE Word χ2 t WPE Word χ2 t WPE NYTimes .3781 .4263 .6223 .4460 .5595 .6105 .6237 .6439 .6091 .6501 .6651 .6867 SOTU .2711 .3032 .3148 .3273 .3867 .4240 .4375 .4573 .4153 .4444 .4584 .4821 Yelp .1717 .1974 .2028 .2149 .2846 .3226 .3303 .3506 .3201 .3586 .3672 .3883 TNC .7614 .7512 .7468 .7578 1.0214 1.0459 1.0484 1.0405 1.0735 1.1027 1.1073 1.0972 BEST .7029 .6742 .6773 .6969 .9210 .9293 .9306 .9358 .9899 1.0053 1.0067 1.0097 Wongnai .2014 .2125 .2141 .2102 .3191 .3379 .3378 .3338 .3492 .3663 .3675 .3443 Prachathai .4366 .4723 .4730 .4659 .7139 .7761 .7783 .7599 .8036 .8736 .8761 .8565 Chinanews .5161 .5444 .6560 .5492 .8114 .8392 .9548 .8544 .9186 .9353 .9758 .9620 Dianping .2571 .2629 .2649 .2617 .4087 .4144 .4179 .4536 .4538 .4594 .4631 .4969 Douban .2974 .3027 .3079 .3046 .4136 .4139 .4211 .4168 .4464 .4417 .4492 .4451

Table 2: Normalized unigram log-likelihood per token improvement in collocation models.

10 topics 50 topics 100 topics Word χ2 t WPE Word χ2 t WPE Word χ2 t WPE NYTimes .0111 .0293 .0374 .0484 -.0509 -.0451 -.0415 -.0305 -.0820 -.0768 -.0748 -.0690 SOTU -.0052 .0014 .0081 .0131 -.0603 -.0589 -.0573 -.0541 -.0802 -.0784 -.0773 -.0731 Yelp -.0617 -.0524 -.0463 -.0374 -.1103 -.1028 -.0981 -.0912 -.1291 -.1249 -.1211 -.1141 TNC -.0223 -.0125 -.0161 -.0196 -.0935 -.0810 -.0837 -.0862 -.1134 -.1031 -.1041 -.1072 BEST -.0323 -.0171 -.0145 -.0261 -.0984 -.0863 -.0851 -.0831 -.1143 -.0987 -.1019 -.0986 Wongnai -.0618 -.0658 -.0676 -.0655 -.1417 -.1406 -.1404 -.1406 -.1631 -.1591 -.1602 -.1616 Prachathai -.0153 .0094 .0073 .0025 -.0814 -.0632 -.0600 -.0692 -.1124 -.0922 -.0900 -.0995 Chinanews -.0003 .0091 .0207 .0222 -.0536 -.0530 -.0459 -.0406 -.0694 -.0679 -.0582 -.0545 Dianping -.0614 -.0539 -.0518 -.0516 -.0977 -.0940 -.0908 -.0933 -.1143 -.1124 -.1126 -.1137 Douban .0072 .0106 .0208 .0149 -.0763 -.0747 -.0734 -.0721 -.0994 -.1010 -.1002 -.0991

Table 3: Silhouette improvement in collocation models

6 Results and Discussion meaning of federal judge is more precise than just federal and judge. In general, corpora containing news have higher If we compare by looking at the topic-keys of percentages of merged words, while those contain- the word and multi-word models, we can come up ing restaurant and movie reviews tend to see lower with similar topics because we as a human who percentages (Table 1). This could be because the understands English and have general knowledge news corpora are in a similar domain to that of of the world can make the connection based on the Wikipedia which we use to build the list of surrounding topic-keys that soviet and union, or co-occurring words. In χ2, t and WPE models, super and bowl, or new and york are part of con- where the input contains multi-word tokens, the re- nected words even though they are not explicitly sults usually improve over the word-token model. merged. However, if we want to use these topic- Most of the exceptions are from corpora contain- keys as input to other tasks such as query search ing restaurant and movie reviews, in which the per- or neural network modeling, it is useful to feed centages of merged words are lower. the merged tokens to be explicit that the bowl here The normalized log-likelihood per token of the doesn’t refer to the deep dish used for food, and multi-word models is generally higher than the that the union here doesn’t refer to the worker as- word model across languages and corpora (Table sociation. 2). This means the multi-word models are better 7 Conclusion than their corresponding word models in reproduc- ing the statistics of the held-out data. We also see In this work, we improve the quality of LDA a general improvement in coherence in multi-word models by better processing the input text before models (Table 3). Further, the higher CBES score training the model. We found that all three ap- indicates that topic-keys are more semantically co- proaches to select candidate multi-word tokens— herent and topics are more distinct. The topic keys Pearson’s chi-squared test, t-statistics, and word- from multi-word models form a coherent concep- pair encoding—improve the results of trained tual unit (Table 2). We can see that supreme court topic models across numerous metrics. We also in the multi-word models is more meaningful than propose a new evaluation metric necessary for supreme or court in the word model. Similarly, evaluating LDA topic models in scenarios where pre-processing changes the corpus vocabulary. As Methods in Natural Language Processing and Com- a future direction, we would like to explore other putational Natural Language Learning, pages 214– collocation measures with applications to other 222. language families that are morphologically rich. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. Http://mallet.cs.umass.edu. References Stephen Merity, Caiming Xiong, James Wirote Aroonmanakun. 2007. Creating the thai na- Bradbury, and Richard Socher. 2016. tional corpus. MANUSYA: Journal of Humanities, Pointer sentinel mixture models. 10(3):4–17. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- GE Bin, Chun-hui HE, Sheng-ze HU, and GUO Cheng. frey Dean. 2013. Efficient estimation of word 2018. Chinese news hot subtopic discovery and rec- representations in vector space. arXiv preprint ommendation method based on key phrase and the arXiv:1301.3781. lda model. DEStech Transactions on Engineering and Technology Research, (ecar). David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Steven Bird. 2006. Nltk: the natural language toolkit. Optimizing semantic coherence in topic models. In In Proceedings of the COLING/ACL 2006 Interac- Proceedings of the 2011 Conference on Empirical tive Presentation Sessions, pages 69–72. Methods in Natural Language Processing, pages 262–272. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of ma- Radim Reh˚uˇrekˇ and Petr Sojka. 2010. Software Frame- chine Learning research, 3(Jan):993–1022. work for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Gerlof Bouma. 2009. Normalized (pointwise) mutual Challenges for NLP Frameworks, pages 45–50, Val- information in collocation extraction. Proceedings letta, Malta. ELRA. of GSCL, pages 31–40.

Jonathan Chang, Sean Gerrish, Chong Wang, Peter J Rousseeuw. 1987. Silhouettes: a graphical aid Jordan Boyd-graber, and David Blei. 2009. to the interpretation and validation of cluster anal- Reading tea leaves: How humans interpret topic models. ysis. Journal of computational and applied mathe- In Advances in Neural Information Processing Sys- matics, 20:53–65. tems, volume 22. Curran Associates, Inc. Evan Sandhaus. 2008. The new york times annotated Pattarawat Chormai, Ponrawee Prasertsom, Jin Chee- corpus. Linguistic Data Consortium, Philadelphia, vaprawatdomrong, and Attapol Rutherford. 2020. 6(12):e26752. Syllable-based neural thai word segmentation. In Proceedings of the 28th International Conference on Alexandra Schofield, M˚ans Magnusson, and David Computational Linguistics, pages 4619–4637. Mimno. 2017. Pulling out the stops: Rethinking stopword removal for topic models. In Proceedings Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare of the 15th Conference of the European Chapter of Voss, and Jiawei Han. 2014. Scalable topical the Association for Computational Linguistics: Vol- phrase mining from text corpora. arXiv preprint ume 2, Short Papers, pages 432–436. arXiv:1406.6312. Alexandra Schofield and David Mimno. 2016. Com- Jey Han Lau, David Newman, and Timothy Baldwin. paring apples to apple: The effects of stemmers on 2014. Machine reading tea leaves: Automatically topic models. Transactions of the Association for evaluating topic coherence and topic model quality. Computational Linguistics, 4:287–300. In Proceedings of the 14th Conference of the Euro- pean Chapter of the Association for Computational Huihsin Tseng, Pi-Chuan Chang, Galen Andrew, Dan Linguistics, pages 530–539. Jurafsky, and Christopher D Manning. 2005. A conditional random field word segmenter for sighan Bing Li, Xiaochun Yang, Rui Zhou, Bin Wang, bakeoff 2005. In Proceedings of the fourth SIGHAN Chengfei Liu, and Yanchun Zhang. 2018. An effi- workshop on Chinese language Processing. cient method for high quality and cohesive topical phrase mining. IEEE Transactions on Knowledge Hanna M. Wallach, Iain Murray, Ruslan and Data Engineering, 31(1):120–137. Salakhutdinov, and David Mimno. 2009. Evaluation methods for topic models. In Proceed- Robert Lindsey, William Headden, and Michael ings of the 26th Annual International Conference Stipicevic. 2012. A phrase-discovering topic model on Machine Learning, ICML ’09, page 1105–1112, using hierarchical pitman-yor processes. In Pro- New York, NY, USA. Association for Computing ceedings of the 2012 Joint Conference on Empirical Machinery. Minmei Wang, Bo Zhao, and Yihua Huang. 2016. Ptr: phrase-based topical ranking for automatic keyphrase extraction in scientific publications. In In- ternational Conference on Neural Information Pro- cessing, pages 120–128. Springer. Zhiguo Yu, Todd R Johnson, and Ramakanth Kavuluru. 2013. Phrase based topic modeling for semantic in- formation processing in biomedicine. In 2013 12th International Conference on Machine Learning and Applications, volume 1, pages 440–445. IEEE.