Arxiv:2108.10755V1 [Cs.CL]
Total Page:16
File Type:pdf, Size:1020Kb
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models Jin Cheevaprawatdomrong Alexandra Schofield Attapol T. Rutherford Chulalongkorn University Harvey Mudd College Chulalongkorn University [email protected] [email protected] [email protected] Abstract without marked word boundaries, such as Chinese and Thai: tokenizers for these languages may Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a collection of doc- split conceptual units into segments that, while uments to discover their latent topics using functional as standalone words, do not express word-document co-occurrences. However, it the concept of the original text. Meaningful is unclear how to achieve the best results for interpretation of topics can be lost without careful languages without marked word boundaries recombination of these words. such as Chinese and Thai. Here, we explore In this paper, we evaluate three techniques to the use of Pearson’s chi-squared (χ2) test, t- merge multiple adjacent words into conceptually- statistics, and Word Pair Encoding (WPE) to produce tokens as input to the LDA model. unified phrasal tokens prior to LDA model infer- The χ2, t and WPE tokenizers are trained on ence: Pearson Chi-square test, t-statistic, and word Wikipedia text to look for words that should pair encoding (WPE). We apply merging strate- be grouped together, such as compound nouns, gies to different language families including Indo- proper nouns, and complex event verbs. We European language (English), Kra-Dai language propose a new metric for measuring the clus- (Thai) and Sinitic language (Chinese). Inspired tering quality in settings where the vocabular- by silhouette coefficients, we also introduce a new ies of the models differ. Based on this met- ric and other established metrics, we show that method to assess the coherence of topics in a set- topics trained with merged tokens result in ting with variable vocabularies caused by different topic keys that are clearer, more coherent, and pre-processing treatments, which was not possible more effective at distinguishing topics than with previously proposed methods. Using this new those unmerged models. metric and existing topic model evaluations, we 1 Introduction find that all three approaches to merging adjacent words can improve the likelihood, coherence, and Latent Dirichlet allocation (LDA) models topic distinctiveness of LDA models. (Blei et al., 2003) provide useful insights into themes and trends in a large text collection 2 Related Work through the unsupervised inference of topics, or arXiv:2108.10755v1 [cs.CL] 24 Aug 2021 probability distributions over unigram word types Despite their popularity in analyzing large in the corpus. In this model, a topic is often amounts of text data, LDA models are notori- interpreted based on its highest-probability words, ously complex to evaluate. One must evaluate with documents expressed in terms of proportions both the statistical fit of a model and the human- of each topic. Unfortunately, the context in registered thematic coherence of the words found which these tokens arise can be obscured in the to arise in the high-probability words, or keys, bag-of-words rendering of text as unigram counts of a topic, which may not correlate (Chang et al., in documents. For instance, a topic with high 2009). Analyses often combine evaluations of fit probabilities of both “coffee” and “table” is tempt- (Wallach et al., 2009) and automated approxima- ing to interpret as focusing on the furniture item tions of human judgments of coherence (Bouma, “coffee table”, but both words could be frequent 2009; Mimno et al., 2011) based on mutual infor- in a discussion of cafes containing no coffee mation, even with the expectation these may only tables. This problem is amplified in languages somewhat correlate with true human judgments (Lau et al., 2014). A limitation of these existing that even a few co-occurrences can trigger signifi- approaches, however, is that they expect the vo- cance. cabulary and tokenization to remain constant be- Taking inspiration from byte-pair encoding, or tween two models. For our evaluation, we use BPE, we propose an alternative to obtain word- a normalized log likelihood approach to capture pair encoding (WPE) tokens. To do this, we first fit while accounting for changes in vocabulary tokenize a large corpus and then collect bigram (Schofield and Mimno, 2016). counts for all bigrams found in the corpus. Sec- Pre-processing steps can meaningfully al- ond, we merge the most frequent bigram to form a ter the results of the LDA models even in new WPE token. This new bigram is then treated languages with good tokenization heuristics as a word in all occurrences. Next, we continue to such as English (Schofield and Mimno, 2016; repeat the counting and merging process with one Schofield et al., 2017). We believe that languages extra word type. Finally, we obtain a vocabulary that do not have clear tokenization standards de- list of both unigram and WPE tokens. serve investigation into what kind of process- ing is appropriate. Many works recognize that 4 Evaluation Metrics LDA results can be improved when input are Held-Out Likelihood. When multi-word phrases including phrases (Lindsey et al., 2012; Yu et al., are converted to individual tokens, the number of 2013; El-Kishky et al., 2014; Wang et al., 2016; tokens in the document decreases while the size Bin et al., 2018; Li et al., 2018). We consider it of the corpus vocabulary increases. It is therefore valuable to specifically assess approaches to deter- illogical to compare the likelihoods of the word- mining these phrases. token model and WPE-token model directly. In order to normalize the scores between the 3 Collocations and Word Pair Encoding two models that do not have the exact same vo- Collocations consist of two or more words that can cabulary and tokens, we use the log-likelihood express conventional meaning. Since collocations ratio between the LDA model likelihood and can convey information about multi-word entities, the null (unigram) likelihood for each model. context, and word usage, we hypothesize that the In other words, we normalize the LDA model L introduction of multi-word tokens, which capture likelihood ( model) by dividing it with the un- L collocations as unigrams through concatenation, igram likelihood ( unigram) as introduced by can help achieve more useful and coherent topic Schofield and Mimno (2016): models. For languages that do not have clear word boundaries, there is a possible additional benefit to L − L log model log unigram multi-word tokens: it can be hard to intuit whether PTLLnorm = (1) N inferred word boundaries will have a large impact on the final results. Merging adjacent words into New Metric: Concatenation-based Embed- ’multi-word’ tokens may help remedy the potential ding Silhouette (CBES) Previous measures of problem of a segmentation that is not optimal for topic coherence rely on statistics from the training topic modeling purposes. data and assume that the vocabularies are identical Many methods are possible to select colloca- for both models, which is not the case for our set- tions from tokenized text, such as frequency, mean tings. To address this, we propose a new metric and variance, and statistical hypothesis testing. In called a concatenation-based embedding silhou- this paper, we evaluate Pearson’s chi-squared test ette (CBES), which measures the coherence within (χ2) and the t-statistic for word co-occurrence, the same topic and also the distinguishability of two hypothesis tests to determine if two words different topics in the LDA results. CBES extends are collocated significantly more than would oc- silhouette coefficients (Rousseeuw, 1987), a com- cur randomly. To implement these tests, we use mon clustering evaluation metric, by projecting to- the NLTK package to compute (Bird, 2006). We kens and multiword tokens into the same space impose a minimum frequency in the corpus for and computing the silhouette coefficients in this each selected bigram: otherwise, top bigrams from vector space in the usual way. the χ2 test will contain only exceptional rare A good topic should have all of their topic words, as these are expected to co-occur so rarely keys close to each other and away from other %merged χ2: debes jugar, euskaltel euskadi, taare zameen, chetro Corpus Docs Tokens χ2 t WPE ketl, hetch hetchy, ngwat mahop, mullum malarum, NYTimes 80K 2M 15.43 15.99 16.55 pazz jop, phnom penh, eisernen kreuzes, sirimalle SOTU 21K 1M 12.21 12.90 13.53 chettu, kasa vubu, moondram pirai, gjems onstad, Yelp 200K 13M 9.47 10.33 12.16 lettow vorbeck, pather panchali, ioann zlatoust, kud TNC 2K 4M 11.43 11.50 8.40 wafter, poquita ropa, viribus unitis BEST 4K 6M 12.43 12.50 9.53 t: united states, new york, world war, km h, take place, Wongnai 40K 8M 5.68 5.71 4.48 miles km, los angeles, united kingdom, first time, high Prachathai 68K 119M 13.41 13.45 10.36 school, tropical storm, new zealand, war ii, video game, Chinanews 100K 3M 12.62 14.35 10.62 mph km, h mph, north america, air force, two years, Dianping 100K 4M 3.22 3.84 2.59 peak number Douban 200K 2M 4.61 5.18 3.99 WPE: unite state, new york, take place, first time, unite kingdom, follow year, world war ii, also know, next Table 1: A survey of corpora providing the number day, new york city, high school, los angeles, north amer- of documents and tokens, as well as the percentage of ica, even though, new zealand, follow day, become first, also use, year old, take part unigram tokens merged using each approach. Figure 1: Different collocation scoring methods result in different top 20 English collocations.