Augmenting Word2vec with Latent Dirichlet Allocation Within a Clinical Application

Augmenting word2vec with latent Dirichlet allocation within a clinical application Akshay Budhkar Frank Rudzicz University of Toronto University of Toronto The Vector Institute Toronto Rehabilitation Institute-UHN [email protected] The Vector Institute [email protected] Abstract address the issue caused by the absence of contex- tual information embedded in these models. This paper presents three hybrid models that Here, we hypothesize that creating a hybrid of directly combine latent Dirichlet allocation LDA and word2vec models will produce discrim- and word embedding for distinguishing be- inative features. These complementing models tween speakers with and without Alzheimer’s have been previously combined for classification disease from transcripts of picture descrip- tions. Two of our models get F-scores over the by Liu et al.(2015), who introduced topical word current state-of-the-art using automatic meth- embeddings in which topics were inferred on a ods on the DementiaBank dataset. small local context, rather than over a complete document, and input to a skip-gram model. How- 1 Introduction ever, these models are limited when working with small context windows and are relatively expen- Word embedding projects word tokens into a sive to calculate when working with long texts as lower-dimensional latent space that captures se- they involve multiple LDA inferences per docu- mantic, morphological, and syntactic information ment. Mikolov et al.(2013). Separately but related, We introduce three new variants of hybrid LDA- the task of topic modelling also discovers latent word2vec models, and investigate the effect of semantic structures or topics in a corpus. Blei dropping the first component after principal com- et al.(2003) introduced latent Dirichlet allocation ponent analysis (PCA). These models can be (LDA), which is based on bag-of-words statistics thought of as extending the conglomeration of top- to infer topics in an unsupervised manner. LDA ical embedding models. We incorporate topical considers each document to be a probability dis- information into our word2vec models by using tribution over hidden topics, and each topic is a the final state of the topic-word distribution in the probability distribution over all words in the vo- LDA model during training. cabulary. Both the topic distributions and the word distributions assume distinct Dirichlet priors. 1.1 Motivation and related work The inferred probabilities over learned latent Alzheimer’s disease (AD) is a neurodegenera- topics of a given document (i.e., topic vectors) can tive disease that affects approximately 5.5 million arXiv:1808.03967v1 [cs.CL] 12 Aug 2018 be used along with a discriminative classifier, as Americans with annual costs of care up to $259B in the work by Luo and Li(2014), but other ap- in the United States, in 2017, alone (Alzheimer’s proaches such as TF-IDF (Lan et al., 2005) eas- Association et al., 2017). The existing state-of- ily outperform this model, like in the case of the the-art methods for detecting AD from speech Reuters-21578 corpus (Lewis et al., 1987). To ad- used extensive feature engineering, some of which dress this, Mcauliffe and Blei(2008) introduced a involved experienced clinicians. Fraser et al. supervised topic model, sLDA, with the intention (2016) investigated multiple linguistic and acous- of inferring latent topics that are predictive of the tic characteristics and obtained accuracies up to provided label. Similarly, Ramage et al.(2009) 81% with aggressive feature selection. introduced labeled LDA, another graphical model Standard methods that discover latent spaces variant of the LDA, to do text classification. Both from data, such as word2vec, allow for problem- these variants have competitive results, but do not agnostic frameworks that don’t involve extensive feature engineering. Yancheva and Rudzicz Sex (M/F) Age (years) (2016) took a step in this direction, clinically, by AD CT AD CT using vector-space topic models, again in detect- WLS -/- 681/685 - (-) 71.2 (4.4) ing AD, and achieved F-scores up to 74%. It is DB 82/158 82/151 71.8 (8.5) 65.2 (7.8) generally expensive to get sufficient labeled data Table 1: Demographics for DB and WLS for pa- for arbitrary pathological conditions. Given the tients with AD and controls (CT). All WLS par- sparse nature of data sets for AD, Noorian et al. ticipants are controls. Years are indicated by their (2017) augmented a clinical data set with norma- means and standard deviations. tive, unlabeled data, including the Wisconsin Lon- gitudinal Study (WLS), to effectively improve the Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Total state of binary classification of people with and CT 55 56 40 40 50 241 without AD. AD 56 54 70 70 60 310 In our experiments, we train our hybrid models on a normative dataset and apply them for classi- Table 2: DB test-data distribution fication on a clinical dataset. While we test and compare these results on detection of AD, this protocol (MacWhinney, 1992). We use a 5−fold framework can easily be applied to other text clas- group cross-validation (CV) to split this dataset sification problems. The goal of this project is to i) while ensuring that a particular participant does effectively augment word2vec with LDA for clas- not occur in both the train and test splits. Table sification, and ii) to improve the accuracy of de- 2 presents the distribution of Control and Demen- mentia detection using automatic methods. tia groups in the test split for each fold. WLS is used to train our LDA, word2vec and 2 Datasets hybrid models that are then used to generate fea- 2.1 Wisconsin Longitudinal Study ture vectors on the DB dataset. The feature vectors on the train set are used to train a discriminative The Wisconsin Longitudinal Study (WLS) is a classifier (e.g., SVM), that is then used to do the normative dataset where residents of Wisconsin AD/CT binary classification on the feature vectors (N = 10,317) born between 1938 and 1940 per- of the test set. form the Cookie Theft picture description task from the Boston Diagnostic Aphasia Examination 2.3 Text pre-processing (Goodglass and Barresi, 2000). The audio ex- cerpts from the 2011 survey (N = 1,366) were con- During the training of our LDA and word2vec verted to text using the Kaldi open source auto- models, we filter out spaCy’s list of stop words matic speech recognition (ASR) engine, specifi- (Honnibal and Montani, 2017) from our datasets. cally using a bi-directional long short-term mem- For our LDA models trained on ASR transcripts, ory network trained to the Fisher data set (Cieri we remove the [UNK] and [NOISE] tokens gener- et al., 2004). We use this normative dataset to train ated by Kaldi. We also exclude the tokens um and our topic and word2vec models. uh, as they were the most prevalent words across most of the generated topics. We exclude all punc- 2.2 DementiaBank tuation and numbers from our datasets. DementiaBank (DB) is part of the TalkBank 3 Methods project (MacWhinney et al., 2011). Each participant was assigned to either the ‘Dementia’ group 3.1 Baselines (N = 167) or the ‘Control’ group (N = 97) based Once an LDA model is trained, it can be used to on their medical histories and an extensive neu- infer the topic distribution on a given document. ropsychological and physical assessment battery. We set the number of topics empirically to K=5 Additionally, since many subjects repeated their and K=25. engagement at yearly intervals (up to five years), We also use a pre-trained word2vec model we use 240 samples from those in the ‘Dementia’ trained on the Google News Dataset 1. The model group, and 233 from those in the ‘Control’ group. contains embeddings for 3 million unique words, Each speech sample was recorded and manually transcribed at the word level following the CHAT 1https://code.google.com/archive/p/word2vec/ though we extract the most frequent 1 million words for faster performance. Words in our corpus that do not exist in this model are replaced with the UNK token. We also train our own word vectors with 300 dimensions and window size of 2 to be consistent with the pre-trained variant. Words are required to appear at least twice to have a mapped word2vec embedding. Both models incorporate negative sampling to aid with better representa- tions for frequent words as discussed by Mikolov et al.(2013). Unless mentioned otherwise, the same parameters are used for all of our proposed word2vec-based models. Given these models, we represent a document Figure 1: Neural representation of topical by averaging the word embeddings for all the word2vec words in that document, i.e.: n P V Wi P p W i=1 i i avg word2vec = (1) topic vector = i=1 (2) n D V where n is the number of words in the document where V is the vocabulary size of our corpus, pi th and Wi is the word2vec embedding for the i is the probability that a given word appears in the word. This representation retains the number of topic, from LDA, and Wi is the word2vec embed- dimensions (N = 300) in the original model. ding of that word. Third, TF-IDF is a common numerical statistic Furthermore, this approach also represents a in information retrieval that measures the number given document (or transcript) using these topic of times a word occurs in a document, and through vectors as a linear combination of the topics in that the entire corpus. We use a TF-IDF vector rep- document. This combination can be thought of as resentation for each transcript for the top 1,000 a topic-influenced point representation of the doc- words after preprocessing.

Load more