Learning Disentangled Representations of Texts With

Home , Bottom of the Bottle

Learning Disentangled Representations of Texts with Application to Biomedical Abstracts Sarthak Jain Edward Banner Jan-Willem van de Meent Northeastern University Northeastern University Northeastern University [email protected] [email protected] [email protected]

Iain J. Marshall Byron C. Wallace King’s College London Northeastern University [email protected] [email protected]

Abstract Thus far in NLP, learned distributed representations have, with few exceptions (Ruder et al., We propose a method for learning disentan- 2016; He et al., 2017; Zhang et al., 2017), been en- gled representations of texts that code for dis- tangled: they indiscriminately encode all aspects tinct and complementary aspects, with the aim of texts. Rather than representing text via a mono- of affording efficient model transfer and interpretability. To induce disentangled embed- lithic vector, we propose to estimate multiple em- dings, we propose an adversarial objective beddings that capture complementary aspects of based on the (dis)similarity between triplets texts, drawing inspiration from the ML in vision of documents with respect to specific as- community (Whitney, 2016; Veit et al., 2017a). pects. Our motivating application is embed- As a motivating example we consider docu- ding biomedical abstracts describing clinical ments that describe clinical trials. Such publica- trials in a manner that disentangles the pop- tions constitute the evidence drawn upon to sup- ulations, interventions, and outcomes in a evidence-based medicine given trial. We show that our method learns port (EBM), in which representations that encode these clinically one formulates precise clinical questions with re- salient aspects, and that these can be effec- spect to the Populations, Interventions, Compara- tively used to perform aspect-specific retrieval. tors and Outcomes (PICO elements) of interest We demonstrate that the approach generalizes (Sackett et al., 1996).1 Ideally, learned represen- beyond our motivating application in experi- tations of such articles would factorize into em- ments on two multi-aspect review corpora. beddings for the respective PICO elements. This would enable aspect-specific similarity measures, 1 Introduction in turn facilitating retrieval of evidence concern- A classic problem that arises in (distributed) rep- ing a given condition of interest (i.e., in a spe- resentation learning is that it is difficult to deter- cific patient population), regardless of the inter- mine what information individual dimensions in ventions and outcomes considered. Better repre- an embedding encode. When training a classifier sentations may reduce the amount of supervision to distinguish between images of people and land- needed, which is expensive in this domain. scapes, we do not know a priori whether the model Our work is one of the first efforts to induce dis- is sensitive to differences in color, contrast, shapes entangled representations of texts,2 which we be- arXiv:1804.07212v2 [cs.CL] 3 Sep 2018 or textures. Analogously, in the case of natural lieve may be broadly useful in NLP. Concretely, language, when we calculate similarities between our contributions in this paper are as follows: document embeddings of user reviews, we cannot • We formalize the problem of learning disentan- know if this similarity primarily reflects user senti- gled representations of texts, and develop a rela- ment, the product discussed, or syntactic patterns. tively general approach for learning these from This lack of interpretability makes it difficult to aspect-specific similarity judgments expressed assess whether a learned representations is likely as triplets (s, d, o)a, which indicate that docu- to generalize to a new task or domain, hinder- ment d is more similar to document s than to ing model transferability. Disentangled represen- document o, with respect to aspect a. tations with known semantics could allow more ef- 1We collapse I and C because the distinction is arbitrary. ficient training in settings in which supervision is 2We review the few recent related works that do exist in expensive to obtain (e.g., biomedical NLP). Section5. • We perform extensive experiments that provide evidence that our approach yields disentangled representations of texts, both for our motivating task of learning PICO-specific embeddings of biomedical abstracts, and, more generally, for multi-aspect sentiment corpora.

2 Framework and Models Figure 1: We propose associating aspects with en- Recent approaches in computer vision have coders (low-level parameters are shared across as- emphasized unsupervised learning of disentan- pects; this is not shown) and training these with gled representations by incorporating information- triplets codifying aspect-wise relative similarities. theoretic regularizers into the objective (Chen et al., 2016; Higgins et al., 2017). These ap- all other aspects a0: In this case the model will proaches do not require explicit manual annota- use only the embeddings for aspect a to represent tions, but consequently they require post-hoc man- similarities. In general, we expect a compromise ual assignment of meaningful interpretations to between these extremes, and propose using nega- learned representations. We believe it is more nat- tive sampling to enable the model to learn targeted ural to use weak supervision to induce meaningful aspect-specific encodings. aspect embeddings. 2.2 Encoder Architecture 2.1 Learning from Aspect Triplets Designing an aspect-based model requires speci- As a general strategy for learning disentangled fying an encoder architecture. One consideration representations, we propose exploiting aspect- here is interpretability: a desirable property for specific document triplets (s, d, o)a: this signals aspect encoders is the ability to identify salient that s and d are more similar than are d and o, words for a given aspect. With this in mind, we with respect to aspect a (Karaletsos et al., 2015; propose using gated CNNs, which afford intro- Veit et al., 2017b), i.e., sima(d, s) > sima(d, o), spection via the token-wise gate activations. where sima quantifies similarity w.r.t. aspect a. Figure2 schematizes our encoder architecture. We associate with each aspect an encoder enca The input is a sequence of word indices d = (encoders share low-level layer parameters; see (w1, ..., wN ) which are mapped to m-dimensional Section 2.2 for architecture details). This is used to word embeddings and stacked into a matrix E = a a a [e , ..., e ]. These are passed through sequential obtain text embeddings (es , ed, eo). To estimate 1 N the parameters of these encoders we adopt a sim- convolutional layers C1, ..., CL, which induce rep- N×k ple objective that seeks to maximize the similarity resentations Hl ∈ R : a a between (ed, es ) and minimize similarity between a a Hl = fe(X ∗ Kl + bl) (2) (ed, eo), via the following maximum margin loss N×k a a a a a where X ∈ R is the input to layer Cl (either L(es , ed, eo) = max{0, 1 − sim(ed, es ) (1) a set of n-gram embeddings or Hl−1) and k is the + sim(ea, ea)} F ×k×k d o number of feature maps. Kernel Kl ∈ R k and bl ∈ R are parameters to be estimated, where i j Where similarity between documents and with F is the size of kernel window.3 An activation respect to a particular aspect a, sima(i, j), is sim- function fe is applied element-wise to the output ply the cosine similarity between the aspect em- a a of the convolution operations. We fix the size of beddings ei and ej . This allows for the same doc- N×k Hl−1 ∈ R by zero-padding where necessary. uments to be similar with respect to some aspects Keeping the size of feature maps constant across while dissimilar in terms of others. layers allows us to introduce residual connections; The above setup depends on the correlation be- the output of layer l is summed with the outputs of tween aspects in the training data. At one ex- preceding layers before being passed forward. treme, when triplets enforce identical similarities We multiply the output of the last convolutional for all aspects, the model cannot distinguish be- N×k N×1 layer HL ∈ R with gates g ∈ R to yield tween aspects at all. At the other extreme, triplets 3 N×m F ×m×k are present for only one aspect a, and absent for The input to C1 is E ∈ R , thus K1 ∈ R . atic Reviews (CDSR). Cochrane is an international E H1 H2 H3 organization that creates and curates biomedical ∗ ∗ ∗ systematic reviews. Briefly, such reviews seek to formally synthesize all relevant articles to answer K , b K , b K , b e 1 1 2 2 3 3 precise clinical questions, i.e., questions that specify a particular PICO frame. The CDSR consists d Shared Layers wg, bg of a set of reviews {Ri}. Reviews include mul- ∗ Convolution Op tiple articles (studies) {Sij}. Each study S con- Element Wise Sum ∗ g sists of an abstract A and a set of free text sum- Dot Product maries (sP , sI , sO) written by reviewers describing the respective P, I and O elements in S. Figure 2: Schematic of our encoder architecture. Reviews implicitly specify PICO frames, and

1×k thus two studies in any given review may be our final embedding ed ∈ R : viewed as equivalent with respect to their PICO aspects. We use this observation to derive docu- g = σ(HL · wg + bg) ment triplets. Recall that triplets for a given as- T (3) ed = g HL pect include two comparatively similar texts (s, d) k×1 and one relatively dissimilar (o). Suppose the as- where wg ∈ R and bg ∈ R are learned parameters and σ is the sigmoid activation function. We pect of interest is the trial population. Here we impose a sparsity-inducing constraint on g via the match a given abstract (d) with its matched population summary from the CDSR (s); this encour- `1 norm; this allows the gates to effectively serve as an attention mechanism over the input. Addi- ages the encoder to yield similar embeddings for tionally, to capture potential cross-aspect correla- the abstract and the population description. The tion, weights in the embedding and first convolu- dissimilar o is constructed to distinguish the given tional layers are shared between aspect encoders. abstract from (1) other aspect encodings (of interventions, outcomes), and, (2) abstracts for trials Alternative encoders. To assess the relative im- with different populations. portance of the specific encoder model architecture used, we conduct experiments in which we Concretely, to construct a triplet (s, d, o) for the fine-tune standard document representation mod- PICO data, we draw two reviews R1 and R2 from els via triplet-based training. Specifically, we con- the CDSR at random, and sample two studies from 0 sider a single-layer MLP with BoW inputs, and the first (s1, s1) and one from the second (s2). In- a Neural Variational Document Model (NVDM) tuitively, s2 will (very likely) comprise entirely 0 (Miao et al., 2016). For the NVDM we take a different PICO elements than (s1, s1), by virtue weighted sum of the original loss function and the of belonging to a different review. To formal- triplet-loss over the learned embeddings, where ize the preceding description, our triplet is then: 0P abstract P 0I 0O the weight is a model hyperparameter. (s = [s1 ], d = [s1 ], o = [s2 |s1 |s1 ]), where abstract s1 is the abstract for study s1, and aspect sum- 3 Varieties of Supervision maries for studies are denoted by superscripts. We include a concrete example of triplet construction Our approach entails learning from triplets that in the Appendix, Section D. codify relative similarity judgments with respect to specific aspects. We consider two approaches 3.2 Learning Directly from Aspect-Wise to acquiring such triplets: the first exploits aspect- Similarity Judgments specific summaries written for texts, and the second assumes a more general scenario in which we The preceding setup assumes a somewhat unique solicit aspect-wise triplet judgments directly. case in which we have access to aspect-specific summaries written for texts. As a more general 3.1 Deriving Triplets from Aspect Summaries setting, we also consider learning directly from In the case of our motivating example – disentan- triplet-wise supervision concerning relative simi- gled representations for articles describing clini- larity with respect to particular aspects (Amid and cal trials – we have obtained aspect-specific sum- Ukkonen, 2015; Veit et al., 2017a; Wilber et al., maries from the Cochrane Database of System- 2014). The assumption is that such judgments can be solicited directly from annotators, and thus the abstracts (Le and Mikolov, 2014). LDA: Latent approach may be applied to arbitrary domains, so Dirichlet Allocation. NVDM: A generative model long as meaningful aspects can be defined implic- of text where the representation is a vector of log- itly via pairwise similarities regarding them. frequencies that encode a topic (Miao et al., 2016). We do not currently have corpora with such ABAE: An autoencoder model that discovers la- judgments in NLP, so we constructed two datasets tent aspects in sentences (He et al., 2017). We ob- using aspect-specific sentiment ratings. Note that tain document embeddings by summing over con- this highlights the flexibility of exploiting aspect- stituent sentence embeddings. DSSM: A CNN wise triplet supervision as a means of learning based encoder trained with triplet loss over ab- disentangled representations: existing annotations stracts (Shen et al., 2014). can often be repurposed into such triplets. Hyperparameters and Settings. We use three layers for our CNN-based encoder (with 200 filters 4 Datasets and Experiments in each layer; window size of 5) and the PReLU We present a series of experiments on three cor- activation function (He et al., 2015) as fe. We use pora to assess the degree to which the learned rep- 200d word embeddings, initialized via pretraining resentations are disentangled, and to evaluate the over a corpus of PubMed abstracts (Pyysalo et al., utility of these embeddings in simple downstream 2013). We used the Adam optimization function retrieval tasks. We are particularly interested in the with default parameters (Kingma and Ba, 2014). ability to identify documents similar w.r.t. a target We imposed `2 regularization over all parameters, aspect. All parameter settings for baselines are re- the value of which was selected from the range ported in the Appendix (along with additional ex- (1e-2, 1e-6) as 1e-5. The `1 regularization pa- perimental results). The code is available at rameter for gates was chosen from the range (1e-2, https://github.com/successar/neural-nlp. 1e-8) as 1e-6. All model hyperparameters for our models and baselines were chosen via line search 4.1 PICO (EBM) Domain over a 10% validation set. We first evaluate embeddings quantitatively with Metric. For this evaluation, we used a held out set respect to retrieval performance. In particular, we of 15 systematic reviews (comprising 2,223 stud- assess whether the induced representations afford ies) compiled by Cohen et al.(2006). The idea is improved retrieval of abstracts relevant to a partic- that good representations should map abstracts in ular systematic review (Cohen et al., 2006; Wal- the same review (which describe studies with the lace et al., 2010). We then perform two evalua- same PICO frame) relatively near to one another. tions that explicitly assess the degree of disentan- To compute AUCs over reviews, we first calculate glement realized by the learned embeddings. all pairwise study similarities (i.e., over all studies The PICO dataset comprises 41K abstracts of in the Cohen corpus). We can then construct an articles describing clinical trials extracted from the ROC for a given abstract a from a particular re- CDSR. Each abstract is associated with a review view to calculate its AUC: this measures the prob- and three summaries, one per aspect (P/I/O). We ability that a study drawn from the same review keep all words that occur in ≥ 5 documents, con- will be nearer to a than a study from a different verting all others to unk. We truncate documents review. A summary AUC for a review is taken as to a fixed length (set to the 95th percentile). the mean of the study AUCs in that review. 4.1.1 Quantitative Evaluation Results. Table1 reports the mean AUCs over in- Baselines. We compare the proposed P, I and O dividual reviews in the Cohen et al.(2006) corpus, embeddings and their concatenation [P|I|O] to the and grand means over these (bottom row). In brief: following. TF-IDF: standard TF-IDF representa- The proposed PICO embeddings (concatenated) tion of abstracts. RR-TF: concatenated TF-IDF obtain an equivalent or higher AUC than base- vectors of sentences predicted to describe the re- line strategies on 12/14 reviews, and strictly higher spective PICO elements, i.e., sentence predictions AUCs in 11/14. It is unsurprising that we outper- made using the pre-trained model from (Wallace form unsupervised approaches, but we also best et al., 2016) — this model was trained using dis- RR-TF, which was trained with the same CDSR tant supervision derived from the CDSR. doc2vec: corpus (Wallace et al., 2016), and DSSM (Shen standard (entangled) distributed representations of et al., 2014), which exploits the same triplet loss Study TF-IDF Doc2Vec LDA NVDM ABAE RR-TF DSSM PIO [P|I|O] ACEInhib. 0.81 0.74 0.72 0.85 0.81 0.67 0.85 0.83 0.88 0.84 0.92 ADHD 0.90 0.82 0.83 0.93 0.77 0.85 0.83 0.86 0.75 0.91 0.89 Antihist. 0.81 0.73 0.67 0.79 0.84 0.72 0.91 0.88 0.84 0.89 0.91 Antipsych. 0.75 0.85 0.88 0.89 0.81 0.63 0.96 0.91 0.93 0.97 0.97 BetaBlockers 0.67 0.65 0.61 0.76 0.70 0.56 0.68 0.71 0.75 0.77 0.81 CCBlockers 0.67 0.60 0.67 0.70 0.69 0.58 0.76 0.73 0.69 0.74 0.77 Estrogens 0.87 0.85 0.60 0.94 0.85 0.82 0.96 1.00 0.98 0.83 1.00 NSAIDS 0.85 0.77 0.73 0.9 0.77 0.74 0.89 0.94 0.95 0.8 0.95 Opioids 0.81 0.75 0.80 0.83 0.77 0.76 0.86 0.80 0.83 0.92 0.92 OHG 0.79 0.80 0.70 0.89 0.90 0.72 0.90 0.90 0.95 0.95 0.96 PPI 0.81 0.79 0.74 0.85 0.82 0.68 0.94 0.94 0.87 0.87 0.95 MuscleRelax. 0.60 0.67 0.74 0.75 0.61 0.57 0.77 0.68 0.62 0.78 0.75 Statins 0.79 0.76 0.66 0.87 0.77 0.68 0.87 0.82 0.94 0.87 0.94 Triptans 0.92 0.82 0.83 0.92 0.75 0.81 0.97 0.93 0.79 0.97 0.97 Mean 0.79 0.76 0.73 0.85 0.78 0.70 0.87 0.85 0.84 0.87 0.91 Table 1: AUCs achieved using different representations on the Cohen et al. corpus. Models to the right of the | are supervised; those to the right of || constitute the proposed disentangled embeddings.

in a clinical trial mainly involving patients over qqq with and those who have undergone a C-section previ- coronary heart disease , ramipril reduced mortality while vitamin e had no preventive effect . ously (PWc); people at risk for colon cancer (CC); in a clinical trial mainly involving patients over qqq with men with and at risk of prostate cancer (PCt and coronary heart disease , ramipril reduced mortality while PCs, respectively) and individuals with atrial fib- vitamin e had no preventive effect . rillation (AF). This review is unusual in that it in a clinical trial mainly involving patients over qqq with coronary heart disease , ramipril reduced mortality while studies a single intervention (decision aids) across vitamin e had no preventive effect . different populations. Thus, if the model is suc- cessful in learning disentangled representations, Table 2: Gate activations for each aspect in a PICO the corresponding P vectors should roughly clus- abstract. Note that because gates are calculated at ter, while the I/C should not. the final convolution layer, activations are not in Figure3 shows a TSNE-reduced plot of the P, exact 1-1 correspondence with words. I/C and O embeddings induced by our model for these studies. Abstracts are color-coded to indi- as our model. We outperform the latter by an aver- cate the populations enumerated above. As hy- age performance gain of 4 points AUC (significant pothesized, P embeddings realize the clearest sep- at 95% level using independent 2-sample t-test). aration with respect to the populations, while the We now turn to the more important questions: I and O embeddings of studies do not co-localize are the learned representations actually disentan- to the same degree. This is reflected quantitatively gled, and do they encode the target aspects? Ta- in the AUC values achieved using each aspect em- ble2 shows aspect-wise gate activations for PICO bedding (listed on the Figure). This result implies elements over a single abstract; this qualitatively disentanglement along the desired axes. suggests disentanglement, but we next investigate Next we assembled 50 abstracts describing this in greater detail. trials involving hip replacement arthroplasty (HipRepl). We selected this topic because 4.1.2 Qualitative Evaluation HipRepl will either describe the trial population To assess the degree to which our PICO embed- (i.e., patients who have received hip replacements) dings are disentangled – i.e., capture complemen- or it will be the intervention, but not both. Thus, tary information relevant to the targeted aspects – we would expect that abstracts describing trials we performed two qualitative studies. in which HipRepl describes the population clus- First, we assembled 87 articles (not seen in ter in the corresponding embedding space, but not training) describing clinical trials from a review in the intervention space (and vice-versa). To test on the effectiveness of decision aids (Stacey et al., this, we first manually annotated the 50 abstracts, 2014) for: women with, at risk for, and geneti- associating HipRepl with either P or I. We used cally at risk for, breast cancer (BCt, BCs and BCg, these labels to calculate pairwise AUCs, reported respectively); type II diabetes (D); menopausal in Table3. The results imply that the popula- women (MW); pregnant women generally (PW) tion embeddings discriminate between studies that Figure 3: TSNE-reduced scatter of disentangled PICO embeddings of abstracts involving “decision aid” interventions. Abstracts are colored by known population group (see legend). Population embeddings for studies in the same group co-localize, more so than in the intervention and outcome space.

HipRepl I HipRepl P Mean Baseline Look Aroma Palate Taste Population 0.62 0.68 0.66 TF-IDF 0.63 0.62 0.62 0.61 Intervention 0.91 0.46 0.57 LDA 0.73 0.73 0.73 0.73 Outcome 0.89 0.42 0.54 Doc2Vec 0.61 0.61 0.61 0.61 NVDM 0.68 0.69 0.69 0.70 Table 3: AUCs realized over HipRepl studies us- ABAE 0.50 0.50 0.50 0.50 BoW + Triplet 0.85 0.90 0.90 0.92 ing different embeddings. Column: Study label NVDM + Triplet 0.90 0.91 0.92 0.95 (HipRepl as P or I). Row: Aspect embedding used. DSSM + Triplet 0.87 0.90 0.90 0.92 CNN + Triplet 0.92 0.93 0.94 0.96 Population american, area, breast, colorectal, diagnosis, inpatients, outpatients, stage, their, uk Table 5: AUC results for different representations Intervention adjunct, alone, an, discussion, intervention, on the BeerAdvocate data. Models beneath the methods, reduced, started, took, written second line are supervised. Outcome adults, area, either, eligible, importance, im- prove, mortality, pre, reduces, survival Look Aroma Palate Taste Table 4: Top ten most activated words, as deter- Look 0.92 0.89 0.88 0.87 Aroma 0.90 0.93 0.91 0.92 mined by the gating mechanism. Palate 0.89 0.92 0.94 0.95 Taste 0.90 0.94 0.95 0.96 enrolled patients with HipRepl and other studies. Table 6: Cross AUC results for different represen- Likewise, studies in which HipRepl was the inter- tations on the BeerAdvocate data. Row: Embed- vention are grouped in the interventions embedding used. Column: Aspect evaluated against. ding space, but not in the populations space. Aspect words. In Table4, we report the most activated unigrams for each aspect embedding on the views. Learned embeddings should capture differ- decision aids corpus. To derive these we use the ent aspects, e.g., taste or look in the case of beer. outputs of the gating mechanism (Eq.3), which is applied to all words in the input text. For each 4.2.1 Beer Reviews (BeerAdvocate) word, we average the activations across all ab- We conducted experiments on the BeerAdvocate stracts and find the top ten words for each aspect. dataset (McAuley et al., 2012), which contains The words align nicely with the PICO aspects, 1.5M reviews of beers that address four aspects: providing further evidence that our model learns appearance, aroma, palate, and taste. Free-text to focus on aspect-specific information. reviews are associated with aspect-specific numer- ical ratings for each of these, ranging from 1 to 4.2 Multi-Aspect Reviews 5. We consider ratings < 3 as negative, and > 3 We now turn from the specialized domain of as positive, and use these to generate triplets of biomedical abstracts to more general applications. reviews. For each aspect a, we construct triplets In particular, we consider learning disentangled (s, d, o)a by first randomly sampling a review d. representations of beer, hotel and restaurant re- We then select s to be a review with the same sen- Look Aroma Palate Taste aspect on our test set using different representa- Look -- 0.42 0.60 0.40 0.63 0.38 0.65 Aroma 0.33 0.69 -- 0.41 0.59 0.41 0.60 tions in Table5. Our model consistently outper- Palate 0.32 0.70 0.46 0.54 -- 0.49 0.52 forms baseline strategies over all aspects. Unsur- Taste 0.23 0.80 0.35 0.66 0.33 0.67 -- prisingly, the model outperforms unsupervised ap- 4 Table 7: ‘Decorrelated’ cross-AUC results on the proaches. We realize consistent though modest BeerAdvocate data, which attempt to mitigate improvement over triplet-supervised approaches confounding due to overall sentiment being cor- that use alternative encoders. related. Each cell reports metrics over subsets of In Table6 we present cross AUC evalua- reviews in which the sentiment differs between the tions. Rows correspond to the embedding used row and column aspects. The numbers in each cell and columns to the aspect evaluated against. are the AUCs w.r.t. sentiment regarding the col- As expected, aspect-embeddings perform better umn aspect achieved using the row and column w.r.t. the aspects for which they code, suggesting aspect representations, respectively. some disentanglement. However, the reduction in performance when using one aspect representation to discriminate w.r.t. others is not as pronounced timent with respect to a as d, and o to be a re- as above. This is because aspect ratings are highly view with the opposite sentiment regarding a. We correlated: if taste is positive, aroma is very likely selected 90K reviews for experiments, such that to be as well. Effectively, here sentiment entangles we had an equal number of positive and negative all of these aspects.5 reviews for each aspect. We only keep words ap- In Table7, we evaluate cross AUC perfor- pearing in at least 5 documents, converting all oth- mance for beer by first ‘decorrelating’ the aspects. ers to unk. We truncated reviews to 95 percentile Specifically, for each cell (k, k0) in the table, we length. We split our data into 80/10/10 ratio for first retrieve the subset of reviews in which the sen- training, validation and testing, respectively. timent w.r.t. k differs from the sentiment w.r.t. k0. Baselines. We used the same baselines as for Then we evaluate the AUC similarity of these re- 0 the PICO domain, save for RR-TF, which was views on the basis of sentiment concerning k us- 0 domain-specific. Here we also evaluate the result ing both k and k embeddings, yielding a pair of of replacing the CNN-based encoder with NVDM, AUCs (listed respectively). We observe that the 0 0 BoW and DSSM based encoders, respectively, using k embeddings to evaluate aspect k similar- each trained using triplet loss. ity yields better results than using k embeddings. We present the most activated words for each Hyperparameters and Settings. For the CNN- aspect (as per the gating mechanism) in Table8. based encoder, we used settings and hyperparam- And we present an illustrative review color-coded eters as described for the PICO domain. For the with aspect-wise gate activations in Table9. For BoW encoder, we used 800d output embeddings completeness, we reproduce the top words for as- and a PReLU activation function with `2 regular- pects discovered using He et al.(2017) in the Ap- ization set to 1e-5. For the NVDM based encoder, pendix; these do not obviously align with the tar- we used 200d embeddings. get aspects, which is unsurprising given that this is Metrics. We again performed an IR-type evalu- an unsupervised method. ation to assess the utility of representations. For each aspect k, we constructed an affinity matrix 4.2.2 Hotel & Restaurant Reviews k k A such that Aij = simk(ri, rj) for beer reviews Finally, we attempt to learn embeddings that dis- ri and rj. We consider two reviews similar un- entangle domain from sentiment in reviews. For der a given aspect k if they have the same (di- this we use a combination of TripAdvisor and chotomized) sentiment value for said aspect. We 4We are not sure why ABAE (He et al., 2017) performs compute AUCs for each review and aspect using so poorly on the review corpora. It may simply fail to promi- the affinity matrix Ak. The AUC values are aver- nently encode sentiment, which is important for these tasks. aged over reviews in the test set to obtain a final We note that this model performs reasonably well on the PICO data above, and qualitatively seems to recover reason- AUC metric for each aspect. We also report cross able aspects (though not specifically sentiment). AUC measures in which we use embeddings for 5Another view is that we are in fact inducing represen- aspect k to distinguish reviews under aspect k0. tations of pairs, and only the aspect varies across these; thus representations remain discrimina- Results We report the AUC measures for each tive (w.r.t. sentiment) across aspects. Look attractive, beautiful, fingers, pumpkin, quarter, received, retention, sheets, sipper, well-balanced timate similarity information implicitly. For ex- Aroma beer, cardboard, cheap, down, follows, medium- ample, triplet-based similarity embeddings may be light, rice, settled, skunked, skunky Palate bother, crafted, luscious, mellow, mint, range, rec- learned using ‘crowdkernels’ with applications to ommended, roasted, tasting, weight multi-view clustering (Amid and Ukkonen, 2015). Taste amazingly, down, highly, product, recommended, Models combining similarity with neural networks tasted, thoroughly, to, truly, wow mainly revolve around Siamese networks (Chopra Table 8: Most activated words for aspects on the et al., 2005) which use pairwise distances to learn beer corpus, as per the gating mechanism. embeddings (Schroff et al., 2015), a tactic we have followed here. Similarity judgments have also Yelp! ratings data. The former comprises reviews been used to generate document embeddings for of hotels, the latter of restaurants; both use a scale IR tasks (Shen et al., 2014; Das et al., 2016). of 1 to 5. We convert ratings into positive/negative Recently, He et al.(2017) introduced a neural labels as above. Here we consider aspects to be model for aspect extraction that relies on an at- the domain (hotel or restaurant) and the sentiment tention mechanism to identify aspect words. They (positive or negative). We aim to generate em- proposed an autoencoder variant designed to tease beddings that capture information about only one apart aspects. In contrast to the method we pro- of these aspects. We use 50K reviews from each pose, their approach is unsupervised; discovered dataset for training and 5K for testing. aspects may thus not have a clear interpretation. Baselines. We use the same baselines as for the Experiments reported here support this hypothe- BeerAdvocate data, and similarly use different en- sis, and we provide additional results using their coder models trained under triplet loss. model in the Appendix. Evaluation Metrics. We perform AUC and cross- Other recent work has focused on text gen- AUC evaluation as in the preceding section. For eration from factorized representations (Larsson the domain aspect, we consider two reviews simi- et al., 2017). And Zhang et al.(2017) proposed a lar if they are from the same domain, irrespective lightly supervised method for domain adaptation of sentiment. Similarly, reviews are considered using aspect-augmented neural networks. They similar with respect to the sentiment aspect if they exploited source document labels to train a clas- share a sentiment value, regardless of domain. sifier for a target aspect. They leveraged sentence- Results. In Table 10 we report the AUCs for each level scores codifying sentence relevance w.r.t. in- aspect on our test set using different representa- dividual aspects, which were derived from terms tions. Baselines perform reasonably well on the a priori associated with aspects. This supervi- domain aspect because reviews from different do- sion is used to construct a composite loss that cap- mains are quite dissimilar. Capturing sentiment in- tures both classification performance on the source formation irrespective of domain is more difficult, task and a term that enforces invariance between and most unsupervised models fail in this respect. source and target representations. In Table 11, we observe that cross AUC results are There is also a large body of work that uses much more pronounced than for the BeerAdvocate probabilistic generative models to recover latent data, as the domain and sentiment are uncorrelated structure in texts. Many of these models de- (i.e., sentiment is independent of domain). rive from Latent Dirichlet Allocation (LDA) (Blei et al., 2003), and some variants have explicitly rep- 5 Related Work resented topics and aspects jointly for sentiment Work in representation learning for NLP has tasks (Brody and Elhadad, 2010; Sauper et al., largely focused on improving word embeddings 2010, 2011; Mukherjee and Liu, 2012; Sauper and (Levy and Goldberg, 2014; Faruqui et al., 2015; Barzilay, 2013; Kim et al., 2013). Huang et al., 2012). But efforts have also been A bit more generally, aspects have also been in- made to embed other textual units, e.g. charac- terpreted as properties spanning entire texts, e.g., ters (Kim et al., 2016), and lengthier texts includ- a perspective or theme which may then color the ing sentences, paragraphs, and documents (Le and discussion of topics (Paul and Girju, 2010). This Mikolov, 2014; Kiros et al., 2015). intuition led to the development of the factorial Triplet-based judgments have been used in mul- LDA family of topic models (Paul and Dredze, tiple domains, including vision and NLP, to es- 2012; Wallace et al., 2014); these model individ- Look : deep amber hue , Aroma : deep amber hue Palate : deep amber hue , Taste :deep amber hue , this this brew is topped with a , this brew is topped with this brew is topped with a brew is topped with a finger finger of off white head . a finger of off white head . finger of off white head . of off white head . smell smell of dog unk , green unk smell of dog unk , green unk smell of dog unk , green unk of dog unk , green unk , , and slightly fruity . taste , and slightly fruity . taste , and slightly fruity . taste and slightly fruity . taste of of belgian yeast , coriander , of belgian yeast , coriander , of belgian yeast , coriander , belgian yeast , coriander , hard water and bready malt hard water and bready malt hard water and bready malt hard water and bready malt . light body , with little . light body , with little . light body , with little . light body , with little carbonation . carbonation . carbonation . carbonation .

Table 9: Gate activations for each aspect in an example beer review.

Baseline Domain Sentiment 7 Acknowledgements TF-IDF 0.59 0.52 Doc2Vec 0.83 0.56 This work was supported in part by National Li- LDA 0.90 0.62 NVDM 0.79 0.63 brary of Medicine (NLM) of the National In- ABAE 0.50 0.50 stitutes of Health (NIH) award R01LM012086, BoW + Triplet 0.99 0.91 NVDM + Triplet 0.99 0.91 by the Army Research Ofﬁce (ARO) award DSSM + Triplet 0.99 0.90 W911NF1810328, and research funds from CNN + Triplet 0.99 0.92 Northeastern University. The content is solely the responsibility of the authors. Table 10: AUC results for different representations on the Yelp!/TripAdvisor Data. Models beneath the second line are supervised. References

Baseline Domain Sentiment Ehsan Amid and Antti Ukkonen. 2015. Multiview Domain 0.988 0.512 triplet embedding: Learning attributes in multiple Sentiment 0.510 0.917 maps. In International Conference on Machine Learning. Table 11: Cross AUC results for different represen- David M Blei, Andrew Y Ng, and Michael I Jordan. tations for Yelp!/TripAdvisor Dataset. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan). ual word probability as a product of multiple latent Samuel Brody and Noemie Elhadad. 2010. An unsu- factors characterizing a text. This is similar to the pervised aspect-sentiment model for online reviews. In Proceedings of the Conference of the North Sparse Additive Generative (SAGE) model of text American Chapter of the Association for Computa- proposed by Eisenstein et al.(2011). tional Linguistics (NAACL). Association for Com- putational Linguistics.

6 Conclusions Xi Chen, Yan Duan, Rein Houthooft, John Schul- man, Ilya Sutskever, and Pieter Abbeel. 2016. In- We have proposed an approach for inducing disen- fogan: Interpretable Representation Learning by In- tangled representations of text. To learn such rep- formation Maximizing Generative Adversarial Nets. resentations we have relied on supervision codi- In Advances in Neural Information Processing Sys- tems. fied in aspect-wise similarity judgments expressed as document triplets. This provides a general su- Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. pervision framework and objective. We evaluated Learning a similarity metric discriminatively, with application to face verification. In CVPR 2005. this approach on three datasets, each with differ- IEEE Computer Society Conference on Computer ent aspects. Our experimental results demonstrate Vision and Pattern Recognition, 2005. that this approach indeed induces aspect-specific embeddings that are qualitatively interpretable and Aaron M Cohen, William R Hersh, K Peterson, and Po- Yin Yen. 2006. Reducing workload in systematic re- achieve superior performance on information review preparation using automated citation classifica- trieval tasks. tion. Journal of the American Medical Informatics Going forward, disentangled representations Association, 13(2). may afford additional advantages in NLP, e.g., by Arpita Das, Harish Yenala, Manoj Chinnakotla, and facilitating transfer (Zhang et al., 2017), or sup- Manish Shrivastava. 2016. Together we stand: porting aspect-focused summarization models. Siamese networks for similar question retrieval. In Proceedings of the 54th Annual Meeting of the As- Maria Larsson, Amanda Nilsson, and Mikael sociation for Computational Linguistics. Kageb˚ ack.¨ 2017. Disentangled Representa- tions for Manipulation of Sentiment in Text. arXiv Jacob Eisenstein, Amr Ahmed, and Eric P Xing. 2011. preprint arXiv:1712.10066. Sparse Additive Generative Models of Text. In Pro- ceedings of the International Conference on Ma- Quoc V Le and Tomas Mikolov. 2014. Distributed chine Learning (ICML). Representations of Sentences and Documents. In Proceedings of the International Conference on Ma- Manaal Faruqui, Jesse Dodge, Sujay K Jauhar, Chris chine Learning (ICML). Dyer, Eduard Hovy, and Noah A Smith. 2015. Retrofitting word vectors to semantic lexicons. Pro- Omer Levy and Yoav Goldberg. 2014. Dependency- ceedings of the Conference of the North American Based Word Embeddings. In Proceedings of the As- Chapter of the Association for Computational Lin- sociation of Computational Linguistics (ACL). guistics (NAACL). Julian McAuley, Jure Leskovec, and Dan Jurafsky. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian 2012. Learning attitudes and attributes from multi- Sun. 2015. Delving deep into rectifiers: Surpass- aspect reviews. In 2012 IEEE 12th International ing human-level performance on imagenet classifi- Conference on Data Mining (ICDM). cation. In Proceedings of the IEEE international conference on computer vision. Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In Interna- Ruidan He, Wee Sun Lee, Hwee Tou Ng, and Daniel tional Conference on Machine Learning. Dahlmeier. 2017. An Unsupervised Neural Atten- tion Model for Aspect Extraction. In Proceedings Arjun Mukherjee and Bing Liu. 2012. Aspect extrac- of the Association for Computational Linguistics tion through semi-supervised modeling. In Proceed- (ACL). Association for Computational Linguistics. ings of the 50th Annual Meeting of the Associa- tion for Computational Linguistics: Long Papers- Irina Higgins, Loic Matthey, Arka Pal, Christopher Volume 1. Association for Computational Linguis- Burgess, Xavier Glorot, Matthew Botvinick, Shakir tics. Mohamed, and Alexander Lerchner. 2017. Beta- Vae: Learning Basic Visual Concepts with a Con- Michael Paul and Mark Dredze. 2012. Factorial LDA: strained Variational Framework. International Con- Sparse multi-dimensional text models. In Advances ference on Learning Representations. in Neural Information Processing Systems.

Eric H Huang, Richard Socher, Christopher D Man- Michael Paul and Roxana Girju. 2010. A two- ning, and Andrew Y Ng. 2012. Improving word dimensional topic-aspect model for discovering representations via global context and multiple word multi-faceted topics. In Proceedings of the AAAI prototypes. In Proceedings of the 50th Annual Meet- Conference on Artificial Intelligence. ing of the Association for Computational Linguis- tics. Association for Computational Linguistics. Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ananiadou. 2013. Distribu- Theofanis Karaletsos, Serge Belongie, and Gunnar tional semantics resources for biomedical text pro- Ratsch.¨ 2015. Bayesian representation learn- cessing. In Proceedings of the 5th International ing with oracle constraints. arXiv preprint Symposium on Languages in Biology and Medicine. arXiv:1506.05011. Sebastian Ruder, Parsa Ghaffari, and John G Breslin. Suin Kim, Jianwen Zhang, Zheng Chen, Alice H Oh, 2016. A Hierarchical Model of Reviews for Aspect- and Shixia Liu. 2013. A Hierarchical Aspect- based Sentiment Analysis. Proceedings of the 2016 Sentiment Model for Online Reviews. In Proceed- Conference on Empirical Methods in Natural Lan- ings of the AAAI Conference on Artificial Intelli- guage Processing. gence. David L Sackett, William M C Rosenberg, J A Muir Yoon Kim, Yacine Jernite, David Sontag, and Alexan- Gray, R Brian Haynes, and W Scott Richardson. der M Rush. 2016. Character-aware neural language 1996. Evidence based medicine: what it is and what models. In Proceedings of the AAAI Conference on it isn’t. Bmj, 312(7023). Artificial Intelligence. Christina Sauper and Regina Barzilay. 2013. Auto- Diederik Kingma and Jimmy Ba. 2014. Adam: A matic aggregation by joint modeling of aspects and method for stochastic optimization. arXiv preprint values. Journal of Artificial Intelligence Research. arXiv:1412.6980. Christina Sauper, Aria Haghighi, and Regina Barzilay. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, 2010. Incorporating content structure into text anal- Richard Zemel, Raquel Urtasun, Antonio Torralba, ysis applications. In Proceedings of the 2010 Con- and Sanja Fidler. 2015. Skip-thought vectors. In ference on Empirical Methods in Natural Language Advances in neural information processing systems. Processing. Christina Sauper, Aria Haghighi, and Regina Barzilay. A Example of PICO Triplet generation 2011. Content models with attitude. In Proceed- ings of the 49th Annual Meeting of the Association As an illustrative example, we walk through con- for Computational Linguistics: Human Language struction of a single triplet for the PICO domain Technologies-Volume 1. in detail. Recall that we first randomly draw two Florian Schroff, Dmitry Kalenichenko, and James reviews, R1 and R2. In this case, R1 consists Philbin. 2015. Facenet: A unified embedding for of studies involving nocturnal enuresis and review face recognition and clustering. In Proceedings of R2 concerns asthma. From R1 we randomly select the IEEE Conference on Computer Vision and Pat- 0 tern Recognition. two studies S, S . (1) Here we sample abstract A for study S, Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, shown below: and Gregoire´ Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information In recent years the treatment of primary nocturnal enure- retrieval. In Proceedings of the 23rd ACM Inter- sis (PNE) with desmopressin (DDAVP) has been promising. national Conference on Conference on Information The route of administration until now had been intranasal, and Knowledge Management. but because the tablets were introduced for the treatment of diabetes insipidus they have also become available for the D Stacey, F Legar´ e,´ N F Col, C L Bennett, M J Barry, treatment of PNE. To find the optimal dosage of desmopressin tablets and to compare desmopressin’s efficacy with K B Eden, H Thomas, A Lyddiatt, R Thomson, placebo in a group of adolescents with severe monosymp- L Trevena, and J H C Wu. 2014. Decision aids tomatic enuresis. The long-term safety of desmopressin for people facing health treatment or screening de- was also studied in the same group of patients. The effect cisions. Cochrane Database of Systematic Reviews, of oral desmopressin (1-deamino-8-D-arginine-vasopressin) 1. (DDAVP tablets, Minirin) was investigated in 25 adolescents (ages 11 to 21 years) with severe monosymptomatic noc- Andreas Veit, Serge Belongie, and Theofanis Karalet- turnal enuresis. The first part of the dose-ranging study sos. 2017a. Conditional Similarity Networks. In comprised a single-blind dose titration period, followed by Proceedings of the IEEE Conference on Computer a double-blind, crossover efficacy period comparing desmo- Vision and Pattern Recognition. pressin with placebo. The final part was an open long-term study consisting of two 12-week treatment periods. The effi- Andreas Veit, Serge Belongie, and Theofanis Karalet- cacy of the drug was measured in reductions of the number of wet nights per week. During the first dose-titration period, the sos. 2017b. Conditional Similarity Networks. In majority of the patients were given desmopressin 400 micro- Computer Vision and Pattern Recognition (CVPR). grams, and the number of wet nights decreased from a mean of 4.9 to 2.8. During the double-blind period, a significant re- Byron C Wallace, Joel¨ Kuiper, Aakash Sharma, duction of wet nights was observed (1.8 vs 4.1 for placebo). Mingxi Brian Zhu, and Iain J Marshall. 2016. Ex- During the two long-term periods, 48% and 53% of the pa- tracting PICO sentences from clinical trial reports tients could be classified as responders (0 to 1 wet night per using supervised distant supervision. Journal of week) and 22% and 23.5% as intermediate responders (2 to Machine Learning Research, 17(132). 3 wet nights per week). No weight gain was observed due to water retention. After cessation of the drug, 44% of the pa- Byron C Wallace, Michael J Paul, Urmimala Sarkar, tients had a significant decrease in the number of wet nights. Thomas A Trikalinos, and Mark Dredze. 2014. A Oral desmopressin has a clinically significant effect on pa- large-scale quantitative analysis of latent factors and tients with PNE, and therapy is safe when administered as long-term treatment. sentiment in online doctor reviews. Journal of the American Medical Informatics Association, 21(6). For study S, the summaries in the CDSR are as Byron C Wallace, Thomas A Trikalinos, Joseph Lau, follow. First the P summary (sP ) , Carla Brodley, and Christopher H Schmid. 2010. Number of children: 10 Inclusion criteria: adolescents (pu- Semi-automated screening of biomedical citations berty stage at least 2, at least 12 years) Exclusion criteria: for systematic reviews. BMC bioinformatics, 11(1). treatment in previous 2 weeks, daytime wetting, UTI, urinary tract abnormalities Previous treatment: failed using alarms, William Whitney. 2016. Disentangled Repre- desmopressin, other drugs Age range 11-21, median 13 years sentations in Neural Models. arXiv preprint Baseline wetting 4.7 (SD 1.1) wet nights/week Department arXiv:1602.02383. of Paediatric Surgery, Sweden (s ) Michael J Wilber, Iljung S Kwak, and Serge J Belongie. I Summary I : 2014. Cost-effective hits for relative similarity com- A : desmopressin orally (dosage based on titration period) B parisons. In Second AAAI Conference on Human : placebo Duration 4 weeks each Computation and Crowdsourcing. The O summary (sO) Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. Wet nights during trial (number, mean, SD): A: 10, 1.8 (SD 2017. Aspect-augmented Adversarial Networks for 1.4); B: 10, 4.1 (1.5) Side effects: headache (5); abdominal Domain Adaptation. In Transactions of the Associ- pain (6); nausea and vertigo (1) All resolved while treatment ation for Computational Linguistics. continued (2) From study S0, the summaries in the CDSR B Implementation Details & Baseline are reproduced as follows. First, the P Summary Hyperparameters (s0 ): P All tokenisation has been done using default Number of children: 135 Dropouts: 23 excluded for non- 6 compliance, and 39 lost to follow up including 12 failed with spaCy tokenizer. alarms Inclusion criteria: monosymptomatic nocturnal enuresis, age > 5 years Exclusion criteria: previous treatment with B.1 PICO Domain DDAVP or alarm, urological pathology, diurnal enuresis, UTI Age, mean years: 11.2 Baseline wetting: A 21% dry nights, B.1.1 Baselines B 14% dry nights For the TF-IDF baseline, we use the implemen- 0 tation in scikit-learn7. The TF-IDF was fit on all I Summary (sI ): A (62): desmopressin 20 g intranasally increasing to 40 g CDSR data. The resulting transformation was ap- if response partial B (73): alarm (pad-and-bell) Duration of plied to Cohen corpus. We use cosine similarity to treatment 3 months. If failed at that time, changed to alterna- evaluate TF-IDF model. tive arm For Doc2Vec, we used the gensim implemen- 0 8 The O summary (sO) tation with an 800d embedding size, and a win- DRY nights at 3 months: A 85%; B: 90% Number not achiev- dow size of 10. These parameters were selected ing 14 dry nights: A 12/39; B: 6/37 Side effects: not men- via random search over validation data. We oth- tioned erwise used the default settings in gensim. The

00 vocabulary used was the same as above. (3) From review R2, we sample one study S . For LDA, we again used the scikit-learn imple- Matched summaries in the CDSR are as follows. mentation, setting the number of topics to 7; this 00 P Summary (sP ): was selected via line search over the validation set n = 8 Mean age = 52 Inclusion: intrinsic asthma, constant across the range 1 to 50. reversibility > 20% None had an acute exacerbation at time of study Exclusion: none listed For the NVDM baseline, we used hidden layer 00 comprising 500 dimensions and an output embed- I summary (sI ): ding dimension of 200. We used Tanh activa- #1: Atenolol 100 mg Metoprolol 100 mg Placebo #2: Terbu- tion functions. The model was trained using the taline (IV then inhaled) after Tx or placebo Adam optimizer, using a learning rate of 5e-5. We 00 O summary (sO): performed early stopping based on validation per- FEV1 Symptoms plexity. For the ABAE model, we use the same hyper- (4) We note how the summaries for S and S0 are parameters as reported in the original He et al. similar to each other (but not identical) since they (2017) paper. We experimented with 20, 50 and belong in the same review whereas they are quite 100 aspects. Of these, 100 performed best, which different from summaries for S00 which belongs in is what we used to generate the reported results. a different review. Now we construct the triplet For the DSSM model, we used 300d filters with (s, d, o)P as follows: filter windows of sizes (1, 6) and output embedding dimension of 256. We used tanh activation 0 s = sP : for both the convolution and output layers, and Number of children: 135 Dropouts: 23 excluded for non- imposed l2 regularization on weights (1e-5). The compliance, and 39 lost to follow up including 12 failed with model was trained using the Adam optimizer with alarms Inclusion criteria: monosymptomatic nocturnal enuresis, age > 5 years Exclusion criteria: previous treatment with early stopping on validation loss. DDAVP or alarm, urological pathology, diurnal enuresis, UTI Age, mean years: 11.2 Baseline wetting: A 21% dry nights, B.2 Beer Advocate Domain B 14% dry nights B.2.1 Baselines 00 0 0 So we have d : A and o = [sP |sI |sO]: For the TF-IDF baseline, we used the scikit-learn n = 8 Mean age = 52 Inclusion: intrinsic asthma, constant re- implementation. TF-IDF parameters were fit on versibility > 20% None had an acute exacerbation at time of study Exclusion: none listed — A (62): desmopressin the available training set. We used cosine similar- 20 g intranasally increasing to 40 g if response partial B ity for evaluation. (73): alarm (pad-and-bell) Duration of treatment 3 months. If 6 failed at that time, changed to alternative arm — DRY nights https://spacy.io/ at 3 months: A 85%; B: 90% Number not achieving 14 dry 7http://scikit-learn.org/ nights: A 12/39; B: 6/37 Side effects: not mentioned 8https://radimrehurek.com/gensim/ For Doc2Vec, we used the same method as tings as above for NVDM, and again used loss above, here setting (after line search) the embed- weighting of 0.001. ding size to 800d and window size to 7. For the DSSM baseline, we used 200d filters For the LDA baseline, we followed the same with filter windows of sizes (1, 4) and output em- strategy as above, arriving at 4 topics. bedding dimension of 256. We used tanh activa- For NVDM, we set the number of dimensions tions for both convolution and output layer and l2 in the hidden layer to 500, and set the embedding regularization on weights of 1e-6. The model was dimension to 145. As an activation for the hidden trained using the Adam optimizer with early stop- layer we used Tanh. Again the model was trained ping on validation loss. using the Adam optimizer, with a learning rate of 5e-5 and early stopping based on validation per- C Yelp!/TripAdvisor Dataset Word plexity. Activations For the ABAE baseline, we use the same hyperparameters as in He et al. (2017) paper, since Domain celebrity, cypress, free, inadequate, BeerAdvocate was one of the evaluation datasets kitchens, maroma, maya, scratchy, suburban, used in said work. supply For NVDM + Triplet Loss, we used same set- Sentiment awful, disgusting, horrible, tings as reported above for NVDM, with loss mediocre, poor, rude, sells, shame, unaccept- weighting of 0.001. able, worst For the DSSM baseline, we used 200d filters Table 12: Most activated words for ho- with filter windows of sizes (2, 4) and output em- tels/restaurant data. bedding dimension of 256. We used tanh activations for both convolution and output layer and l2 We did not have sufficient space to include the regularization on weights of 1e-6. The model was most activated words per aspect (inferred via the trained using Adam optimizer with early stopping gating mechanism) for the Yelp!/TripAdvisor cor- on validation loss. pus in the manuscript (as we did for the other do- B.3 Yelp! and TripAdvisor Domain mains) and so we include them here. B.3.1 Baselines D Highlighted Text To induce the baseline TF-IDF vectors, we used To visualize the output of gating mechanism g ∈ the scikit-learn implementation. TF-IDF parame- RN×1, we perform the following transformations: ters were fit on the available training set. Our vocabulary comprised those words that appeared in 1. Since the gating is applied on the top convo- at least 5 unique reviews, which corresponded to lution layer, there is no one-to-one correspon- 10K words. We used cosine similarity for evalua- dence between words and values in g. Hence, tion. we convolved the output a 5 length mean fil- For Doc2Vec, we set the embedding size to ter to smooth the gating output. 800d and the window size to 7. For LDA, we again 2. We normalize the range of values in g to [0, 1] used the scikit-learn implementation, setting the using the following equation: number of topics to 10. For NVDM, we set the number of dimensions g − min(g) g = (4) in the hidden layer to 500, and set the embedding max(g) − min(g) dimension to 200. As an activation for the hidden layer we used Tanh. Again the model was trained In the following subsections, we provide 3 ex- using the Adam optimizer, with a learning rate of amples from each of our datasets highlighted with 5e-5 and early stopping based on validation per- corresponding gating output. plexity. D.1 PICO Domain For ABAE, we use the same hyperparameters in He et al. (2017), since Yelp was another one of the Color Legend : Population Intervention Outcome Example 1 evaluation datasets used in their work. determined the efficacy of unk ( unk ) in a clinical For NVDM + Triplet Loss, we used same set- population of aggressive , urban children diagnosed with attention deficit unk disorder ( adhd ) . in previous studies of high perinatal mortality . unk children with adhd , unk has been shown to be effective aims : to assess maternal and neonatal complications when compared with placebo . eighteen unk children ( ages in pregnancies of diabetic women treated with oral qqq to qqq years ) , diagnosed with adhd and attending a unk hypoglycaemic agents during pregnancy . methods : a cohort treatment program for youth with unk behavior disorders , study including all unk registered , orally treated pregnant participated in a double-blind placebo trial with assessment diabetic patients set in a diabetic unk service at a university data obtained from staff in the program and parents at home hospital : qqq women treated with metformin , qqq women . based on staff ratings of the children ’s behavior in the treated with sulphonylurea during pregnancy and a reference program and an academic classroom , the children displayed group of qqq diabetic women treated with insulin during significant improvements in adhd symptoms and aggressive pregnancy . results : the prevalence of pre-eclampsia was behavior with unk and high-dose unk conditions . at home , significantly increased in the group of women treated with parents and unk reported few significant differences between metformin compared to women treated with sulphonylurea placebo and unk on behavior ratings . in both settings , unk or insulin ( qqq vs. qqq vs. qqq % , p ¡ qqq ) . no difference in was well tolerated with few side effects found during active neonatal morbidity was observed between the orally treated drug conditions . and unk group ; no cases of severe hypoglycaemia or jaundice determined the efficacy of unk ( unk ) in a clinical were seen in the orally treated groups . however , in the group population of aggressive , urban children diagnosed with of women treated with metformin in the third trimester , the attention deficit unk disorder ( adhd ) . in previous studies of perinatal mortality was significantly increased compared to unk children with adhd , unk has been shown to be effective women not treated with metformin ( qqq vs. qqq % , p ¡ qqq ) . when compared with placebo . eighteen unk children ( ages conclusion : treatment with metformin during pregnancy was qqq to qqq years ) , diagnosed with adhd and attending a unk associated with increased prevalence of pre-eclampsia and a treatment program for youth with unk behavior disorders , high perinatal mortality . participated in a double-blind placebo trial with assessment aims : to assess maternal and neonatal complications data obtained from staff in the program and parents at home in pregnancies of diabetic women treated with oral . based on staff ratings of the children ’s behavior in the hypoglycaemic agents during pregnancy . methods : a cohort program and an academic classroom , the children displayed study including all unk registered , orally treated pregnant significant improvements in adhd symptoms and aggressive diabetic patients set in a diabetic unk service at a university behavior with unk and high-dose unk conditions . at home , hospital : qqq women treated with metformin , qqq women parents and unk reported few significant differences between treated with sulphonylurea during pregnancy and a reference placebo and unk on behavior ratings . in both settings , unk group of qqq diabetic women treated with insulin during was well tolerated with few side effects found during active pregnancy . results : the prevalence of pre-eclampsia was drug conditions . significantly increased in the group of women treated with determined the efficacy of unk ( unk ) in a clinical metformin compared to women treated with sulphonylurea population of aggressive , urban children diagnosed with or insulin ( qqq vs. qqq vs. qqq % , p ¡ qqq ) . no difference in attention deficit unk disorder ( adhd ) . in previous studies of neonatal morbidity was observed between the orally treated unk children with adhd , unk has been shown to be effective and unk group ; no cases of severe hypoglycaemia or jaundice when compared with placebo . eighteen unk children ( ages were seen in the orally treated groups . however , in the group qqq to qqq years ) , diagnosed with adhd and attending a unk of women treated with metformin in the third trimester , the treatment program for youth with unk behavior disorders , perinatal mortality was significantly increased compared to participated in a double-blind placebo trial with assessment women not treated with metformin ( qqq vs. qqq % , p ¡ qqq ) . data obtained from staff in the program and parents at home conclusion : treatment with metformin during pregnancy was . based on staff ratings of the children ’s behavior in the associated with increased prevalence of pre-eclampsia and a program and an academic classroom , the children displayed high perinatal mortality . significant improvements in adhd symptoms and aggressive Example 3 behavior with unk and high-dose unk conditions . at home , purpose : congestive heart failure is an important cause parents and unk reported few significant differences between of patient morbidity and mortality . although several placebo and unk on behavior ratings . in both settings , unk randomized clinical trials have compared beta-blockers was well tolerated with few side effects found during active with placebo for treatment of congestive heart failure , a drug conditions . meta-analysis unk the effect on mortality and morbidity has Example 2 not been performed recently . data unk : the unk , unk , aims : to assess maternal and neonatal complications and unk of unk electronic unk were unk from qqq to july in pregnancies of diabetic women treated with oral qqq unk were also identified from unk of unk unk . study hypoglycaemic agents during pregnancy . methods : a cohort selection : all randomized clinical trials of beta-blockers study including all unk registered , orally treated pregnant versus placebo in chronic stable congestive heart failure were diabetic patients set in a diabetic unk service at a university included . data extraction : a specified protocol was followed hospital : qqq women treated with metformin , qqq women to extract data on patient characteristics , unk used , overall treated with sulphonylurea during pregnancy and a reference mortality , hospitalizations for congestive heart failure , and group of qqq diabetic women treated with insulin during study quality . data unk : a unk unk model was used to unk pregnancy . results : the prevalence of pre-eclampsia was the results . a total of qqq trials involving qqq qqq patients significantly increased in the group of women treated with were identified . there were qqq deaths among qqq patients metformin compared to women treated with sulphonylurea randomly assigned to placebo and qqq deaths among qqq or insulin ( qqq vs. qqq vs. qqq % , p ¡ qqq ) . no difference in patients assigned to unk therapy . in these groups , qqq neonatal morbidity was observed between the orally treated and qqq patients , respectively , required hospitalization for and unk group ; no cases of severe hypoglycaemia or jaundice congestive heart failure . the probability that unk therapy were seen in the orally treated groups . however , in the group reduced total mortality and hospitalizations for congestive of women treated with metformin in the third trimester , the heart failure was almost qqq % . the best estimates of these perinatal mortality was significantly increased compared to advantages are qqq unk unk and qqq fewer hospitalizations women not treated with metformin ( qqq vs. qqq % , p ¡ qqq ) . per qqq patients treated in the first year after therapy . the conclusion : treatment with metformin during pregnancy was probability that these benefits are clinically significant ( ¿ associated with increased prevalence of pre-eclampsia and a qqq unk unk or ¿ qqq fewer hospitalizations per qqq patients treated ) is qqq % . both selective and unk agents produced D.2 BeerAdvocate these unk effects . the results are unk to any unk publication Color Legend : Look Aroma Palate Taste unk . conclusions : unk therapy is associated with clinically Example 1 meaningful reductions in mortality and morbidity in patients with stable congestive heart failure and should be routinely appearance was gold , clear , no head , and fizzy . almost offered to all patients similar to those included in trials . looked like champagne . aroma was actually real good . sweet apricot , unk smith apple , almost like a unk wine . taste was purpose : congestive heart failure is an important cause unk . i unk a couple of bottles an tried them so if they were of patient morbidity and mortality . although several skunked , it was unk unk that were bad . there was no real randomized clinical trials have compared beta-blockers apricot taste , and no sweetness . it tasted like hay but not in with placebo for treatment of congestive heart failure , a the good wheat beer sense , more like dirty hay thats been meta-analysis unk the effect on mortality and morbidity has under the budweiser unk . mouthfeel was thin and carbonated not been performed recently . data unk : the unk , unk , . drinkability ... well this is the second beer in my long list and unk of unk electronic unk were unk from qqq to july that i ’ve actually had to pour out . qqq unk were also identified from unk of unk unk . study selection : all randomized clinical trials of beta-blockers appearance was gold , clear , no head , and fizzy . almost versus placebo in chronic stable congestive heart failure were looked like champagne . aroma was actually real good . sweet included . data extraction : a specified protocol was followed apricot , unk smith apple , almost like a unk wine . taste was to extract data on patient characteristics , unk used , overall unk . i unk a couple of bottles an tried them so if they were mortality , hospitalizations for congestive heart failure , and skunked , it was unk unk that were bad . there was no real study quality . data unk : a unk unk model was used to unk apricot taste , and no sweetness . it tasted like hay but not in the results . a total of qqq trials involving qqq qqq patients the good wheat beer sense , more like dirty hay thats been were identified . there were qqq deaths among qqq patients under the budweiser unk . mouthfeel was thin and carbonated randomly assigned to placebo and qqq deaths among qqq . drinkability ... well this is the second beer in my long list patients assigned to unk therapy . in these groups , qqq that i ’ve actually had to pour out . and qqq patients , respectively , required hospitalization for appearance was gold , clear , no head , and fizzy . almost congestive heart failure . the probability that unk therapy looked like champagne . aroma was actually real good . sweet reduced total mortality and hospitalizations for congestive apricot , unk smith apple , almost like a unk wine . taste was heart failure was almost qqq % . the best estimates of these unk . i unk a couple of bottles an tried them so if they were advantages are qqq unk unk and qqq fewer hospitalizations skunked , it was unk unk that were bad . there was no real per qqq patients treated in the first year after therapy . the apricot taste , and no sweetness . it tasted like hay but not in probability that these benefits are clinically significant ( ¿ the good wheat beer sense , more like dirty hay thats been qqq unk unk or ¿ qqq fewer hospitalizations per qqq patients under the budweiser unk . mouthfeel was thin and carbonated treated ) is qqq % . both selective and unk agents produced . drinkability ... well this is the second beer in my long list these unk effects . the results are unk to any unk publication that i ’ve actually had to pour out . unk . conclusions : unk therapy is associated with clinically appearance was gold , clear , no head , and fizzy . almost meaningful reductions in mortality and morbidity in patients looked like champagne . aroma was actually real good . sweet with stable congestive heart failure and should be routinely apricot , unk smith apple , almost like a unk wine . taste was offered to all patients similar to those included in trials . unk . i unk a couple of bottles an tried them so if they were purpose : congestive heart failure is an important cause skunked , it was unk unk that were bad . there was no real of patient morbidity and mortality . although several apricot taste , and no sweetness . it tasted like hay but not in randomized clinical trials have compared beta-blockers the good wheat beer sense , more like dirty hay thats been with placebo for treatment of congestive heart failure , a under the budweiser unk . mouthfeel was thin and carbonated meta-analysis unk the effect on mortality and morbidity has . drinkability ... well this is the second beer in my long list not been performed recently . data unk : the unk , unk , that i ’ve actually had to pour out . and unk of unk electronic unk were unk from qqq to july Example 2 qqq unk were also identified from unk of unk unk . study poured into a nonic pint glass ... jet black , unk no selection : all randomized clinical trials of beta-blockers highlights around the edges at all , dense tan unk head ... versus placebo in chronic stable congestive heart failure were looks great ! a bit of sediment in the bottom of the bottle . included . data extraction : a specified protocol was followed bottle read “ live ale , keep unk . ” the store where i bought to extract data on patient characteristics , unk used , overall it from had it on the shelf at room temp : ( the smell was mortality , hospitalizations for congestive heart failure , and surprising ... an earthy roasted smell , mixed with day old study quality . data unk : a unk unk model was used to unk coffee . also some weird “ off ” licorice notes . taste was dry the results . a total of qqq trials involving qqq qqq patients at the start , and very dry on the finish . bitter roast notes with were identified . there were qqq deaths among qqq patients a sort of unk . full bodied with appropriate carbonation . i was randomly assigned to placebo and qqq deaths among qqq very excited to try this beer , and i was pretty disappointed . patients assigned to unk therapy . in these groups , qqq this bottle could be old , or the flavors could be “ off ” from and qqq patients , respectively , required hospitalization for the room temp unk . congestive heart failure . the probability that unk therapy poured into a nonic pint glass ... jet black , unk no reduced total mortality and hospitalizations for congestive highlights around the edges at all , dense tan unk head ... heart failure was almost qqq % . the best estimates of these looks great ! a bit of sediment in the bottom of the bottle . advantages are qqq unk unk and qqq fewer hospitalizations bottle read “ live ale , keep unk . ” the store where i bought per qqq patients treated in the first year after therapy . the it from had it on the shelf at room temp : ( the smell was probability that these benefits are clinically significant ( ¿ surprising ... an earthy roasted smell , mixed with day old qqq unk unk or ¿ qqq fewer hospitalizations per qqq patients coffee . also some weird “ off ” licorice notes . taste was dry treated ) is qqq % . both selective and unk agents produced at the start , and very dry on the finish . bitter roast notes with these unk effects . the results are unk to any unk publication a sort of unk . full bodied with appropriate carbonation . i was unk . conclusions : unk therapy is associated with clinically very excited to try this beer , and i was pretty disappointed . meaningful reductions in mortality and morbidity in patients this bottle could be old , or the flavors could be “ off ” from with stable congestive heart failure and should be routinely the room temp unk . offered to all patients similar to those included in trials . poured into a nonic pint glass ... jet black , unk no highlights around the edges at all , dense tan unk head ... Example 2 we have just returned from our first trip to looks great ! a bit of sediment in the bottom of the bottle . rome and what a wonderful time we had ! our stay at unk bottle read “ live ale , keep unk . ” the store where i bought unk was everything we had hoped for and all we had read it from had it on the shelf at room temp : ( the smell was about it was true ! the room was small , but very clean and surprising ... an earthy roasted smell , mixed with day old comfortable . the staff was fantastic and friendly , very helpful coffee . also some weird “ off ” licorice notes . taste was dry and the breakfast every day was wonderful ! we loved the at the start , and very dry on the finish . bitter roast notes with lift up to the rooms ! the location was great and at night we a sort of unk . full bodied with appropriate carbonation . i was left the windows open and listened to the bustle of the street very excited to try this beer , and i was pretty disappointed . below and a unk off somewhere ... we felt so ’ roman ” ! this bottle could be old , or the flavors could be “ off ” from they gave us champagne on our last night ! the gentleman the room temp unk . who works the night shift even gave us a personal escort to poured into a nonic pint glass ... jet black , unk no unk unk , a maze of metro stops , and delivered us right at the highlights around the edges at all , dense tan unk head ... gate ! unk was just so nice ! i highly recommend this small looks great ! a bit of sediment in the bottom of the bottle . hotel and would absolutely stay there again if we are lucky bottle read “ live ale , keep unk . ” the store where i bought enough to return ! they deserve their # qqq tripadvisor rating it from had it on the shelf at room temp : ( the smell was ! thanks you , unk unk , for your part in making this trip such surprising ... an earthy roasted smell , mixed with day old a pleasure ! coffee . also some weird “ off ” licorice notes . taste was dry we have just returned from our first trip to rome and what at the start , and very dry on the finish . bitter roast notes with a wonderful time we had ! our stay at unk unk was everything a sort of unk . full bodied with appropriate carbonation . i was we had hoped for and all we had read about it was true ! the very excited to try this beer , and i was pretty disappointed . room was small , but very clean and comfortable . the staff this bottle could be old , or the flavors could be “ off ” from was fantastic and friendly , very helpful and the breakfast the room temp unk . every day was wonderful ! we loved the lift up to the rooms Example 3 ! the location was great and at night we left the windows this is a real “ nothing ” beer . pours a unk yellow color , open and listened to the bustle of the street below and a looking more like unk water than beer . strange unk smell , unk off somewhere ... we felt so ’ roman ” ! they gave us with minimal unk hop aroma unk away behind whatever it is champagne on our last night ! the gentleman who works the that unk in this brew . taste is equally as unk . nearly unk , the night shift even gave us a personal escort to unk unk , a maze most you get from this beer is a slightly sweet unk flavor and of metro stops , and delivered us right at the gate ! unk was a lot of carbonation in the beginning . watery and thin , but just so nice ! i highly recommend this small hotel and would something that goes down easy so the drinkability is unk up absolutely stay there again if we are lucky enough to return ! slightly . unk . they deserve their # qqq tripadvisor rating ! thanks you , unk this is a real “ nothing ” beer . pours a unk yellow color , unk , for your part in making this trip such a pleasure ! looking more like unk water than beer . strange unk smell , Example 3 with minimal unk hop aroma unk away behind whatever it is oh , unk . true to name , you ’re like a guilty indulgence that unk in this brew . taste is equally as unk . nearly unk , the that i want more of . while visiting cleveland , i went to unk most you get from this beer is a slightly sweet unk flavor and for a late dinner around qqq on a thursday . i sat at the bar , a lot of carbonation in the beginning . watery and thin , but and there was quite a full crowd because they have a second , something that goes down easy so the drinkability is unk up late happy hour from qqq with great deals on drinks and small slightly . unk . plates . i ordered the house red wine ( $ qqq at happy hour this is a real “ nothing ” beer . pours a unk yellow color , ! ) , and two suggestions of the bartender : the roasted dates looking more like unk water than beer . strange unk smell , to start and the steak entree with a side of brussels sprouts with minimal unk hop aroma unk away behind whatever it is . the dates ( $ qqq ) were incredibly decadent : roasted and that unk in this brew . taste is equally as unk . nearly unk , the topped with almonds , bacon , unk and parsley , they are a most you get from this beer is a slightly sweet unk flavor and party in your both of sweet and salty and spicy . the sirloin a lot of carbonation in the beginning . watery and thin , but ( $ qqq ) , which the bartender told me is a new menu item something that goes down easy so the drinkability is unk up , was beautifully cooked and topped with parmesan , truffle slightly . unk . butter , arugula and mushrooms . it was good , although there this is a real “ nothing ” beer . pours a unk yellow color , are other things on the menu that might be more exciting . the looking more like unk water than beer . strange unk smell , stand out dish of the evening was the fried brussels sprouts with minimal unk hop aroma unk away behind whatever it is , which i would very much like the recipe for so i can unk that unk in this brew . taste is equally as unk . nearly unk , the myself on them every night . those babies , topped with crispy most you get from this beer is a slightly sweet unk flavor and crumbled bits of unk , capers and walnuts , were pure bliss a lot of carbonation in the beginning . watery and thin , but . and let ’s be honest , how often do you really have the something that goes down easy so the drinkability is unk up opportunity to call unk sprouts unk ? probably not very often slightly . unk . . i unk off the evening with a glass of unk scotch , neat . perfection . D.3 Yelp!/TripAdvisor Dataset oh , unk . true to name , you ’re like a guilty indulgence that i want more of . while visiting cleveland , i went to unk Color Legend : Sentiment Domain for a late dinner around qqq on a thursday . i sat at the bar , Example 1 i love everything about this place , i find my and there was quite a full crowd because they have a second , self there more often then i want , the deserts are amazing , late happy hour from qqq with great deals on drinks and small so many choices , the deli is awesome , the sushi bar is great plates . i ordered the house red wine ( $ qqq at happy hour , you can not go wrong with this place , if you want a formal ! ) , and two suggestions of the bartender : the roasted dates dining you can sit at the full service restaurant . to start and the steak entree with a side of brussels sprouts i love everything about this place , i find my self there . the dates ( $ qqq ) were incredibly decadent : roasted and more often then i want , the deserts are amazing , so many topped with almonds , bacon , unk and parsley , they are a choices , the deli is awesome , the sushi bar is great , you party in your both of sweet and salty and spicy . the sirloin can not go wrong with this place , if you want a formal dining ( $ qqq ) , which the bartender told me is a new menu item you can sit at the full service restaurant . , was beautifully cooked and topped with parmesan , truffle butter , arugula and mushrooms . it was good , although there are other things on the menu that might be more exciting . the stand out dish of the evening was the fried brussels sprouts , which i would very much like the recipe for so i can unk myself on them every night . those babies , topped with crispy crumbled bits of unk , capers and walnuts , were pure bliss . and let ’s be honest , how often do you really have the opportunity to call unk sprouts unk ? probably not very often . i unk off the evening with a glass of unk scotch , neat . perfection . PICO Domain listening, respondent, perceived, motivation, attitude, participant, asking, apprehension, helpfulness, antismoking virtual, prototype, smart, tracking, internally, locked, haptic, handle, autism, autistic psychosocial, gluconate, mental, prospective, social, genetic, psychological, environmental, sulphadoxine, reproductive regain, obviously, drive, relieved, promoted, bid, diego, mdi, bocs, rescue workplace, caring, carers, helpfulness, occupational, awareness, spiritual, motivating, personal, educational euro, biochemical, virological, effectivity, yearly, oxidation, dos, reversion, quitter, audiologic semen, infertile, uncircumcised, sperm, fertility, azoospermia, fe, ejaculation, caput, particulate obstetrician, radiologist, cardiologist, technician, physician, midwife, nurse, gynecologist, physiotherapist, anaesthetist nonsignificant, bias, unadjusted, quadratic, constraint, apgar, undue, trend, nonstatistically, ckd dependency, dependence, comorbid, dependent, type, sensitivity, insensitive, manner, specific, comorbidity stem, progenitor, resident, concomitant, recruiting, mobilized, promoted, precursor, malondialdehyde, reduces anemic, cardiorespiratory, normothermic, hemiplegic, thromboelastography, hematological, hemodilution, dyslipidemic premenopausal, locoregional, postmenopausal, menopausal, cancer, nonmetastatic, nsclc, pelvic, tirilazad, operable conscious, intubated, anaesthetized, terlipressin, corticotropin, ventilated, resuscitation, anesthetized, resuscitated, vasopressin quasi, intergroup, midstream, static, voluntary, csa, vienna, proprioceptive, stroop, multinational trachea, malnourished, intestine, diaphragm, semitendinosus, gastrocnemius, undernourished, pancreas, lung, nonanemic pound, mile, customized, pack, per, hundred, liter, oz, thousand, litre lying, correlated, activity, matter, functioning, transported, gut, subscale, phosphodiesterase, impacting count, admission, disparity, spot, parasitaemia, visit, quartile, white, age, wbc Yelp!-TripAdvisor Domain like, hate, grow, love, coming, come, hygienic, desperate, reminds, unsanitary, complainer food, service, restaurant, resturants, steakhouse, restaraunt, tapa, eats, restraunts, ate, eatery bunny, chimney, ooze, sliver, masked, adorned, cow, concrete, coating, triangle, beige cause, get, brain, nt, annoying, guess, time, sometimes, anymore, cuz, hurry week, bosa, month, today, california, visited, newton, bought, yogurtland, exellent, recent caring, outgoing, demeanor, gm, personable, engaging, employee, respectful, interaction, associate, incompetent staff, excellent, personel, good, unseasoned, superb, unremarkable, excellant, tasty, personnel, unfailingly place, sounding, scenery, cuppa, temptation, kick, indulge, confection, coffeehouse, lavender sweetened, watery, raisin, flavoring, doughy, mmmm, crumble, hazelnut, consistency, lemonade, powdered upgrade, premium, room, staff, offer, amenity, perk, suite, whereas, price, pricing walk, lined, directed, nearest, headed, adjacent, gaze, grab, lovely, surrounded, field time, afternoon, night, day, moment, evening, went, rushed, place, boyfriend, came hotel, property, travelodge, radisson, metropolitan, intercontinental, lodging, ibis, sofitel, novotel, doubletree vegetable, dish, herb, premade, delicacy, chopped, masa, lentil, tamale, canned, omlettes stay, staying, return, trip, vacation, honeymoon, visit, outing, revisit, celebrate, stayed great, awesome, amazing, excellent, fantastic, phenomenal, exceptional, nice, outstanding, wonderful, good room, cupboard, bedcover, drapery, bathroom, dusty, carpeting, laminate, ensuite, washroom, bedspread utilized, day, excercise, recycled, tp, liner, borrowed, depot, vanished, restocked argued, blamed, demanded, called, registered, referred, transferred, contacted, claiming, denied, questioned neck, bruise, suspended, souvenir, tragedy, godiva, depot, blazing, peice, investigating BeerAdvocate Dataset shade, ruddy, transluscent, yellowish, translucent, hue, color, foggy, pours, branded recommend, like, disturb, conform, recomend, suggest, consider, imagine, liken, resemble longneck, swingtop, tapered, greeted, squat, ml, stubby, getting, stubbie, oz earthy, toffee, fruit, fruity, molasses, herbal, woody, floral, pineapple, raisin scarily, faster, quicker, obliged, inch, trace, warranted, mere, frighteningly, wouldve dubbel, quad, doppelbock, weizenbock, witbier, barleywine, tripel, dopplebock, brew, dipa lack, chestnut, purple, mahogany, nt, weakness, burgundy, coppery, fraction, garnet offering, product, beer, stuff, brew, gueuze, saison, porter, geuze, news mich, least, bass, pacifico, sumpin, inferior, grolsch, westy, yanjing, everclear foam, disk, head, heading, buts, edging, remains, lacing, froth, liquid starbucks, coffe, pepsi, expresso, kahlua, cofee, liqour, esspresso, cuban, grocery roasty, hop, chocolatey, hoppy, chocolaty, roasted, roast, robust, espresso, chocolately bodied, body, feel, barley, mouthfeel, tone, enjoyably, texture, moutfeel, mothfeel lasted, crema, flute, head, retentive, smell, fing, duper, compact, retention nt, rather, almost, never, kinda, fiddle, manner, easier, memphis, hard wafted, jumped, onset, proceeds, consisted, rife, consists, lotsa, citrisy, consisting hop, hoppy, assertive, inclined, tempted, doubtful, manageable, bearable, importantly, prominant great, good, decent, crazy, fair, killer, ridiculous, serious, insane, tremendous beer, brew, aperitif, nightcap, simply, intro, truly, ale, iipa, aipa wayne, harrisburg, massachusetts, arizona, jamaica, lincoln, kc, oklahoma, adult, odd

Table 13: Top words generated for each topic using ABAE method