Unsupervised dialogue intent detection via hierarchical topic model

Artem Popov1,2, Victor Bulatov1, Darya Polyudova1, Eugenia Veselova1 1Moscow Institute of Physics and Technology 2Lomonosov Moscow State University [email protected], [email protected], [email protected], [email protected]

Abstract labeling challenge easier. The resulting labels will be more consistent and better suitable for model One of the challenges during a task- training. oriented development is the scarce Simple intent analysis is based on empirical availability of the labeled training data. rules, e.g. “question” intent contains phrase “what The best way of getting one is to ask the is # of #” (Yan et al., 2017). More universal and assessors to tag each dialogue according robust dialogue systems should work without any to its intent. Unfortunately, performing la- supervision or defined rules. Such systems can beling without any provisional collection be implemented with automatic extraction of the structure is difficult since the very notion semantic hierarchy from the query by multi-level of the intent is ill-defined. clustering, based on different semantic frames (ca- In this paper, we propose a hierarchical pability, location, characteristics etc.) in sentences multimodal regularized topic model to ob- (Chen et al., 2015). In our work intents represent a tain a first approximation of the intent more complex entity which combine all intentions set. Our rationale for hierarchical mod- and objectives. els usage is their ability to take into ac- Many previous works take advantage of hier- count several degrees of the dialogues rel- archical structures in user intention analysis. In evancy. We attempt to build a model paper (Shepitsen et al., 2008) automatic approach that can distinguish between subject-based through hierarchical clustering for document tag- (e.g. medicine and transport topics) and ging is used. However, this approach does not take action-based (e.g. filing of an application advantage of peculiar phrase features, such as syn- and tracking application status) similari- tax or specific words order. Syntactic of ties. In order to achieve this, we divide intention was applied in (Gupta et al., 2018) to set of all features into several groups ac- decompose client intent. This hierarchical repre- cording to part-of-speech analysis. Vari- sentation is similar to a constituency syntax tree. ous feature groups are treated differently It contains intentions and objects as tree elements on different hierarchy levels. and demands deep analysis of every sentence. At- tempt to extract subintents along with main in- 1 Introduction tent can be found in paper (Tang et al., 2018), but One of the most important goals of task-oriented as proved below it is not necessary to apply neu- dialogue systems is to identify the user intention ral networks for precise and efficient retrieval of from the user utterances. State-of-the-art solutions multi-intent, especially in unsupervised task. like (Chen et al., 2017) require a lot of labeled We propose a hierarchical multimodal regular- data. User’s utterances (one or several for a dia- ized topic model as a simple and efficient solu- logue) have to be tagged by the intent of the dia- tion for accurate approximation of the collection logue. structure. The main contribution of this paper is This is a challenging task for a new dialogue the construction of a two-level hierarchical topic collection because the set of all possible intents model using different features on the first and sec- is unknown. Giving a provisional hierarchical col- ond levels. To the best of our knowledge, this is lection structure to assessors could make the intent the first work that investigates that possibility. We

932 Proceedings of Recent Advances in Natural Language Processing, pages 932–938, Varna, Bulgaria, Sep 2–4, 2019. https://doi.org/10.26615/978-954-452-056-4_108 introduce a custom evaluation metric which mea- ample, Wikipedia. Further improvement in qual- sures the quality of hierarchical relations between ity of clustering models with embeddings can be topics and intent detection. achieved through fine-tuning. Similar to the tf- The hierarchy structure helps to make a provi- idf approach dimensionality reduction is often em- sional clustering more interpretative. Namely, we ployed for the clustering problem (Park et al., require first level topics to describe the dialogue 2019). Several averaging schemes can be used subject and the second level topics to describe the to aggregate word embeddings: mean, where all action user is interested in. We accomplish this words contribute equally to the document, or idf- by incorporating information about part-of-speech weighted, where rare words have a greater contri- (PoS) tags into the model. bution than frequent words. This paper is organized as follows. Section 2.2 Topic modeling two describes popular approaches to an unsuper- vised text classification. Section three describes Another approach to text clustering problem is our reasoning behind our choices of model archi- topic modeling. The topic model simultaneously tecture. Section four briefly reviews our prepro- computes words and document embeddings and cessing pipeline and introduces several enhance- perform clusterization. It should be noted that ments to the existing NLP techniques. We demon- in some cases topic model-based embeddings out- strate the results of our model in section five. We perform traditional word embeddings, (Potapenko conclude our work in section six. et al., 2017). The probability of the word w in the document d is represented by formula below: 2 Text clustering approaches X X 2.1 Embeddings approaches p(w | d) = p(w | t)p(t | d) = φwtθtd The simplest way to build a clustering model on t∈T t∈T a collection of text documents includes two steps. where matrix Φ contains probabilities φwt of word On the first step, each document is mapped to a w in topic t, matrix Θ contains probabilities θtd of real-valued vector. On the second step, one of the topic t in document d. standard clustering algorithms is applied to the re- Probabilistic (pLSA) sulting vectors. (Hofmann, 2000) is the simplest topic model There are many methods to build an embedding which describes words in documents by a mix- of a document. The simplest way is the tf-idf ture of hidden topics. The Φ and Θ distribu- representation. Logistic regression on the tf-idf tions are obtained via maximization of the like- representation is quite a strong algorithm for the lihood given probabilistic normalization and non- text classification problem. This algorithm is re- negativity constraints: spectable baseline even in deep neural networks research (Park et al., 2019). However, the di- X X rect use of the tf-idf representation leads to poor L(Φ, Θ) = ndw log p(w|d) → max Φ,Θ results in the clustering problem because of the d∈D w∈W curse of dimensionality. Dimensionality reduction X φ = 1, φ ≥ 0 methods could be used to improve clustering qual- wt wt w∈W ity: PCA or Uniform Manifold Approximation X and Projection (UMAP, McInnes et al.(2018)). θtd = 1, θtd ≥ 0 Another popular approach makes use of differ- t∈T ent word embeddings (Esposito et al., 2016). First This optimization problem can be effectively of all, each word is mapped to a real-valued vector. solved via EM-algorithm or its online modifica- Then the document representation is derived from tions (Kochedykov et al., 2017). the embeddings of its words. The most popular Latend Dirichlet Allocation (LDA) (Blei et al., embedding models belong to the family 2003) model is an extension of pLSA with a prior (Mikolov et al., 2013b): CBOW, Skip-gram and estimation of the Φ and Θ, widely used in topic their modifications (Mikolov et al.(2013a)). For modelling. However, as a solution for both pLSA correct representation word2vec models should be and LDA optimization problem is not unique, each trained on a large collection of documents, for ex- solution may have different characteristics.

933 Additive Regularization of Topic Models p(t|d, w) more stable for w belonging to a same (ARTM) (Vorontsov and Potapenko, 2015) is local segment. In a way, p(t|d, w) distribution non-bayesian extension of likelihood optimization could be interpreted as the analogue for context task, providing robustness of the solution by ap- embeddings in topic modeling world. p(t|d, w) plying different regularizers. Each regularizer distribution isn’t used directly for topic represen- is used to pursue different solution characteris- tation, but it is used on the E-step of EM-algorithm tics. For example, many varieties of LDA can for φwt and θtd recalculation. be obtained from ARTM model by using certain In order to obtain more control over intent ro- smoothing regularizer; pLSA model is an ARTM bustness we propose to use a two-level hierarchi- model without regularizers. Furthermore, docu- cal topic model. The first level is responsible for ments can contain not only words but also terms of coarse-grained similarity, while the second one other modalities (e.g. authors, classes, n-grams), could take into account less obvious but important which allow us to select specific for our task lan- differences. guage features. In this case, instead of a sin- The hierarchical ARTM model consists of two gle Φ matrix, we have several Φm matrices for different ARTM models for each level, which are each modality m. Resulting functional to be opti- linked to each other. The first level of the hierar- mized is the sum of weighted with αm coefficients chical model can be any ARTM model. The sec- modalities likelihoods with regularization terms: ond level is built using regularizer from (Chirkova X and Vorontsov, 2016) which ensures that each αmL(Φm, Θm) + R(∪mΦm, Θ) → max Φ,Θ m first-level topic is a convex sum of second-level topics. Various methods could be employed to en- 3 Multilevel clustering sure that each parent topic is connected to only a Our goal is to build a topic model with topics cor- handful of relevant children topics: one can use ei- responding to the user’s intents. We use the fol- ther interlevel sparsing regularizer (Chirkova and lowing operational definition of intent: two di- Vorontsov, 2016) or remove “bad“ edges accord- alogues (as represented by user’s utterances) are ing to EmbedSim metric (Belyy, 2018). said to have the same intent if both users would be satisfied with the essentially same reaction by 3.1 Distinct hierarchy levels the call centre operator. This definition, while in- Building a two-level clustering model is a diffi- herently problematic, allows us to highlight sev- cult task due to the inaccuracy of clustering al- eral important practical problems: gorithms. Provided that documents in the model • Simple bag-of-words (BoW) approach isn’t first-level clusters are already similar to each other sufficient. Compare: “I want my credit card (as they should be), further separation could be to be blocked. What should I do¿‘ and “My complicated (especially if we attempt to subdivide credit card is blocked, what should I do¿‘. each cluster by the same algorithm). In practice, the second-level clusters tend to repeat first-level • In some cases, the intent of conversation is clusters at smaller scale instead of demonstrating not robust to a single word change. “I want to some meaningful differences. In order to make make an appointment with cardiologist“ and our model able to distinguish new dissimilarities “I want to make an appointment with neurol- in clusters on the second level, we adjust algorithm ogist“ are considered to have the same intent at the second level: in broad strokes, we base the since they require the user to perform a virtu- second level of model on different features. ally identical set of actions. However, “Pay- In the context of our problem, separation based ment of state duty for a passport“ and “Pay- on the functional purpose of the model tokens is ment of state duty for vehicle“ are vastly dif- proposed. We divide all words and n-grams into ferent. two groups based on the PoS analysis: “thematic” To account for the BoW problem we add an and “functional”. The “functional” group consists n-gram modality and ptdw smoothing regularizer of the verb words and n-grams that contain at least (Skachkov and Vorontsov, 2018) for all tokens. one verb. The “thematic” group consists of the The ptdw smoothing regularizer respects the se- nouns and adjectives and n-grams that contain at quential nature of text, making the distributions least one noun and have no verbs. Inspired by

934 multi–level (Tang et al., 2018) and multi–syntactic modification will extract every high-scoring collo- (Gupta et al., 2018) phrases annotation, among cation at the cost of increased memory usage. with hierarchical partition, our approach is essen- tial for client goal and subgoals extraction. 4.2 Named entity recognition The purpose of the first hierarchy level is to de- There are a lot of references to the speakers’ termine the conversation subject (the entities the names, company/product names, streets, cities in dialogue is about). Hence, at the first level of the the dialogue collection. It makes sense to take into hierarchy thematic tokens should have a notice- account some entities in a special way. ably higher weight than functional tokens. The For the named entity recognition problem purpose of the second level of hierarchy, by con- (NER) different methods are commonly used: trast, is to determine client intent concerning par- rule-based, machine-learning-based or neural- ticular objects (e.g. what action the client is trying networks-based. We used neural network from to perform). Functional tokens should have higher Arkhipov et al.(2017) pretrained on a PERSONS- impact over thematic ones. The tokens unrelated 1000 (Vlasova et al.(2014)) for our experi- to these two groups are used on both levels and ments. We replace all person related tokens by the serve as a connection between the layers. hPERSONi tag.

4 Preprocessing 4.3 Spell checking Errors and typos in client utterances are common We use standard preprossessing pipeline consist- in the dialogue collection. The simplest way to ing of tokenization, lemmatization, part-of-speech deal with this problem is to apply a spell checking tagging, n-grams extracting, named entity recog- algorithm. We use Jamspell1 algorithm for spell nition and spell checking. In this section we de- checking since its fastness. scribe some details of preprocessing algorithms We make some modifications to adapt the Jam- since the preprocessing is very important for any spell model to our case. First of all, the language morphologically rich languages such as Russian. model used to select the best correction candidates Data prepossessing pipeline consists of many should be trained on the collection for clustering. parts, therefore each part must be relatively fast. This modification takes into account the collection That is why we don’t use some great powerful ap- specificity and collection specific words won’t be proaches such as (Devlin et al., 2018) for NER. treated as unknown. 4.1 N-grams extracting Also the set of candidates can be extended. Ac- cording to the statistics of Yandex search engine2 Conventional approach in surpassing the bag-of- word merging error is one of the most popular ty- word hypothesis of the model is by adding n- pos in the dialogues. Hence, we add candidates grams or collocations into the model. To extract that are obtained via splitting a word in two. n-grams we use TopMine algorithm (El-Kishky et al., 2014) based on a words co-occurrences 5 Experiments statistics. However, we found it beneficial to implement We use two dialogue datasets from the Russian some modifications. First change alters gathering call-centres (∼ 90K dialogues in each) in our ex- and usage of word co-occurrences statistics: Top- periments. The first dataset is collected from client dialogues with various public services. The sec- Mine differentiates between sequences (w1, w2) ond dataset is conversation logs of ISP tech sup- and (w2, w1), which is not desirable for synthetic languages with less strict word order compared port. All dialogues are between a user and a call to English. To make it better suited to the Rus- agent, mean length of a single dialogue is six ut- sian language, we use multisets as containers for terances. Both datasets are proprietary. collocations instead of sequences. Second change 5.1 Scoring metric modifies the extraction process: while the original version of TopMine extracts only disjoint collo- There are several approaches for measuring the cations and won’t detect sub-collocations (e.g. if quality of topic model, especially its interpretabil- n-gram “support vector machines” is extracted, n- 1Jamspell github gram “support vector” will not be extracted), our 2Yandex search errors statistics (on Russian)

935 ity. The usual procedure involves evaluating the 5.2 Baselines list of most frequent topic words by human ex- As one of the baselines, we use the following perts. However, this approach suffers from several procedure. First, we convert raw texts into real- fundamental limitations (Alekseev, 2018). There- valued vectors using pretrained embeddings or tf- fore we choose to employ a different method. idf scores in a way described in 2.1. Second, we For each dataset, we collect a set of dialogue cluster this dataset via K-Means algorithm. Third, pairs to score our model. Following the reasoning we treat each cluster as a separate collection and outlined in the section3, we generated a number perform K-Means algorithm again. As a result, we of (d1, d2) pairs (where di is a dialogue) and asked obtain both first-level and second-level clusters. three human experts to label them. To measure the Another baseline models are hierarchical topic quality of the model, we compare these labels to model without any additional regularizers and hi- the labels predicted by model. erarchical topic model with Φ and Θ smoothing The following list summarizes our approach for for both levels. For K-Means based algorithms model estimation and labeling guidelines for hu- we tune embeddings dimensionality and both level man experts: cluster number. For topic modeling based al- gorithms we tune both level topics number. As • 0: d1 and d2 have nothing in common. Such objects should correspond to the different shown in table1 regularized topic model outper- first-level topics. forms K-Means approaches at two out of the three pair sets. • 1: both d1 and d2 are related to the same subject, but there are significant differences. 1-big 1-small 2-small Such dialogues should correspond to the hKmeans (tf-idf) 0.568 0.593 0.649 same first-level topic, but to the different hKmeans (emb.) 0.615 0.638 0.641 second-level topics. hPLSA 0.603 0.675 0.633 hARTM 0.636 0.683 0.631 • 2: d1 shares an intent with d2. Such dia- logues should correspond to the same first- Table 1: Baselines accuracy level and second-level topics. • ?: it is impossible to determine the intent for 5.3 Proposed model perfomance at least one of the dialogues. We use several NLP-based techniques described We select the best model according to the accu- in4 to improve main model quality. We start with racy metric on a given labeled pairs. Three sets the hPLSA model. For each problem we test a few of pairs are used for the estimation (∼ 12K and approaches and choose the best one. We add all ∼ 1.5K for the first dataset, ∼ 1.5K for the sec- main features one by one, e.g. we choose the best ond dataset). All model hyperparameters are tuned method for extracting n-grams and use it on the according to the accuracy on a 12K dataset (“1- next step. We conduct all the experiments in the big“). Two other sets are used to control over- following order: fitting (“1-small“ and “2-small“). Notably, the 1. including additional n-gram modality, choos- good performance on 2-small dataset implies that ing between the based and modified n-grams the model generalizes beyond the initial training extracting methods, tuning modality weights dataset. and topics number; The same preprocessing procedures are used for both datasets. All tokens are lemmatized, stop- 2. adding ptdw smoothing at the first model level tokens are deleted, simple entities (e-mails, web- for all tokens, tuning regularizer coefficient sites e.t.c) are replaced by their tags. Operator ut- and topics number; terances are deleted from the dialogue document 3. replacing person related named entities, (they are not informative in our datasets; for ex- choosing between the dictionary-based and ample, there are many cases where operator fails rnn-based methods; to reply at all). Finally, each document is a con- catenation of one dialogue user utterances from a 4. typo correction, choosing between the base single dialogue. and modified algorithm

936 1-big 1-small 2-small Tariff plan hPLSA 0.603 0.675 0.633 How to change the tariff plan? + n-grams base 0.612 0.634 0.633 When did the tariff change happen? + n-grams mod. 0.635 0.674 0.655 How often can I change my tariff plan? + ptdw smooth. 0.64 0.678 0.66 When will the changes take effect when the tariff is changed? + NER dict. 0.634 0.661 0.635 Why can’t I change the tariff? + NER NN 0.64 0.68 0.662 Why was the tariff plan changed without my + Jamspell 0.635 0.674 0.655 knowledge? + mod. Jamspell 0.657 0.686 0.663 Why there are no available tariff plans for the Table 2: NLP techniques quality improvement transition? Table 4: Subtopics of topic “Tariff plan” As the table2 demonstrates our n-grams ex- traction method outperforms traditional TopMine How do I switch from credit to advance algorithm in this task. Replacing persons by a payment? tag does not lead to a great improvement of the How do I switch from credit to advance pay- quality. Our analysis of hPLSA cluster top-tokens ment? shows that only 3% of the top-tokens are related Hi. Tell me can we change the credit system to persons. After the NER preprocessing the pro- of payment to advance? Well thanks! portion of named entities in top tokens reduces to I need to change my payment from credit to 0.3%. And at the same time spellchecking im- advance. proves the performance on all three pair sets. It How to disable credit payment system? should be noted that standard Jamspell algorithm Hello. Change the payment system from leads to a quality decrease. credit to advance! Finally, we apply feature grouping schemes pro- Good morning. How to change the payment posed in 3.1. The results (table3) turned out to system from credit to normal? be reassuring. There is a noticeable performance Disable the credit payment system. boost for all of the pair sets. Table 5: Top documents of subtopic “How do I 1-big 1-small 2-small switch from credit to advance payment?” featured hARTM 0.657 0.686 0.663 + groups 0.667 0.715 0.672 quest and the action user wishes to perform helped Table 3: Grouping feature quality improvement us to choose a two-level hierarchical model as our main tool. This leads us to design a custom quality Further, we represent some examples of the metric which takes into account several degrees of model performance. All example texts from ex- the dialogues relevancy. amples were translated from Russian to English. Our next step was to devise a PoS-based feature In the table4 all subtopics of the topic “Tariff separation and to leverage n-grams, named entities plan” are presented. Each subtopic described by and spellchecking. This allowed us to construct the characteristic question. a hierarchical multimodal regularized topic model In the table5 we demonstrate top documents which outperforms all baseline models. corresponding to the “How do I switch from credit to advance payment?” subtopic. Acknowledgments 6 Conclusion We thank our colleagues Alexey Goncharov and In this paper, we report a success in formalizing Konstantin Vorontsov from Machine Intelligence the clustering process suitable for unsupervised in- Laboratory who provided expertise that greatly as- ference of user intents. sisted the research. We thank Evgeny Egorov for The realization that any intent consists of two comments that greatly improved the manuscript. crucial parts: the entity relevant to the user’s re-

937 References Denis Kochedykov, Murat Apishev, Lev Golitsyn, and Konstantin Vorontsov. 2017. Fast and modular reg- V.A. Bulatov V.G. Vorontsov K.V. Alekseev. 2018. ularized topic modelling. In 2017 21st Conference Intra-text coherence as a measure of topic models’ of Open Innovations Association (FRUCT). IEEE, interpretability. In Computational Linguistics and pages 182–193. Intellectual Technologies: Papers from the Annual International Conference Dialogue. pages 1–13. Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and Mikhail Y Arkhipov, Mikhail S Burtsev, et al. 2017. projection for dimension reduction. arXiv preprint Application of a hybrid bi-lstm-crf model to the arXiv:1802.03426 . task of russian named entity recognition. In Con- ference on Artificial Intelligence and Natural Lan- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- guage. Springer, pages 91–103. frey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint A.V. Seleznova M.S. Sholokhov A.K. Vorontsov K.V. arXiv:1301.3781 . Belyy. 2018. Quality evaluation and improvement for hierarchical topic modeling. In Computational Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Linguistics and Intellectual Technologies: Papers rado, and Jeff Dean. 2013b. Distributed representa- from the Annual International Conference Dialogue. tions of words and phrases and their compositional- pages 110–123. ity. In Advances in neural information processing David M Blei, Andrew Y Ng, and Michael I Jordan. systems. pages 3111–3119. 2003. Latent dirichlet allocation. Journal of ma- Jinuk Park, Chanhee Park, Jeongwoo Kim, Minsoo chine Learning research 3(Jan):993–1022. Cho, and Sanghyun Park. 2019. Adc: Advanced Yun-Nung Chen, William Yang Wang, and Alexander I using contextualized represen- Rudnicky. 2015. Learning semantic hierarchy with tations. Expert Systems with Applications . distributed representations for unsupervised spoken Anna Potapenko, Artem Popov, and Konstantin language understanding. In Sixteenth Annual Con- Vorontsov. 2017. Interpretable probabilistic embed- ference of the International Speech Communication dings: bridging the gap between topic models and Association. neural networks. In Conference on Artificial Intelli- Zheqian Chen, Rongqin Yang, Zhou Zhao, Deng gence and Natural Language. Springer, pages 167– Cai, and Xiaofei He. 2017. Dialogue act recog- 180. nition via crf-attentive structured network. CoRR Andriy Shepitsen, Jonathan Gemmell, Bamshad abs/1711.05568. Mobasher, and Robin Burke. 2008. Personalized Nadezhda Chirkova and Konstantin Vorontsov. 2016. recommendation in social tagging systems using hi- Additive regularization for hierarchical multimodal erarchical clustering. In Proceedings of the 2008 topic modeling. Journal and ACM conference on Recommender systems. ACM, Data Analysis 2(2):187–200. pages 259–266. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Nikolay Skachkov and Konstantin Vorontsov. 2018. Kristina Toutanova. 2018. BERT: pre-training of Improving topic models with segmental structure of deep bidirectional transformers for language under- texts. In Computational Linguistics and Intellectual standing. CoRR abs/1810.04805. Technologies: Papers from the Annual International Conference Dialogue. pages 652–661. Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R Voss, and Jiawei Han. 2014. Scalable topical phrase Da Tang, Xiujun Li, Jianfeng Gao, Chong Wang, Li- mining from text corpora. Proceedings of the VLDB hong Li, and Tony Jebara. 2018. Subgoal discov- Endowment 8(3):305–316. ery for hierarchical dialogue policy learning. In EMNLP. Fabrizio Esposito, Anna Corazza, and Francesco Cu- tugno. 2016. Topic modelling with word embed- Nataliya Vlasova, Elena Syleymanova, and Igor Trofi- dings. In Proceedings of the Third Italian Confer- mov. 2014. The russian language collection for the ence on Computational Linguistics CLiC-it 2016). named-entity recognition task. Language seman- pages 129–134. tics: models and technologies pages 36–40. Sonal Gupta, Rushin Shah, Mrinal Mohit, Anuj Ku- Konstantin Vorontsov and Anna Potapenko. 2015. Ad- mar, and Mike Lewis. 2018. Semantic parsing for ditive regularization of topic models. Machine task oriented dialog using hierarchical representa- Learning 101(1-3):303–323. tions. In EMNLP. Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, Jianshe Thomas Hofmann. 2000. Learning the similarity of Zhou, and Zhoujun Li. 2017. Building task-oriented documents: An information-geometric approach to dialogue systems for online shopping. In Thirty- document retrieval and categorization pages 914– First AAAI Conference on Artificial Intelligence. 920.

938