Improving Out-Of-Scope Detection in Intent Classification by Using

Improving Out-of-Scope Detection in Intent Classification by Using Embeddings of the Word Graph Space of the Classes Paulo Cavalin, Victor Henrique Alves Ribeiro, Ana Paula Appel, Claudio Pinhanez IBM Research Sao˜ Paulo, SP, Brazil [email protected] Abstract created using word graphs, as described later. We show that, although the method in some cases is This paper explores how intent classification able to improve the accuracy of a text classifier in can be improved by representing the class labels not as a discrete set of symbols but as in-scope examples, it has often a tremendous im- a space where the word graphs associated to pact on improving the ability of text classifier to each class are mapped using typical graph em- reject OOS text, without relying on OOS examples bedding techniques. The approach, inspired in the training set. by a previous algorithm used for an inverse Notice that the intent classifier is typically im- dictionary task, allows the classification algo- plemented using standard text classification algorithm to take in account inter-class similarities rithms (Weiss et al., 2012; Larson et al., 2019; provided by the repeated occurrence of some words in the training examples of the differ- Casanueva et al., 2020). Consequently, to per- ent classes. The classification is carried out by form OOS sample detection, methods often rely mapping text embeddings to the word graph on one-class classification or threshold rejection- embeddings of the classes. Focusing solely based techniques using the probability outputs for on improving the representation of the class each class (Larson et al., 2019) or reconstruction label set, we show in experiments conducted errors (Ryu et al., 2017, 2018). in both private and public intent classification There also exist approaches based on the assump- datasets, that better detection of out-of-scope tion that OOS data can be collected and included examples (OOS) is achieved and, as a conse- quence, that the overall accuracy of intent clas- in the training set (Tan et al., 2019; Larson et al., sification is also improved. In particular, using 2019). However, in practice, collecting OOS data the recently-released Larson dataset, an error can be a burden for intent classifier creation, which of about 9.9% has been achieved for OOS de- is generally carried out by domain experts and not tection, beating the previous state-of-the-art re- by machine learning experts. Thus, in the ideal sult by more than 31 percentage points. world, one should rely solely on in-scope data for 1 Introduction this task because it is very difficult to collect a set of data that appropriately represents the space of Intent classification is usually applied for response the very unpredictable OOS inputs. selection in conversational systems, such as text- The classes in a traditional text classifier are gen- based chatbots. For the end-user to have the best erally represented by a discrete set of symbols and possible experience with those systems, it is ex- the classifier is trained with the help of a finite set pected that an intent classifier is able not only to of examples, where the classes are assumed to be map an input utterance to the correct intent but independent and the set of examples to be disjoint. also to detect when the utterance is not related to But, in many cases, the classes are in fact asso- any of the intents, to which we refer to as out-of- ciated with inter-connected higher-level concepts scope (OOS)1 inputs or samples. In the light of which could be formatted into more meaningful this, this paper describes and evaluates a method representations and better exploited in the classi- which tries to capture the complexity of the set fication process for an enhanced representation of of intents by embedding them into a vector space the scope of the classifier. 1Out-of-domain examples is also a common term in the In particular we explore here the use of graphs literature. which represent information by means of nodes 3952 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 3952–3961, November 16–20, 2020. c 2020 Association for Computational Linguistics connected to each other by arcs. Recent research 2 The Word Graph Method has demonstrated that nodes in a graph can be con- This section presents a formal description of the verted to an embedding, that is, projected into a vec- methodology employed in this work. tor space, which can then be mapped to sentences to cope with tasks such as the reverse dictionary 2.1 Embedding the Set of Classes problem (Hill et al., 2016; Kartsaklis et al., 2018). An intent classification method is a function D We propose here an adaptation of those ideas to an which maps a set of sentences (potentially infinite) intent classifier so it uses such mappings to expand S = fs ; s ; :::g into a finite set of classes Ω = the representation of the class space, its scope, and 1 2 f! ;! ; :::; ! g: class inter-dependencies, and thus possibly making 1 2 c the OOS detection task easier. D : S ! Ω D(s) = !i (1) This paper presents an investigation of exploit- word graphs To enable a numeric, easier handling of the in- ing information from associated to the n intent classes to improve OOS sample detection in put text, an embedding ξ : S ! R is often used, mapping the space of sentences S into a vector intent classification. By considering that each class n space R , and defining a classification function is represented by a set of text examples and that n different classes can be connected to each other by E : R ! Ω such as D(s) = E(ξ(s)). In typical means of the repeated occurrence of words in their intent classifiers, E is usually composed of a func- respective examples, we build a word graph where tion C which computes the probability of s being both class labels and words are represented by sin- in a given class, followed by the arg max function. gle nodes. The word nodes are connected to the In many intent classifiers, C is the softmax function. class label nodes in accordance to their occurrence ξ n C c argmax in the training samples and their respective class S ! R ! R ! Ω (2) labels. Then, a typical graph embedding technique This paper explores how to use embeddings in is used to represent classes with the embedding the other side of the classification functions, that is, of their corresponding class label node. Instead by embedding the set Ω of classes into another vec- m of finding the classes with the highest probability, tor space R . The idea is to use class embedding the intent classifier search for the class embedding functions which somehow capture better inter-class which maps best to the sentence embedding of a relations such as similarities, using, for instance, given input sample. information from the training sets, as we will show We have implemented and tested this idea with later. Formally, we use a class embedding func- m −1 different types of base methods for sentence em- tion :Ω ! R , its inverse , and a function n m bedding, such as Long-short Term Memory (LSTM) M : R ! R to map the two vector spaces so −1 neural networks and Bidirectional Encoder Rep- D(s) = (M(ξ(s))). resentations from Transformers (BERT), and per- −1 ξ n M m formed OOS detection by means of a simple S ! R ! R ! Ω (3) threshold-based rejection. We conducted a thor- In this paper we use typical sentence embedding ough evaluation on both private and public intent methods to implement ξ. To approximately con- classification datasets, such as the Larson dataset struct the function M we employ a basic Mean for this specific task (Larson et al., 2019). Square Error (MSE) method using the training Our results show that the proposed word-graph set composed of sentence examples for each class based method improves considerably OOS detec- !i 2 Ω. As we will see next, the training set will tion, compared against the corresponding tradi- also be used to construct the embedding function tional classification algorithms, based on combin- for the set of classes and an approximation for ing the sentence embedding algorithm with soft- its inverse −1. max probabilities. In the case of the Larson dataset, where comparison against varied OOS detection 2.2 Adapting Kartsaklis Method (LSTM) methods is available, we show that our proposed In this paper we explore a text classification method approach reduces dramatically the previous state- proposed for the inverse dictionary problem, where of-the-art (SOTA) false acceptance rate in more text definitions of terms are mapped to the term than 30 percentage points, from 41.1% to 9.9%. they define, proposed by Kartsaklis et al.(2018). 3953 The embedding of the class set into the continuous simply LSTM: vector space (equivalent to the function in equa- LST M n softmax c argmax tion3) is done by expanding the knowledge graph S ! R ! R ! Ω (7) of the dictionary words with nodes corresponding to words related to those terms and performing 2.3 Replacing the LSTM with BERT random walks on the graph to compute graph em- The natural language processing community has beddings related to each dictionary node, using the been recently focusing attention on the novel trans- DeepWalk algorithm (Perozzi et al., 2014). No- former models (Vaswani et al., 2017). This is due tice that DeepWalk is a two-way function mapping to the great performance improvement in several nodes into vectors and back.

Load more