Out-Of-Domain Detection for Low-Resource Text Classification

Out-of-Domain Detection for Low-Resource Text Classification Tasks Ming Tan ∗y Yang Yu ∗† Haoyu Wang ∗† Dakuo Wang z Saloni Potdar y Shiyu Chang z Mo Yu z y IBM Watson z IBM Research Abstract Intent Label Example Help List List what you can help me Out-of-domain (OOD) detection for low- with. resource text classification is a realistic but Watson, I need your help understudied task. The goal is to detect the Schedule Appointment Can you book a cleaning with my dentist for me? OOD cases with limited in-domain (ID) train- Can you schedule my den- ing data, since we observe that training data tist’s appointment? is often insufficient in machine learning appli- End Meeting You can end the meeting cations. In this work, we propose an OOD- now resistant Prototypical Network to tackle this Meeting is over ··· ··· zero-shot OOD detection and few-shot ID classification task. Evaluation on real-world OOD utterances My birthday is coming! datasets show that the proposed solution out- blah blah... performs state-of-the-art methods in zero-shot OOD detection task, while maintaining a com- Table 1: A few-shot ID training set for a conversa- petitive performance on ID classification task. tion service for teleconference management, with OOD testing examples. 1 Introduction Text classification tasks in real-world applications son Assistant1. For example, Table1 shows some often consists of 2 components- In-Doman (ID) of the utterances a chat-bot builder provided for classification and Out-of-Domain (OOD) detec- training. Each class may only have less than 20 tion components (Liao et al., 2018; Kim and Kim, training utterances, due to the high cost of man- 2018; Shu et al., 2017; Shamekhi et al., 2018). ual labelling by domain experts. Meanwhile, the ID classification refers to classifying a user’s in- user also expects the service to effectively reject put with a label that exists in the training data, and irrelevant queries (as shown at the bottom of Table OOD detection refers to designate a special OOD 1). The challenge of OOD detection is reflected tag to the input when it does not belong to any by the undefined in-domain boundary. Although of the labels in the ID training dataset (Dai et al., one can provide a certain amount of OOD sam- 2007). Recent state-of-the-art deep learning (DL) ples to build a binary classifier for OOD detec- approaches for OOD detection and ID classification, such samples may not efficiently reflect the tion task often require massive amounts of ID or infinite OOD space. Recent approaches, such as OOD labeled data (Kim and Kim, 2018). In re- (Shu et al., 2017), make remarkable progress on ality, many applications have very limited ID la- OOD detection with only ID examples. However, beled data (i.e., few-shot learning) and no OOD such condition on ID data cannot be satisfied by labeled data (i.e., zero-shot learning). Thus, ex- the few-shot scenario presented in Table1. isting methods for OOD detection do not perform This work aims to build a model that can de- well in this setting. tect OOD inputs with limited ID data and zero One such application is the intent classification OOD training data, while classifying ID inputs for conversational AI services, such as IBM Wat- with a high accuracy. Learning similarities with ∗ Equal contributions from the corresponding authors: fmingtan,yu,[email protected]. 1https://www.ibm.com/cloud/watson-assistant/ 3566 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3566–3572, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics the meta-learning strategy (Vinyals et al., 2016) Task !" Supporting Sets Prototypical Vectors Label-1 9"$ /,:*/; ℒ"$, ℒ%& example-1 has been proposed to deal with the problem of example-2 example-3 limited training examples for each label (few-shot ℒ''( Label-2 9"$ /,:*/< Prototypical Net- example-4 learning). In this line of work, example-5 works (Snell et al., 2017), which was originally 9"$ Label-3 /,:*/= introduced for few-shot image classification, has example-6 4$5'(*1 4(7) proven to be promising for few-shot ID text clas- 4$5'(*1 4 7 "$ '3& Random selection I) *+,-./* + OOD example +2 ,>% − .''/ " sification (Yu et al., 2018). However the usage of 01'- !" 01'- !2(not shown here) prototypical network for OOD detection is unex- Figure 1: Model overview: the model maximizes likeli- plored in this regard. hood of the ground-truth of ID example, minimizes dis- To the best of our knowledge, this work is the tance between ID example and ground-truth, and max- first one to adopt a meta-learning strategy to train imizes the distance of OOD example and all ID labels. an OOD-Resistant Prototypical Network for si- studies suggests that few-shot learning is promis- multaneously detecting OOD examples and classi- ing in the text domain, including text classifica- fying ID examples. The contributions of this work tion (Yu et al., 2018; Jiang et al., 2018), relation are two-fold: 1) Unified solution using a proto- extraction (Han et al., 2018), link prediction in typical network model which can detect OOD in- knowledge bases (Xiong et al., 2018) and fine- stances and classify ID instances in a real-world grained entity typing (Xiong et al., 2019), and we low-resource scenario. 2) Experiments and analy- put it to test with the OOD detection task. sis on two datasets prove that the proposed model outperforms previous work on the OOD detection 3 Approach task, while maintaining a state-of-the-art ID classification performance. In this paper, we target solving the zero-shot OOD detection problem for a few-shot meta-test dataset 2 Related Work D = (Dtrain;Dtest) by training a transferable Out-of-Domain Detection Existing methods prototypical network model from large-scale in- often formulate the OOD task as a one-class clas- dependent source datasets T = fT1;T2; :::; TN g sification problem, then use appropriate methods for dynamic construction of the meta-train set. to solve it (e.g., one-class SVM (Scholkopf¨ et al., Each task Ti contains labeled training examples 2001) and one-class DL-based classifiers (Ruff (note that a test set is not required in meta- et al., 2018; Manevitz and Yousef, 2007). A group train). D is different from the traditional super- of researchers also proposed an auto-encoder- vised close-domain classification dataset from two folds: 1) Dtest contains OOD testing examples, based approach and its variation to tackle OOD train tasks (Ryu et al., 2017, 2018). Recently, a few pa- whereas D only includes labeled examples for the target domain. 2) The training size for each pers have investigated ID classification and OOD train detection simultaneously (Kim and Kim, 2018; label in D is limited (e.g. less than 100 exam- Shu et al., 2017), but they fail in a low resource ples). Such limitations prevent existing methods setting. from efficiently training a model for either ID classification or OOD detection using Dtrain only. Few-Shot Learning While few-shot learning We propose an OOD-resistant prototypical net- approaches may help with this low-resource set- work for both OOD detection and few-shot ID ting, some recent work is promising in this regard. classification. We follow (Snell et al., 2017) in For example, (Vinyals et al., 2016; Bertinetto few-shot image classification by training a proto- et al., 2016; Snell et al., 2017) use metric learn- typical network on T and directly perform pre- ing by learning a good similarity metric between diction on D without additional training. But our input examples; some other methods adapt a method is different from the prior work in that dur- meta-learning framework, and train the model ing the meta-training, while we maximize the like- to quickly adapt to new tasks with gradients on lihood of the true label for an example in Ti, we small samples, e.g., learning the optimization step also sample an example from another meta-train sizes (Ravi and Larochelle) or model initializa- task Tj for the purpose of OOD training by max- tion (Finn et al., 2017). Though most of these ap- imizing the distance between the OOD instance proaches are explored for computer vision, recent and the prototypical vector of each ID label. 3567 3.1 General Framework 1)2 between the E(·)-encoded representations of As in Fig.1, on a large-scale source dataset T with x and the prototypical vector of a label. Our ex- the following steps: periments show this meta-learning approach is ef- 1. Sample a training task Ti from T (e.g., the ficient for ID classification, but is not good enough Book category of Amazon Review in Section for detecting the OOD examples. 4), and another task Tj from T − Ti (e.g. the We propose two more training losses in addi- Apps-for-Android category). tion to the Lin for OOD detection. The rationale in behind this addition is to adopt the examples from 2. Sample an ID training example xi from Ti, out other tasks as simulated OOD examples for the and a simulated OOD example xj from Tj. current meta-train tasks. Specifically, we first de- 3. Sample N labels (N=4) from Ti in addition out to the label of xin. For the ground-truth la- fine a hinge loss on xj and the closest ID sup- i in bel and N negative labels, we select K train- porting set in S , then we push the examples from ing examples for each label (K-shot learning, another task away from the prototypical vectors of we set K=20).

Load more