Out-of-Domain Detection for Low-Resource Text Classification Tasks

Ming Tan ∗† Yang Yu ∗† Haoyu Wang ∗† Dakuo Wang ‡ Saloni Potdar † Shiyu Chang ‡ Mo Yu ‡ † IBM Watson ‡ IBM Research

Abstract Intent Label Example Help List List what you can help me Out-of-domain (OOD) detection for low- with. resource text classification is a realistic but Watson, I need your help understudied task. The goal is to detect the Schedule Appointment Can you book a cleaning with my dentist for me? OOD cases with limited in-domain (ID) train- Can you schedule my den- ing data, since we observe that training data tist’s appointment? is often insufficient in appli- End Meeting You can end the meeting cations. In this work, we propose an OOD- now resistant Prototypical Network to tackle this Meeting is over ··· ··· zero-shot OOD detection and few-shot ID classification task. Evaluation on real-world OOD utterances My birthday is coming! datasets show that the proposed solution out- blah blah... performs state-of-the-art methods in zero-shot OOD detection task, while maintaining a com- Table 1: A few-shot ID training set for a conversa- petitive performance on ID classification task. tion service for teleconference management, with OOD testing examples. 1 Introduction

Text classification tasks in real-world applications son Assistant1. For example, Table1 shows some often consists of 2 components- In-Doman (ID) of the utterances a chat-bot builder provided for classification and Out-of-Domain (OOD) detec- training. Each class may only have less than 20 tion components (Liao et al., 2018; Kim and Kim, training utterances, due to the high cost of man- 2018; Shu et al., 2017; Shamekhi et al., 2018). ual labelling by domain experts. Meanwhile, the ID classification refers to classifying a user’s in- user also expects the service to effectively reject put with a label that exists in the training data, and irrelevant queries (as shown at the bottom of Table OOD detection refers to designate a special OOD 1). The challenge of OOD detection is reflected to the input when it does not belong to any by the undefined in-domain boundary. Although of the labels in the ID training dataset (Dai et al., one can provide a certain amount of OOD sam- 2007). Recent state-of-the-art deep learning (DL) ples to build a binary classifier for OOD detec- approaches for OOD detection and ID classifica- tion, such samples may not efficiently reflect the tion task often require massive amounts of ID or infinite OOD space. Recent approaches, such as OOD labeled data (Kim and Kim, 2018). In re- (Shu et al., 2017), make remarkable progress on ality, many applications have very limited ID la- OOD detection with only ID examples. However, beled data (i.e., few-shot learning) and no OOD such condition on ID data cannot be satisfied by labeled data (i.e., zero-shot learning). Thus, ex- the few-shot scenario presented in Table1. isting methods for OOD detection do not perform This work aims to build a model that can de- well in this setting. tect OOD inputs with limited ID data and zero One such application is the intent classification OOD training data, while classifying ID inputs for conversational AI services, such as IBM Wat- with a high accuracy. Learning similarities with ∗ Equal contributions from the corresponding authors: {mingtan,yu,wanghaoy}@us.ibm.com. 1https://www.ibm.com/cloud/watson-assistant/

3566 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3566–3572, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics the meta-learning strategy (Vinyals et al., 2016) Task !" Supporting Sets Prototypical Vectors Label-1 9"$ /,:*/; ℒ"$, ℒ%& example-1 has been proposed to deal with the problem of example-2 example-3 limited training examples for each label (few-shot ℒ''( Label-2 9"$ /,:*/< Prototypical Net- example-4 learning). In this line of work, example-5 works (Snell et al., 2017), which was originally 9"$ Label-3 /,:*/= introduced for few-shot image classification, has example-6 4$5'(*1 4(7) proven to be promising for few-shot ID text clas- 4$5'(*1 4 7 "$ '3& Random selection I) *+,-./* + OOD example +2 ,>% − .''/ " sification (Yu et al., 2018). However the usage of 01'- !" 01'- !2(not shown here) prototypical network for OOD detection is unex- Figure 1: Model overview: the model maximizes likeli- plored in this regard. hood of the ground-truth of ID example, minimizes dis- To the best of our knowledge, this work is the tance between ID example and ground-truth, and max- first one to adopt a meta-learning strategy to train imizes the distance of OOD example and all ID labels. an OOD-Resistant Prototypical Network for si- studies suggests that few-shot learning is promis- multaneously detecting OOD examples and classi- ing in the text domain, including text classifica- fying ID examples. The contributions of this work tion (Yu et al., 2018; Jiang et al., 2018), relation are two-fold: 1) Unified solution using a proto- extraction (Han et al., 2018), link prediction in typical network model which can detect OOD in- knowledge bases (Xiong et al., 2018) and fine- stances and classify ID instances in a real-world grained entity typing (Xiong et al., 2019), and we low-resource scenario. 2) Experiments and analy- put it to test with the OOD detection task. sis on two datasets prove that the proposed model outperforms previous work on the OOD detection 3 Approach task, while maintaining a state-of-the-art ID clas- sification performance. In this paper, we target solving the zero-shot OOD detection problem for a few-shot meta-test dataset 2 Related Work D = (Dtrain,Dtest) by training a transferable Out-of-Domain Detection Existing methods prototypical network model from large-scale in- often formulate the OOD task as a one-class clas- dependent source datasets T = {T1,T2, ..., TN } sification problem, then use appropriate methods for dynamic construction of the meta-train set. to solve it (e.g., one-class SVM (Scholkopf¨ et al., Each task Ti contains labeled training examples 2001) and one-class DL-based classifiers (Ruff (note that a test set is not required in meta- et al., 2018; Manevitz and Yousef, 2007). A group train). D is different from the traditional super- of researchers also proposed an auto-encoder- vised close-domain classification dataset from two folds: 1) Dtest contains OOD testing examples, based approach and its variation to tackle OOD train tasks (Ryu et al., 2017, 2018). Recently, a few pa- whereas D only includes labeled examples for the target domain. 2) The training size for each pers have investigated ID classification and OOD train detection simultaneously (Kim and Kim, 2018; label in D is limited (e.g. less than 100 exam- Shu et al., 2017), but they fail in a low resource ples). Such limitations prevent existing methods setting. from efficiently training a model for either ID clas- sification or OOD detection using Dtrain only. Few-Shot Learning While few-shot learning We propose an OOD-resistant prototypical net- approaches may help with this low-resource set- work for both OOD detection and few-shot ID ting, some recent work is promising in this regard. classification. We follow (Snell et al., 2017) in For example, (Vinyals et al., 2016; Bertinetto few-shot image classification by training a proto- et al., 2016; Snell et al., 2017) use metric learn- typical network on T and directly perform pre- ing by learning a good similarity metric between diction on D without additional training. But our input examples; some other methods adapt a method is different from the prior work in that dur- meta-learning framework, and train the model ing the meta-training, while we maximize the like- to quickly adapt to new tasks with gradients on lihood of the true label for an example in Ti, we small samples, e.g., learning the optimization step also sample an example from another meta-train sizes (Ravi and Larochelle) or model initializa- task Tj for the purpose of OOD training by max- tion (Finn et al., 2017). Though most of these ap- imizing the distance between the OOD instance proaches are explored for computer vision, recent and the prototypical vector of each ID label.

3567 3.1 General Framework 1)2 between the E(·)-encoded representations of As in Fig.1, on a large-scale source dataset T with x and the prototypical vector of a label. Our ex- the following steps: periments show this meta-learning approach is ef- 1. Sample a training task Ti from T (e.g., the ficient for ID classification, but is not good enough Book category of Amazon Review in Section for detecting the OOD examples. 4), and another task Tj from T − Ti (e.g. the We propose two more training losses in addi- Apps-for-Android category). tion to the Lin for OOD detection. The rationale in behind this addition is to adopt the examples from 2. Sample an ID training example xi from Ti, out other tasks as simulated OOD examples for the and a simulated OOD example xj from Tj. current meta-train tasks. Specifically, we first de- 3. Sample N labels (N=4) from Ti in addition out to the label of xin. For the ground-truth la- fine a hinge loss on xj and the closest ID sup- i in bel and N negative labels, we select K train- porting set in S , then we push the examples from ing examples for each label (K-shot learning, another task away from the prototypical vectors of we set K=20). If a label has less than K ex- ID supporting sets. amples, we replicate the selected example to out in Lood = max[0, max(F (xj ,Sl ) − M1)] (2) satisfy K. Therefore, (N + 1) × K examples l serve as a supporting set Sin = {Sin}N . l l=1 We expect optimizing only on Lin and Lood will 4. Given a batch of dynamically-constructed lead to lower confidences on ID classification, be- in out in meta-train set (xi , xj , S ), E(·) encodes cause the system tends to mistakenly reduce the in out in xi , xj and the examples in Sl using a scale of F in order to minimize the loss for OOD deep network (Any DL structure can be used examples. Therefore we add another loss to im- for the encoder, such as LSTM and CNN. prove the confidence of classifed ID labels. Here we use a one-layer CNN with a mean in in pooling. The detailed CNN hyper-parameters Lgt = max[0, M2 − F (xi ,Sli ))] (3) are introduced in Section5). The model is optimized on the three losses. 5. Following (Snell et al., 2017), a Prototypical

Vector representation for each label is gen- L = Lin + βLood + γLgt (4) erated, by averaging all the examples’ repre- sentations of that label. where α, β, γ, M1 and M2 are hyper-parameters, 6. The model is optimized by an objective func- whose detailed values are shown in Section5. in out in tion, defined by xi , xj and S . Details in During inference, the supporting set per label Section 3.2. is generated by averaging the encoded representa- 7. Repeat these steps for multiple epochs (5k in tions of all instances of that label in Dtrain and the in this paper) to train the model, and select the prediction is based on F (x, Sl ). OOD detection best model based on an independent meta- is decided with a confidence threshold. valid set T valid. T valid contains tasks that are homogeneous to the meta-test task D. 4 Datasets The only trainable parameters of this model are in Our methods are evaluated on two datasets and the encoder E(·). Therefore the trained model can each has many tasks and is divided into meta-train, be easily transferred to the few-shot target domain. meta-valid and meta-test sets, which are respec- tively used for background model training, evalu- 3.2 Training Objective and Runtime ation and hyper-parameter selection. Prototypical networks (Snell et al., 2017) mini- Amazon Review 3: We follow (Yu et al., 2018) mize a cross-entropy loss defined on the distance to construct multiple tasks using the Amazon Re- in metrics between xi and the supporting sets, view dataset (He and McAuley, 2016) . We con- exp αF (xin,Sin) vert it into a binary classification task of labeling L = − log i li in P in in (1) the review sentiment (positive/negative). It has 21 l0 exp αF (xi ,Sl0 ) 2Following (Snell et al., 2017), we also tried squared Eu- where li is the ground-truth label of xi, α is a re- clidean distance, but did not achieve better results. scaling factor. Here we define F as a cosine sim- 3We will release Amazon data and our code at ilarity score (mapped to the range between 0 and https://github.com/SLAD-ml/few-shot-ood

3568 categories of products, each of which is treated LSTM, we set the input embedding dimension as as a task. We randomly picked 13 categories as 100 and hidden as 200. We use RMSprop as the meta-train, 4 as meta-test and 4 as meta-valid. (an- optimizer with a learning rate of 0.001. We train other 3 original categories are discarded due to the LSTM with a batch size of 32 and 100 epochs. not enough examples to make a dataset). We con- For Autoencoder, we set the hidden size as 20. We struct a 2-way 100-shot problem per meta-test task use Adam as the optimizer with a learning rate of by sampling 100 reviews per label in a category. 0.001. We train the model with a batch size of 32 For the test examples in meta-test and meta-valid, for 10 epochs. we sample other categories’ examples as OOD, For vanilla CNN, we use the most common merged with a equal number of ID instances. We CNN architecture used in NLP tasks, where the used all available data for meta-train. convolutional layer on top of embedding has 128 filters followed by a ReLU and max pooling Conversation Dataset: An intent classification layer before the final softmax. We use Adam as the dataset for a AI conversational system. It has 539 optimizer with a learning rate of 0.001. We train categories/tasks. We allocate 497 tasks as meta- the model with a batch size 64 for 100 epochs. train, and 42 tasks as meta-test. This dataset is Our proposed model O-Proto uses the similar different and more difficult than the typical ID CNN architecture, the optimizer and the learn- few-shot learning data: 1) Both the meta-test and ing rate in the previous Vanilla CNN. The in- meta-train tasks are not restricted to N-way K- put word embeddings are pre-trained by 1-billion- shot classification, and the source dataset is highly token Wikipedia corpus. We set the batch size as imbalanced across labels; 2) Each task has a vari- 10. In Eq.1,2,3 and4, α, β, γ, M1 and M2 are ety of labels (utterance intents), whereas Amazon hyper-parameters, which we fix β, γ as 1.0 by de- data always has two labels. There are 29% OOD fault, and set α, M1 and M2 as 10.0, 0.4 and 0.8 testing instances in meta-test, which are human- according to the meta-valid performance of Ama- labeled and not generated from other tasks. zon dataset. The sentence encoder, CNN, has 200 filters, followed by a tanh and mean pooling layer 5 Experimental Results before the final regression layer. The maximum Baselines: We compare our model O-Proto with length of tokens per example is 40, and any 4 baselines: 1) OSVM (Scholkopf¨ et al., 2001): out of this range will be discarded. During train- OSVM is trained on meta-test set, and learn a do- ing, we set the size of sampled negative labels Step main boundary by only examining ID examples. 3 (section 3.1) to at most four, so there will be 2) LSTM-AutoEncoder (Ryu et al., 2017): Re- maximum five labels involved in a training step (1 cent work on OOD detection that uses only ID ex- positive, 4 negative). The supporting set size for amples to train an autoencoder for OOD detection. each label are 20. 3) Vanilla CNN: A classifier with a typical CNN To make a fair comparison, we follow the same structure that uses a confidence threshold for OOD hyper-parameters as O-Proto in Proto. Network, detection. 4) Proto. Network (Snell et al., 2017): except that the weight of Lood and Lgt, β and γ, A native prototypical network trained on T with are set to zero. only the loss Lin, which uses a confidence thresh- Evaluation Metrics: Following (Ryu et al., 2017; old for OOD detection. We test the Proto. Net- Lane et al., 2007), we use a commonly used OOD work with both CNN and bidirectional LSTM as detection metric, equal error rate (EER), which is the encoder E(·). the error rate when the confidence threshold is lo- cated where false acceptance rate (FAR) is equiv- Hyper Parameters: We introduce the hyper- alent to false rejection rate (FRR). parameters of our model and all baselines below. We use Python scikit-learn One-Class SVM as Number of accepted OOD sentences F AR = the basis of our OSVM implementation. We use Number of OOD sentences Number of rejected ID sentences Radial Basis Function (RBF) as the kernel and the FRR = gamma parameter is set to auto. We use squared Number of ID sentences hinge loss and L2 regularization. We use class error rate CER to reflect ID perfor- We follow the same architecture as proposed in mance. Lastly, we applied the threshold used in (Ryu et al., 2017) for the LSTM-Autoencoder. In the EER to ID test examples to test how many ID

3569 Conversation Amazon Review EER

(%) EER CER Comb. EER CER Comb. 0.45 0.250

OSVM 63.6 - - 47.6 - - 0.225 0.40 LSTM AutoEnc. 48.0 78.4 79.5 45.4 29.3 38.6 0.200 Vanilla CNN 26.4 76.8 77.6 47.7 34.4 42.8 0.175 0.35 EER Proto. Network 26.9 32.5 44.5 46.5 7.3 47.6 CER 0.150 O-Proto (Lin + Lgt) 27.6 33.3 46.2 47.8 7.4 48.9 0.30 0.125 O-Proto (Lin + Lood) 24.5 30.1 41.2 24.7 9.7 30.1 O-Proto (all) 24.1 29.6 40.8 24.0 9.1 29.1 0.100 0.25 Proto. with bilstm 25.0 32.5 42.6 45.1 6.8 46.0 CER 0.075

O-Proto with bilstm 22.0 30.5 39.8 21.9 9.0 27.1 0.0 0.1 0.5 1.0 2.0 5.0 10.0 20.0 100.0

Table 2: O-Proto is compared with other baselines for Conversation Figure 2: Various β for Amazon data and Amazon data

0.31 EER EER (%) K=1 =5 =10 =20 =100 0.40 0.30 Proto. 53.9 46.4 44.5 41.0 46.5 0.29 0.38 O-Proto. 31.1 25.1 24.0 23.4 24.1

0.28 0.36 Table 3: Compare our model with Prototype Network EER 0.27 CER 0.34 on EER when choosing various K-shot values 0.26

0.32 0.25 O-Proto, respectively. O-Proto without Lood com- 0.24 0.30 CER pletely loses the ability of OOD detection. Com- 0.0 0.1 0.5 1.0 2.0 5.0 10.0 20.0 100.0 pared to the one without Lgt, O-Proto with all losses gives a mild improvement. We also observe Figure 3: Various β for Conversation data a more stable testing performance among epochs during training. Finally, we replace the CNN en- examples are rejected by the OOD detection, as coders with bidirectional LSTMs (the bottom of Combined-CER (Comb. in Table2). Table 1), which yields the same dimension of sen- Results: Table2 compares the proposed model tence representations as CNN. For Conversation and baselines on EER and CER on the two data, we achieve the best performance on valida- 4 datasets. For prototypical-network related mod- tion set when α and γ are 0.5. We observe compa- els, we randomly initialize the model parame- rable performances with respect to Proto.Network ters by 10 times and report the averaged metrics. and O-Proto, showing that our proposed OOD ap- OSVM, LSTM-AutoEncoder and Vanilla CNN per- proach is not limited to a specific sentence encod- form poorly on OOD detection as expected, as ing architecture. they require larger number of ID examples which is not a few-shot scenario. Proto.Network has a Improvement in different K-shot settings: On K better performance on ID classification. But it the Amazon data, we construct different -shot is not designed for OOD, thus it does not per- tasks as meta-test (results shown in Table3), and form well on OOD detection. O-Proto achieves observe consistent improvements on EER. significantly better EERs (2.8% improvement on Effect of β: Fig.2 and3 show EER and CER Conversation, and 22.5% on Amazon), yield- with different β values on the two datasets. We ob- ing competitive results on CER compared to serve within a proper range of β (between 0.5 and Proto.Network. These lead to a remarkable im- 2.0), the model can provide stable improvement on provement on Combined-CER. O-Proto improves EER, guaranteeing competitive CER results. less on EER in Conversation than Amazon, be- cause some Conversation tasks actually come 6 Conclusion from the conversational service providers belong- Inspired by the Prototypical Network, we propose ing to similar business domains. Our model is bet- a new method to tackle the OOD detection task ter even when meta-train datasets are from slightly in low-resource settings. Evaluation on real-world different domains. Moreover, in Table2 we show datasets demonstrates that our method performs the ablation study by removing Lood and Lgt from favorably against state-of-the-art algorithms on the 4No CER reported for OSVM as it treats ID as one class OOD detection, without adversely affecting per- and does not support ID classification. formance on the few-shot ID classification.

3570 References Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. International Confer- Luca Bertinetto, Joao˜ F. Henriques, Jack Valmadre, ence on Learning Representations, 2017. Philip H. S. Torr, and Andrea Vedaldi. 2016. Learn- ing feed-forward one-shot learners. In Proceedings Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lu- of the 30th International Conference on Neural In- cas Deecke, Shoaib Ahmed Siddiqui, Alexander formation Processing Systems, NIPS’16, pages 523– Binder, Emmanuel Muller,¨ and Marius Kloft. 2018. 531, USA. Curran Associates Inc. Deep one-class classification. In Proceedings of the 35th International Conference on Machine Learn- Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong ing, pages 4393–4402, Stockholmsmssan, Stock- Yu. 2007. Co-clustering based classification for out- holm Sweden. PMLR. of-domain documents. In Proceedings of the 13th ACM SIGKDD international conference on Knowl- Seonghan Ryu, Seokhwan Kim, Junhwi Choi, Hwanjo edge discovery and data mining, pages 210–219. Yu, and Gary Lee. 2017. Neural sentence em- bedding using only in-domain sentences for out-of- Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. domain sentence detection in dialog systems. Pat- Model-agnostic meta-learning for fast adaptation of tern Recognition Letters, 88:26–32. deep networks. In Proceedings of the 34th In- ternational Conference on Machine Learning, vol- Seonghan Ryu, Sangjun Koo, Hwanjo Yu, and ume 70 of Proceedings of Machine Learning Re- Gary Geunbae Lee. 2018. Out-of-domain detection search, pages 1126–1135, International Convention based on generative adversarial network. In Pro- Centre, Sydney, Australia. PMLR. ceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pages 714– Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, 718. Association for Computational Linguistics. Zhiyuan Liu, and Maosong Sun. 2018. Fewrel: A large-scale supervised few-shot relation classifica- Bernhard Scholkopf,¨ John C. Platt, John C. Shawe- tion dataset with state-of-the-art evaluation. In Pro- Taylor, Alex J. Smola, and Robert C. Williamson. ceedings of the 2018 Conference on Empirical Meth- 2001. Estimating the support of a high-dimensional ods in Natural Language Processing, pages 4803– distribution. Neural Comput., 13(7):1443–1471. 4809. Ameneh Shamekhi, Q Vera Liao, Dakuo Wang, Ruining He and Julian McAuley. 2016. Ups and Rachel KE Bellamy, and Thomas Erickson. 2018. downs: Modeling the visual evolution of fashion Face value? exploring the effects of embodiment trends with one-class collaborative filtering. In pro- for a group facilitation agent. In Proceedings of the ceedings of the 25th international conference on 2018 CHI Conference on Human Factors in Com- world wide web, pages 507–517. puting Systems, page 391. ACM. Xiang Jiang, Mohammad Havaei, Gabriel Chartrand, Lei Shu, Hu Xu, and Bing Liu. 2017. Doc: Deep Hassan Chouaib, Thomas Vincent, Andrew Jesson, open classification of text documents. In Proceed- Nicolas Chapados, and Stan Matwin. 2018. At- ings of the 2017 Conference on Empirical Methods tentive task-agnostic meta-learning for few-shot text in Natural Language Processing, pages 2911–2916, classification. The Second Workshop on Meta- Copenhagen, Denmark. Association for Computa- Learning at NeurIPS. tional Linguistics. Joo-Kyung Kim and Young-Bum Kim. 2018. Joint Jake Snell, Kevin Swersky, and Richard Zemel. 2017. learning of domain classification and out-of-domain Prototypical networks for few-shot learning. In detection with dynamic class weighting for satis- I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, ficing false acceptance rates. In Proc. Interspeech R. Fergus, S. Vishwanathan, and R. Garnett, editors, 2018, pages 556–560. Advances in Neural Information Processing Systems 30, pages 4077–4087. Curran Associates, Inc. I. Lane, T. Kawahara, T. Matsui, and S. Nakamura. 2007. Out-of-domain utterance detection using clas- Oriol Vinyals, Charles Blundell, Timothy Lillicrap, sification confidences of multiple topics. IEEE Koray Kavukcuoglu, and Daan Wierstra. 2016. Transactions on Audio, Speech, and Language Pro- Matching networks for one shot learning. In Pro- cessing, 15(1):150–161. ceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Q. Vera Liao, Mas-ud Hussain, Praveen Chandar, pages 3637–3645, USA. Curran Associates Inc. Matthew Davis, Yasaman Khazaeni, Marco Patri- cio Crasso, Dakuo Wang, Michael Muller, N. Sadat Wenhan Xiong, Jiawei Wu, Deren Lei, Mo Yu, Shiyu Shami, and Werner Geyer. 2018. All work and no Chang, Xiaoxiao Guo, and William Yang Wang. play? In In Proceedings of the 2018 CHI Confer- 2019. Imposing label-relational inductive bias for ence on Human Factors in Computing Systems. extremely fine-grained entity typing. In Proceed- ings of the 2019 Conference of the North American Larry Manevitz and Malik Yousef. 2007. One-class Chapter of the Association for Computational Lin- document classification via neural networks. Neu- guistics: Human Language Technologies, Volume 1 rocomput., 70(7-9):1466–1481. (Long and Short Papers), pages 773–784.

3571 Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2018. One-shot relational learning for knowledge graphs. In Proceedings of the 2018 Conference on Empirical Methods in Nat- ural Language Processing, pages 1980–1990. Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. 2018. Diverse few-shot text clas- sification with multiple metrics. In Proceedings of the 2018 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pa- pers), pages 1206–1215. Association for Computa- tional Linguistics.

3572