Arxiv:2009.13833V2 [Cs.CL] 10 Oct 2020 (Howard and Ruder, 2018; Peters Et Al., 2018; De- Rent Approach of Learning
Total Page:16
File Type:pdf, Size:1020Kb
HINT3: Raising the bar for Intent Detection in the Wild Gaurav Arora Chirag Jain Jio Haptik Jio Haptik [email protected] [email protected] Manas Chaturvedi Krupal Modi Jio Haptik Jio Haptik [email protected] [email protected] Abstract background of domain experts involved in defining the classes make this task more challenging. Dur- Intent Detection systems in the real world ing inference, these systems may be deployed to are exposed to complexities of imbalanced users with diverse cultural backgrounds who might datasets containing varying perception of in- tent, unintended correlations and domain- frame their queries differently even when commu- specific aberrations. To facilitate benchmark- nicating in the same language. Furthermore, during ing which can reflect near real-world scenar- inference, apart from correctly identifying in-scope ios, we introduce 3 new datasets created from queries, the system is expected to accurately reject live chatbots in diverse domains. Unlike out-of-scope (Larson et al., 2019) queries, adding most existing datasets that are crowdsourced, on to the challenge. our datasets contain real user queries received by the chatbots and facilitates penalising un- Most existing datasets for intent detection are wanted correlations grasped during the train- generated using crowdsourcing services. To accu- ing process. We evaluate 4 NLU platforms rately benchmark in real-world settings, we release and a BERT based classifier and find that per- 3 new single-domain datasets, each spanning mul- formance saturates at inadequate levels on test tiple coarse and fine grain intents, with the test sets sets because all systems latch on to unintended being drawn entirely from actual user queries on patterns in training data. the live systems at scale instead of being crowd- 1 Introduction sourced. On these datasets, we find that the perfor- mance of existing systems saturates at unsatisfac- Over the last few years, task-oriented dialogue sys- tory levels because they end up learning spurious tems have gained increasing traction for applica- patterns from the training dataset instead of gener- tions like personal assistants, automated customer alising to the perceived meanings of intents. support agents, etc. This has led to the availability We evaluate 4 NLU platforms - Dialogflow1, of several commercialised and/or open conversa- LUIS2, Rasa NLU3, Haptik45 and a BERT (De- tional bot building platforms. Most popular sys- vlin et al., 2019) based classifier on all 3 datasets tems today involve intent detection as a vital part and highlight gaps in language understanding. We of their Natural Language Understanding (NLU) further probe into queries where all the current pipeline. Recent advances in transfer learning systems fail and question the efficacy of the cur- arXiv:2009.13833v2 [cs.CL] 10 Oct 2020 (Howard and Ruder, 2018; Peters et al., 2018; De- rent approach of learning. Additionally, we repeat vlin et al., 2019) has enabled systems that perform all our experiments on the subset of training data quite well on existing benchmarking datasets (Lar- and show a performance drop in all the systems son et al., 2019; Casanueva et al., 2020). despite retaining relevant and sufficient utterances Definitions of intent often vary across users, in the training subset. We’ve made our datasets tasks and domains. Perception of intent could range and code freely accessible on GitHub to promote from a generic abstraction such as “Ordering a product” to extreme granularity such as “Enquiring 1https://cloud.google.com/dialogflow for a discount on a specific product if ordered us- 2https://www.luis.ai/ 3 ing a specific card”. Additionally, factors such as https://github.com/RasaHQ/rasa/ 4https://haptik.ai imbalanced data distribution in the training set, as- 5Access requests for signup on Haptik are processed via sumptions during training data generation, diverse contact form at https://haptik.ai/contact-us/ Example Intents Example Queries Dataset Train Test Type Label In-scope Out-of-Scope What are the If I order now do Generic OFFERS available offers Any other offers SOF i get 20% discount Give me some discount Mattress in lockdown period What is the 100 NIGHT Free 100 days Specific 100-night offer TRIAL OFFER trial Trial offer on I need try customisation Sir I want to fast I need Beginners Role of RECOMMEND gain weight. hair multivitamin electrolytes powder Curekart Generic PRODUCT i’m beginner Which is the Is it help for in gym best whey protein sperm count For biceps and Can i get Can diabetic tricep muscle rivamal 120 ml patient have it growth supplements at my home Connect with chat with customer Application is not agent service agent Power CHAT WITH responding during Generic My transaction PowerPlay11 play11 AN AGENT team joining is incorrect rummy issue My sign up Suddenly balance Why my current bonus is incorrect gone basketball teams Did not receive Why it is showing being shown?? my amount wrong balance Table 1: Few examples of Intents and Queries in Train and Test set in HINT3 dataset transparency and reproducibility6. Dataset #Intent #Queries Train Test Full Subset in-scope oos 2 Prior Work SOFMattress 21 328 180 231 166 Curekart 28 600 413 452 539 Despite intent detection being an important compo- Powerplay11 59 471 261 275 708 nent of most dialogue systems, very few datasets have been collected from real users. Web Apps, Table 2: Statistics of the 3 datasets in HINT3 Ask Ubuntu and Chatbot datasets from (Braun et al., 2017) contain a limited number of intents BANKING77 offer relatively large and well bal- (<10), oversimplifying the task. More recent anced training set which might not be always feasi- datasets like HWU64 from (Liu et al., 2019) and ble to collect for every new domain. For all datasets CLINC150 from (Larson et al., 2019) span a large mentioned so far, recent works have reported a number of intents in multiple domains but are gen- reasonably high performance (>90% average) for erated using crowd sourcing services hence are in-scope queries. Despite this, gaps in language un- limited in diversity in user expressions which arise derstanding become apparent when such systems from but not limited to domain specific presump- are deployed. Datasets introduced in this paper tions, context from how and where the bot is made and further analysis of results attempts to recognise available, paraphrases emerging from cultural and critical gaps in language understanding and calls ethnic diversity of user base, conversational slang, for further research into more robust methods. etc. Our work has some similarity with CLINC150, in that they also highlight the problem of out-of- 3 Datasets scope intent detection and with BANKING77 from (Casanueva et al., 2020) that focuses on a single We introduce HINT3, a collection of datasets domain. However, all three - HWU64, CLINC150, shown in Table2- SOFMattress, Curekart and Powerplay11 each containing diverse set of intents 6https://github.com/hellohaptik/HINT3 in a single domain - mattress products retail, fitness Figure 1: Matthew’s Correlation Coefficient and Accuracy across all datasets and platforms supplements retail and online gaming respectively. rest, discarding a query if it has an entailment score Table1 shows few example intents of varying gran- (Bowman et al., 2015) greater than 0.6 in both di- ularity in HINT3 dataset, along with examples of rections with any of the queries retained so far i.e. training queries created by domain experts and the subset version has the following property in-scope, out-of-scope queries received from real users. E(x ; x ) ≤ 0:6 ^ E(x ; x ) ≤ 0:6; 3.1 Training Data Collection a b b a a 6= b; a 2 [1; jX^ j]; b 2 [1; jX^ j] 8 I Training data is prepared by a team of domain ex- i i perts trying to emulate real users after in-depth re- search of historical user queries. The experts do not where I is the set of all intents, X^i is the set create an explicit set of out of scope queries primar- of queries retained for class i, E(h; p) is the en- ily because the universe of such queries is infinitely tailment scoring function with h as hypothesis and big. Training datasets show class imbalance, oc- p as premise. We use ELMo model trained on 7 currence of domain specific words, acronyms . All SNLI (Peters et al., 2018; Parikh et al., 2016) 8 training data queries are in English. for E(h; p). These are intended to evaluate perfor- Dataset Variants mance with only semantically different sentences in the training set as ideally systems should already In addition to Full training sets, we create Sub- understand semantically similar queries to the ones set versions for each training set. For each class, present in the training set. after retaining the first query we iterate over the 7github.com/hellohaptik/HINT3/tree/master/data exploration 8https://demo.allennlp.org/textual-entailment SOFMattress Curekart Powerplay11 Full Subset Full Subset Full Subset Dialogflow 73.1 65.3 75.0 71.2 59.6 55.6 RASA 69.2 56.2 84.0 80.5 49.0 38.5 LUIS 59.3 49.3 72.5 71.6 48.0 44.0 Haptik 72.2 64.0 80.3 79.8 66.5 59.2 BERT 73.5 57.1 83.6 82.3 58.5 53.0 Table 3: Inscope Accuracy at low threshold=0.1 for Full and Subset data variants 3.2 Test Data Collection and Annotation 4.2 Metrics Our test sets contain the first message received by We consider Accuracy and Matthew’s Correlation live systems from real users over a period of 15 Coefficient9 as overall performance metrics for the days. Inter-annotator agreement was 75.8%, 80.0% systems.