Scalable and Quality-Aware Training Data Acquisition for Conversational Cognitive Services

Scalable and Quality-Aware Training Data Acquisition for Conversational Cognitive Services Mohammad-Ali Yaghoub-Zadeh-Fard A thesis in fulfilment of the requirements for the degree of Doctor of Philosophy School of Computer Science and Engineering Faculty of Engineering May 2021 Originality Statement I hereby declare that this submission is my own work and to the best of my knowl- edge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged. Signed: ............................................... Date: ....................................................... i Copyright Statement I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation. Signed: ................................................... Date: ....................................................... iii Authenticity Statement I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format. Signed: ................................................... Date: ....................................................... v Dedication I dedicate this thesis to the love of my life, Raha for being there for me throughout the doctorate program and in all stages of my life. I also dedicate this work to my family members. A special feeling of gratitude to my lovely parents, Mahnaz and Kazem who left no stone unturned to support me and encouraged me in all stages of my life; my siblings Mahoumeh and Aliakbar who grow up together; my mother, father, and brother-in-law, Fatemeh, Alireza, and Mohsen for their support and encouragement. i ii Acknowledgments I would like to give my special gratitude to all the people who supported me during this journey: • Special thanks to my supervisor, Scientia Professor Boualem Benatallah. He has been such an inspiration and supportive during these years. I really enjoyed the opportunity to work under his supervision. • I would like to thank my sponsor, Australian Research Council (ARC), for supporting me financially during my research. • I would like to thank Australian Government and acknowledge that my work partially was supported by Australian Government Research Training Program Scholarship. • I would like to give my special thanks to Dr. Shayan Zamanirad for his help and assistance throughout my thesis. Doing research, exercising and having fun together with Shayan I finished this journey. • Thanks to Dr. Moshe Chai Barukh and Professor Fabio Casati for their helpful comments on my work, and also their his brilliant ideas. • Thanks to Dr. Amin Beheshti. His support and friendly attitude was inspiring and heart-warming. • Thanks to Dr. Mohsen Afsharchi and Dr. Mozafar Bag-Mohammadi, who were my role models during my bachelor’s degree. Their friendly characters and their insights in computer science inspired me to take this journey in the first place.. iii iv Abstract Dialog Systems (or simply bots) have recently become a popular human-computer interface for performing user’s tasks, by invoking the appropriate back-end APIs (Application Programming Interfaces) based on the user’s request in natural language. Building task-oriented bots, which aim at performing real-world tasks (e.g., booking flights), has become feasible with the continuous advances in Nat- ural Language Processing (NLP), Artificial Intelligence (AI), and the countless number of devices which allow third-party software systems to invoke their back- end APIs. Nonetheless, bot development technologies are still in their preliminary stages, with several unsolved theoretical and technical challenges stemming from the ambiguous nature of human languages. Given the richness of natural language, supervised models require a large number of user utterances paired with their corresponding tasks – called intents. To build a bot, developers need to manually translate APIs to utterances (called canonical utterances) and paraphrase them to obtain a diverse set of utterances. Crowdsourcing has been widely used to obtain such datasets, by paraphrasing the initial utterances generated by the bot developers for each task. However, there are several unsolved issues. First, generating canonical utterances requires manual efforts, making bot development both expensive and hard to scale. Second, since crowd workers may be anonymous and are asked to provide open-ended text (paraphrases), crowdsourced paraphrases may be noisy and incorrect (not conveying the same intent as the given task). This thesis first surveys the state-of-the-art approaches for collecting large training utterances for task-oriented bots. Next, we conduct an empirical study to identify quality issues of crowdsourced utterances (e.g., grammatical errors, semantic completeness). Moreover, we propose novel approaches for identifying unqualified crowd workers and eliminating malicious workers from crowdsourcing tasks. Particularly, we propose a novel technique to promote the diversity of crowdsourced paraphrases by dynamically generating word suggestions while v Abstract crowd workers are paraphrasing a particular utterance. Moreover, we propose a novel technique to automatically translate APIs to canonical utterances. Finally, we present our platform to automatically generate bots out of API specifications. We also conduct thorough experiments to validate the proposed techniques and models. vi Contents Abstractv Acknowledgements iv List of Figures xiv List of Tables xvi Publications xvii 1 Introduction1 1.1 Background, Motivations and Aims . .1 1.2 Research Issues . .5 1.2.1 Acquisition of Large Training Utterances in Scale . .5 1.2.2 Assessing and Controlling Quality of Training Utterances .6 1.3 Contributions . .8 1.3.1 User Utterance Acquisition for Training Task-Oriented Bots: A Review of Challenges, Techniques and Opportunities8 1.3.2 An Empirical Study of Crowdsourced Paraphrases . .9 1.3.3 Dynamic Word Recommendation to Obtain Diverse Crowd- sourced Paraphrases of User Utterances . 10 1.3.4 Automatic Malicious Worker Detection in Crowdsourced Paraphrases . 10 1.3.5 Automatic Canonical Utterance Generation for Task- Oriented Bots from API Specifications . 11 1.3.6 REST2Bot: Bridging the Gap between Bot Platforms and REST APIs . 12 1.4 Thesis structure . 13 vii Contents 2 Utterance Acquisition Methods: Background and the State- of-the-art 15 2.1 Intent Recognition Methods . 17 2.1.1 Rule-based Methods . 18 2.1.2 Machine Learning Methods . 19 2.1.3 Crowdsourcing-based Methods . 20 2.1.4 Retrieval-based Methods . 21 2.1.5 End-to-End Approaches . 21 2.2 Problem Dimensions in Training Utterance Acquisition . 23 2.2.1 Quality . 23 2.2.2 Cost . 25 2.3 Utterance Acquisition Methods . 26 2.3.1 Utterance Acquisition from Bot Usage . 26 2.3.2 Automatically Generating Utterances . 28 2.3.3 Crowdsourcing User Utterances . 32 2.4 Summary and Discussion . 36 2.4.1 Continuous Crowd-Machine Collaboration . 36 2.4.2 Integrating Quality Control and Crowdsourcing. 37 2.4.3 Sharing Utterances . 38 2.4.4 Gamification . 39 2.4.5 Data Programming . 39 2.5 Conclusion . 40 3 An Empirical Study of Crowdsourced Paraphrases 41 3.1 Introduction . 42 3.2 Related Work . 43 3.2.1 Quality Control . 43 3.2.2 Semantic Similarity . 44 3.3 Paraphrase Dataset Collection . 44 3.3.1 Methodology . 45 3.4 Common Paraphrasing Issues . 45 3.4.1 Spelling Errors . 45 3.4.2 Linguistic Errors . 47 3.4.3 Semantic Errors . 47 3.4.4 Task Misunderstanding . 47 viii Contents 3.4.5 Cheating . 48 3.5 Dataset Annotation . 48 3.5.1 Methodology . 49 3.5.2 Statistics . 49 3.6 Automatic Error Detection . 50 3.6.1 Spelling Errors . 51 3.6.2 Linguistic Errors . 52 3.6.3 Translation . 53 3.6.4 Answering to Canonical Utterance . 53 3.6.5 Semantic Errors & Cheating . 54 3.6.6 Incorrect Paraphrases Detection . 57 3.7 Conclusion . 58 4 Diversity-aware Crowdsourced Utterances 59 4.1 Introduction . 59 4.2 Related Work . 63 4.2.1 Priming in crowdsourcing . 63 4.2.2 Word list expansion . 64 4.3 Word Recommender . 65 4.3.1 Synonym Sets Extractor . 67 4.3.2 Recommendation Sets Generator . 68 4.3.3 Recommendation Set Ranker . 69 4.4 Experiments & Results . 72 4.4.1 Task Design Experiment . 72 4.4.2 Crowdsourced Paraphrasing . 75 4.4.3 Results . 76 4.4.4 Limitations . 84 4.5 Conclusion . 84 5 Automatic Malicious Worker Detection 87 5.1 Introduction . 87 5.2 Related Work . 88 5.3 Cheating Behaviors . 89 5.3.1 Character-Level Edits . 89 5.3.2 Word-Level Edits . 90 5.3.3 Random Sentences . 90 ix Contents 5.3.4 Answering to Canonical Utterance .

Scalable and Quality-Aware Training Data Acquisition for Conversational Cognitive Services

Digital Workaholics:A Click Away

Deep Almond: a Deep Learning-Based Virtual Assistant [Language-To-Code Synthesis of Trigger-Action Programs Using Seq2seq Neural Networks]

Inside Chatbots – Answer Bot Vs. Personal Assistant Vs. Virtual Support Agent

Welsh Language Technology Action Plan Progress Report 2020 Welsh Language Technology Action Plan: Progress Report 2020

A Generator of Natural Language Semantic Parsers for Virtual

Virtual Assistant Services Portfolio

Voice Assistants and Smart Speakers in Everyday Life and in Education

Virtual Assistant Interview Questions and Answers Guide

Chatbots & Virtual Assistants

Towards an Intelligent Virtual Assistant for Low Resource Spoken-Only

Artificial Intelligence for Virtual Assistant

Neural Logic Framework for Digital Assistants