<<

Scalable and Quality-Aware Training Data Acquisition for Conversational Cognitive Services

Mohammad-Ali Yaghoub-Zadeh-Fard

A thesis in fulfilment of the requirements for the degree of Doctor of Philosophy

School of Computer Science and Engineering Faculty of Engineering

May 2021

Originality Statement

I hereby declare that this submission is my own work and to the best of my knowl- edge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.

Signed: ......

Date: ......

i Copyright Statement

I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.

Signed: ......

Date: ......

iii Authenticity Statement

I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.

Signed: ......

Date: ......

v Dedication

I dedicate this thesis to the love of my life, Raha for being there for me throughout the doctorate program and in all stages of my life.

I also dedicate this work to my family members. A special feeling of gratitude to my lovely parents, Mahnaz and Kazem who left no stone unturned to support me and encouraged me in all stages of my life; my siblings Mahoumeh and Aliakbar who grow up together; my mother, father, and brother-in-law, Fatemeh, Alireza, and Mohsen for their support and encouragement.

i ii Acknowledgments

I would like to give my special gratitude to all the people who supported me during this journey:

• Special thanks to my supervisor, Scientia Professor Boualem Benatallah. He has been such an inspiration and supportive during these years. I really enjoyed the opportunity to work under his supervision.

• I would like to thank my sponsor, Australian Research Council (ARC), for supporting me financially during my research.

• I would like to thank Australian Government and acknowledge that my work partially was supported by Australian Government Research Training Program Scholarship.

• I would like to give my special thanks to Dr. Shayan Zamanirad for his help and assistance throughout my thesis. Doing research, exercising and having fun together with Shayan I finished this journey.

• Thanks to Dr. Moshe Chai Barukh and Professor Fabio Casati for their helpful comments on my work, and also their his brilliant ideas.

• Thanks to Dr. Amin Beheshti. His support and friendly attitude was inspiring and heart-warming.

• Thanks to Dr. Mohsen Afsharchi and Dr. Mozafar Bag-Mohammadi, who were my role models during my bachelor’s degree. Their friendly characters and their insights in computer science inspired me to take this journey in the first place..

iii iv Abstract

Dialog Systems (or simply bots) have recently become a popular human-computer interface for performing user’s tasks, by invoking the appropriate back-end APIs (Application Programming Interfaces) based on the user’s request in natural lan- guage. Building task-oriented bots, which at performing real-world tasks (e.g., booking flights), has become feasible with the continuous advances in Nat- ural Language Processing (NLP), Artificial Intelligence (AI), and the countless number of devices which allow third-party software systems to their back- end APIs. Nonetheless, bot development technologies are still in their preliminary stages, with several unsolved theoretical and technical challenges from the ambiguous nature of human languages. Given the richness of natural language, supervised models require a large number of user utterances paired with their corresponding tasks – called intents. To build a bot, developers need to manually translate APIs to utterances (called canonical utterances) and paraphrase them to obtain a diverse set of utterances. Crowdsourcing has been widely used to obtain such datasets, by paraphrasing the initial utterances generated by the bot developers for each task. However, there are several unsolved issues. First, gener- ating canonical utterances requires manual efforts, making bot development both expensive and hard to scale. Second, since crowd workers may be anonymous and are asked to provide open-ended text (paraphrases), crowdsourced paraphrases may be noisy and incorrect (not conveying the same intent as the given task). This thesis first surveys the state-of-the-art approaches for collecting large training utterances for task-oriented bots. Next, we conduct an empirical study to identify quality issues of crowdsourced utterances (e.g., grammatical errors, semantic completeness). Moreover, we propose novel approaches for identifying unqualified crowd workers and eliminating malicious workers from crowdsourc- ing tasks. Particularly, we propose a novel technique to promote the diversity of crowdsourced paraphrases by dynamically generating word suggestions while

v Abstract crowd workers are paraphrasing a particular utterance. Moreover, we propose a novel technique to automatically translate APIs to canonical utterances. Finally, we present our platform to automatically generate bots out of API specifications. We also conduct thorough experiments to validate the proposed techniques and models.

vi Contents

Abstractv

Acknowledgements iv

List of Figures xiv

List of Tables xvi

Publications xvii

1 Introduction1 1.1 Background, Motivations and Aims ...... 1 1.2 Research Issues ...... 5 1.2.1 Acquisition of Large Training Utterances in Scale . . . . .5 1.2.2 Assessing and Controlling Quality of Training Utterances .6 1.3 Contributions ...... 8 1.3.1 User Utterance Acquisition for Training Task-Oriented Bots: A Review of Challenges, Techniques and Opportunities8 1.3.2 An Empirical Study of Crowdsourced Paraphrases . . . . .9 1.3.3 Dynamic Word Recommendation to Obtain Diverse Crowd- sourced Paraphrases of User Utterances ...... 10 1.3.4 Automatic Malicious Worker Detection in Crowdsourced Paraphrases ...... 10 1.3.5 Automatic Canonical Utterance Generation for Task- Oriented Bots from API Specifications ...... 11 1.3.6 REST2Bot: Bridging the Gap between Bot Platforms and REST APIs ...... 12 1.4 Thesis structure ...... 13

vii Contents

2 Utterance Acquisition Methods: Background and the State- of-the-art 15 2.1 Intent Recognition Methods ...... 17 2.1.1 Rule-based Methods ...... 18 2.1.2 Methods ...... 19 2.1.3 Crowdsourcing-based Methods ...... 20 2.1.4 Retrieval-based Methods ...... 21 2.1.5 End-to-End Approaches ...... 21 2.2 Problem Dimensions in Training Utterance Acquisition ...... 23 2.2.1 Quality ...... 23 2.2.2 Cost ...... 25 2.3 Utterance Acquisition Methods ...... 26 2.3.1 Utterance Acquisition from Bot Usage ...... 26 2.3.2 Automatically Generating Utterances ...... 28 2.3.3 Crowdsourcing User Utterances ...... 32 2.4 Summary and Discussion ...... 36 2.4.1 Continuous Crowd-Machine Collaboration ...... 36 2.4.2 Integrating Quality Control and Crowdsourcing...... 37 2.4.3 Sharing Utterances ...... 38 2.4.4 Gamification ...... 39 2.4.5 Data Programming ...... 39 2.5 Conclusion ...... 40

3 An Empirical Study of Crowdsourced Paraphrases 41 3.1 Introduction ...... 42 3.2 Related Work ...... 43 3.2.1 Quality Control ...... 43 3.2.2 ...... 44 3.3 Paraphrase Dataset Collection ...... 44 3.3.1 Methodology ...... 45 3.4 Common Paraphrasing Issues ...... 45 3.4.1 Spelling Errors ...... 45 3.4.2 Linguistic Errors ...... 47 3.4.3 Semantic Errors ...... 47 3.4.4 Task Misunderstanding ...... 47 viii Contents

3.4.5 Cheating ...... 48 3.5 Dataset Annotation ...... 48 3.5.1 Methodology ...... 49 3.5.2 Statistics ...... 49 3.6 Automatic Error Detection ...... 50 3.6.1 Spelling Errors ...... 51 3.6.2 Linguistic Errors ...... 52 3.6.3 Translation ...... 53 3.6.4 Answering to Canonical Utterance ...... 53 3.6.5 Semantic Errors & Cheating ...... 54 3.6.6 Incorrect Paraphrases Detection ...... 57 3.7 Conclusion ...... 58

4 Diversity-aware Crowdsourced Utterances 59 4.1 Introduction ...... 59 4.2 Related Work ...... 63 4.2.1 Priming in crowdsourcing ...... 63 4.2.2 Word list expansion ...... 64 4.3 Word Recommender ...... 65 4.3.1 Synonym Sets Extractor ...... 67 4.3.2 Recommendation Sets Generator ...... 68 4.3.3 Recommendation Set Ranker ...... 69 4.4 Experiments & Results ...... 72 4.4.1 Task Design Experiment ...... 72 4.4.2 Crowdsourced Paraphrasing ...... 75 4.4.3 Results ...... 76 4.4.4 Limitations ...... 84 4.5 Conclusion ...... 84

5 Automatic Malicious Worker Detection 87 5.1 Introduction ...... 87 5.2 Related Work ...... 88 5.3 Cheating Behaviors ...... 89 5.3.1 Character-Level Edits ...... 89 5.3.2 Word-Level Edits ...... 90 5.3.3 Random Sentences ...... 90

ix Contents

5.3.4 Answering to Canonical Utterance ...... 90 5.3.5 Foreign Language ...... 91 5.4 Cheating Detection ...... 92 5.4.1 Feature Engineering ...... 93 5.4.2 Malicious Worker Detection ...... 95 5.5 Experiment & Results ...... 96 5.5.1 Evaluation ...... 96 5.5.2 Error Analysis ...... 96 5.6 Conclusion ...... 98

6 Automated Canonical Utterance Generation 99 6.1 Introduction ...... 100 6.2 Related Work ...... 103 6.2.1 REST APIs ...... 103 6.2.2 Conversational Agents and Web APIs ...... 104 6.3 The API2CAN Dataset ...... 105 6.3.1 API2CAN Generation Process ...... 105 6.3.2 Dataset Statistics ...... 108 6.4 Neural Canonical Sentence Generation ...... 109 6.4.1 Resources in REST ...... 110 6.4.2 Resource-based Delexicalization ...... 112 6.5 Parameter Value Sampling ...... 113 6.6 API2CAN Service ...... 116 6.7 Experiments & Results ...... 117 6.7.1 Translation Methods ...... 118 6.7.2 Canonical Utterance Generation ...... 120 6.7.3 Parameter Value Sampling ...... 124 6.8 Conclusion ...... 125

7 REST2Bot: A Platform for Automated Bot Development 127 7.1 Introduction ...... 127 7.2 System Overview ...... 129 7.2.1 API Parser ...... 130 7.2.2 Canonical Sentence Generator ...... 130 7.2.3 Paraphraser ...... 132 7.2.4 Conversation Manager Generator ...... 133 x Contents

7.2.5 Webhook Generator ...... 133 7.3 Usecases ...... 133 7.4 Conclusion ...... 135

8 Conclusion 137 8.1 Summary the Research Issues ...... 137 8.2 Summary of the Research Outcomes ...... 138 8.3 Future Research Directions ...... 139 8.3.1 Integrating Quality Control and Crowdsourcing with Feed- back Mechanism ...... 140 8.3.2 Canonical Utterance Generation for Complex Intents . . . 140 8.3.3 Acquisition of Training Dialogues ...... 141

Bibliography 142

xi Contents

xii List of Figures

1.1 Human-Computer Interfaces...... 1 1.2 Single-turn Conversation vs Multi-turn Conversations ...... 3 1.3 Typical Bot Development Process ...... 4 1.4 Research Approach ...... 14

2.1 Pipeline Architecture of Text-based Dialog Systems ...... 16 2.2 Utterance, Intent, and Entity Relationship ...... 17 2.3 Defined pattern/response pairs for Greeting intent [228] ...... 18 2.4 Crowd workers extract the required parameters and intents in “Guardian”[89] ...... 20 2.5 Utterance Acquisition Methods ...... 28 2.6 Canonical Utterance Generation ...... 29

3.1 Dataset Label Statistics ...... 50

4.1 Word-Recommendation Overview ...... 61 4.2 Crowd Workers’ Interface ...... 65 4.3 Word Recommender Architecture ...... 66 4.4 Synonym-Sets Creation for an Utterance ...... 67 4.5 Recommendation-Set Generation ...... 69 4.6 Likert Assessment of Word Clouds by Size ...... 74 4.7 Average User Rating over Time ...... 79

5.1 Crowdsourced Paraphrasing in figure-eight ...... 93 5.2 Cheating rate across different domains ...... 97

6.1 Example of an HTTP POST Request ...... 101 6.2 Classical Training Data Generation Pipeline ...... 101 6.3 Excerpt of an OpenAPI Specification ...... 105 6.4 Process of Canonical Utterance Extraction ...... 106

xiii List of Figures

6.5 API2CAN Breakdown by HTTP Verb ...... 109 6.6 API2CAN Breakdown by Length ...... 110 6.7 Canonical Template Generation via Resource-based Delexicalization115 6.8 Assessment of Generated Canonical Templates ...... 123 6.9 Parameter Type and Location Statistics ...... 126

7.1 Typical Bot Development Process vs Bot Development Process Using REST2Bot (Green Arrow) ...... 128 7.2 REST2Bot Architecture - Building conversational bots from APIs specifications ...... 129 7.3 Excerpt of ’s OpenAPI Specification ...... 131 7.4 REST2Bot ...... 134

xiv List of Tables

2.1 Summary of Utterance Acquisition Methods ...... 27 2.2 Crowdsourced paraphrasing strategies ...... 33

3.1 Paraphrase Samples ...... 46 3.2 Pairwise Inter-Annotator Agreement ...... 49 3.3 Comparison of Spell Checkers ...... 51 3.4 Comparison of Grammar Checkers ...... 53 3.5 Language Detection ...... 54 3.6 Summary of Feature Library ...... 55 3.7 Automatic Answering Detection ...... 55 3.8 Automatic Semantic Error Detection ...... 56 3.9 Automatic Cheating Detection ...... 56 3.10 Automatic Incorrect Paraphrase Detection ...... 57

4.1 Crowdsourced Paraphrase Datasets ...... 76 4.2 Diversity Comparison ...... 77 4.3 Naturalness Comparison ...... 80 4.4 Diversity of 1st, 2nd, and 3rd Paraphrases ...... 81 4.5 Intent Detection Accuracy by Dataset ...... 82 4.6 Worker Satisfaction ...... 83

5.1 Summary of Feature Library ...... 92 5.2 Automatic Malicious Worker Detection ...... 97

6.1 Parameter Replacement Context Free Grammar ...... 107 6.2 API2CAN Statistics ...... 108 6.3 Resource Types ...... 112 6.4 Excerpt of Transformation Rules ...... 120 6.5 Translation Performance ...... 121

xv List of Tables

6.6 Examples of Generated Canonical Templates ...... 123

xvi Publications

Journals

• Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, Fabio Casati, Moshe Chai Barush, and Shayan Zamanirad, “User Utterance Acquisition for Training Task-Oriented Bots: A Review of Challenges, Techniques and Opportunities,” in IEEE Internet Computing, 2020. Conferences

• Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, Moshe Chai Barush,and Shayan Zamanirad, “A study of incorrect paraphrases in crowd- sourced user utterances,” in NAACL-HLT, 2019.

• Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, Fabio Casati,Moshe Chai Barush, and Shayan Zamanirad, “Dynamic word recom-mendation to obtain diverse crowdsourced paraphrases of user utterances,” in IUI, 2020.

• Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, and Shayan Zamanirad, “Automatic canonical utterance generation for task-oriented bots from API specifications,” in EDBT, 2020.

• Mohammad-Ali Yaghoub-Zadeh-Fard, Shayan Zamanirad, and Boualem Benatallah, “REST2Bot: Bridging the gap between bot platforms and REST APIs,” in WWW, 2020.

xvii

Chapter 1

Introduction

1.1 Background, Motivations and Aims

The human-computer interface, as the means of communication between the hu- man user and a machine, has been evolving since the invention of the computer; from command-line environments to graphical user interfaces, and natural lan- guage interfaces (text-based and spoken interfaces). In today’s highly message- based culture, natural language interfaces have gained enormous attraction, and building dialog systems (also known as virtual assistants, conversational agents, , or simply bots) has come to attention to facilitate the human-computer interactions. Almost all of the big companies have already invested in virtual personal as- sistants. , , Alexa, and Apple to name a few, are collectively being used by millions of users worldwide [45]. Many other sophisticated bots are continuously being developed, from those that allow

Figure 1.1: Human-Computer Interfaces.

1 Chapter 1 Introduction data scientists to assemble data analytic pipelines (e.g., AVA [97], Analyza [48]) to bots that act like human (e.g., Microsoft’s Tay). Other applications of bots include entertainment, generating source code, querying databases, controlling home appliances, and even testing theories [207, 13, 97, 61, 171]. At the time of writing, over 1 billion Alexa and devices have been collectively implanted in our homes1 [228]. A conversation between a user and the computer can be as simple as asking a question and getting an answer, or it can be a complex conversation with series of interactions (e.g., questions and answers). In terms of interaction modes, conversational bots can be divided into two main categories [101, 100]:

• Single-turn conversation. In single-turn mode, the user queries the bot, and the bot provides a response back to the user. As such, the user gets the requested information or performs the desired task in one quick interaction. After generating the response, the conversation is over, and neither the question nor the response will be remembered for future conversations. In other , each conversation is performed in complete isolation.

• Multi-turn conversations. In multi-turn conversations, there is a dialog between the user and the bot. In each turn the bot responds appropriately to the user’s request based on the conversation history. As such, the bot may request additional information (e.g., location of the restaurant, type of cuisine, rate) if required for fulfilling the user’s request (e.g., finding a restaurant).

To perform a task (e.g., finding a restaurant), bots require to map the user’s utterance (e.g., “Is there any restaurant around Station Street?”) to exe- cutable forms such as back-end APIs (Application Programming Interfaces) (e.g., @flights.query(location=‘Station Street’)) and SQL commands (e.g., select name from restaurants where location=“Station Street”). In this thesis, we focus on one of the most popular forms of intents which are fulfilled with by invoking APIs. The translation of user utterances to intents is challenging since human language is unbounded and rich, and users can express the same task using countless ways in natural language. Moreover, smart devices and software services are continuously developed and provide new APIs to interact with other

1https://www.cnet.com/news/google-assistant-expands-to-a-billion-devices-and-80- countries/

2 1.1 Background, Motivations and Aims

(a) Single-turn Conversation (b) Multi-turn Conversation

Figure 1.2: Single-turn Conversation vs Multi-turn Conversations services, including bots. With the number of different APIs growing and evolving very rapidly, we need bot development to “scale” in terms of how effectively they can be integrated to APIs. However, from an engineering perspective, integration of bots and APIs are still lagging behind the deployment of new APIs, devices, and services [228]. Current solutions for bot development rely heavily on experts’ understanding of APIs and require extensive manual efforts to obtain training samples for building supervised machine learning models. Given the evolving nature of APIs and their abundance, large-scale integration of bots and APIs require minimal expert involvement. After all, the value of virtual assistants is heavily related to their capabilities of integrating with a large number of evolving and heterogeneous APIs [228, 217]. Developing a bot typically implies the ability to invoke APIs corresponding to user utterances (e.g., “what will the weather be like tomorrow in NYC?"). This is typically done in two phases as demonstrated in Figure 7.1: (i) training a Natural Language Understanding (NLU) model to map user utterances to a predefined set of intents2 and extracting associating entities3; and (ii) developing Webhook func- tions to map intents to executable forms (e.g., APIs, SQL commands) and serve the user’s request by performing tasks (e.g., reporting the weather) [228, 218]. In this thesis, we concentrate on a very popular forms of user intents which are fulfilled by invoking API methods. Supervised NLU techniques require definition

2An intent is an abstract description of a task to be performed such as “book-flights”, “set-alarm”, “get-playlists”. 3An entity is a parameter (e.g., “location”, “date”) of the intent (e.g., “book-flights”)

3 Chapter 1 Introduction

API Specifications

Defining Intents & Params

Creating an initial Utterance

Paraphrasing Utterances

Removing Incorrect Utterances Natural Language Understanding

Training Bots

Creating a Webhook Server Server Webhook Deploying Webhook Server

Ready to Go!

Figure 1.3: Typical Bot Development Process of intents (e.g., booking hotels), entity types (e.g., location, date), and a set of an- notated utterances in which entities are labeled with the entity types and intents. To obtain such sets of utterances, the typical approach is to obtain an initial ut- terance called canonical utterance and then paraphrase the canonical utterance (either manually or automatically) to generate a diverse set of user utterances [217, 95]. Obtaining canonical utterances for each intent requires creation of utterance templates or grammar rules [217, 25, 190]. Canonical utterances are then paraphrased (automatically or using crowdsourcing) to obtain more diverse set of training utterances. However, bot developers are required to filter/remove (manually, automatically, or via crowdsourcing) unqualified utterances from the generated set of utterances to obtain a high-quality set of utterances and build reliable bots [219, 217]. The cleaned dataset can be used for training NLU mod- els. Finally, bot developers provide Webhooks (i.e, intent-action rules that trigger API calls upon detection of associated intents) [218]. Integration of bots an APIs, therefore, requires techniques to reduce expert involvement or automate some training data acquisition tasks (e.g. canonical ut- terance generation, paraphrasing, quality control of training utterances). In this

4 1.2 Research Issues thesis, we contribute novel techniques focusing on acquisition of quality train- ing utterances in scale and reducing the involvement of experts in the process of building bots.

1.2 Research Issues

In this section, we will discuss important issues regarding efficient and effective training data acquisition in dialog systems.

1.2.1 Acquisition of Large Training Utterances in Scale

The state-of-the-art solutions for building NLU models often rely on training ut- terances for a particular set of intents [217]. As mentioned earlier in this chapter, obtaining training data typically involves also called canonical utterances) that captures users’ particular intent, and (ii) paraphrasing the initial utterances into multiple variations to live up to human language richness [225, 203].

Canonical Utterance Generation. For a given intent and its entities, the first step in training data acquisition is to create a canonical utterance which conveys the same intent with the given set of entity values. While APIs are unleashing the power of software systems and devices, but they are often difficult to understood by non-experts. Thus, generating canonical utterances is essential since they express the functionality of an API method in natural language and can be understood by non-experts. Existing solutions for generating canonical utterances often involve employing domain experts (e.g., API developers) to generate hand-crafted domain-specific templates or grammar rules [25, 190, 204]. Not only are such approaches costly since they rely on experts, but also they are domain-specific and not scalable [204, 25, 190]. As a result, building a bot for a new domain (a new API) requires manual efforts to modify the templates or hand-crafted grammar rules to gener- ate canonical utterances [217]. In addition, API specifications (e.g., the name of methods, parameters) change overtime, requiring modification of templates and grammar rule. It is thus essential to develop approaches with minimum involve- ment of experts [25, 190].

Paraphrasing Canonical Utterances. A diverse set of utterances can more

5 Chapter 1 Introduction effectively represent the various ways that an intent can be expressed in natural language [217, 25, 190, 139]. Paraphrasing is thus essential, especially given the flexible but ambiguous nature of the human languages [206]. A lack of variation in training utterances can cause incorrect intent recognition or entity resolution, and it can results in bots performing undesirable (even dangerous) tasks (e.g., pressing the accelerator rather than the brake pedal in a car) [85]. Automated or crowdsourcing paraphrasing techniques have been used to diver- sity training utterances. Automatic paraphrase generation is potentially cost-less [52, 50, 125]. However, even the state-of-the-art techniques fall short in producing sufficiently diverse utterances which are at the same time semantically correct [80, 75, 215, 18]. Due to shortcomings of automated paragraphing system, recently crowdsourced paraphrasing has become popular [25]. However, such paraphrases may be generated by anonymous workers with varied skills and motivations. Con- sequently, they may contain incorrect paraphrases [46].

1.2.2 Assessing and Controlling Quality of Training Utterances

Collected utterances may contain incorrect samples which do not convey the expected intent [217]. The lack of high-quality training samples can be disas- trous [138]. Microsoft’s Tay, as an example, quickly made a number of offensive commentaries because of presence of biases in training samples [85]. Particularly in crowdsourced utterances, malicious workers, spammers, and in- experienced workers may generate misleading and erroneous paraphrases [219]. Quality issues in crowdsourced utterances may also stem from misunderstand- ing the task of paraphrasing or missing information like values for parameters of intents [190]. This thesis thus focuses on solutions required for obtaining high- quality training utterances from crowd workers. The quality dimensions of such collections of utterances can be divided into two primary categories:

• Corpus-level. At the corpus-level, bot developers aim to obtain diverse set of utterances with minimal bias [85]. However, it has been shown in numerous studies that crowd-workers are biased towards using words in the given initial utterance (to be paraphrased) [217, 220, 139]. Bias towards vocabulary and structure of the sentence to be paraphrased can be explained by the priming effect – an automatic, implicit and non-conscious activation

6 1.2 Research Issues

of information in memory [86]. According to the priming effect, exposure to a stimulus affects responses to a subsequent stimulus. (e.g., being asked to name a word starting with “str”, humans are more likely to form the word “strong” than “street” if they have previously been shown the word “strong”) [211]. As such, primed by the words in the given utterance, crowd workers are more likely to use the same vocabulary when paraphrasing [171, 95]. Thus, the priming effect may negatively impact the diversity of collected paraphrases. Current solutions for overcoming the priming effects such as paraphrasing the paraphrases obtained by other crowd-workers, have been shown to result in numerous semantically incorrect paraphrases [171, 95]. The new challenge is to effectively design a crowdsourcing task without biasing crowd-workers.

• Utterance-level. At the utterance-level, an utterance should convey the supposed intent without any divergence in meaning. In the crowdsourcing settings, workers are supposed to provide correct paraphrases for a given ut- terance. However, it has been reported in many studies that crowd workers may generate incorrect paraphrases [11, 85, 112]. For example, spammers, malicious and even inexperienced crowd-workers may provide misleading, erroneous, and semantically invalid paraphrases [118, 25]. Quality issues may also stem from misunderstanding the intent or not covering important information such as values of the intent parameters [190]. Thus, crowd- sourced paraphrases need to be checked for quality, given that they are produced by unknown workers with varied skills and motivations [25, 46]. The common practice to remove unqualified paraphrases is to design an- other crowdsourcing task called “validation task” [217, 139]. However, this approach is costly having to pay for both the paraphrasing and validation tasks, making automated techniques a very appealing alternative. More- over, quality control is more desirable if it is done before letting workers submit their paraphrases, since low quality workers can be removed early on without any payment [112, 217, 219, 141]. To achieve this, it is there- fore important to automatically recognize quality issues in crowdsourced paraphrases during the process of bot development. Existing automated approaches are limited to removing misspelled paraphrases [203] and dis- carding submissions from workers with low/high task completion time [123]. We need to characterize what kind of paraphrases can be considered incor-

7 Chapter 1 Introduction

rect. We also need to build efficient and effective approaches to detect incorrect paraphrases.

1.3 Contributions

In this thesis, we devise novel techniques and models to tackle the above- mentioned issues in training data acquisition for dialog systems. We build upon advances in natural language processing, machine learning, dialog systems, and crowdsourcing to address the raised issues in obtaining bot training utterances. The proposed concepts and techniques resolve important gaps and shortcomings in efficiently and effectively acquiring bot training utterances. They include (i) a comprehensive survey of exiting approaches for obtaining quality training utter- ances; (ii) a study that characterizes how incorrect paraphrases are generated by crowd workers and a taxonomy of common quality problems in crowdsourced user utterances, as well as techniques to detect each category of incorrect paraphrases; (iii) improving diversity of crowdsourced paraphrases via dynamic word recom- mendation; (iv) a novel technique to detect malicious workers who intentionally generate incorrect paraphrases in crowdsourcing tasks, particularly, generate in- correct paraphrases in crowdsourced user utterances; (v) an automatic approach for generating canonical utterances to directly translate a REST (REpresenta- tional State Transfer) API to natural language; and (vi) a software prototype demonstrating all the techniques and models developed in this study. We in- vestigate and develop software architectures, prototypes, evaluation studies and applications to assess the proposed models and techniques. In the rest of this section, we will briefly explain the contributions made in this study.

1.3.1 User Utterance Acquisition for Training Task-Oriented Bots: A Review of Challenges, Techniques and Opportunities4

Building conversational task-oriented bots requires large and diverse sets of an- notated user utterances to learn mappings between natural language utterances and user intents. Given the complexity of human language as well as the recent advances on intent recognition (especially deep-learning-based approaches), bot

4This contribution has been published in IEEE Internet Computing (see [217])

8 1.3 Contributions developers now have faced a new challenge: efficiently and effectively collecting a large number of quality (e.g., diverse, unbiased) training samples. This work studies training user utterance acquisition methods along several important dimensions including cost and quality [217]. We discuss the state-of- the-art techniques, identify and explore open issues, and inform an outlook on future research directions [217]. In the following contributions, we first perform an empirical study to characterize crowdsourced utterances (e.g., their quality issues) and then subject the identified open issues, namely lack of diversity in crowdsourced utterances and automatic generation of canonical utterances.

1.3.2 An Empirical Study of Crowdsourced Paraphrases5

Developing bots requires high quality training samples. Crowdsourcing has widely been used to collect such datasets by paraphrasing an initial utterance into new variations. However, the quality of this approach often suffers from various quality issues, particularly language errors produced by unqualified crowd workers. More so, since workers are tasked to write open-ended text, it is very challenging to automatically asses the quality of paraphrased utterances. In this study, we investigate common crowdsourced paraphrasing issues and derives a taxonomy of such issues (e.g., cheating, spelling errors, grammatical errors, semantic errors, task misunderstanding- translation and answering) [219]. We collected a large set of crowdsourced paraphrases in several domains and pro- posed an annotated dataset called Para-Quality, for detecting the quality issues. We also investigated existing tools and services to provide baselines for detecting each category of issues to ensure if existing tools are capable of detecting such issues automatically and accurately [219]. Overall, this work presents a data- driven view of incorrect paraphrases during the bot development process, and we pave the way towards automatic detection of unqualified paraphrases [219]. In the following contributions, we subject two of most important quality issues, namely lack of diversity in training utterances and detection of incorrect utter- ances generated by malicious workers in crowdsourced utterances.

5This contribution has been published in the proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (see [219])

9 Chapter 1 Introduction

1.3.3 Dynamic Word Recommendation to Obtain Diverse Crowdsourced Paraphrases of User Utterances6

As discussed in Research Issues (see Section 1.2.1 and 1.2.2) and investigated thoroughly in our prior contributions (see Section 1.3.1 and 1.3.2), building task-oriented bots requires a large number of diverse utterances. Crowdsourc- ing may be an effective, inexpensive, and scalable technique for collecting such large datasets [217, 25, 190]. However, the diversity of the results suffers from the priming effect (i.e. workers are more likely to use the words in the sentence we are asking to paraphrase) [217, 139]. In this study, we leverage priming as an opportunity rather than a threat: we dynamically generate word suggestions by introducing a novel word-list en- richment method based on word-alignment and techniques, to motivate crowd workers towards producing diverse utterances [220]. The sug- gestions are generated based on the already collected paraphrases to encourage diversity by suggesting new words/phrases to crowd workers. The key challenge is to make suggestions that can improve diversity without resulting in semanti- cally invalid paraphrases. To achieve this, we propose a probabilistic model that generates continuously improved versions of word suggestions that balance diver- sity and semantic relevance. Our experiments show that the proposed approach improves the diversity of crowdsourced paraphrases [220]. Moreover, it decreases the task completion time (the time taken for doing a task by a crowd worker) and reduces the number of spelling errors [220]. Later in this study, the proposed crowdsourcing approach (crowdsourced utterances by word recommendation) is used along with other contributions to build a platform for facilitating the process of building bots (see the contribution in Section 1.3.6).

1.3.4 Automatic Malicious Worker Detection in Crowdsourced Paraphrases

Crowdsourced utterances are generated by crowd workers with varied skills and motivations. They thus often suffer from various quality issues (e.g., semantic and linguistic errors) which are studied in prior contributions (see the research issue in Section 1.2.2 and the contributions in Section 1.3.1, 1.3.2, and 1.3.3). Such

6This contribution has been published in the proceedings of the 25th International Confer- ence on Intelligent User Interfaces (IUI) (see [220])

10 1.3 Contributions quality issues stem from the lack of quality assessment methods for open-ended text in crowdsourcing platforms (e.g., Mechanical Turk, Figure-Eight). Thus unqualified workers and spammers may generate erroneous paraphrases. Crowd- sourced paraphrases must be verified to remove erroneous paraphrases, imposing an extra cost. Detecting such erroneous paraphrases requires an understanding of how malicious workers generate paraphrases. In this study, we provide a taxonomy of cheating behaviors in crowdsourced paraphrasing to give insight into how malicious workers generated erroneous para- phrases intentionally. Moreover, we identified various features from literature based on the characteristics of each type of cheating behaviors. Then we pro- posed two new features to overcome shortcomings in the literature (e.g., multi- paraphrase similarity7, user edit habits 8) and propose an domain-independent method for modeling malicious workers. Our experiments indicates that the pro- posed approach can effectively detect malicious workers based on the paraphrases generated by them, and reduce the rate of incorrect paraphrases 9 generated by malicious workers by 76 percent. The proposed approach for detecting malicious workers is used along with other contributions to build a platform for facilitating the generation of training utterances for bots (see the contribution in Section 1.3.6).

1.3.5 Automatic Canonical Utterance Generation for Task-Oriented Bots from API Specifications10

As discussed in Research Issues (see Section 1.2.1) and studied in the prior con- tribution in Section 1.3.1, existing approaches for generating canonical utterances require experts to generate utterance templates or grammar rules [190, 217, 25]. They thus do not scale and are domain-specific, making bots expensive to main- tain [217]. With the development of REST APIs, many applications have been designed to harness their potential, and they are used to fulfill intents by invoking back-end APIs provided by software systems and smart devices [190]. The automatic generation of canonical utterances can be considered as a su-

7computing the similarity between a set of paraphrases submitted by a crowd worker 8defined by how a crowd worker edits a sentence to generate a paraphrase 9the number of incorrect paraphrases by cheater divided by the total number of paraphrases 10This contribution has been published in the proceedings of 23rd International Conference on Extending Database Technology (EDBT) (see [218])

11 Chapter 1 Introduction pervised translation task in which an API method is translated into an utter- ance. However, the key challenge is the lack of training data for training domain- independent models. In this study, we propose API2CAN, an extensible dataset containing 14,370 pairs of API methods and utterances. The dataset is built by processing a large number of public APIs. However, deep-learning-based ap- proaches such as sequence-to-sequence models require larger sets of training sam- ples (ideally millions of samples) [217, 218]. To mitigate the absence of such large datasets, we define resources in REST APIs and propose a delexicalization tech- nique (by converting an API method and initial utterances to tagged sequences of resources) to let deep-learning-based approaches learn from such datasets [218]. In addition, we show how parameter values can be sampled to feed placeholders in a canonical template and generate canonical utterances. We also provide a sys- tematic analysis of Web APIs and their parameters, indicating the importance of string parameters in automating the generation of canonical utterances [218]. We also conduct both qualitative and quantitative experiments to verify effectiveness of the proposed delexicalization technique and methods for sampling parameter values. Our experiments indicate that the proposed approaches can be effectively used to automatically generate canonical templates for REST APIs. In the next contribution, we use the proposed translation model to generate canonical utter- ances for REST APIs as well as the prior contributions to facilitate the process of training utterance acquisition.

1.3.6 REST2Bot: Bridging the Gap between Bot Platforms and REST APIs11

Building bots requires collecting a large but quality training utterances (see Sec- tion 1.2.1 and 1.2.2). Whereas existing bot development platforms (e.g., Di- alogflow, Wit.ai) facilitate building bots, bot developers are still required to provide training data by defining corresponding intents (user’s intention such as booking a hotel) and entities (e.g., hotel location). Moreover, bot developers are required to build and deploy Webhook functions to invoke API methods on intents detection. In this contribution, we introduce REST2Bot, a tool that ad- dresses these shortcomings (e.g., translating APIs to Intents, and invoking APIs

11This contribution has been published in the proceedings of 2020 The Web Conference (WWW) (see [221])

12 1.4 Thesis structure based on detected Intents) in bot development frameworks to automate several tasks in the life cycle of the bot development process [221]. REST2Bot relies on the proposed techniques and approaches for generating canonical utterances (see Section 1.3.5, crowdsourcing utterances via word recommendation (see 1.3.3) to bots on desired bot development frameworks; It also relies on the state-of-the- art approaches to parse API specifications and generating deployable webhook functions to map intents and entities to APIs [228].

1.4 Thesis structure

The thesis structure follows the research approach presented in Figure 1. In each chapter, we present the related work to motivate the study and provide necessary background for understanding the chapter. We also provide essential background knowledge, discuss state of the art and relevant terminology in Chapter 2. Chap- ter 3 presents our empirical study on how incorrect paraphrases are generated during crowdsourcing, proposes a taxonomy of common issues in crowdsourced paraphrases, and investigates existing tools and models for detecting each cate- gory of the issues. Chapter 4 proposes and presents techniques to dynamically generate word suggestions in crowdsourced paraphrases to promote diversity while discouraging generation of semantically incorrect paraphrases. Chapter 5 delves into an important category of incorrect paraphrases generated by insincere work- ers, and propose an effective approach to detect malicious workers at the time of submission of crowdsourcing tasks. Chapter 6 proposes and presents a technique to generate the initial utterances (to be later paraphrased by crowdsourcing or automated paraphrasing systems) for REST APIs. Chapter 7 presents our proto- type software which build bots automatically by REST API specifications (also known as swagger documentations) using the proposed approaches for generating canonical sentences, the crowdsourcing approach proposed in this thesis, as well as the state-of-the-art approaches proposed in related work. Finally, Chapter 8 provides our conclusion and highlights future research directions and outlook.

13 Chapter 1 Introduction

Chapter 2

Utterance Acquisition Methods, Background, and the State-of-the-art

Chapter 3

An Empirical Study of Incorrect Crowdsourced Paraphrases

Chapter 4 Chapter 5 Chapter 6 Diversity-aware Automatic Malicious Automated Canonical Crowdsourced User Worker Detection Utterance Generation Utterances

Chapter 7

REST2Bot: A Platform for Automated Bot Development

Figure 1.4: Research Approach

14 Chapter 2

Utterance Acquisition Methods: Background and the State-of-the-art

This chapter presents background on dialog systems and introduces main con- cepts and techniques that are relevant to contributions that we present in the following chapters. It reviews training user utterance acquisition methods (i.e., bot-usage, automated, and crowdsourcing -based approaches) along several impor- tant dimensions including cost and quality. We discuss state of the art techniques and identify open issues in training utterance acquisitions. In Section 2, we discuss background on dialog systems. Section 2.1 particularly discusses an important task in building such system, namely intent recognition. Section 2.2 identifies the problem dimensions in training data acquisition for train- ing intent recognition models. In Section 2.3, we overview utterance acquisition methods. Section 2.4 provides a summary and outlook for utterance acquisition methods. Finally, Section 2.5 provides our conclusion.

Part I - Dialog Systems

Dialog systems generally can be categorized into two classes: (i) Non-task oriented bots (usually referred as chatbots), and (ii) Task oriented bots [31]. Non-task ori- ented bots engage in open domain conversations with users without a predefined goal for the conversations. It is also worth noting that even in task-oriented bots, around 80% of conversations are just chi-chat messages [31]. Such bots hardly keep the history of the conversation and its states, and they are therefore not

15 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art

Natural Language Dialog State Tracker Understanding (NLU) (DST)

Natural Language Dialog Policy Generation (NLG)

Figure 2.1: Pipeline Architecture of Text-based Dialog Systems

designed to perform specific user tasks (e.g., booking a flight or a hotel) [194]. Examples for non-task oriented bots include: DBpedia chatbot [6], Mitsuku1, and Cleverbot2.

On the other hand, task-oriented bots allow users to accomplish simple/com- plex tasks (e.g. playing music, querying a database [43, 194]) using information provided by users during conversations. This thesis focus on this type of the bots, and the term “bots” will refer to task-oriented bots from so on in this thesis. The main components of a text-based task-oriented dialog system are shown in Fig- ure 2.1 [31]. Dialog systems often consist of four main units: Natural Language Understanding (NLU), Dialog State Tracking (DST), Dialog Policy (DP), and Natural Language Generation (NLG) [100]. In each turn of the conversation, the user provides an utterance (e.g., “book a flight from Sydney to Houston”) which is analyzed by the NLU component to detect the user’s intent (e.g., “flight booking”) and its entities (e.g., “from:Sydney”, “to:Houston”). “Intent” captures the user’s purpose, while “entity” (aka. slot) describes a term/object from the utterance to formulate the parameter(s) needed to fulfill the intent [229]. Next, DST deter- mines the current state of the dialog based on the given utterance and previous interactions. The state of dialog indicates the current values of entities based on the history of the conversation (e.g., “intent:flight booking”, “from:Sydney”, “to:Houston”, “departure date:—”). The new dialog state is passed onto DP which decides on next actions (e.g., “ask for the departure date”). Finally, NLG produces an appropriate response (e.g., “When do you want to depart?”)3.

1https://www.pandorabots.com/mitsuku/ 2https://www.cleverbot.com/ 3The interested reader can refer to [100] for more details

16 2.1 Intent Recognition Methods

2.1 Intent Recognition Methods

At the heart of building task-oriented bots lies identifying the user’s intent as well as extracting related entities from a given utterance. To interact with users, it is essential for bot to detect and understand user intentions. This task is called intent recognition [106, 163, 81]. An intent refers to a user’s purpose, which a bot should respond to [163, 31]. For instance, in the utterance “is there any hotel around”, the user’s intent is to find a hotel in his/her neighborhood. Bot developers assign a short text such as “bookHotel” for each intent supported by the bot. User utterances may also contain important information about how the task must be performed. Bots, therefore, need to understand provided information in order to be able to serve the user’s request [68, 31]. Such information is known as entities or slots of the intent. These entities have data types (e.g. date, time, location, product categories). For example, in the utterance “find a hotel in Sydney”, the term Sydney is an entity of type location, indicating the hotel’s location. Figure 2.2 demonstrate an example illustrating the relationship between utterances, intents, and entities.

Utterance

"is there any expensive hotel in Sydney

Entity Entity

Intent: findHotel Verb Noun

Figure 2.2: Utterance, Intent, and Entity Relationship

Generally, techniques for mapping natural language to intents can be di- vided into four major classes: rule-based methods, machine learning meth- ods, crowdsourcing-based Methods, retrieval-based methods, and end-to-end ap- proaches. In the rest of this section, we present how these models work and argue why collecting data is necessary for each method.

17 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art

2.1.1 Rule-based Methods

The history of chatbots goes back to ELIZA[207], a rule-based dialog system de- signed to simulate a psychologist, and still rule-based approaches have maintained their popularity[25, 190, 48, 97, 180]. In rule-based intent recognition, bot developers are required to define intents and a set of intent recognition rules. Such rules are in the form of pattern/re- sponse pairs as shown in Figure 2.3. As such, if a pattern matches, the corre- sponding response is returned back to the user [202, 228]. In other words, the bot detects the user’s intent by matching all rules with the utterance. For instance, given the utterance like “My name is Ali”, since the first three words “My name is” match the second pattern in Figure 2.3, the bot concludes that the user intent is Greeting and the user’s name is Ali. Next, the bot responds Ali by saying “Nice to meet you, Ali.”.

+ pattern: hi bot response: Hi human!

+ pattern: my name is response: Nice to meet you, .

+ pattern: how are you response: I am good, how about you?

Figure 2.3: Defined pattern/response pairs for Greeting intent [228]

Several domain-specific languages have been proposed to create such rules [180]. Examples include Artificial Intelligence Markup Language (AIML), an XML based language which widely has been leveraged by many famous chat- bots (e.g., ELIZA [208], PARRY [39] and ALICE [202]). Other examples of such languages include Rivescript4 and Chatscript5. In compared to AIML, these lan- guages are much easier to understand, and provide additional features. In general, rule-based approaches suffer from two main issues [228]: • Lack of flexibility: Since expert-written patterns are fixed, bot users are doomed to only use the controlled language defined for the interactions with the bot [31]. For instance, a simple rule for greeting such as (“hi bot”) can

4https://www.rivescript.com/ 5https://github.com/ChatScript/ChatScript

18 2.1 Intent Recognition Methods

only respond to the same utterance and will fail if the provided utterance differs (e.g. “hey bot” or simply “hi”). Thus, the chatbot is not likely to answer properly expect using the defined set of utterances.

• Cost of maintenance: Hand-crafted rules written by experts requires an- ticipating end-users’ language diversity, and it is intensively time-consuming and costly [230, 228]. Moreover, as the number of rules grow, finding over- laps and conflicts between rules also make this approach not scalable [166].

In spite of the above issues, rule-based intent recognition is useful in following cases [228]: (i) lack of training data [223], and (ii) small-scaled limited bots (limited number of questions/answers) [72]. Rule-based models are mostly based on controlled natural languages to avoid complexity and ambiguities imposed by natural language [97]. Given that rule-based methods are confined to a finite set of grammar rules, they are suitable for closed-domain dialog systems [117]. While rule-based methods are usually based on hand crafted rules, they can also be benefit from training date in two aspects: (1) annotated user utterances can be used for automatic grammar generation (called automatic rule induction) [6, 1, 135]; (2) training data is also beneficial in parse tree disambiguation for semantic parsers[204, 190, 25].

2.1.2 Machine Learning Methods

Most of the bots and bot development platforms rely on machine learning based classification methods to classify a given utterance into a predefined set of intents [186]. Such approaches are build having the flexibility limitation of rule-based techniques in mind. In such approaches, a classification model is trained based on training datasets containing a large number of utterances, each labeled with both intents and their corresponding entities [122, 83]. Based on a diverse set of utterances for each intent, the model learns to differentiate between the utter- ances of different intents. For example, SVM (Support Vector Machine) has been shown to perform well on intent recognition tasks [129, 84]. Language processing platforms such as RasaNLU6 also uses SVM as their default algorithm to identify intent from user utterances. Other examples of classification algorithms used in this task include CRF (Conditional Random Field) [94], and deep learning-based classifiers [190, 186, 197].

6https://rasa.com/docs/rasa/nlu/components/#sklearnintentclassifier

19 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art

There are also approaches to building unsupervised techniques for intent recog- nition. A common approach is to build a vector space model (VSM) on a collection utterances, and use common metrics (e.g., cosine similarity) to measure similar- ity between intents and utterances [158]. As such, text embeddings techniques (e.g., word embedding [131], Sentence-BERT [174], Universal Sentence Encoder [29], and Concatenated Power Mean Embeddings [179], Sent2Vec [145], InferSent [41]) have been used to represent intents and utterances in a vector space [229, 228]. As an example, for a given intent with a set of utterances, BotBase first calculates a vector for each utterance by averaging the word embedding vectors of its consisting words; then it calculates the average vector of all the utterances vectors to obtain the vector representation of the intent [229]. As such, the user’s expression is converted to a vector at runtime, and compared to the vectors of all intents to determine the most close intent to the user’s expression.

2.1.3 Crowdsourcing-based Methods

Crowdsourcing-based methods benefit from human-in-the-loop strategies to de- tect a user’s intent, determine the required parameters, and finally generate re- sponses. Guardian [89], Chorus [114], Evorus [90], and CRQA [181] are examples of crowd-powered dialog systems.

Figure 2.4: Crowd workers extract the required parameters and intents in “Guardian”[89]

While in the crowd-based methods do not usually depend on intelligent systems, some crowd-powered approaches have recently added machine learning models to minimize the role of crowd-workers and the costs of having human-in-the-loop[90].

20 2.1 Intent Recognition Methods

In other words, these systems consider the best of both worlds by automatically creating responses when the machine learning algorithm is confident, and relaying on crowd workers for the rest of cases. As a result, trading data are also beneficial in hybrid crowd-based bots.

2.1.4 Retrieval-based Methods

Retrieval-based methods borrow concepts from Information Retrieval (IR) sys- tems and represent each intent with a set of user utterances, and at runtime the user’s utterance is compared to the existing utterances for each of the intents in the repository, and the most similar intent is determined [190, 117]. In compar- ison with the rule-based approaches, in this approach instead of matching the user utterance with existing rules (patterns), the user utterance is compared to the pre-collected utterances for each intent. The key to retrieval-based methods is a method to measure the similarity of a given utterance and the existing ut- terances in the repository. As a result, having samples of utterances and their corresponding intent have at-least two applications for retrieval-based methods: (1) it can be used for expanding the repository, (2) they can be used for training supervised algorithms for matching techniques.

2.1.5 End-to-End Approaches

As opposed to the above-mentioned approaches, a new trend in building bots is to build end-to-end approaches in which the user request directly is mapped to the correct answer (or executable forms such as API methods or SQL queries) [2, 25, 190]. In other words, such approaches directly perform tasks rather than learn- ing the mapping between utterances and the intents and then performing tasks based on detected intent and entities. Most of such state-of-the-art approaches relay on techniques and deep learning (especially sequence- to-sequence neural networks [193]) to translate a natural language directly (e.g., “book a hotel in Station street”) into an appropriate response (e.g., “your hotel has been booked”) [190, 117, 167]. To further reveal the significance of training data acquisition, it is also worth mentioning that deep learning models are data thirsty and there is a strong correlation between the quality of a deep learning model and the size of training data [73, 225]. Moreover, considering the fuzzy and ambiguous nature of the human languages [206], capturing users’ intents requires

21 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art a very large corpora which covers the variations in the language usage [203]. Semantic is also another example of end-to-end approaches. A semantic parser’s aim is to map utterances (natural language) to logical forms (executable codes). A semantic parser uses a grammar to parse the utterance and create a parse tree and consequently the corresponding logical form [25, 190].

Part II - Training Utterance Acquisition

Building dialog systems requires training models or algorithms to address sev- eral natural language processing tasks such as dialogue state tracking [210, 224, 69], dialog act detection7 [176, 4, 63], and intent detection [229, 76]. Super- vised techniques for doing any of these tasks require training data. In this thesis, we focus on training utterance acquisition for the task of intent detec- tion, and we refer the interested reader to a survey of data-driven dialog sys- tems [184]. Depending on the intent detection method, training samples con- sist of user utterances paired with intents or executable forms such as logical forms, database queries, and APIs (Application Programming Interface) (e.g., @flight.book(to=“Sydney”, from=“Houston”)) [25, 219]. Given the richness of human language, having a large and linguistically diverse set of annotated utterances is essential for effective training of intent recognition models (especially data-thirsty deep-learning-based models) [190]. Ultimately, noisy training data (e.g., containing incorrectly labeled utterances) will likely lead to incorrect intent detection or entity recognition, with potentially disastrous effects [219]. Collecting such datasets is costly since a large number of user utterances is required and each utterance must be annotated by experts. We thus seek to understand the characteristics of a “good” dataset and how to obtain as such efficiently. Based on analysis of a wide range of literature and a selection of bots devel- opment tools, we survey existing approaches for obtaining labeled utterances for training intent detection methods 8. We also discuss quality and cost issues of each approach, and we define opportunities for future research directions.

7to detect the general intent of the utterance such as classifying if an utterance is providing information or asking a question. 8This survey does not cover approaches for obtaining dialog-level training samples

22 2.2 Problem Dimensions in Training Utterance Acquisition

2.2 Problem Dimensions in Training Utterance Acquisition

As discussed earlier in this section, supervised intent and entity recognition methods require training user utterances which are annotated with correspond- ing intents and entities (e.g., utterance=“book a flight Sydney to Houston”, in- tent=“flight booking”, to=“Houston”, from=“Sydney”). Costs associated with building a bot and its accuracy in serving users’ requests are of paramount con- cerns considered in this study. Accurate detection of users’ intentions requires having both effective machine learning models and high quality training data which reflect real user utterances [31]. Thus it is important to identify the prop- erties of a high quality set of user utterances. In addition, given the huge range of language variations from person to person, obtaining a comprehensive set of user utterances even for a single intent is costly [190]. By analyzing metrics used in the literature to compare datasets for bot training, we identify the key properties of high quality datasets, both at the utterance level and for the entire corpus. At the utterance level, we identify naturalness, semantic completeness, and language correctness as explained in Section 2.2.1. At the corpus-level, the set of expres- sions for a given intent must also be diverse to ideally cover language variations. In the rest of this section, we discuss quality and cost dimensions9.

2.2.1 Quality

Diversity. Given the richness of natural language, diversity of user utterances is key to building effective dialog systems [190, 25]. Several metrics have been proposed to measure this, namely Type-Token Ratio (TTR) (also known as lex- ical diversity) [171], Paraphrase In N-gram Changes (PINC) [30], and Diversity (abbreviated as DIV) [103]. TTR calculates the ratio of unique words to the total number of words in the utterances. It rewards the use of new words without considering differences in sequences of words in utterances.

|unique words| TTR = |all words|

9We do not cover “biases” in training samples since they are studied elsewhere [85, 198, 10, 134, 32, 26, 9, 119].

23 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art

To address this limitation, PINC measures the percentage of n-gram changes between the initial utterance (u) and a collected utterance (p).

1 N |ngrams(n, u) ∩ ngrams(n, p)| PINC(u, p) = X 1 − N |ngrams(n, p)| n=1 where N is the maximum number of ngrams (usually 4), and ngrams(.) ex- tracts the n-grams in the given sentence. The average of the PINC scores over all collected utterances is used to measure the diversity of the corpus without considering inter-paraphrase n-gram changes. The DIV metric calculates mean percentage of n-gram changes not only be- tween the initial utterance and each collected utterance but also between any pair of collected utterances (C)10.

1 DIV (C) = X X D(x , x ) |C|2 i j xi∈C xj ∈C

1 N |ngrams(n, x ) ∩ ngrams(n, x )| D(x , x ) = X 1 − i j i j N |ngrams(n, x ) ∪ ngrams(n, x )| n=1 i j

Coverage (CVG) also measures how well a list of utterances covers the space of all possible utterances for an intent [103]. It is calculated by averaging the similarity score (percentage of n-gram matches) between each utterance in the real user queries (R) and its most similar utterance in the collected utterances (T ).

1 CVG(T,R) = X max (1 − D(u, x)) |R| x∈T u∈R

Based on the definition of the Coverage metric, the higher is the Coverage, the more realistic and high quality is the collected set.

Naturalness. While it is essential to collect diverse utterances, training data should still be natural [220, 135]. Naturalness is often defined informally and measured by human evaluation since there is no formal measure for calculating how natural is an utterance. Informally, an utterance is called natural if it is

10For the sake of consistency, we assume there is only a single intent; in case of multiple intents, the average diversity (e.g., DIV) scores of all intents is computed

24 2.2 Problem Dimensions in Training Utterance Acquisition similar to the utterances generated by real users in terms of wording and sen- tence structure (e.g., “find a flight” vs “I’ in quest of a flight”). In this sense, naturalness can be defined informally as the likelihood of an utterance in a real world.

Semantic Completeness. An utterance is semantically incorrect when it fails to correspond to the initial intent (e.g., if we paraphrase “book a flight”, then the sentence “book a cruise holiday” is semantically incorrect) [219]. In another sense, semantically incorrect utterances do not precisely express the same intent. Such utterances add noise to the training data which might confuse the classifi- cation model, resulting in unwanted actions (e.g., booking a cruise instead of a flight) [219]. With large datasets, manual verification of semantically incorrect utterances is challenging. The state-of-the-art automatic approaches for measur- ing the semantic similarity between utterances often rely on sentence embedding techniques (e.g., ELMo [156], BERT[47], Universal Sentence Encoder (USE) [29]) to encode sentences into vectors and compute the cosine similarity between the vectors. However, even the state-of-the-art methods often suffer from lack of accuracy in detecting lexically similar but semantically different sentences [219].

Language Correctness. Users may make grammatical and spelling errors. Therefore, a diverse and natural set of utterances should also contain a wide variety of such errors [11, 85]. However, this also depends on the implementations of a bot. For example, utterances can be corrected by a spell corrector before detecting intents and entities, making the bot robust to typos. In such cases, having typos in the training utterances does not seem necessary. Moreover, bot users make errors which might differ from the errors happening in utterances collected by a particular acquisition method [11, 85]. For example, given the word “flight”, bot users may make misspelling errors such as “fligt”, but it is not skeptical to make typos such as “flight3423” (randomly adding letters to a word) [219]. However, given that bot users also occasionally make such spelling and grammatical errors, bot developers should decide whether consider collecting such erroneous utterances or not.

2.2.2 Cost

Building a set of annotated user utterances requires a large set of utterances for each intent with all possible compositions of entities in user utterances [203].

25 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art

For example, for a “flight booking” intent with two entities, we ideally want to collect expressions for the following cases: “book a flight”, “book a flight from Sydney”, “book a flight to Houston”, and “book a flight from Sydney to Houston”. The number of possible cases increases exponentially with the number of entities, making it expensive to get utterances for intents with large number of entities. The number of possible scenarios with n being the number of intent entities Pn n equals to k=1 C(n, k) = 2 . Moreover, dialog systems need to know if a given user utterance falls outside their range of supported intents, and this also requires collecting out-of-topic utterances[112].

2.3 Utterance Acquisition Methods

Building conversational cognitive services often requires corpora of natural lan- guage utterances and their corresponding executable forms (1) to detect a user’s intent; (2) to identify entities and parameters of the intent; (3) to automatically generate grammars in rule-based methods [6, 1, 135]; (4) to disambiguate poten- tial parses trees generated with semantic parsers [204]; and (4) to minimize costs [90] in crowd-powered systems. Based on the analysis of a wide range of literature we identified three main approaches for acquiring user utterances as illustrated in Figure 2.5 and summa- rized in Table 2.1, and we discuss their quality and cost issues: (1) utterance acquisition from bot usage; (2) automatically generating user utterances; and (3) crowdsourcing user utterances11.

2.3.1 Utterance Acquisition from Bot Usage

One of the prevalent approaches for collecting user utterances is to obtain them directly from bot users by launching a fully functional or a prototype bot, initially trained with a small number of annotated utterances [48].

Quality Dimension

In this approach, user utterances are collected from real bot users which makes them natural. In addition, a deployed chatbot collects new utterances over time

11It is worth noting that a combination of these approaches are used in practice to generate training datasets.

26 2.3 Utterance Acquisition Methods fixed annotation annotation annotation verification, verification, building the crowdsourcing, models, verification, canonical templates entity values are not annotation required if building paraphrasing prototype, verification, Diversity Cost not diverse creating rules or paraphrases confronted by suffers from the diverse over time diverse over time Chinese-whispers utterance set gets utterance set gets approaches such as not able to generate multiple high quality priming effect that is Dimensions templates equivalent Semantic Completeness errors and typos errors and typos entities or added may occasionally may occasionally divergence from the contain grammatical contain grammatical given intent, missing unmentioned entities grammar or canonical highly depends on the often semantically not and typos and typos by cheaters Correctness substitutions) depends on the as incorrect word grammatical errors grammatical errors grammatical errors occasionally contain occasionally contain linguistic errors such and typos; harmfully grammar or templates Table 2.1: Summary of Utterance Acquisition Methods natural occasionally contain natural occasionally contain unnatural unnatural unnatural occasionally Naturalness Language occasionally occasionally

Paraphrasing Sub-Method(s) All Methods Prototype Canonical Deployed Bot

Approaches

Crowds. Automatic Meth. Usage Bot

27 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art

Utterance Acquisition Methods

Automatic Approaches Bot Usage Crowdsourcing Sentence-based Scenario-based

Canonical Sentence Paraphrasing Goal-based Generative Grammars Sentence T Entity-Image Repl. Chinese Whisper Neural Networks Domain Specific V Rule-based Prototypes ideo-based Statistical Language Real Bots emplates

Figure 2.5: Utterance Acquisition Methods and consequently improves the diversity of the training samples. However, col- lected user utterances must be verified and annotated with intents and their entities. In addition, utterances need to be checked for grammatical errors and typos.

Cost Dimension

Building a prototype by itself can impose extra costs. This is especially true if the prototype’s intent detection method (e.g., handcrafted rules) is meant to be com- pletely replaced with a completely different model (e.g., machine learning based models). However, if the prototype requires annotated utterances, collecting an- notated utterances is still needed. Moreover, given the quality issues mentioned above, user utterances must be quality assessed and annotated to be useful in the training phase [219, 185, 53].

2.3.2 Automatically Generating Utterances

Automated utterance generation can be divided into two main steps: (i) cre- ating initial set of utterances (called canonical utterances); and (ii) automatic paraphrasing the initial set into new variations.

28 2.3 Utterance Acquisition Methods

2.3.2.1 Canonical Utterance Generation

A canonical utterance is a basic imperative sentence (e.g., “get a flight to Syd- ney”) expressing an intent, and they are later paraphrased to diversify the train- ing utterances. To generate canonical utterances, a common approach is to write sentence templates [203]. A sentence template is a canonical utterance with placeholders (“get a flight to DESTINATION”). In this approach, experts must provide seed values for the placeholders to populate canonical utterances. Like- wise, domain specific languages (DSL) are used to generate utterances. Chatito12 and Tracery13 are examples of such [40]. Using generative grammars is another alternative approach employed by se- mantic parsers as illustrated in Figure 2.6 [190, 171]. In this approach, logical forms are automatically generated based on the expert-written generative gram- mars. Next, the grammar is used to automatically produce canonical utterances for the randomly generated logical forms [190]. Automatically generated sen- tences are typically very hard to understand. To further improve the canonical utterances, heuristics are applied to make them more understandable (e.g., pa- rameter entities in the flight booking example should be city names not rural areas) [25].

search for nearby restaurants give me a list of restaurants find nearby eateries @api.search(query=restaurants)

Generate Create Pairs of Create Executable Random Corresponding Domain Specific Paraphrase Forms Executable Canonical Grammar & Forms Sentence Expressions

Figure 2.6: Canonical Utterance Generation

Quality Dimension

When using DSL approaches, correctness of generated canonical utterances depends highly on templates written by bot developers. Generative grammars and neural translators may also produce unnatural sentences including grammat- ical errors and semantic errors [25]. However, such automatic approaches can

12https://github.com/rodrigopivi/Chatito 13http://tracery.io/

29 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art still generate an initial set of annotated utterances despite not necessarily being diverse and natural [25].

Cost Dimension

Generative grammars and neural translators occasionally produce low quality canonical utterances as mentioned earlier. Therefore, the common practice is to evaluate the quality of automated utterances through hiring of people [171, 139], albeit is a costly exercise.

2.3.2.2 Automatic Paraphrasing

Paraphrasing is the task of expressing the meaning of a fragment of text using different words. It has numerous applications in natural language processing sys- tems such as evaluation of machine translation systems, sentence simplification, automatic plagiarism detection, text summarization, and natural language gen- eration [104, 109, 124]. It is also used in and information retrieval systems to reformulate user’s queries [190, 25]. Paraphrasing is necessary for diversifying canonical utterances. Automatic paraphrasing relies on machine translation techniques, namely Rule-based Ma- chine Translation (RBMT), Statistical Machine Translation (STM), and Neural Machine Translation (NMT) [88, 50, 125]. RBMT methods rely on hand-crafted rules and knowledge bases (e.g., WordNet) to generate paraphrases [137, 152, 67]. SMT methods rely on statistical analysis of bilingual text corpora to gen- erate paraphrasing rules (e.g., “significant quantity” −→ “a lot”). NMT-based approaches are built based on encoder-decode architecture to directly paraphrase a sentence to new variations. These approaches require mono-lingual paraphrase datasets, containing pairs of sentences and their paraphrases. However, such datasets are rare as opposed to bilingual datasets which contains sentences in one language and their translations in another language. In the absence of mono- lingual paraphrase datasets, the typical approach is to use language pivoting [125, 50]. In language pivoting, a given utterance is translated into another language, and then translated back to the source language to obtain a paraphrase [125, 88, 124]. In-depth study of these techniques is out of scope of this survey, and the interested reader can refer to [88, 124]. Another promising line of work is finding equivalent question patterns. Notable scenarios in virtual assistants and question answering systems include inquiries

30 2.3 Utterance Acquisition Methods

(e.g., “Is there any restaurant nearby?”, “where can I have my lunch”). Mining Question Answering systems has proved to generate paraphrases for question templates [57, 50, 52]. By obtaining equivalent question patterns (e.g., “who founded < NOUN >” = “who is the owner of < NOUN >”) it is feasible to create a very large corpus of such question patterns. There are also question corpus namely SQuAD[165], MS MARCO[140], WebQuestions [12], and WikiQA [226] waiting be exploited by dialog systems.

Quality Dimension

Automatically generated paraphrases suffer from several issues such as grammat- ical errors and being unnatural [215, 222]. Moreover, even the state-of-the-art models fall short in producing sufficiently diverse paraphrases [150, 80]. They also fail in producing multiple semantically-correct paraphrases for a single ex- pression [75, 215]. Current attempts to diversify generated paraphrases involve using beam search [125] in the decoding phrase of the decode-encodes models, or adding random noise to the output of the encoder to generate a new paraphrase [150]. However, preserving semantic meaning often leads to repetitions of the original utterance [214].

Cost Dimension

Automated paraphrase generation is potentially cost-free14. However, it includes hidden costs such as annotation costs. Automatically generated paraphrases need to be annotated either manually or automatically. If the entity values are not changed during paraphrasing, it is possible to automatically annotate the para- phrase. In this case, the same value will be labeled with corresponding entity in the source utterance. However, if paraphrasing changes the value of the entities, manual annotation is required. For example, given the expression :“get 5 cheapest flights” where 5 represents the number of desired items, it might be difficult to au- tomatically detect the same entity in a paraphrase like “get a handful of cheapest flights” without having expert written rules. As such, annotation imposes extra costs.

14Not considering the need for building specific paraphrasing systems

31 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art

2.3.3 Crowdsourcing User Utterances

Crowdsourcing is the practice of using the power of crowd (a number of paid or unpaid people) to obtain information for a particular task. Crowdsourcing has been employed to obtain natural language corpora for dialog systems [11, 25, 190]. To the best of our knowledge, the first attempt for obtaining paraphrases via crowdsourcing goes back to 2005 when Chklovski used gamification to collect paraphrases by asking contributors to guess paraphrases based on given hints (e.g., a few words and phrases inside the paraphrase) [34]. Later, three primary crowdsourcing strategies were suggested [203].

Sentence-based strategy. In this approach, crowd workers are asked to para- phrase a sentence into new variations [190, 25, 203, 112] (e.g., “book a flight from Sydney to Houston”). Other approaches also extend the sentence-based strategy to improve diversity of crowdsourced paraphrases. Examples include Chinese- whispers and entity-image replacement. Chinese-whispers is a multi-round approach inspired by a children’s game with the same name15. In the first round, workers paraphrase canonical utterances, but in subsequent rounds they paraphrase the paraphrases obtained in the previous rounds [139, 113]. In entity-image replacement, entities are replaced with corresponding images in the canonical utterances (e.g., replacing the word flight with an airplane image) [171]. As such, workers must paraphrase a mixture of text and images.

Goal-based strategy. In this approach, workers are given a goal (e.g., “book a flight”) and a set of possible values for its entities (From: “Sydney”, To: “Hous- ton”) and asked to make proper sentences.

Scenario-based strategy. This approach employs a storytelling approach. The story puts the worker in a situation to perform a task by asking someone (“Your goal is to book a flight; you are in Sydney; your destination is Houston”). Next, the worker is asked to express themselves in the given scenario [203, 25, 112]. The only difference between this approach and the goal-based strategy is that scenario-based methods are more explanative which can minimize the potential ambiguities while it increases the task completion time [203]. It is reported that crowd can perform tasks faster using this goal-based method in comparison with

15this game is also called “telephone” in the north America

32 2.3 Utterance Acquisition Methods

Table 2.2: Crowdsourced paraphrasing strategies Strategy Example Sentence-based Strategy Paraphrase the following sentence: “Book a flight from Sydney to Houston”

Goal-based Strategy How would you state the following intent? Goal: “flight booking” From: “Sydney”, To: “Houston”

Scenario-based Strategy How would you state the following intent? “Your goal is to book a flight; you are in Sydney; your destination is Houston.”

sentence-based and scenario-based methods which is described next. Another variation of this approach (so-called video-based strategy) is to present a video demonstrating an event (e.g., someone is booking a flight ticket) [30]. Next, crowd workers are asked to express the event in natural language.

Quality Dimension

The primary reason for crowdsourcing paraphrases is to acquire as many as di- verse utterances as possible. However, it has been shown that crowd workers are biased towards using the words in the given sentences, harming diversity of col- lected paraphrases [220, 139, 95]. This phenomenon is called the priming effect – an automatic non-conscious activation of information in memory, which is specif- ically related to perceptual identification of words and objects [196]. According to the priming effect, humans are more likely to use the same vocabulary when they encounter any text or speech. Thus the priming effect may decrease the chance of obtaining diverse utterances. It has been reported that goal-based paraphrasing is less biased than the two other strategies [203]. Moreover, allowing a worker to submit multiple para- phrases for the same utterance also improves diversity [203]. However, to obtain diverse paraphrases more sophisticated crowdsourcing approaches are needed to cover a wider range of possible user utterances. To overcome the priming effect, new approaches have been proposed as men- tioned earlier in this section However, these approaches have their own limitations.

33 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art

For example, Entity-image replacement cannot be applied for non-entity words (e.g., abstract nouns). Preparing appropriate videos is very time-consuming and costly in the video-based approach. Using such approaches also results in pro- ducing a large number of semantically incorrect paraphrases [30]. In case of using Chinese-whispers, an incorrect paraphrase may cascade into other incor- rect results on subsequent iterations. Thus, this approach is prone to producing semantically incorrect paraphrases [95, 113]. It has been reported that up to 40% of crowdsourced paraphrases are not us- able [219]. Crowdsourced paraphrases often include paraphrases which are not semantically equivalent with a given utterance (“book a flight from Sydney to Houston”). Examples includes missing entity values (e.g., “book a flight to Hous- ton”), and using wrong entity values (e.g., “book a flight from NYC to Houston) [203, 171]. Collecting paraphrases from a crowd is also not immune from un- qualified workers, cheaters, or spammers [219, 225]. Some workers may generate incorrect paraphrases especially if the given task is an unconditional text [219]. Without having a gold set of tests, it is challenging to perform quality control in crowdsourced utterances. That is because there are countless possible para- phrases for a given utterance. Crowd workers might use improper synonyms to create a paraphrase[139], or more dangerously, malicious workers might also try to give inappropriate and misleading information[91]. However, machine learn- ing models trained on crowdsourced paraphrases have been shown to perform as accurate as those trained on utterances provided by real users, even though crowdsourcing might produce slightly unnatural user utterances (e.g., “I’m in quest of flights”) [44]. To address the quality issues, the quality of collected paraphrases must be assessed, imposing extra cost caused by the validation task [139]. However, au- tomated approaches such as finding outliers (potentially incorrect paraphrases), can reduce the cost of validation [113]. In outlier detection, each collected para- phrase is mapped to a vector (using a sentence embedding model), and their average vector is calculated. Next, paraphrases are ranked based on their dis- tances to the average vector in ascending order, and top-k% (a defined threshold) of paraphrases are considered valid without crowdsourced validation.

Cost Dimension

Even though crowdsourcing has made obtaining training data feasible, the quality

34 2.3 Utterance Acquisition Methods is usually unsatisfactory, requiring further cleaning Paraphrased user utterances may contain unqualified paraphrases which makes verification a necessity [219]. Quality control of crowdsourced paraphrases may require launching other crowd- sourcing tasks to verify, correct or remove low quality utterances [216]. Moreover, crowdsourced paraphrases must be annotated with entity labels. Since entity val- ues in the initial utterance may be changed during paraphrasing, it is necessary to label the utterance with entities either manually or automatically. It is possible to eliminate the need for annotation by requiring crowd workers to preserve the entity values as they are in the initial sentences [135, 11]. However, this might affect the diversity of crowdsourced paraphrases. For instance, given an initial utterance “book a room for one week” with the entity period=“one week”, a para- phrase like “book a room for 7 days” might be missed if the workers are obligated to use “one week” in the paraphrase.

It can be conculded, therefore, that three main causes of cost in crowdsourced user utterances are the cost of crowdsourced paraphrasing, verification of collected paraphrases, and annotating them. Therefore, crowdsourcing approaches must seek for methods to minimize the cost of each tasks.

An essential question in crowdsourcing paraphrases for a particular intent is to determine when sufficient number of utterances has been collected and no further crowdsourcing is needed. This depends on both the intent and the diversity of crowdsourced paraphrases. Different intents require different levels of language variations [120]. For example, we need to obtain far more user utterances for doing complex tasks like “booking a flight” than a intent like “greetings”. A simple but effective approach is to stop crowdsourcing when the intent classifier reaches a predefined confidence threshold [11].

Another approach towards minimizing the cost of crowdsourcing is to only high value utterances (utterances which are more likely to improve the bot’s performance) based on a given budget (e.g., hierarchical probabilistic model [190]). Adaptive termination strategy is also a similar approach based on the Coverage metric [120]. In this strategy, the data collection is done in multiple rounds, and in each round the coverage between previously collected paraphrases and current round is calculated. This process continues until the coverage reaches a defined threshold (e.g., 0.69) [120].

35 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art

2.4 Summary and Discussion

Designing and engineering scalable and robust training data acquisition tech- niques for conversational bots remains a deeply challenging problem [218]. There are several aspects of this challenge some of which have been investigated in this thesis and some are being investigated in complementary research studies, in- cluding privacy and ethics in dialog systems [85, 28]. In this section, we identify critical issues and directions spanning scale and quality of intent training data acquisition.

2.4.1 Continuous Crowd-Machine Collaboration

An intent recognition task involves learning sub-tasks such as intent classification, and entity recognition, dialog act classification. Existing techniques usually con- sist of pipelines of crowdsourcing, expert-enabled or even (albeit simple) automa- tion training data acquisition components, and automated techniques to develop models for these tasks. Cost-effective and scalable training of models that en- able meaningful and robust conversations is key to the ubiquity and affordability of conversational bots, especially for small organizations. We envision the next generation of training data acquisition process should be capable of empowering humans in efficient and effective data labeling, while augmenting machine-based data generation. A key research area is that of methodologies and processes for selecting, pipelin- ing, and tuning data acquisition tasks. As an example, the crowdsourcing ap- proach requires careful combination of workers selection, appropriate task design, test tasks, and termination strategies to ensure quality but cost-effective data (such as a customer support bot deployed for a new company or in a new do- main). As another example, consider how can we estimate the error rate of a bot trained using a data collection approach, and its consequent effects on customer (dis)satisfaction? Will the automated approaches be able to mimic customer ut- terances? How broad and expensive does crowdsourcing need to be to provide satisfactory results, and how to assess this? Answering these questions today is important not just from market and cost perspective, but also to ensure the quality of the deployed bot from day one (also to avoid high-visibility errors that today go viral on ), as it is important for many businesses. Further- more, we need to have methods to know when to stop crowdsourcing because we

36 2.4 Summary and Discussion have achieved “enough” diverse examples.

2.4.2 Integrating Quality Control and Crowdsourcing.

There is growing use of bots in critical processes, especially those which require interactions between citizens, businesses, and governments. There are significant quality control gaps and risks in bot training data acquisition. Building bots is notoriously complex, with many unsolved theoretical and technical challenges stemming from the scale of the systems, and rapid evolution of technologies. , and growing concerns about the unintended, possible costly and harmful consequences of the digital age. Indeed, empirical studies confirm that existing crowdsourcing technologies are vulnerable to both intended and unintended errors (e.g., preju- dice, human bias, unfairness, inappropriate behavior) [85, 220, 219]. It is thus important to endow crowdsourcing methodologies with robust quality assessment and assurance mechanisms– key to the success and adoption. Orchestrating human-machine conversations requires rich abstractions to rea- son about social bias, norms and constraints (e.g., fairness, politeness, account- ability). We believe that this in turn constitutes a critical and challenging issue whose investigations will require meaningful integration of evidence and methods from various disciplines including crowdsourcing, natural language processing, machine learning, software testing and verification as well as social and cognitive science [219, 220]. Interesting research directions include: (i) empirical studies to identify quality concerns, (ii) model and data quality assessment methods, and (iii) methods to improve quality of crowdsourced paraphrases. For instance, the lack of diversity in user utterances obtained from the crowd may impact on intention recognition models [220]. In this context, there is a need for techniques to measure and test diversity of training data and paraphrasing methods that balance diversity and semantic relevance while considering other quality aspects (e.g., human bias, politeness). Automatic quality control can not only reduce the need for the verification pro- cess, but also screen unqualified or malicious workers. In this regard, online approaches are required to asses the quality of provided utterances at the time of paraphrasing. Early detection of workers can save resources– especially because crowdsourcing platform lacks the mechanism for not paying unqualified workers when a task is done.

37 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art

2.4.3 Sharing Utterances

Knowledge graphs are key assets in systems and are built to store information such as entities and relationships [172, 37]. The growth in the number of available intents, is likely to increase the opportunity to reuse training data across domains and intents. Leveraging the opportunities that combination of transfer learning and dynamic synthesis of utterances over existing intents knowledge bring, the cost of bot training could be significantly reduced through reuse of training data based on similarity and compose-ability of intents. There have recently been a few attempts to create shared resources. Examples include ThingPedia, a repository of applications for the IoT () devices called Almond [25]. It is a very small repository but it is built for virtual assistants which makes its information very useful for dialog systems. For a given API, ThingPedia contains a list of its methods called functions. A function consists of a list of user expressions along with associated entities [25]. For another example, API-KG is a knowledge graph to offers: (i) searching for APIs that can support a desired intent/task , and (ii) a set of utterances and parameters that users may use to invoke API methods [229, 228]. Moreover, API-KG offers a middle-ware to build bots on bot development platforms to enable even non- skilled developers to rapidly create bots based on a given intent [228]. Building such knowledge graphs is beneficial for bot training because training data can be shared within bots and reduce the cost. The prior works have prepared dataset for building both task-oriented and non- task oriented dialog systems [184]. Available datasets are built to train different tasks such as intent detection, entity resolution, speech act detection, and dialog state tracking. There are also multiple public datasets which can be used for training bots in various domains. Examples include the Second Dialog State Tracking Challenge (DSTC2) [82], the second Wizard-of-Oz dataset (WOZ2.0) [209], FRAMES [5], Machines Talking To Machines (M2M) [204], Multi-Domain Wizard-of-Oz dataset (MultiWOZ) [55], and Schema-Guided Dialogue (SGD) [168]. Among these SGD is the largest dataset containing over 1600 dialogues for 16 domains. We refer the reader to [184], a survey of datasets which are publicly available for training dialogue systems. In this context, there is a need for building more comprehensive knowledge graphs which connect intents to APIs as well as tools to reduce the manual efforts (e.g., suggesting appropriate APIs for an intent by considering APIs’ Quality-of-

38 2.4 Summary and Discussion

Service and cost) and repetitive tasks (e.g., training utterance acquisition) in training bots. For instance, many APIs share similar intents and the training utterances can be shared if similar API methods are known to bot developers. Both of Dropbox 16 and Google Drive APIs17, as an example, have API methods for uploading files, and utterances can be shared between these API methods. Thus, a knowledge graph, should also consider the relation between APIs. Thus sharing user utterances can cut the costs [229].

2.4.4 Gamification

Gamification is also used in many fields to both engage volunteers and minimize expenses [8]. Gamification is the application of game elements (e.g., scoring, leader-board) to encourage engagement with non-game activities (e.g., transla- tion, paraphrasing). Interestingly, the first attempt to crowdsource user utter- ances was also based on a game [34]. Gamification can be considered a type of crowdsourcing in which users are motivated by the elements of the game rather than money. There are also recent attempts to gamify similar tasks such trans- lation (e.g., Flitto18). In this context, gamification needs to be explored more to collect user utter- ances from volunteers. The key research area is how to design engaging games to motivate game users to implicitly generate training utterances without actually knowing they are, for instance, paraphrasing. This requires special considerations to design interesting games beyond just employing game elements (e.g., leader- board). Another consideration is to design games for various kinds of people such as different age-groups and specialty (e.g., API developers, Bot developers, flight agents, etc.) with different motivations. This can help acquisition of diverse utterances since different categories of people (e.g., age-groups, natives vs non- natives, educated vs non-educated) use different sets of vocabularies and sentence structures [65, 22].

2.4.5 Data Programming

Data programming is a paradigm for automatically creating labeled training datasets [169]. For instance, Snorkel [170] allows the generation of large datasets

16https://www.dropbox.com/developers/documentation/ 17https://developers.google.com/drive/api/v3/manage-uploads 18https://www.flitto.com

39 Chapter 2 Utterance Acquisition Methods: Background and the State-of-the-art from a large collection of documents by writing labeling functions. A labeling function is a programming function specified by domain experts based on some heuristics, crowdsourced training sets, or available knowledge bases [170]. Using a generative model, Snorkel learns the accuracy of labeling functions based on their overlapping and conflicting decisions. Then the domain experts defines a discriminative model to learn the labels for each instance. Even though data programming has not yet been used in this field, considering its success in different domains, it is worth exploring its potential in this area. Ap- ply data-programming in this context requires: (i) first, finding resources which may contain potential set of utterances (e.g., web forums talking about booking flight) and (ii) developing labeling functions to label utterances with intents (e.g., “book-flight”). Particularly, Internet users write about many different topics in social networks, question answering , and search engines [223]. Such re- sources cab be used for obtaining a collection of utterances in a single domain. Next, domain-dependent labeling functions can be created by experts. For ex- ample, in the domain, intent-related phrases such as “want to buy” can be automatically extracted from questions posted on community sites [223]. Similarly several studies have attempted to make use of customer reviews in the restaurant domain [144, 102, 23, 187]. Similarity, existing intent-detection models (both supervised and unsupervised) can be explored as potential labeling functions.

2.5 Conclusion

In this chapter, we overviwed the background on dialog systems and intent de- tection methods. We also surveyed training utterance acquisition methods along with two important dimensions namely, cost and quality. We also discussed the issues of existing approaches and provided the future outlook for efficiently col- lecting training utterances. As discussed in depth in this chapter, addressing these issues is difficult and challenging. In this thesis, we tackled some of these issues. Namely, we study quality issues in Chapter 3, propose a novel technique to promote diversity of crowdsourced paraphrases in 4, propose a novel approach to detect malicious crowd workers automatically in 5, and propose a novel technique to automatically generate canonical utterances for APIs.

40 Chapter 3

An Empirical Study of Crowdsourced Paraphrases

A Study of Incorrect Paraphrases in Crowdsourced User Utterances

In this chapter, we investigate common crowdsourced paraphrasing issues, and propose an annotated dataset called Para-Quality1, for detecting the quality is- sues. We also investigate existing language tools (e.g., spell checkers) to provide baselines for detecting each category of issues. In a nutshell, this work presents a data-driven view of incorrect paraphrases during the bot development process, and we pave the way towards automatic detection of unqualified paraphrases.

The chapter is organized as follows. Section 3.1 provides a brief introduction. Section 3.2 discusses background and related work to the topics (e.g., quality con- trol, semantic similarity between utterances) addressed in this chapter. Section 3.3 presents our methodology to collect crowdsourced paraphrases for various domains. Section 3.4 presents a taxonomy of common issues in crowdsourced paraphrases and discusses their characteristics. Section 3.5 provides our method- ology to annotate the collected dataset with the common quality issues identified in Section 3.4. In Section 3.6, we examine existing tools and techniques to detect the identified categories of common issues in crowdsourced paraphrases. Finally we provide concluding remarks in Section 4.5.

1https://github.com/unsw-cse-soc/ParaQuality

41 Chapter 3 An Empirical Study of Crowdsourced Paraphrases

3.1 Introduction

As discussed in Chapter 2, paraphrasing canonical utterances is vital to cover the variety of ways an intent can be specified [225]. As summarized in [127], a quality paraphrases has three components: semantic completeness, lexical dif- ference, and syntactic difference. To obtain lexically and syntactically diverse paraphrase, crowdsourcing paraphrasing has gained popularity in recent years. However, crowdsourced paraphrases need to be checked for quality, given that they are produced by unknown workers with varied skills and motivations [25, 46]. For example, spammers, malicious and even inexperienced crowd-workers may provide misleading, erroneous, and semantically invalid paraphrases [118, 25]. Quality issues may also stem from misunderstanding the intent or not cov- ering important information such as values of the intent parameters [190]. The common practice for quality assessment of crowdsourced paraphrases is to design another crowdsourcing task in which workers validate the output from others (see Chapter 2). However, this approach is costly having to pay for the task twice, making domain-independent automated techniques a very appealing alternative [217]. Moreover, quality control is especially desirable if done before workers submit their paraphrases, since low quality workers can be removed early on without any payment. This can also allow crowdsourcing tasks to provide feedback to users in order to assist them in generating high quality paraphrases [141, 142]. To achieve this, it is therefore necessary to automatically recognize quality issues in crowdsourced paraphrases during the process of bot development [217]. In this chapter, we investigate common paraphrasing errors when using crowd- sourcing. We propose an annotated dataset called Para-Quality in which each paraphrase is labeled with the error categories. Accordingly, this work presents a quantitative data-driven study of incorrect paraphrases in bot development process and paves the way towards enhanced automated detection of unqualified paraphrased utterances. More specifically, our contributions are two-folded:

• We obtained a sample set of 6000 paraphrases using crowdsourcing. To aim for a broad diversity of samples, the initial expressions were sourced from 40 expressions of highly popular APIs from various domains. Next, we examined and analyzed these samples in order to identify a taxonomy of common paraphrase errors errors (e.g., cheating, misspelling, linguistic

42 3.2 Related Work

errors). Accordingly, we constructed an annotated dataset called Para- Quality (using both crowdsourcing and manual verification), in which the paraphrases were labeled with a range of different categorized errors.

• We investigated existing language tools (e.g., spell and grammar checkers, language identifiers) to detect potential errors. We formulated baselines for each category of errors to determine if they were capable to automatically detect such issues. Our experiments indicate that existing tools often have low precision and recall, and hence our results advocates the need for new approaches in effective detection of paraphrasing issues.

3.2 Related Work

To the best of our knowledge, that our work is the first to categorize paraphrasing issues and propose an annotated dataset for assessing quality issues of crowd- sourced utterances. The novel aspect of the proposed dataset is that utterances in the dataset are labeled with quality issues that we identify later in this chapter. Thus, the main aim of this dataset is to design techniques and train models to automatically detect quality issues of a given utterance. We refer the interested reader to Chapter 2 for a background on dialog systems, intent detection meth- ods, and available datasets for building dialog systems. Nevertheless, our work is related to the areas of (i) quality control in crowdsourced natural language datasets; and (ii) semantic similarity.

3.2.1 Quality Control

Quality can be assessed after or before data acquisition. While post-hoc methods evaluate quality when all paraphrases are collected, pre-hoc methods can prevent submission of low quality paraphrases during crowdsourcing. The most preva- lent post-hoc approach is launching a verification task to evaluate crowdsourced paraphrases [139, 195]. However, automatically removing misspelled paraphrases [203] and discarding submissions from workers with low/high task completion time [123] are also applied in literature. Machine learning models have also been explored in plagiarism detection systems to assure quality of crowdsourced para- phrases [44, 24].

43 Chapter 3 An Empirical Study of Crowdsourced Paraphrases

Pre-hoc methods, on the other hand, rely on online approaches to asses the quality of the data provided during crowdsourcing [141]. Sophisticated tech- niques are required to avoid generation of erroneous paraphrases (e.g., automatic feedback generation was used to assist crowd workers in generating high quality paraphrases). Precog [141] is an example of such tools which is based on a su- pervised method for generating automatic writing feedback for multi-paragraph text– designed mostly for crowdsourced product reviews [141, 142]. This chapter aims for paving the way for building automatic pre-hoc approaches, and provid- ing appropriate online feedback to users to assist them in generating appropriate paraphrases. However, the provided dataset can also be used for building post-hoc methods to automatically omit faulty paraphrases.

3.2.2 Semantic Similarity

Measuring similarity between units of text plays an important role in Natural Language Processing (NLP). Several NLP tasks have been designed to cover various aspects and usages of textual similarity. Examples include textual entail- ment, semantic textual similarity [227, 58], paraphrase detection [3, 93], duplicate question detection [126] tasks which are studied well in NLP. Moreover, recent success in sentence encoders (e.g., Sent2Vec [145], InferSent [41], Universal Sen- tence Encoder [29], and Concatenated Power Mean Embeddings [179]) can be exploited to detect paraphrasing issues with more accuracy. These techniques can be borrowed with some domain specific considerations to build automatic quality control systems for detecting low quality paraphrases.

3.3 Paraphrase Dataset Collection

In previous chapter, we discussed various techniques for acquisition of training ut- terances for training intent detection models, including crowdsourced utterances. In this section, we provide our methodology to collect a corpus of crowdsourced utterances and identify their quality issues to pave the way for automatic quality control in crowdsourced utterances. Various types of paraphrasing issues have been reported in the literature, namely: spelling errors [19], grammatical errors [95, 139], and missing slot-value (happens when a worker forget to include an entity in paraphrases) [190]. We

44 3.4 Common Paraphrasing Issues collected paraphrases for two main reasons: (i) to have a hands-on experience on how incorrect paraphrases are generated, and (ii) to annotate the dataset for building and evaluating paraphrasing quality control systems.

3.3.1 Methodology

We obtained 40 utterances for various intents associated with various API meth- ods (i.e. , Skyscanner, Spotify, Scopus, Expedia, Open Weather, Amazon AWS, Gamil, , Bing Image Search) indexed in ThingPedia [25] and API- KG [228]. We then launched a paraphrasing task on Figure-Eight2. Workers were asked to provide three paraphrases for a given expression [95], which is common practice in crowdsourced paraphrasing to reduce repetitive results [25, 95]. In the provided expression, parameter values were highlighted and crowd-workers were asked to preserve them. Each worker’s paraphrases for an initial utterance are normalized by lowercasing and removing punctuation. Next, the initial utterance and the paraphrases are compared to forbid submitting empty strings or repeated paraphrases, and checked if they contain highlighted parameter values (which is also a common practice to avoid missing parameter values) [135]. We collected paraphrases from workers in English speaking countries, and created a dataset containing 6000 paraphrases (2000 triple-paraphrases) in total3.

3.4 Common Paraphrasing Issues

To characterize the types of paraphrasing issues, two authors of this chapter in- vestigated the crowdsourced paraphrases, and recognized 5 primary categories of paraphrasing issues. However, we only considered paraphrase-level issues related to the validity of a paraphrase without considering dataset-level quality issues such as lexical diversity [139] and bias [85].

3.4.1 Spelling Errors

Misspelling has been reported as one of most common mistakes in paraphrasing [92, 203, 34, 19]. In our sample set, we also noticed misspellings were generated both intentionally (as an act of cheating to quickly generate a paraphrase such

2https://www.figure-eight.com 3https://github.com/unsw-cse-soc/ParaQuality

45 Chapter 3 An Empirical Study of Crowdsourced Paraphrases

# Label Sample

1 Correct Expression Create a public playlist named new_playlist Paraphrase . Make a public playlist named new_playlist 2 Spelling Expression Estimate the taxi fare from the airport to home Errors Paraphrase . Estimate the taxi fare from the airport to hom 3 Spelling Expression Estimate the taxi fare from the airport to home Errors Paraphrase . Tell me about the far from airport to home 4 Spelling Expression Where should I try coffee near Newtown? Errors Paraphrase . Find cafes near Newtown 5 Linguistic Expression Estimate the taxi fare from the airport to home Errors Paraphrase . How much for taxi from airport to home 6 Semantic Expression Are the burglar alarms in the office malfunctioning? Errors Paraphrase . Is the burglar alarm faulty in our work place? 7 Semantic Expression Are the burglar alarms in the office malfunctioning? Errors Paraphrase . Are the office alarms working? 8 Semantic Expression Is the TV in the house off? Errors Paraphrase . Can you turn off the TV in the house if it’s on? 9 Task Expression Estimate the taxi fare from the airport to home? Misunder. Paraphrase . Airport to home is $50 10 Cheating Expression Request a taxi from airport to home Paraphrases . A taxi from airport to home . Taxi from airport to home . From airport to home 11 Cheating Expression I want reviews for McDonald at Kensington st. Paraphrases . I want reviews g for McDonald at Kensington st. . I want for reviews for McDonald at Kensington st. . I want reviegws for McDonald at Kensington st. 12 Cheating Expression I want reviews for McDonald at Kensington st. Paraphrases . I want to do reviews for McDonald’s in Kensington . I would like to do reviews for McDonald’s in Kens- ington . I could do reviews for McDonald’s in Kensington 13 Cheating Expression Create a public playlist named NewPlaylist Paraphrases . That song hits the public NewPlaylist this year . Public really loved that NewPlaylist played on the event . Public saw many NewPlaylist this year 14 Cheating Expression Estimate the taxi fare from the airport to home Paraphrases . What is the fare of taxi from airport to home . Tell me about fare from airport to home . You have high taxi fare airport to home

Table 3.1: Paraphrase Samples

as Example 2 in Table 3.1) and unintentionally (due to a lack of knowledge or a simple mistake such as Example 3 in Table 3.1).

46 3.4 Common Paraphrasing Issues

3.4.2 Linguistic Errors

Linguistic errors are also common in crowdsourced natural language collections [95, 139]. Verb errors, preposition errors, vocabulary errors (improper word sub- stitutions), and incorrect singular/plural nouns, just to name a few. Moreover, capitalization and article errors seems abundant (e.g., Example 5 in Table 3.1). Given that real bot users also make such errors, it is important to have lin- guistically incorrect utterances in the training samples [11]. However, at a very least, detecting linguistic errors can contribute to quality-aware selection of crowd workers.

3.4.3 Semantic Errors

This occurs when a paraphrase deviates from the meaning of the initial utterance (e.g., find cafes in Chicago). As reported in various studies, workers may forget to mention parameter values (also known as missing slot)(e.g., find cafes)4 [44, 190, 171, 203, 19], provide wrong values (e.g., find cafes in Paris) [190, 171, 203, 139, 19], or add unmentioned parameter values[203] (e.g., find two cafes in Chicago). Workers may also incorrectly use a singular noun instead of its plural form, and vice versa. For instance, in Example 6 of Table 3.1, the paraphrase only asks for the status of one specific burglar alarm while the expression asks for the status of all burglar alarms. Making mistakes in paraphrasing complementary forms of words also exists in the crowdsourced dataset. For instance, in Example 7 of Table 3.1, assuming that the bot answers the question only by saying “YES” or “NO”, the answer for the paraphrase differs from that of the expression. However, it will make no difference if the bot’s response is more descriptive (e.g., “it’s working”, “it isn’t working”.) Finally, some paraphrases significantly diverge from expressions. For instance, in Example 8 of Table 3.1, the intent of paraphrase is to turn off the TV; however, that of initial utterance is to query about the TV status.

3.4.4 Task Misunderstanding

In some cases, workers misunderstood the task and provided translations in their own native languages (referred to as Translation issues) [44, 19, 11], and some

4In our task design, this type of error cannot happen since parameter values are checked using regular expressions before submission

47 Chapter 3 An Empirical Study of Crowdsourced Paraphrases mistakenly thought they should provide answers for expressions phrased as ques- tions (referred to as Answering issues) such as Example 9 in Table 3.1. This occurred even though workers were provided with comprehensive instructions and examples. We infer that some workers did not read the instructions, ignoring the possibility of cheating.

3.4.5 Cheating

In crowdsourced tasks, collecting paraphrases is not immune to unqualified work- ers, cheaters, or spammers [46, 44, 34]. Detecting malicious behaviour is vital because even constructive feedback may not guarantee quality improvements as workers act carelessly on purpose. Cheat- ing is thus considered a special case of Semantic Error which is done intentionally. It is difficult even for experts to detect if someone is cheating or unintentionally making mistakes. However, it becomes easier when we consider all three para- phrases written by a worker for a given expression at once. For example, in Example 10 of Table 3.1, the malicious worker removes words one by one to gen- erate new paraphrases. In this example, we also notice that it is still possible that a cheater produces a valid paraphrase accidentally such as the first paraphrase in Example 10. Workers may also start providing faulty paraphrases after generat- ing some correct paraphrases as shown in Example 14 of Table 3.1. Based on our observations, the simplest mode of cheating is to add a few random characters to the source sentence as shown in Example 11. Next is adding a few words to the source sentence without much editing as shown in Example 12. Finally, there are cheaters who rewrite and change the sentences substantially in a very random way such as Example 13.

3.5 Dataset Annotation

We designed another crowdsourcing task to annotate the collected paraphrases according to the category of issues devised above. Namely, using following labels: Correct, Semantic Error, Misspelling, Linguistic Error, Translation, Answering, and Cheating. We split the category of misunderstanding issues into Translation and Answering because they require different methods to detect.

48 3.5 Dataset Annotation

3.5.1 Methodology

In the annotation task, crowd workers were instructed to label each paraphrase with the paraphrasing issues. Next, to further increase the quality of annotations 5, two authors of this chapter manually re-annotated the paraphrases to resolve disagreements between crowd annotators. Moreover, contradictory labels (e.g., a paraphrase cannot be labeled both Correct and Misspelling simultaneously) were checked to ensure consistency. The overall Kappa test showed a high agreement coefficient between the annotators [128] by Kappa being 0.85. Table 3.2 also shows the pair-wise inter-annotator agreement [38]. Next, the authors discussed and revised the re-annotated labels to further increase the quality of annotations by discussing and resolving disagreements.

Label Kappa

Correct 0.900 Misspelling 0.972 Linguistic Errors 0.879 Translation 1.000 Answering 0.855 Cheating 0.936 Semantic Errors 0.833

Table 3.2: Pairwise Inter-Annotator Agreement

3.5.2 Statistics

Figure 3.1 shows the frequencies of each label in the crowdsourced paraphrases as well as their co-occurrences in an UpSet plot [116] using Intervene [105]. Ac- cordingly we infer that only 61% of paraphrases are labeled Correct. This plot also shows how many times two labels co-occurred. For example, all paraphrases which are labeled Translation (24 times), are also labeled Cheating6.

5because of weak agreement between crowd workers 6We used Google Translate to check whether they were proper translations or just random sentences in other languages

49

Chapter 3 An Empirical Study of Crowdsourced Paraphrases

1 359 4000

1000

586

474 438 Frequency

Intersection

315

109

107 104

91

89

52

31 34 27 11

24

1 1 0 24 (0%) Translation ● 148 (2%) Answering ● ● ● ● ● ● ● ● ● ● 278 (5%) Misspelling 714 (12%) SemanticError ● ● ● ● ● 900 (15%) LinguisitcError ● ● ● ● ● 1059(18%) Cheating ● ● ● ● ● ● ● 3682 (61%) Correct ● ● Frequency

Figure 3.1: Dataset Label Statistics

3.6 Automatic Error Detection

Automatically detecting paraphrasing issues, especially when done during the crowd task, can minimize the cost of crowdsourcing by eliminating malicious workers, reducing the number of erroneous paraphrases, and eliminating the need for launching another crowdsourced validation task. Moreover, by detecting Mis- spelling and Linguistic Errors, users can be provided with proper feedback to help them improve the quality of paraphrasing by showing the source of error and sug- gestions to address the error (e.g., “Spelling error detected: articl → article”). Detecting Semantic Errors, such as missing parameter values, can also help crowd workers to generate high quality correct paraphrases. Automated methods can also be used to identify low quality workers, and particularly cheaters who may generate potentially large amount of invalid paraphrases intentionally. Moreover, providing suggestions to cheaters will not help and therefore early detection is of paramount. In a pre-hoc quality control approach for crowdsourced paraphrases, the most important metric seems to be the precision of detecting invalid paraphrases [141]. That is because the main aim of using such a quality control approach is rejecting invalid paraphrases without rejecting correct ones [24]. This is essential because

50 3.6 Automatic Error Detection rejecting correct paraphrases would be unfair and unproductive. For instance, sincere and trustful crowd workers might not get paid as a result of false-positives (incorrectly detected errors). On the other hand, having a high recall in detecting invalid paraphrases is important to eliminate faulty paraphrases and consequently obtain robust training samples. Moreover, such a quality control technique should ideally be domain- independent, accessible, and easily-operated to minimize the cost of customization for a special domain and requiring paid experts (e.g., an open source pre-built machine learning model). In the rest of this section, we examine current tools and approaches and discuss their effectiveness in assessing the paraphrasing issues.

3.6.1 Spelling Errors

We employed several spell checkers as listed in Table 3.3 to examine if they are effective in recognizing spelling errors. We looked up , Github, and ProgrammableWeb7 to find available tools and APIs for this purpose.

Spell Checker Precision Recall F1

Aspell8 0.249 0.618 0.354 Hunspell9 0.249 0.619 0.355 MySpell10 0.249 0.619 0.355 Norvig11 0.488 0.655 0.559 Ginger12 0.540 0.719 0.616 Yandex13 0.571 0.752 0.650 Bing Spell Check14 0.612 0.737 0.669 LanguageTool15 0.630 0.727 0.674

Table 3.3: Comparison of Spell Checkers

7https://www.programmableweb.com 8http://aspell.net/ 9http://hunspell.github.io/ 10http://www.openoffice.org/lingucomponent/dictionary.html 11https://github.com/barrust/pyspellchecker 12https://www.gingersoftware.com/ grammarcheck 13https://tech.yandex.ru/speller/ 14https://azure.microsoft.com/en-us/services/cognitive-services/spell-check/ 15https://languagetool.org

51 Chapter 3 An Empirical Study of Crowdsourced Paraphrases

Even though detecting misspelled words seems easy with existing automatic spellcheckers, they fall short in a few cases. This can be also concluded from Table 3.3 by considering the precision and recall of each in detecting only paraphrases with misspellings. For instance, spell checkers are often unable to identify homonyms [155], incorrectly mark proper nouns and unusual words [14], and sometimes do not identify wrong words that are properly spelled [33]. For instance, in Example 1 of Table 3.1, the “new_playlist" is incorrectly detected as a misspelled word by LanguageTool (the best performer as listed in Table 3.3). In Example 3, the word “far” is not detected even though the worker has misspelled the word “fare”. In Example 4, the word “Newtown” (a suburb in Sydney) is mistakenly detected as a misspelling error. Some of these deficiencies can be addressed. For instance, in the case of spelling errors, assuming that the initial expressions given to the crowd are free of typos, we can ignore false-positives like the “Newtown” and “new_playlist”.

3.6.2 Linguistic Errors

We investigated how well grammar checkers perform in detecting linguistic errors. We employed several grammar checkers as listed in Table 3.4. Our experiments shows that spell checkers have both low precision and re- call. Perelman [155] also conducted several experiments with major commercial and non-commercial grammar checkers, and identified that grammar checkers are unreliable. Based on our observations, grammar checkers often fail in detect- ing linguistic errors as shown in Table 3.4. Examples include improper use of words (e.g., “Who is the latest scientific article of machine learning?”), random sequence of words generated by cheaters (e.g., “Come the next sing”), and miss- ing articles16 (e.g., “I’m looking for flight for Tehran to Sydney”). Given these examples, we believe that language models can be used to measure the likelihood of a sequence of words to detect if it is linguistically acceptable.

16Missing articles in expressions similar to newschapter headlines are not considered error in the dataset (e.g., “Hotel near Disneyland”) 17https://www.afterthedeadline.com 18https://www.grammarbot.io

52 3.6 Automatic Error Detection

Grammar Checker Precision Recall F1

AfterDeadline17 0.228 0.069 0.106 Ginger 0.322 0.256 0.285 GrammarBot18 0.356 0.139 0.200 LanguageTool 0.388 0.098 0.156

Table 3.4: Comparison of Grammar Checkers

3.6.3 Translation

We also investigated several language detectors to evaluate how well they perform when crowd workers use another language instead of English. The results of experiment in Table 3.5 indicate that these tools detect almost all sentences in other languages. But they produce lots of false-positives including for correct English sentences (e.g., “play next song”). As a result, the tools in our experiment have low precision in detecting languages as shown in Table 3.5. Most of the false- positives are caused by sentences that contain unusual words such as misspellings and named entities in the sentence (e.g., Phil saying “I got you”). One possible approach to improve the precision of such tools and APIs is to check if a given paraphrase has spelling errors prior to using language detection tools. We therefore extended the DetectLanguage (the best performing tool) by adding a constraint: a sentence is not written in another language unless it has at least two spelling errors. This constraint is based on the assumption that spell checkers treat foreign words as spelling errors and a sentence has at least two words to be called a sentence. This approach (DetectLanguage+ in Table 3.5) significantly reduced the number of false-positives and thus improved precision.

3.6.4 Answering to Canonical Utterance

Dialog Acts (DAs) [100], also known as speech acts, represent general intents of an utterance. DA tagging systems label utterances with a predefined set of utterance types (Directive, Commissive, Informative, etc [130].) Based on the fact that DAs

19https://fasttext.cc/blog/2017/10/02/blog-post.html[98] 20https://pypi.org/project/langdetect/ 21https://github.com/saffsd/langid.py 22https://console.bluemix.net/apidocs/language-translator#identify-language 23https://ws.detectlanguage.com

53 Chapter 3 An Empirical Study of Crowdsourced Paraphrases

Language Detector Precision Recall F1

FastText19 0.072 1.000 0.135 LangDetect20 0.080 0.917 0.147 LanguageIdentifier21 0.080 0.917 0.147 IBM Watson22 0.170 0.958 0.289 DetectLanguage23 0.344 0.917 0.500 DetectLanguage+ 0.909 1.000 0.952

Table 3.5: Language Detection must remain consistent during paraphrasing, we employed a state-of-art, domain- independent, pre-trained DA tagger proposed in [130]. For example, if an initial utterance is a question (e.g., are there any cafes nearby?) it is acceptable to paraphrase it into a directive sentence (e.g., find cafes nearby.), but its speech act cannot be informative (e.g., there is a cafe on the corner.). Overall, due to the lack of any other domain-independent DA tagger for the English language, we only investigated this tagger. We found that it has a precision of 2% with recall of 63%. This shows that detecting speech acts is a very challenging task especially for domain-independent environments. Advances in speech act detection and availability of public speech act datasets can assist in detecting this category of the paraphrasing issues. Moreover, it is feasible to automatically generate pairs of questions and answers by mining datasets in the fields of Question Answering and dialog systems. Automatically building such pairs can help building a dataset which is diverse enough to be used in practice. Such a dataset can be fed into deep learning algorithms to yield better performance in detecting Answering issues.

3.6.5 Semantic Errors & Cheating

To the best our knowledge, there is not yet an approach to distinguish between categories of semantically invalid paraphrases. Paraphrase detection and textual semantic similarity (STS) methods are designed to measure how two pieces of text are semantically similar. However, they do not differentiate between different types of errors (e.g., Cheating, Answering, Semantic Errors) in our settings. As

54 3.6 Automatic Error Detection

Category # Description

N-gram Features 12 N-gram overlap, exclusive longest common pre- fix n-gram overlap, and SUMO all proposed in [96], as well as Gaussian, Parabolic, and Trigonometric proposed in [42], Paraphrase In N-gram Changes (PINC) [30], Bilingual Evalua- tion Understudy (BLEU) [149], Google’s BLEU (GLEU) [213], NIST [49], Character n-gram F-score (CHRF) [160], and the length of the longest common sub-sequence. Semantic Similarity 15 Semantic Textual Similarity [58], Word Mover’s Distance [110] between words embeddings of ex- pression and paraphrase, cosine similarity and euclidean distance between vectors of expres- sion and paraphrase generated by Sent2Vec [145], InferSent [41], Universal Sentence En- coder [29], Concatenated Power Mean Embed- dings [179], tenses of sentences, pronoun used in the paraphrase, and miss-matched named enti- ties Others 11 Number of spelling and grammatical errors de- tected by LanguageTool, task completion time, edit distance, normalized edit distance, word- level edit distance [58], length difference be- tween expression and paraphrase (in characters and words), and simple functions to detect ques- tions, imperative sentences, and answering.

Table 3.6: Summary of Feature Library such, these techniques are not directly applicable. In the rest of this section, we focus on building machine learning models to detect the paraphrasing errors.

Classifier Precision Recall F1

Random Forest 0.947 0.129 0.226 Maximum Entropy 0.564 0.157 0.246 Decision Tree 0.527 0.350 0.421

Table 3.7: Automatic Answering Detection

55 Chapter 3 An Empirical Study of Crowdsourced Paraphrases

Classifier Precision Recall F1

Random Forest 0.798 0.120 0.209 Decision Tree 0.377 0.276 0.319 Naive Bayes 0.171 0.783 0.280

Table 3.8: Automatic Semantic Error Detection

Classifier Precision Recall F1

SVM 0.878 0.223 0.356 K-Nearest Neighbor 0.871 0.248 0.386 Random Forest 0.843 0.546 0.663 Maximum Entropy 0.756 0.440 0.557 Decision Tree 0.632 0.566 0.597 Naive Bayes 0.473 0.426 0.449

Table 3.9: Automatic Cheating Detection

For this purpose, we used 38 established features from the literature as summa- rized in Table 5.1. Using these features and Weka [77], we built various classifiers to detect the following paraphrasing issues: Answering, Semantic Errors, and Cheating. We chose to test the five classification algorithms applied in para- phrasing literature as mentioned in [24]: C4.5 Decision Tree, K-Nearest Neighbor (K=50), Maximum Entropy, Naive Bayes, and Support Vector Machines (SVM) using default Weka 3.6.13 parameters for each of the classification algorithms. We also experimented with Random Forest algorithm since it is a widely-used classi- fier. We did not apply deep learning based classifiers directly due to the lack of expressions in the collected dataset which seems essential for developing domain independent classifiers. While our dataset is reasonably large, it contains only 40 expressions (each having 150 paraphrases). Given that deep learning techniques are data thirsty [73, 225], to use these kinds of models and eliminate the burden of manual feature engineering, much more expressions are needed. Instead, we benefited from the state-of-art sentence encoders via Transfer Learning as listed in Table 5.1.

56 3.6 Automatic Error Detection

Table 3.7, 3.8, and 5.2 demonstrate the performance of various classifiers (ex- cluding classifiers with F1 being less than 0.2) for each of paraphrasing issues using 10-fold cross validation. To keep the classifiers domain-independent, we split the dataset based on the expressions without sharing any paraphrases of a single expression between the test and train samples. It can be seen that automat- ically detecting these quality issues is very challenging; even the best performing classifier has a very low F1 score especially for detecting Answering and Semantic Error issues. Based on manual exploration, we also found that the classifiers fail to recognize complex cheating behaviours such as Example 13 in Table 3.1 as discussed in Section 3.4. Therefore, new approaches are required to accurately detect paraphrasing issues. Based on our explorations and a prior work [127], we postulate that accurately detecting linguistic errors such as grammatically in- correct paraphrases can play indispensable role in detecting cheating behaviors. Moreover, advances in measuring semantic similarity between sentences can help differentiate between semantically invalid paraphrases and correct ones.

3.6.6 Incorrect Paraphrases Detection

We also assessed the performance of detecting incorrect paraphrases regardless of their categories. In this setting, we labeled all incorrect sentences with a single label (“Incorrect”) regardless of their categories. Table 3.10 demonstrates the performance of various classifiers. Detecting incorrect paraphrases is useful for post-hoc quality control to remove incorrect paraphrases after crowdsourcing paraphrases and consequently eliminate the need for crowdsourced validation task.

Classifier Precision Recall F1

K-Nearest Neighbor 0.799 0.341 0.478 Random Forest 0.781 0.551 0.646 Maximum Entropy 0.721 0.489 0.583 SVM 0.709 0.289 0.411 Decision Tree 0.633 0.585 0.608 Naive Bayes 0.574 0.557 0.565

Table 3.10: Automatic Incorrect Paraphrase Detection

57 Chapter 3 An Empirical Study of Crowdsourced Paraphrases

3.7 Conclusion

In this chapter, we employed a data-driven approach to investigate and quan- titatively study various crowdsourced paraphrasing issues. We discussed how automatic techniques for detecting various quality issues can assist the manual process of crowdsourced paraphrasing. We collected an annotated dataset of crowdsourced paraphrasing in which each paraphrase is labeled with associated paraphrasing issues. We used this dataset to assess existing tools and techniques and to determine whether they are sufficient for automatically detecting such is- sues. Our experiments revealed that automated detection of errors in paraphrases is a challenging task. We observed that detecting semantic similarity divergence is a challenging task, especially detecting if it is done intentionally or unintentionally (cheating behavior vs semantic error). While the proposed dataset has been annotated by two expert annotators, the labels of the proposed dataset may still be disputable. That is because it is even difficult for experts to come to a definite conclusion about quality issues of a single utterance. There might be also other categories of quality issues which have not been seen in the collected dataset or reported in the related works. An interesting extension to this work can be devising automated-assisted meth- ods for detection of paraphrasing issues. This will be based on a two-way feedback mechanism: generating feedback for workers, while at the same time the system learns from the (data of) users to improve its machine intelligence. In time, we envision increasingly less dependence on users.

58 Chapter 4

Diversity-aware Crowdsourced Utterances

Dynamic Word Recommendation to Obtain Diverse Crowdsourced Paraphrases of User Utterances

This chapter provides our solution to tackle the diversity problem in crowdsourced paraphrases. We leverage the priming effect as an opportunity rather than a thread to promote the diversity of crowdsourced paraphrases. We dynamically generate word suggestions to motivate crowd workers towards producing diverse utterances. The key challenge is to make suggestions that can improve diversity without resulting in semantically invalid paraphrases. To achieve this, we pro- pose a probabilistic model that generates continuously improved versions of word suggestions that balance diversity and semantic relevance. Our experiments show that the proposed approach improves the diversity of crowdsourced paraphrases. In Section 4.1, we introduce the issues and contributes we make. We discuss related work in Section 4.2. In Section 4.3 we describe the proposed techniques for diversifying crowdsourced paraphrases. In Section 4.4, we evaluate the proposed techniques and discuss the results. Finally, we provide concluding remarks and discuss future work in Section 4.5.

4.1 Introduction

As discussed in Chapter 2, building task-oriented bots requires processing a given user utterance (e.g., “search for a restaurant near the university”) to identify the

59 Chapter 4 Diversity-aware Crowdsourced Utterances user’s intent (e.g., business search) along with its entities (e.g., business= “restau- rant”, location= “university”). The success of intent recognition models heavily relies on obtaining large and high-quality corpora of annotated utterances [217]. An annotated utterance (e.g., “search for a restaurant near the university” where intent=“business search”, business=“restaurant”, and location=“university”) is a user utterance labeled with a specific intent and its corresponding entities. As discussed thoroughly in Chapter 2 and 3, obtaining training utterances typically involves two main steps: (i) obtaining a canonical utterance, and (ii) paraphrasing it into multiple variations [219, 225, 203]. Paraphrasing is necessary since having a diverse set of utterances in the training set can better represent the different ways in which people may specify an intent, especially given the ambiguous and flexible nature of the human language [206, 217]. Crowdsourced paraphrasing is an effective and inexpensive way to obtain train- ing utterances as discussed in Chapter 2 and 3. However, outcomes from crowd- sourcing tasks must be checked for quality since they are produced by workers with unknown or varied skills and motivations (see Chapter 3) [46]. For exam- ple, spammers, malicious or inexperienced workers can provide misleading and erroneous paraphrases [219]. We discussed various quality aspects of crowdsourced utterances in the previous chapter as well as in Chapter 2. This chapter focuses on a specific but impor- tant quality issue in crowdsourced paraphrases: the lack of diversity in the user utterances obtained from the crowd (used for training bots). Indeed, research has shown that crowd workers are biased towards the vocabulary and structure used in the initial sentences provided to them when performing the paraphras- ing task, thereby negatively impacting the diversity of training utterances [203, 171, 25]. As mentioned in Chapter 2, bias towards vocabulary and structure of the sentence to be paraphrased can be explained by the priming effect – an automatic, implicit and non-conscious activation of information in memory [86]. According to the priming effect, exposure to a stimulus affects responses to a sub- sequent stimulus. (e.g., being asked to name a word starting with “str”, humans are more likely to form the word “strong” than “street” if they have previously been shown the word “strong”) [211]. As such, primed by the words in the given utterance, crowd workers are more likely to use the same vocabulary when para- phrasing [171, 95]. Thus, the priming effect may negatively impact the diversity of collected paraphrases.

60 4.1 Introduction

Text Alignment Expanded Ranked Recommendation List Word Lists Word Lists Word Cloud search for a restaurant get give eating_place find a resto resto eat locate look for an eatery

Implicit New Feedback Paraphrases

Figure 4.1: Word-Recommendation Overview

In this work, we hypothesize and demonstrate that recommending word- s/phrases can positively prime crowd workers to use new vocabularies in their paraphrases. In other words, we leverage the priming effect itself to devise diversity-enhancing paraphrasing techniques, countering the negative impact of priming given by the words inside the given utterance to be paraphrased. Inspired by automated paraphrase generation techniques while also considering priming effects, we propose a novel hybrid (automated-crowdsourced) approach that dy- namically suggests words/phrases to workers, to assist them in recalling words. The suggestions can potentially improve the diversity of paraphrasing for a given intent. Suggesting words is challenging because we need to ensure that sugges- tions do not promote semantically invalid paraphrases. For example, if the intent is to “find a restaurant”, suggesting words such as “kitchen” and “counter” (while related to “restaurant”) would result in generation of a paraphrase such as “find a counter” which does not convey the same intent. Therefore, a key challenge of dynamically generating these suggestions is recommending words/phrases that will improve diversity without resulting in semantically invalid paraphrases. We contribute a novel framework (depicted in Figure 4.1) combining techniques from dynamic words list expansion [60] and implicit relevance feedback [99] to provide diverse and semantically relevant training utterances for a given intent. More specifically, we make the following contributions:

• A novel words list expansion technique to improve diversity of training utterances. We formalize the generation of alternative word- s/phrases as a problem of expanding a list of seed words/phrases, extracted

61 Chapter 4 Diversity-aware Crowdsourced Utterances

from an utterance (to be paraphrased) and its previously collected para- phrases. Our approach leverages word alignment [21] and word embedding [131] techniques. Word alignment is the task of identifying translation re- lationships between two sentences. For example, word alignment between “search for a restaurant near the university” and “find a restaurant close to the university” results in the following translation relationships (synonym sets): “search for” → “find”, and “near” → “close to”. Aligning words allows deriving a more accurate sense for each word, since a word may often have multiple meanings depending on its context. For instance, the word “near” may also mean “almost” and “approach” in different sentences. However, using word alignment, we can derive words like “close to” to understand the correct word sense. Word embedding can then be used to further enrich the synonym sets by mapping words to semantically similar candidates (e.g, “neighbourhood”, “in close proximity”) [154, 131, 59].

• A novel probabilistic model to recommend diverse but semanti- cally relevant words. Unsupervised techniques such as word embedding can generate a large number of related words. Moreover, when combined with techniques such as word alignment to help the generation of highly related words list, in many cases the output results may still not be perfect. As a simple example, alternatives that are generated using word embed- ding for the word “green” may include {“greenable”, “virescent”}. These words would seemingly be unnatural to be used as paraphrases, albeit a machine would not be able to interpret this. More importantly, obtaining a very large list of suggestions may not necessarily be useful unless we devise methods for ranking (and continuously re-ranking) the results in order to choose and present the set of top-quality words to the crowd worker. To achieve this, we propose a probabilistic model that uses various indicators (e.g., implicit feedback from workers, diversity maximization and semantic relevance) to automatically and adaptively synthesize a reliable score to rank generated word list. This allows the generation of improved versions of suggestions, based on continually monitoring implicit worker feedback, in order to balance diversity and semantic relevance.

• End-to-End Evaluation. Finally, we evaluate how these automatically generated suggestions impact the diversity of collected paraphrases on var-

62 4.2 Related Work

ious domains, including 40 intents that are designed for highly popular APIs (e.g., Yelp, Spotify). Experiments show that our approach improves not only the lexical diversity but also other diversity metrics (PINC[30] and DIV [103]). Moreover, we found out that it reduces task completion time and misspelling errors.

4.2 Related Work

In this section, we provide required background and related work. To avoid repetition, we refer the reader to Chapter 2 and 3 for background on dialog systems, paraphrasing and crowdsourced utterances.

4.2.1 Priming in crowdsourcing

In experimental psychology, the term “priming” refers to a technique in which introducing one stimulus influences a person’s response to a subsequent stimulus [86]. In other words, humans may be biased by prior stimuli that affects future processing of information [211]. Word-fragment completion (WFC) is an example of a task where priming may have an impact [188]: a person is given a fragment of a word like “str- - -” and is asked to complete the word. This person is more likely to form the word “strong” than “street” if she had been shown the word “strong” before performing the task [20]. This type of priming is called repetition priming. Repetition priming (also called direct priming) refers to a form of priming according to which the brain responds more quickly to a stimulus if it has experienced it previously [64]. Priming is used in several applications, from swaying public opinion about product marketing, political campaign to sharpening memory skills [78, 17, 51, 196]. Priming has also been used in crowdsourcing to positively affect the per- formance of crowd workers [79, 66]. For example, research found that showing positive images (e.g., a photo of a smiling child) or listening to positive music when working on tasks can influence idea generation [136]. As another example, priming has been used to trigger workers’ inherent motivation to excel in per- forming a task by showing them quotes of famous figures about “achievement” [66]. In crowdsourced paraphrases, research has also shown that crowd workers are primed by the given utterance as well as by the examples and instructions

63 Chapter 4 Diversity-aware Crowdsourced Utterances presented to them. However, in this case, priming negatively impacts diversity since crowd workers are more likely to use the same vocabulary as used in the given utterance and task description [139, 95, 171, 143]. In this chapter , we build upon this prior art but also look at priming as an opportunity rather than a problem. Specifically, inspired by tasks such as word- fragment completion, we leverage priming by suggesting a list of words/phrases, that can potentially improve diversity. We hypothesize (and demonstrate) that priming the crowd with a list of potential lexical substitutions can mitigate the negative impact of priming by the words in the given utterance.

4.2.2 Word list expansion

Query rewriting and more specifically query expansion methods can also be adopted to build dynamic word suggestions. The primary goal of query expan- sion is to solve the term mismatch problem [27] by adding a few words to a given query. In this manner, it aims to mitigate the vocabulary problem when the doc- uments and users use different terminologies to express the same thing. In this chapter , we extend a query expansion model based on word embedding [111] by aligning words used in the given utterance and the already collected paraphrases. In this manner, we can distinguish the senses of multi-sense words. For example, word alignment between “search for a restaurant near the university” and “find a restaurant close to the university” detects the following translation relationships: “search for” → “find”, and “near” → “close to”. Thus, it can be inferred that the word “near” conveys a sense meaning “close to” (not other senses like “almost” or “approach”). Furthermore, we adapt the concept of implicit relevance feedback from query expansion methods of information retrieval systems, in order to measure if a word is semantically relevant to the given intent. Relevance feedback in information retrieval systems is used to expand to the query by using contents of relevant documents [182]. However, we use implicit relevance feedback to dynamically remove noises (semantically irrelevant words) from the suggestions. We propose a probabilistic model to measure the likelihood of a word being relevant to the given intent by tracking how suggested words are used. In short, if a word has been suggested multiple times without being used in the collected paraphrases, it is considered semantically irrelevant for the given intent.

64 4.3 Word Recommender

4.3 Word Recommender

Figure 4.2 illustrates an example of a paraphrasing task based on word recom- mendation on the Figure Eight1 crowdsourcing platform. The recommendations are shown in a word cloud as a graphical representation. Words inside the word cloud (on the left-hand side) have been generated as relevant words to the given utterance. In the given example, workers have been asked to paraphrase “search for a restaurant near the university” which is an utterance for the business search intent with “restaurant” and “the university” as values for business and location parameters, respectively. Our initial implementation (as presented in Section 4.4.1) categorizes words with the same color as possible alternatives for one of the words in the utterance. Thereby at the time of paraphrasing, crowd workers are able to pick words from this list and form new paraphrases.

Paraphrase the following expression

Search for a restaurant near the university

eatery Paraphrase 1 (required) find within_walking_distance close_to eating_place Paraphrase 2 (required) want place_to_eat be_after neighbourhood Paraphrase 3 (required) list

Figure 4.2: Crowd Workers’ Interface

The generation of diverse (but relevant) words is done by the Word Recom- mender. The main components are illustrated in Figure 4.3. Given an intent (e.g., business search) with an annotated utterance (e.g., Search for a business= "restaurant" near location="the university") to be paraphrased, the Word Rec- ommender first extracts all words/phrases from the utterance. It then builds a synonym set for each extracted word/phrase by aligning the collected paraphrases with the utterance. Next, for each synonym set, it finds a set of related words to appear in the recommendation list with the help of a word-embedding model. Fi- nally, it (re-)ranks candidate words based on the words appeared in the collected paraphrases to generate a new recommendation list. This process is an ongoing

1https://www.figure-eight.com/

65 Chapter 4 Diversity-aware Crowdsourced Utterances

Annotated Utterance Search for a business="restaurant" near location="the university"

Synonym-Sets near Extractor Synonym-Sets unearth nearby neighbourhood {search_for, seek,..} obtain locate in_vicinity_of {restaurant, eatery,...} findWord Embedding {near, close, ...} discover located site campus university faculty uni Recom-Set Generator Recom-Sets (close_to 0.71) (vicinity (look_for , 0.55) 0.6) (located (locate, 0.34) 0.5) (eatery, 0.7) (adjacent (obtain, 0.09) 0.4) (cafe, 0.6) (discover, 0.1) Recom-Set (resto, 0.6) Paraphrase Ranker (steakhouse, 0.3) DB

Paraphrase 1 Paraphrase 2 Paraphrase 3 nearby look_for seek_for eatery in_vicinity_of cafe place_to_eat close locate

Figure 4.3: Word Recommender Architecture effort: periodically, when workers provide paraphrases for a particular utterance, a new recommendation list is generated according to the collected paraphrases. Our system adopts a word embedding model called ConceptNet Numberbatch [189] to find related words for each word that appears in the given utterance. Word embedding methods map words into a vector space model in a way that similar words have similar vectors. As a result, a word embedding model can be used to find similar words by finding the closest neighbors for a given word. ConceptNet Numberbatch has a few characteristics which make it desirable in our work. Firstly, it has been built using retrofitting [59] to refine existing word embedding models with external knowledge bases (e.g., ConceptNet [189].) Sec- ondly, it provides vectors not only for words but also for frequent n-grams up to tri-grams. As a result, using such a word embedding approach, our system is able to suggest both words and phrases. While we have used this model in our experiments, it can easily be substituted with any word embedding model. The

66 4.3 Word Recommender rest of this section provides further details about the three main components of Word Recommender2.

4.3.1 Synonym Sets Extractor

Any word can have different meanings in different contexts [56]. Since our aim is to suggest highly relevant words to a given utterance, it is important to disam- biguate the sense of a given word in a sentence. For example, the word “near” can be an adjective, adverb, or even a verb, and based on its part-of-speech (POS) it can have several synonyms such as close, approximate, skinny, dear, good3. To suggest appropriate substitutions for such a word, it is important to consider its role and sense in the utterance.

Annotated Utterance Tokenization Enrichment with WSD/Word-Alignment

Search for a {search_for} {search_for, look_for, seek, ...} business="restaurant" {restaurant} {restaurant, eatery, eating_place, ...} near location="the university" {near} {near, close, ...}

Figure 4.4: Synonym-Sets Creation for an Utterance

To this end, Synonym Sets Extractor first extracts all words/phrases from a given annotated utterance4, excluding stop-words5 (since frequent words are less likely to result in repetition priming [191]). We also excluded parameter values marked as fixed as a design decision [135] (e.g., {near}, {search_for}, {restau- rant}) . Fixing parameter values can be a double-edged sword. Preserving param- eter values removes the need for manual annotation of collected paraphrases [135], however it might also reduce variability where we do not want (e.g., cities in the world have many possible values) and allow workers to efficiently focus on where we want variability. On the other hand, if a parameter like “business=restaurant” is fixed then we will be deprived of diverse but relevant utterances such as “where to eat in the university”. Best practices for fixing parameters are outside the scope of this chapter , and here we simply assume that the bot developer has the choice to do so. Thus, Synonym Sets Extractor also ignores words which are marked as fixed. 2https://github.com/unsw-cse-soc/WordRecommender 3http://wordnetweb.princeton.edu/perl/webwn?s=near 4we assume that utterances are annotated (either manually or automatically) 5stop-words are frequently used in text by such as “a”, “an”, “is”

67 Chapter 4 Diversity-aware Crowdsourced Utterances

Next, Synonym Sets Extractor enriches each word by creating a set for each word/phrase and adding its synonyms to the set. In a nutshell, Synonym Sets Ex- tractor employs (i) a word sense disambiguation (WSD) method if no paraphrase has been collected yet, and (ii) a word alignment [21] method to find aligned words of the given utterance and the collected paraphrases. Word alignment is a task of finding the corresponding translation of a word between two pieces of text (e.g., a sentence and its paraphrase). For instance, if we collect a paraphrase like “search for a restaurant in the vicinity of Sydney”, we can infer that the word “vicinity” in this context is aligned to “near”. In our implementation we used the sentence aligner proposed in [192] to find alternatives for a given word in the collected paraphrases. These alternatives are added to the synonym set of the word to enrich the word and disambiguate its meaning in the utterance. While word alignment can be used for disambiguation, it is not practical until a few paraphrases are collected from crowd workers. To mitigate this issue when there is no paraphrase at the beginning of the crowdsourcing, the proposed system uses a WSD method proposed in [153]. Using a dictionary such as WordNet [133], the WSD method is able to determine a set of synonyms for a word in the given sentence. Finally, by enriching each word/phrase of the given annotated utterance, the following synonym-sets are obtained: {search_for, look_for, seek}, {restaurant, eatery, eating_place, eating_house}, and {near, close}.

4.3.2 Recommendation Sets Generator

Recommendation Sets Generator expands a synonym set by adding relevant words to the set, resulting in a recommendation set. This component builds upon word embeddings– a method of mapping words into a vector space model in a way that similar words have similar vectors. In particular, it makes use of a word embed- ding model called ConceptNet Numberbatch [189] to find related words/phrases for each synonym set. Figure 4.5 demonstrates how related words are found by Recommendation Sets Generator. For a given synonym set (e.g., {near, close}), Recommendation Sets Generator calculates its mean vector by averaging the vectors of all words in the P synonym set ( ~ss = 1/|ss| s∈ss ~s; where ss stands for synonym set). Next, it finds the top-n neighbours of the mean vector ( ~ss) using cosine similarity (cos(~u,~v) = ~u.~v/||~u||.||~v||), and ranks the neighbours based on their cosine similarity to the vector of the given utterance ( ~expr)– the average of all content words (nouns,

68 4.3 Word Recommender verbs, and adjectives) in the utterance [229]. It should be noted that, while it is possible to order the words/phrases in the candidate sets based on their similarity to the corresponding synonym set, we calculate their similarities to the utterance to give importance to words which are more relevant to the utterance. For example, in Figure 4.5 “vicinity” is ranked higher than “adjacent” while the vectors of “adjacent” and “near” are closer than those of “vicinity” and “near”; but since “vicinity” is closer to the utterance, it is given a higher score.

. university search_for. Keyword Set: [near, close, nigh] . . near . close Expression: [search_for, restaurant, near, university] ...... nigh. restaurant 1 Keyword Sets and 2 Cosine distance over Expression vector space

(close_to 0.43) (close_to 0.71) (adjacent 0.32) (vicinity 0.55) vicinity (located 0.15) (located 0.34) adjacent located (vicinity 0.06) (adjacent 0.09) adjacent located close_to vicinity Keyword set Expression close_to

Find closest words Reordering similarities 3 4 (Comparing vectors) (Comparing vectors)

Figure 4.5: Recommendation-Set Generation

4.3.3 Recommendation Set Ranker

The goal of Word Recommender is to estimate the probability that a word/phrase will improve diversity of paraphrases if it appears in the recommendation list. Recommending infrequent words can improve the diversity because they enhance the chance of increasing the number of unique n-grams in the collected para- phrases and thus increase the diversity metrics (e.g., PINC[30], DIV [103].) How- ever, recommendations should also be semantically related to the given utterance to avoid the generation of semantically invalid paraphrases. If the recommended word is not semantically related to the given utterance, it will have no effect on the diversity of collected paraphrases because workers may not use them. Or even more dangerously, showing irrelevant recommendations may result in the generation of semantically invalid paraphrases. For example, if the intent is to “find a restaurant”, suggesting words like “kitchen” while might be relevant but

69 Chapter 4 Diversity-aware Crowdsourced Utterances will result in generating paraphrases which are not specifying the same intent (e.g., “look for a kitchen”.) To this end, we devise a probabilistic model to rank the words in recommendation-sets by estimating the probability that a given word will in- crease diversity while being semantically related to the intent. To achieve this we propose two probabilistic models for measuring the semantic similarity of a word to the intent, and one probabilistic model for estimating if recommending a word will increase diversity of paraphrases.

Similarity Probability Model (SM)

It is important to suggest only words which are highly related to the utterance. This is partly done in the previous step when words in a recommendation-set are scored based on how they are close to the utterance see (Section 4.3.2). However, to obtain a probability distribution over the recommendation words, we can use softmax-normalization[111]:

exp(cos(~w, ~expr)) P (w|SM) = P 0 (4.1) w0∈rs exp(cos(~w , ~expr)) where rs stands for recommendation-set and w ∈ cs. In fact, this probability model, constitutes a unigram language model defined over the recommendation- set assuming that the rest of words have zero probability.

Relevance-Feedback Probability Model (RM)

As mentioned before, recommendation-sets are noisy and not all of the recommen- dations can be used in paraphrasing. Our premise is that using an implicit rel- evance feedback model can assist Word Recommender in distinguishing between noises and highly related words. Intuitively, if a word has been recommended but has rarely been used in the collected paraphrases, most likely it is an unrelated word and it should not be present in the next updates. Formally, the maximum likelihood estimate (MLE) of w with respect to the number of times it appeared in recommendation list and used in the paraphrases is:

used(w) p (w) = MLE shown(w)

70 4.3 Word Recommender where shown(.) counts how many times a term has been shown to crowd workers, and used(.) counts the number of times a word has been used in the paraphrases. However, this estimation disregards two essential aspects: [leftmargin=0.15in]Not all of the collected paraphrases are valid, and only valid paraphrases should be considered while counting. Even if a word is highly relevant to a given utterance, it does not necessarily mean that the worker will include the word in the paraphrases. For example, imagine that there are several synonymous for a word, and all workers pick only a single one of them, and no one uses the rest of synonyms while they might be as appropriate as the chosen one. Such cases can definitely affect the chance of a word to be selected. So we cannot assume that a word which has not been used is completely out of scope.

To address these, let λ be the probability of a given utterance to be valid, and γ be the probability of a word not to be used in a paraphrase regardless of being an appropriate alternative or not; the revised relevance probability is:

λused(w) + γ(shown(w) − used(w)) +  p (w) = RMLE shown(w) +  In our experiments, we set λ = 0.80 (base on error rates reported in various works[190, 139]),  = 0.1 and γ = 0.5 which are initial configurations, and finding the optimal values requires further studies. To avoid zero division errors, we also added a small positive number  = 0.1 to the numerator and denominator. In summary, this function reduces the weights of words which appeared in the recommendation list but were rarely used, as well as words which have been exploited by the paraphrases. The relevance probability can be normalized to obtain the relevance probability distribution under the relevance model (RM):

pRMLE(w) P (w|RM) = P 0 (4.2) w0∈cs pRMLE(w )

Diversity Probability Model (DM)

To encourage diversity, infrequent words must be given more priority because they are more likely to generate new n-grams which have not been seen in the collected paraphrases and thus improve diversity metrics. To attend this issue, we used a modified version of BM25 inverse-document-frequency (IDF) [177] to

71 Chapter 4 Diversity-aware Crowdsourced Utterances avoid negative numbers:

N − f(w) + 0.5 idf(w) = log(1 + ) (4.3) f(w) + 0.5 where f(.) represents the frequency of the word in the collected paraphrases, and N is the total number of the collected paraphrases. The IDF values can be normalized to yield probabilities:

idf(w) P (w|DM) = 0 (4.4) Σw0∈csidf(w ) Finally, we use a linear interpolation technique [132] to approximate to what extent a word increases diversity while preserving semantics:

P (w) = αP (w|SM) + βP (w|RM) + θP (w|DM) (4.5) where α, β, θ ∈ [0, 1] are interpolation parameters (α + β + θ = 1) and control the trade-off between the probability models. In our experiments, we kept the interpolation parameters equal. Words in each recommendation-set are ranked based on Equation 4.5; and top-m words from each set are selected to be present in the current update of the recommendation list which is shown to workers; where m is the rounded value of the size of the list (see Section 4.4.1) divided by the number of recommendation-sets. In the time of showing the recommendation list, words which belong to the same recommendation set are inked with the same color. Moreover, their final scores determine how big they should appear on the current update of the word-cloud.

4.4 Experiments & Results

Before doing a comprehensive experiment, we conducted a pilot to decide on the of a crowdsourcing task. Based on these observations and interviews, we then evaluated the approach on the Figure-Eight crowdsourcing platform.

4.4.1 Task Design Experiment

Participants. We recruited a convenience sample of 7 participants including 2 Postdoctoral researchers (P1, P2), 3 PhD students (P3, P4, P5), 1 research

72 4.4 Experiments & Results assistant (P6) and 1 undergraduate student (P7).

Procedure. The experiment was conducted in the following parts: (a) Firstly, a brief explanation of the task was provided to the participants, as well as a few examples for invalid paraphrases. To not bias participants, we did not give them any valid paraphrasing examples; (b) next, participants were assigned 6 randomly chosen utterances to provide 3 paraphrases for each. Each of participants encoun- tered two utterances with the word-cloud size of 10, two with the size of 15, and two with the size of 20; and (c) finally, a follow-up semi-structured interview was conducted about the user experience of using word-cloud during the paraphrasing process. Does word recommendation help? One of the aims of this experiment was to know if the automatically generated word-cloud helped workers during paraphrasing. All of our participants confirmed that the word-cloud assisted them especially for those whose main language was not English:

“I think the word cloud can help the users with English as the second language...” (P5)

“I like the idea of giving you some words, it actually helped me specially in those paraphrasing questions that I didn’t have any alternatives in mind” (P3)

“Well, yes definitely, specially for me that English is not my mother tongue; I also learned some new words while I was making new sentences for [...]” (P1)

How many words are appropriate in the recommendation list? Figure 4.6 demonstrates how participants rated different word-clouds (recommendation lists) based on their sizes in a Likert scale with 5 being very appropriate. Gener- ally, most of the participants preferred the size of 10; even it was considered as an accelerating factor by one of the participants:

“The more compact ones seemed more useful to me. I guess it is because you don’t have to inspect too many options, which allows you to perform the task more quickly.” (P2)

73 20 15 10 5 0 3 2 4 0 1 5 3 5 2 0 2 2 1 0 1 0 0 0 Chapter 4 Diversity-aware Crowdsourced Utterances

Likert Scale : 1 2 3 4 5

10

15

20 Word-Cloud Size Word-Cloud 0% 20% 40% 60% 80% 100% User Preference

Figure 4.6: Likert Assessment of Word Clouds by Size

However, a few found the word-cloud of size 15 more preferable. Our inves- tigations revealed that the size of word-cloud may also be a function of the number of words in an utterance, and for short utterances showing many alterna- tives for a single word is considered inappropriate and time-consuming sometimes: 20 5 4 “...having15 more words makes it hard to choose.” (P4) 3 10 2

“..., alsoSize WordCloud I think 10 is enough, even less let’s say 7 or 8; 201 is definitely too much.” Satisfaction Percentage (P6)

Therefore, in our follow-up experiments we set a limit of 10 for the total size of the word-cloud and only 5 for single word alternatives. Is coloring similar words with the same color helpful? While some of participants found word coloring very helpful, others did not understand the semantics behind the coloring:

“Word coloring is great and should be kept as it is...” (P7)

“But the coloring was confusing. I cannot really assign a specific characteristics to the used colors...while doing the task, the semantics of the colors did not come immediately and naturally to my mind” (P2)

Based on the interviews, we decided to use coloring with muted colors (avoiding bright ones). To resolve misunderstandings about colors, we also explained the coloring in the task instructions.

74 4.4 Experiments & Results

4.4.2 Crowdsourced Paraphrasing

To verify our approach, we randomly selected 40 utterances indexed in API- KG [229] –a knowledge graph designed for RESTful APIs – and ThingPedia [25] including utterances from different domains: Yelp, Skyscanner, Spotify, Sco- pus, Expedia, Open Weather, Amazon AWS, Gmail, Facebook, and Bing Image Search. Next, we launched five paraphrasing jobs on Figure-Eight: (i) a simple baseline which simply asks crowd workers to paraphrase given paraphrases; (ii) the state- of-art approach named Chinese Whispers (CW) [139]; (iii) the proposed approach called Word Recommender (WR); (iv) a recommendation method with words generated by an open-source query rewriting method called SearchBetter (SB)6; and (v) finally, inspired by the Taboo game, we created another baseline called Taboo Words (TW) by forbidding workers from using the words which have high frequencies in the already collected paraphrases; we excluded stopwords and in each round only 5 taboo words were shown to the crowd workers; since higher numbers of taboo words makes the paraphrasing task very difficult7. We did not compare our system with approaches like replacing entities with images or showing videos because they are difficult to adopt in general, as discussed earlier in this chapter . For each of the five jobs, we collected 10 judgments per utterance. In our jobs, a judgment is a triple of paraphrases submitted by a worker. Workers were asked to provide three paraphrases for each utterance to reduce repetitive paraphrases [95] which is a common practice [25, 95] in crowdsourced paraphrasing. Each worker gained 10 cents per judgment, and totally we spent about $258 including the transaction fee charged by the platform. Over a span of 3 days, we collected 30 paraphrases per utterance per approach from English speaking countries, and created five datasets containing 6000 paraphrases in total. Procedure. First, the workers were instructed to be familiar with the task and its constraints (e.g., parameter values must be used in the paraphrases as they are in the given initial utterance). Next, for a given utterance, participants were asked to provide three paraphrases. In the case of using one of the word recommendation

6available at https://github.com/hathix/searchbetter; where we combined the two query rewriting methods of SearchBetter (Wikipedia and rewriters) and fed the suggestions into the word cloud in the order given by the framework 7we first experimented with 10 taboo words but workers stopped completing the task and rate the difficulty of the task as 1 out of 5

75 Chapter 4 Diversity-aware Crowdsourced Utterances

methods, each worker was also asked if the generated recommendations helped them during the task using the Likert scale (1 to 5). In the following section, we discuss different aspects of our experimentation. Cleaning. After collecting the paraphrases, we launched another crowdsourc- ing task to qualify crowdsourced paraphrases to determine correct and incorrect paraphrases. To this end, we assigned each paraphrases to 3 workers, where each worker gained 2 cents per annotating triple-paraphrases, and totally we spent about 144 dollars. To further increase the quality of annotations and resolve dis- agreements between crowd workers, two authors of this chapter manually checked, discussed, and revised the labels. As it is also shown in Table 4.1, the sizes of datasets are roughly equal after pruning, and roughly 20% of the collected para- phrases are incorrect ( except the TW and CW methods). This is also in-line with the value chosen for λ = 0.8 in Equation 4.2. As mentioned before, CW is prone to producing many incorrect paraphrases [95]). Moreover, based on our observations, the TW method makes the task very difficult for crowd workers and as a result, many incorrect paraphrases are generated to circumvent the forbidden words. In the following sections, we compare the diversity of the datasets only based on the correct paraphrases.

Table 4.1: Crowdsourced Paraphrase Datasets

Dataset Total Size Correct Incorrect

Baseline 1200 935 (78%) 265 (22%) Chinese Whispers (CW) 1200 823 (69%) 377 (31%) 1.2. SearchBetter (SB) 1200 986 (82%) 214 (18%) Taboo Words (TW) 1200 770 (64%) 430 (36%) Word Recommender (WR) 1200 974 (81%) 226 (19%)

4.4.3 Results

Does word recommendation improve diversity? The main aim of the proposed approach is to improve the diversity of collected paraphrases by stimu-

76 4.4 Experiments & Results lating users to use words/phrases that can add variety to the paraphrases collected for a given intent. To measure diversity of collected paraphrases, after remov- ing punctuation marks, lowercasing, and lemmatizing paraphrases, we calculated four different measures described in Section 4.2: (1) TTR, (2) PINC, (3) DIV, and (4) the vocabulary size.

Table 4.2: Diversity Comparison

Dataset TTR PINC DIV Vocabulary Size

Baseline 0.258 0.653 0.382 1647 Chinese Whispers (CW) 0.278 0.695 0.365 1622 SearchBetter (SB) 0.285 0.724 0.484† 1713 Taboo Words (TW) 0.338† 0.733† 0.518† 1682 Word Recommender (WR) 0.313† 0.734† 0.543† 2064

Bold illustrates the best performance for each of the diversity measures, and a † indicates two-tailed statistical significance at the 0.01 level over baseline.

Using a two-tailed independent-samples t-test, there was a significant difference in the lexical diversity (TTR) for WR (M = 0.32,SD = 0.72) over the baseline (M = 0.26,SD = 0.05); t(38) = 4.01, p = 0.0001. As shown in Table 4.2, these results suggest that the WR does have an effect on lexical diversity; on average, using WR enhances lexical diversity by 21.32%. Although TTR is less for longer documents, knowing the fact that both datasets are almost equal in size, we can compare the TTRs. Moreover, WR increased the vocabulary size by 19.92%. By comparing the vocabulary sizes of two datasets, we can also infer that using WR yields more diverse paraphrases. On the other hand, while SB (M = 0.29,SD = 0.84) improves TTR over the baseline, it is not statistically significant; t(38) = 1.82, p = 0.07. Moreover, even though TW proceeds WR in terms of TTR, these two are not comparable since there is a big difference in the number of paraphrases in the two datasets, making TTR a not very suitable metric for comparing the datasets. TTR cannot measure how much structurally diverse are two datasets; to over- come this issue, PINC has been introduced. PINC measures the percentages of n-grams in the source sentence and its paraphrase; in short, PINC rewards in-

77 Chapter 4 Diversity-aware Crowdsourced Utterances troducing new n-grams. An independent-samples t-test was also conducted to compare the PINC scores. The PINC values indicated that the paraphrases gen- erated by WR (M = 0.73,SD = 0.09) are more diverse than those written using the baseline approach (M = 0.65,SD = 0.11); t(38) = 3.42, p = 0.001. WR also improves mean average PINC by 12.4%. Using TW (M = 0.73,SD = 0.09) also yields statistically significant improvement on PINC; t(38) = 3.39, p = 0.001. One problem with PINC is that it only considers n-gram changes between the source sentence and its paraphrases; without considering that of between two paraphrases. DIV [103] is another diversity measure which overcomes this issue by pair-wide n-gram comparison between paraphrases. Our experiments indicate that WR yields 42.15% improvement over the baseline. Interestingly, CW has a lower DIV than the baseline; one reason behind that may lay in the fact that CW tries to make diverse paraphrases regarding the source utterance, but it fails to promote diversity between paraphrases. The TW approach also improves DIV by 35.6%. Given the above-mentioned results, we concluded that word recommendation as well as showing taboo words improve not only lexical diversity but also PINC and DIV. However, TW results in generating too many incorrect paraphrases as shown in Section 4.4.

Does Word Recommender offer appropriate alternatives? To track the quality of our dynamic word recommendation list, we asked crowd-workers to rate how helpful the generated list was for that particular task. Figure 4.7 illustrates the average user rating of 40 utterances over time (the 1st user to 10th user) for both WR (green) and SB (orange). This figure reveals that the proposed approach for creating the recommendation list and the probability model for re- ranking are properly working and over time improve the quality of the generated list. Figure 4.7 also shows the overall ratings for the WR (the green pie chart) and SB (the orange pie chart) approaches. As shown in these pie charts, the most of suggestion lists are rated 4 in the WR approaches while that of SB is 1. On the other hand, almost without any trend for improving, the performance of SB fluctuates over time. Since both approaches use the same word-embedding model, we can conclude that WR outperforms the SB mostly because of the probability model described in Section 4.3.3 by reducing noises over time while trying to propose new words. As opposed to WR, SB tries to exploit new words. Moreover, Figure 4.7 indicates that crowd workers are more happy with WR’s

78 1 2 3 4 5 AVG weight R0 0.27 0.29 0.24 0.15 0.052.42 0.2 1 2.4 1.8 1.463414634 R1 0.11 0.15 0.18 0.13 0.433.62 0.2 2 2.5 1.7 1.463414634 R2 0.11 0.17 0.23 0.44 0.053.15 0.3 3 2.711.744 1.743902439 R3 0.023 0.053 0.23 0.34 0.36 3.95526839 0.3 4 2.95 1.744 1.743902439 0.1159 0.1549 0.222 0.29 0.2193.339580517 5 3 1.874 1.746341463 6 3.03 1.874 1.845528455 7 3.12 1.8 1.944250871 1 0.45122 185 8 3.2 2.003 1.978658537 2 0.26098 107 9 3.3 1.9 2.013550136 3 0.13659 56 10 3.34 1.99 2.019512195 4 0.11951 49 5 0.03171 13 1.463 4.4 Experiments & Results 1.744 1.874 5 2.003 1 Word Recommender (WR) 2.02 5 12% SearchBetter (SB) 22% 2 4 15%

4 3 29% 22% 3

5

Average User Rate User Average 4 2 3% 12% 3 1 14% 45%

1 2 1 2 3 4 5 6 7 8 910 26% User Timeline Figure 4.7: Average User Rating over Time dynamic lists than those of SB.

Does word recommendation impact the semantic error rate? While improving diversity is of paramount importance, it is also essential for any ap- proach not to advocate collecting semantically invalid paraphrases. To assure that our proposed approach does not increase semantic error rate8, we compared the datasets. In this comparison we deducted the paraphrases which have spelling errors from the total number of incorrect paraphrases, assuming that the rest in- correct paraphrases semantically differ from the original utterance. We noticed that the datasets created by the word recommendations approaches (SB and WR) have fewer semantically incorrect paraphrases. Thus it can be concluded that us- ing word recommendation does not increase the number of semantically incorrect paraphrases in comparison to the rest of the approaches. On the other hand, while TW generates diverse paraphrases, it has the most number of semantically incorrect paraphrases. Based on our observations, it lies in the fact that forbid- ding users from using words makes the task very difficult to perform and many of workers started to generate garbage paraphrases to accelerate the paraphrasing process.

Does word recommendation impact naturalness? We refer natural- ness as the likelihood of an utterance occurring in a real situation. To measure how naturalness is affected by each of the crowdsourcing methods, we randomly selected 100 utterances from each of the methods. Next, we launched a crowd-

8percentage of semantically incorrect paraphrases

79 Chapter 4 Diversity-aware Crowdsourced Utterances sourcing task on Figure-Eight. Crowd workers were asked to rate how likely a given utterance might happen in a real conversation in a scale of 1 to 5 by 5 being highly likely. Totally we spent 21 dollars for the annotation task. Table 4.3 gives the average naturalness score given by crowd workers. As it is shown in the table, all methods almost perform alike; however, baseline followed by the proposed method surpasses the rest. Given that the difference is not significant, it can be concluded that word recommendation does not significantly impact the naturalness of paraphrases.

Table 4.3: Naturalness Comparison

Dataset Naturalness

Baseline 3.63 Chinese Whispers (CW) 3.50 SearchBetter (SB) 3.37 Taboo Words (TW) 3.31 Word Recommender (WR) 3.57

Does requiring workers to provide three paraphrases for a given utterance adversely affect diversity? It is debatable as to whether crowd- workers should be required to provide more than one paraphrase for a given utterance. While our approach can work perfectly fine with any number of para- phrases obtained from workers for a given utterance, we decided to track the ef- fects of having word-recommendation in the first, second, and third paraphrases separately. By comparing the paraphrases from the crowd, we observed that second para- phrases are generally more diverse than the first, and the same is true for the third paraphrases. This is almost the case for all datasets. While this compar- ison indicates that crowd participants generated more diverse paraphrases for the case of second and third paraphrases, it does not prove that obtaining three paraphrases from a worker yields greater diversity than asking three people to provide one paraphrase per person. Reasoning about the number of paraphrases require further studies.

80 4.4 Experiments & Results

Table 4.4: Diversity of 1st, 2nd, and 3rd Paraphrases

Dataset Metric 1st 2nd 3rd Paraphrase Paraphrase Paraphrase

Baseline TTR 0.404 0.430 0.471 PINC 0.622 0.657 0.685 DIV 0.365 0.410 0.499

Chinese Whispers (CW) TTR 0.452 0.486 0.506 PINC 0.670 0.714 0.715 DIV 0.330 0.476 0.451

SearchBetter (SB) TTR 0.408 0.472 0.488 PINC 0.704 0.726 0.747 DIV 0.333 0.540 0.503

Taboo Words (TW) TTR 0.516 0.580 0.590 PINC 0.717 0.742 0.741 DIV 0.527 0.602 0.620

Word Recommender (WR) TTR 0.458 0.530 0.504 PINC 0.702 0.751 0.750 DIV 0.499 0.608 0.608

Does higher diversity improve bots’ performance? The main reason for improving the diversity of user utterances is to build a more accurate bot. To compare the performance of bots built by each method, we used the wit.ai9 platform and built a bot per API (Yelp, Skyscanner, Spotify, Scopus, Expedia, Open Weather, Amazon AWS, Gmail, Facebook, and Bing Image Search) for each crowdsourcing method (Baseline, CW, SB, WR, and TW). Each bot is trained using the crowdsourced utterances by a particular method, excluding incorrect paraphrases. Next, we evaluated the trained bot against the correct utterances in other datasets in the absence of a gold dataset containing a list of real user utterances. Table 4.5 shows the average accuracy for intent detection per bot.

9https://wit.ai

81 Chapter 4 Diversity-aware Crowdsourced Utterances

The bots which were trained on the dataset obtained by the proposed method yield 35% accuracy improvement over the baseline dataset on average. Therefore, it can be inferred that diversity plays a role in the accuracy of intent detection in bot development platforms. Comparing the average accuracy of each bot and their diversity measures, we recognized that DIV is more in-line with the bot’s intent detection accuracy. As can be seen in Table 4.5, the bot trained on the CW dataset has the lowest accuracy, as it has lowest DIV value as well, while it outperforms the baseline in terms of TTR and PINC.

Table 4.5: Intent Detection Accuracy by Dataset

Dataset Accuracy

Baseline 0.619 Chinese Whispers (CW) 0.582 SearchBetter (SB) 0.795† Taboo Words (TW) 0.771† Word Recommender (WR) 0.835†

Bold illustrates the best performance for each of the diversity measures, and a † indicates two-tailed statistical significance at the 0.01 level over baseline.

Does word recommendation reduce spelling errors? To count spelling errors, we used Google Docs editor10, and manually counted the errors identified by the editor. We observed that the baseline, CW, SB, WR, TW datasets have 29, 23, 11, 16, and 38 spelling errors. The reason behind such a reduction when using a word recommendation based approach (WR and SB) might lie in the fact that workers are less prone to making spelling errors when they have given spellings of words they may use. Does word recommendation reduces the task completion time? Task completion time indicates how long it takes for a worker to finish the task. Since the time calculated by platforms cannot consider the time a worker spend on unrelated jobs (e.g., talking on phones, having coffees), we calculated the in- terquartile mean (IQM) for all datasets. The IQM values for the baseline, CW,

10https://docs.google.com

82 4.4 Experiments & Results

SB, TW, and WR were 47, 41, 40, 65, and 35 seconds per paraphrase. There- fore, it can be inferred that using the proposed approach can slightly accelerate paraphrasing. It is also in-line with the priming effect that an appropriate set of words/phrases recommendations can help workers to retrieve related words faster. The proposed approach has the minimum completion time among other approaches, while the recommendation list generated by SB shows only a slight improvement in task completion time which might be due to the low-quality of recommendations in comparison to the proposed approach. On the other hand, forbidding workers from using taboo words increases the difficulty of the task, making the task completion time longer.

Does word recommendation increase workers’ satisfaction? Upon completion of the task, crowd workers were prompted to take a satisfaction sur- vey for a feedback about various aspects of the task; including for how easy workers found the crowdsourcing job and how satisfied they are regarding the payment they received for doing triple paraphrasing. Table 4.6 gives the average scores (scaling from 0-5) given by crowd workers for each of the paraphrasing tasks. Based on the scores reported by Figure-Eight, workers found the word- recommendation approaches (SB and WR) easier and fairer regarding the pay- ment. This indicates that recommending words facilitate paraphrasing. On the other hand, workers found the TW approach comparatively difficult and unfair.

Table 4.6: Worker Satisfaction

Dataset Ease of Job Pay

Baseline 3.98 3.66 Chinese Whispers (CW) 4.06 3.75 SearchBetter (SB) 4.47 4.20 Taboo Words (TW) 3.30 3.50 Word Recommender (WR) 4.62 4.64

83 Chapter 4 Diversity-aware Crowdsourced Utterances

4.4.4 Limitations

Word-recommendation facilitates paraphrasing and it improves diversity of col- lected paraphrases. However, it is not immune to unqualified crowd workers. Cheaters and unqualified workers generated incorrect paraphrases which can af- fect the quality of the recommendations [219]. While in the design of the ranking probability model we have taken invalid paraphrases into account, they can still harm the recommendations. This can be mitigated by automatic quality control to only let qualified paraphrases be submitted [219]. This requires automatic detection of incorrect paraphrases, to only allow workers to submit high quality paraphrases. As such, noises can be reduced and consequently quality of word suggestions can be improved. Another limitation of the proposed system is for the cases in which given words do not have many closely related words. However, we have used a fixed number of top-n neighbours for all words as mentioned in Section 4.3.2. Choosing a proper value for n is debatable and depends on how a word embedding model is trained (e.g., its vector space dimension) [175, 54]. As future work, it is essential to dynamically determine the value of n for a given synonym-set. Moreover, using domain specific word embeddings can help denoising word suggestions. As such, in a given domain, the proposed system can suggest highly related words and synonyms (e.g., in programming domain, the word “Java” refers to a programming language not to “coffee” or “mocha”).

4.5 Conclusion

In this chapter , we showed how word recommendations can accelerate paraphras- ing and improve diversity. We proposed a novel hybrid technique that combines existing advances using both automated methods and crowdsourcing. Our work aimed at addressing an important shortcoming in current crowdsourced para- phrases, namely the priming effect. By recommending appropriate words we sought to motivate crowd workers to enhance diversity of their paraphrases. Our solution involved automated methods for selecting seed words and performing word expansion. Nevertheless, the main challenge is to recommend diverse but semantically relevant words. We thus devised a probabilistic model to contin- uously adapt the expanded list into an improved version; we relied specifically on monitoring implicit worker feedback. Ultimately, our end-to-end experiments

84 4.5 Conclusion indicate that the proposed method improved the diversity of paraphrases. We observed that a major quality issue with serious effects is malicious workers who generated garbage paraphrases [118, 91, 195]. While we accounted for this problem in the proposed model by the implicit relevance feedback, it can still hurt the performance of recommendation. Our future work will focus on automatically detecting invalid paraphrases. Another important aspect of crowdsourcing is to formalize and define when enough paraphrases have been collected for a given intent. Doing so is not easy [103], and it might depend on the intent detection algorithm, desired accuracy, and many other factors. In future work, we will also target this problem, together with many other exciting opportunities as extensions to this work.

85 Chapter 4 Diversity-aware Crowdsourced Utterances

86 Chapter 5

Automatic Malicious Worker Detection

Automatic Malicious Worker Detection in Crowdsourced Paraphrases

In this chapter, we analyze the cheating behaviors shown in crowdsourced para- phrases. We provide a taxonomy of cheating behaviors in crowdsourced para- phrasing, and discuss our solutions for detecting such behaviors. We also propose an effective approach to detect cheating behaviors. This chapter is organized as follows. We start with introduction in Section 5.1. In Section 5.2 we discuss related work. In Section 5.3, we provide detailed analysis of cheating behaviors and pattern in crowdsourced paraphrases. In Section 5.4, we describe the proposed techniques to detect cheaters efficiently. In Section 5.5, we evaluate the proposed techniques, and finally we provide concluding remarks and directions for future work in Section 5.6.

5.1 Introduction

We showed in Chapter 3 that a considerable percentage of crowdsourced para- phrases (up to 40%) may be of unacceptable quality [46, 217, 219]. In particular, malicious crowd workers may intentionally generate erroneous paraphrases [219, 46]. Thus crowdsourced paraphrases must be checked for quality [219]. In this chapter, we investigate how malicious crowd workers (also called cheaters) generate erroneous paraphrases and categorize various types of cheating behaviors (e.g., using foreign languages, adding/removing characters) in crowd- sourced paraphrases. Moreover, based on the identified cheating behaviors, we

87 Chapter 5 Automatic Malicious Worker Detection propose a set features to be used in training machine learning models for detecting malicious workers. More specifically, our contributions are two-folded:

• By manual inspections on the ParaQuality1 [219] dataset which contains 6000 paraphrases in 40 domains, we identify a taxonomy of common cheat- ing behaviors in crowdsourced paraphrases (e.g., adding/removing random words to the sentence to generate a paraphrase). Accordingly, we discuss the characteristics of each category of cheating behaviors and how they can be detected automatically.

• Based on the identified characteristics of cheating behaviors, we identify various features from the literature to be used in automatic detection of ma- licious workers [123, 219]. We also discuss the shortcomings of existing fea- tures, namely calculating semantic similarity between multiple paraphrases and tracking how a worker edits the given sentence to generate passphrases. Thus we propose two new features to overcome the identified shortcomings: (i) inter-paraphrase semantic similarity and (ii) worker’s editing patterns (e.g., which part of sentence is rephrased to generate the paraphrase). Our experiments indicate that the proposed method is an effective approach to detect cheating behaviors in crowdsourced paraphrases.

5.2 Related Work

To the best of our knowledge, that our work is the first to identify and categorize cheating behaviors of crowdsource workers generating paraphrases. We refer the reader to Chapter 2 and 3 to read the background on crowdsourced utterances, their associating quality issues, and quality control in crowdsourcing tasks. Nev- ertheless, our work is related to the areas of semantic similarity to measure who semantically two sentences are similar. Measuring similarity between between two sentences has a wide range of ap- plications in Natural Language Processing (NLP) systems. Examples of such application include plagiarism detection, question-answering systems, paraphrase detection [227, 58, 126]. Recent advances in sentence embedding (e.g., Universal Sentence Encoder [29]) can be exploited to detect if a given paraphrase conveys

1https://github.com/unsw-cse-soc/ParaQuality

88 5.3 Cheating Behaviors the same meaning. However, existing approaches may fail in correctly under- standing the extent of semantic similarity between two sentences. Examples includes assigning higher similarity values for a pair of sentences which share many words even regardless of the semantics, and having a low semantic simi- larity between two exact paraphrases without sharing any word [205]. To reduce the impact of such issues, in this study, we introduced a new feature called IPSS. IPSS has built on the fact that a worker’s paraphrases for the same sentence must semantically express the same meaning.

5.3 Cheating Behaviors

To characterize cheating behaviors in crowdsourced paraphrases, we investigated paraphrases labeled Cheating in the ParaQuality dataset [219]. This dataset includes 6000 annotated paraphrases (i.e., Correct, Cheating, Linguistic Errors) from 40 domains in which crowd workers provided three paraphrases for a given sentence (totally 40 sentences with 2000 triple paraphrases). Based on manual inspections, we recognized 5 primary categories of cheating behaviors as listed below.

5.3.1 Character-Level Edits

One of the common cheating behaviors is to add/remove one or a few characters to/from the given sentence (e.g., “what is the taxi fare from home to airport”), potentially resulting in spelling errors in the generated paraphrase (e.g., “what is the taxi fare from home to airportsdsdsd”). Sincere workers may also occasionally have typos in their paraphrases (“from” → “form”), but such typos may differ from those generated by malicious workers (“from” → “frommmm”). In particu- lar, adding random characters may create nonsense words (also called gibberish words e.g., “sdsds”) in the paraphrase. In addition, such erroneous paraphrases and the corresponding sentences have a low edit distance2 since only a few char- acters are added/removed form the given sentence. However, our manual inspec- tions indicates that a sincere worker may also create a correct paraphrase with low edit distance (e.g., “what is the taxi rate from home to airport”), which is not an act of cheating.

2edit distance between two strings is the minimum number of operations required to transform one string into the other

89 Chapter 5 Automatic Malicious Worker Detection

5.3.2 Word-Level Edits

Similar to the previous type of cheating behavior, cheaters may also add/remove words to/from the given sentence to generate a paraphrase (e.g., “what is the taxi rate from home to airport”). Having low edit-distance with the given sentence is the main characteristic of such a paraphrase. Moreover, removed/added words may result in grammatical errors (e.g., “what the taxi fare from home to airport”). Needless to say, simply having grammatical errors does not mean a worker is cheating. This type of cheating may also add new entities not mentioned in the given sentence (e.g., “what is the taxi rate from home to airport for 10 people”).

5.3.3 Random Sentences

Another cheating behavior is to rewrite the given sentence (e.g., “create a public playlist named NewPlaylist”) substantially in a very random way (e.g., “public saw many NewPlaylist this year”). Detecting such paraphrases requires sophisticated approaches to compute the semantic similarity between two sentences to detect if they are conveying the same meaning or not. The reason stems from the fact that existing solutions fail in detecting semantic divergence when sentences share many words but convey different meanings [58]. However, it is easier to detect such case of cheating when a crowd worker provides more than one paraphrase for a given sentences. Take the following paraphrases generated by a worker as example for the sentence “Create a public playlist named NewPlaylist”:

• That song hits the public NewPlaylist this year

• Public really loved that NewPlaylist played on the event

• Public saw many NewPlaylist this year

Since these are paraphrases for the same sentence, we can also consider the inter semantic similarity between paraphrases. Considering such similarities (as sup- plementary measures) can assist in detecting semantically incorrect paraphrases as discussed more in Section 5.4.

5.3.4 Answering to Canonical Utterance

When a given sentence is a question (e.g., “what is the taxi fare from home to airport?”) or a request for information (e.g., “I want to know the taxi fare from

90 5.3 Cheating Behaviors home to airport”), malicious workers may start giving responses to the given sentence instead of paraphrasing (e.g., “the taxi fare is $10”). However, such type of errors may also be a result of misunderstanding the task by the crowd worker [219]. Our analysis indicates that this type of cheating may result in having new named entities in the paraphrases which do not exist in the given sentence (e.g., “$10” in the above-mentioned paraphrase). Thus any automated approach must check if the entities in the paraphrase and given sentence match. Comparing the speech act3 (e.g., answering, requesting) of a given sentence and its paraphrase can be useful in detecting these cases. However, it has been shown that existing tools for detecting speech acts do not perform well in domain independent settings yet [219].

5.3.5 Foreign Language

Cheaters occasionally use their own native languages instead of paraphrasing in the given language. A sincere worker may also generate paraphrases in other languages as a result of misunderstanding the task [219]. However, our inves- tigations indicates that malicious workers provides sentences in other languages which do not express the same meaning as the given sentence. Detecting such cases of cheating seems easy by using language identification tools (e.g., Detect- Language4). However, these tools may fail in language detection for sentences with misspellings [219]. Misspellings substantially impact the performance of language detection techniques, and consequently the language is detected incor- rectly for misspelled sentences. Given that the majority of words in a sentence in another language must considered misspellings by spell checkers in the expected language, we can condition the use of language detection tools for the cases when the sentence has many spelling errors [219].

3the function of a sentence in communication (e.g., such as answering, requesting, greetings) 4https://https://ws.detectlanguage.com/ 5https://www.nist.gov/ 6https://github.com/rrenaud/Gibberish-Detector 7https://kite.com/python/docs/nltk.probability.LaplaceProbDist 8https://github.com/delph-in/pydelphin

91 Chapter 5 Automatic Malicious Worker Detection

Category Feature Library Semantic Semantic textual similarity method proposed in [58]; Word Similarity Mover’s Distance [110] between word embeddings of sentence and word embeddings of its paraphrase; cosine similarities and euclidean distances between vectors of expression and para- phrase generated by Sentence-BERT [174], Universal Sentence Encoder [29], and Concatenated Power Mean Embeddings [179]. Editing Edit distance between the sentence and its paraphrase; N- Patterns gram overlap, exclusive longest common prefix n-gram over- lap, and Sumo [42]; Gaussian, Parabolic, and Trigonometric functions proposed in [96]; Paraphrase In N-gram Changes (PINC) [30]; Bilingual Evaluation Understudy (BLEU) [149]; Google’s BLEU (GLEU) [213]; NIST’s5 ngram score function [49]; Character n-gram F-score (CHRF) [160]; and the length of the longest common subsequence. Language A function to detect if the paraphrase is written in English Correctness [219]; count of gibberish words in the paraphrase6; count of spelling and grammatical errors in the paraphrase; language- model score based on the probability distribution of ngrams in the given paraphrase7; functions to detect if the given para- phrase is a question, an answer or an imperative sentence [219]; a function to detect weather the tenses/pronouns/en- tities of the sentence and paraphrase match; a function to detect if the paraphrase can be parsed as a proper English sentence by delph-in8. General Difference between count of characters in expression and para- phrase; difference between count of words in expression and paraphrase; time spent by the worker for paraphrasing the given sentence.

Table 5.1: Summary of Feature Library

5.4 Cheating Detection

Deep-learning based approaches gain popularity because they eliminate the bur- den of manual feature engineering. However using such techniques require a large amount of high quality annotated data sets [73, 225]. Because of lack of such data sets, we thus manually identified a set of features as discussed in this section.

92 5.4 Cheating Detection

Paraphrase the following sentence:

Search for a restaurant near the university

Paraphrase 1 (required)

Paraphrase 2 (required)

Paraphrase 3 (required)

Figure 5.1: Crowdsourced Paraphrasing in figure-eight

5.4.1 Feature Engineering

Detecting a cheating behavior requires a deep understanding of how paraphrases are generated by workers. As such, based on the identified types of cheating and their characteristics, we identified high-level categories of features required for detecting cheating behaviors, including for semantic similarity metrics, edit patterns, language correctness, and a few general features as listed in Table 5.1. Asking for triple paraphrases is a common practice in crowdsourced paraphras- ing [219, 95]. Figure 5.1 shows a sample paraphrasing task designed for getting three paraphrases for a given sentence in figure-eight9. Given a sentence and three paraphrases, we propose two novel sets of features: Inter-Paraphrase Semantic Similarity and Edit Features.

Inter-Paraphrase Semantic Similarity (IPSS). Measuring semantic similar- ity between two units of text has numerous applications and has been given much attention by researchers [58, 29]. Yet, existing approaches may fail in correctly understanding the extent of semantic similarity between two sentences. Examples includes assigning higher similarity values for a pair of sentences which share large number words regardless of the semantics, and having a low semantic similarity between two equivalent paraphrases without sharing any word [205]. To mitigate the impact of such issues, we propose a new metric called Inter- Paraphrase Semantic Similarity (IPSS). IPSS depends on semantic similarity of a paraphrase not only with the given sentence but also with the other two para- phrases submitted by the same worker for the same sentence. Our intuition is that the triple paraphrases and the given sentence must semantically convey the

9https://www.figure-eight.com

93 Chapter 5 Automatic Malicious Worker Detection same meaning, and thus inter-paraphrase semantic similarity should be also high (as well as that of between a sentence and each paraphrase). Moreover, inter- paraphrase similarity scores must be close to each other since they are equivalent paraphrases (indicating low variance). As such, the model can detect if a mali- cious worker is generating random paraphrases (expressing different meanings).

n o PS = sim(u, m)|∀u∈{s}∪P , ∀m∈P , m 6= u

mean(PS) IPSS(PS) = (5.1) variance(PS) +  where P represents a set of paraphrases {p1, p2, . . . , pn} for the given sentence s, and sim(.) is a function to measure the semantic similarity between two pieces of text. In our implementation, we used three different similarity functions which have been reported to outperform existing solutions in the semantic textual simi- larity task: (1) the cosine similarity between embeddings generated by Universal Sentence Encoder [29], (2) the cosine similarity between embeddings generated by Sentence BERT [174], and (3) the similarity model proposed in [58].

Worker’s Editing Patterns. As mentioned earlier, a common type cheaters’ editing habits is to create paraphrases by inserting/removing text in/from a par- ticular position of a given sentence (or the previous paraphrase). On the other hand, sincere workers usually create diverse paraphrases by editing the given sen- tence extensively. Thus, locating the positions (in scale of [0,1] with 0 showing the beginning and 1 showing the end of paraphrase) of the main edited parts in the paraphrases can give an insight into how a worker generates paraphrases. For a given pair of sentences (s1, s2), we define edit position (EP) as follow:

 p1.index(lcs(p1,p2)) 1 − , if lcs(p1, p2) 6= φ EP (p1, p2) = length(p1)+ 0.5, otherwise where lcs shows Longest Common Subsequence between the given sentences. In- tuitively, it can be an act of cheating if a worker keeps editing the same part of a sentence to generate a paraphrase based on the previous paraphrase (or the sen- tence) without much rephrasing. Thereby, such paraphrases should have similar edit position values. The variance of edit positions is lower for such paraphrases. Based on that, we define a new feature for training machine learning models

94 5.4 Cheating Detection called Edit Position Score (EPS) as follow:

n o PS = EP (pi, pj)|∀i,j∈{1,2,...,n}, i = j + 1 ∪ {EP (s, p1)}

EPS(PS) = variance(PS) (5.2)

EPS only considers the position of longest common sub-string between sen- tences. Given that a sincere worker provides diverse paraphrases, the paraphrases do have not have many common ngrams. As such, we calculate the ngram dis- tance (NGD) between a sentence (s) and all of corresponding paraphrases (P ) generated by a worker:

[ 5 NgSet = ∪i=1ngrams(p, n) p∈P ∪{s} P count(ngram, p) p∈P ∪{s} likelihood(ngram) = |P | + 1 1 NGD(s, P ) = X 1 − likelihood(ngram) (5.3) |NgSet| ngram∈NgSet where ngrams(.) extracts ngrams in a given sentence, and count(.) represents the count of a given ngram occurrences in the given sentence.

5.4.2 Malicious Worker Detection

It has been shown in [219] that Random Forest classifier [breiman1999random] outperformed other classifiers in terms of recall and F1 score, Support Vector Ma- chine based classifier (SVM) also gained highest precision. As such, we propose a model based on these algorithms. Using the aforementioned features and the Random Forest classifier, we estimate if a paraphrase (p) for a given sentence (s) indicates a cheating (ch) behavior: p(ch|p, s). In other words, the probability of a paraphrase generated by a cheater is the proportion the trees which classified the paraphrase as cheating in the ensemble. However, a malicious worker may have also shown cheating behaviors in their previously submitted paraphrases. The problem can be reformulated as the probability of a worker cheating based on his/her provided paraphrases (in particular we only consider the last three paraphrases submitted by the worker): pworker(ch|s, p1, p2, p3). To estimate if a worker cheats, we used Support Vector Regression (SVR) conditioned on proba-

95 Chapter 5 Automatic Malicious Worker Detection bilities of all submitted paraphrases showing cheating behaviors:

pworker(ch|s, p1, p2, p3) = pworker(ch|p(ch|p1, s), p(ch|p2, s), p(ch|p3, s)) (5.4)

5.5 Experiment & Results

Similar to [219], we used 10-fold cross validation on the ParaQuality dataset without sharing paraphrases between domains in the test and train folds. In our settings, the test and training sets did not share paraphrases from the same domain, to evaluate the proposed method in a domain-independent manner.

5.5.1 Evaluation

Table 5.2 provides the performance of the proposed method in comparison to prior work. The proposed method to estimate the worker’s cheating probability (Worker Modeling) has a promising performance in comparison with the methods proposed in [219]. It surpasses existing approaches by higher recall and F1 scores. In ParaQuality dataset, 18% of paraphrases are labeled as cheating paraphrases [219]. Figure 5.2 shows the cheating rate10 for each domain (totally 40 domains). As shown in this figure, the proposed method works well in different domains which makes it appropriate for being used in crowdsourcing platforms for para- phrasing in any domain. Overall, if we automatically reject triple paraphrases which are detected as cheating, the cheating rate drops to 8.2%. However, if the worker is also prevented to continue paraphrasing when they are suspected as cheaters for the first time, the cheating rate will drop to 4.4%. This also indi- cates that submitted paraphrases can be verified automatically in real time, and if the submitted paraphrases are detected as cheating cases, the worker can be banned automatically from the paraphrasing task (potentially saving money and improving the quality of collected dataset).

5.5.2 Error Analysis

While the proposed approach outperforms the prior work, it still fails in a few cases. The related work indicates that existing language tools frequently fail in detecting linguistic errors (e.g., grammatical and spelling errors), and thus affects

10Cheating rate is the percentage of invalid paraphrases generated by cheaters.

96 5.5 Experiment & Results

Model (Features) PrecisionRecall F1 Worker Modeling (All Features + IPSS + 0.832 0.606 0.701 Editing Patterns) Paraphrase Modeling (All Features + IPSS 0.848 0.555 0.671 + Editing Patterns) Worker Modeling (IPSS + Editing Patterns) 0.719 0.586 0.646 Worker Modeling (IPSS) 0.737 0.518 0.609 SVM [219] 0.878 0.223 0.356 K-Nearest Neighbor [219] 0.871 0.248 0.386 Random Forest [219] 0.843 0.546 0.663 Maximum Entropy [219] 0.756 0.440 0.557 Decision Tree [219] 0.632 0.566 0.597 Naive Bayes [219] 0.473 0.426 0.449

Table 5.2: Automatic Malicious Worker Detection the proposed method as well [219]. Moreover, our investigations indicate that the key to detecting invalid paraphrases is measuring the semantic similarity. Using the state of art approaches to measure semantic similarity (i.e, Universal Sentence Encoder), the proposed feature (IPSS) improved the model performance over the prior work. However, we still require more accurate approaches to measure the similarity between a given sentence and its paraphrases. In addition, using the features of worker’s editing patterns may also results in rejecting valid paraphrases with low edit distance to the given sentence (e.g., “request a taxi to take me home” and “get a taxi to take me home”). Such paraphrases are valid paraphrases; however, they might not be proper paraphrases

Original Dataset 0.4 Paraphrase‐Modeling Cheating Rate Worker‐Modeling 0.3

0.2

0.1

0 1 3 5 7 9 111315171921232527293133353739 Domain Figure 5.2: Cheating rate across different domains

97 Chapter 5 Automatic Malicious Worker Detection since the main reason for paraphrasing is obtaining diverse paraphrases (ideally with new wording and structure). Thus, rejecting such paraphrases may also be ideal in many applications [95, 171]. Our evaluation also indicates that in 78.4% of times that the proposed model incorrectly detected a cheating behavior, the paraphrases were not proper para- phrases (suffer from other types of quality issues such as misspelling, grammatical errors). Thus, in most cases, the proposed method only bans cheaters and low quality workers from continuing the job (less likely to remove high quality work- ers).

5.6 Conclusion

In this chapter, we employed a data-driven approach to investigate various cheat- ing behaviors and their characteristics in crowdsourced paraphrases. Moreover, we identified various features from literature based on the characteristics of each type of cheating behaviors. Then we proposed two new features to overcome shortcomings in the literature (e.g., multi-paraphrase similarity, user edit habits) and propose an domain-independent method for modeling malicious workers. We observed that the proposed dataset to be small for training the-state-of-art deep learning models. Moreover, the dataset is unbalanced and many of utter- ances do nor suffer from the cheating problem. Likewise, spelling and linguistic errors also usually interpreted as a cheating behavior by the trained models. While the proposed methods positively contributed to this task, there can be several extensions to overcome theses issues. One possible extension to this work can be obtaining a new collection of utterances which suffer from the cheating problem. However, it is time-consuming to crowdsourced utterance and annotate them with the quality issues. An alternative approach is to design another crowd- sourcing task and ask workers to behave as cheaters and provide inappropriate paraphrases. Meanwhile, to prevent workers from entering too obvious cases of cheating (e.g., random characters such as “khkjdla ld”), the proposed model can be used to only accept utterances which are not detected. As such, we can ob- tain a larger set of cheating utterances, and build more robust cheating detection models.

98 Chapter 6

Automated Canonical Utterance Generation

Automatic Canonical Utterance Generation for Task-Oriented Bots from API Specifications

This chapter proposes an automated approach for generation of canonical ut- terances for REST (REpresentational State Transfer) APIs. We show that the generation of canonical utterances can be considered as a supervised transla- tion task in which an API method is translated into an utterance. Moreover, we propose API2CAN, a dataset containing 14,370 pairs of API methods and utterances. The dataset is built by processing a large number of public APIs. In addition, we formalize and define resources in REST APIs, and we propose a delexicalization technique (by converting an API method and initial utterances to tagged sequences of resources) to let deep-learning-based approaches learn from the proposed dataset. The rest of this chapter is organized as follows. We start with introduction in Section 6.1. In Section 6.2 we discuss related work. Section 6.3 proposes the API2CAN dataset, and it discusses how we build the dataset automatically by processing a large set of API specifications. In Section 6.4.2, we present the proposed approached for translating API operations to natural languages. Section 6.5 provides the proposed approach to sample values for the parameters of a given API operation. In Section 6.7, we evaluate the proposed approaches. Finally, Section 6.8 provides our conclusion.

99 Chapter 6 Automated Canonical Utterance Generation

6.1 Introduction

APIs are used for connecting devices, managing data, and invoking services [190, 7, 200]. In particular, because of its simplicity, REST is the most dominant approach for designing Web APIs [146, 178, 148]. Meanwhile, thanks to the advances in machine learning and availability of web services, building natural language interfaces has gained attention by both researchers and organizations (e.g., Apple’s Siri, Google’s Virtual Assistant, IBM’s , Microsoft’s Cor- tana). As discussed in Chapter 2, virtual assistants often employ supervised models which require a large set of natural language utterances (e.g., “get a customer with id being 1”) paired with their corresponding intents or executable forms (e.g., SQL queries, API calls, logical forms). In this chapter, given the popularity of REST APIs (based on the well-known HTTP protocol), we focus on one of the most common types of executable forms called operations.

Annotated Utterance Get a customer with id being 1

Operation GET /customers/ {customer_id}

HTTP Verb Endpoint (URI/Path) Parameter

In REST APIs, an operation (also called API method) consists of an HTTP verb (e.g., GET, POST), an endpoint (e.g., /customers), and a set of parameters1 (e.g., query parameters). Figure 6.1 shows different parts of a REST request in HTTP. As shown in Figure 6.2 and discussed in Chapter 2, collecting such pairs require obtaining canonical utterances and paraphrasing them to new variations in order to live up to the richness in human languages [25, 190, 219]. Paraphrasing ap- proaches (e.g., crowdsourcing, automatic paraphrasing systems) have made the second step less costly [219, 25, 139], but existing approaches (e.g., utterance templates, and rule-based approaches) for generating the canonical utterances are still limited, and they are not scalable (see Chapter 2) [25]. In other words, adding new APIs to a particular virtual assistant requires manual efforts for revis- ing hand-crafted grammars to generate training samples for new domains. With

1In this chapter, to show parameters of an operation, we use curly brackets with two parts separated by semicolon (e.g., {customer_id:1}): the first part gives the name of the parameter and the second part indicates a sample value for the parameter

100 6.1 Introduction

HTTP Method Endpoint & Path Parameter Query Parameter Protocol POST /customers/1/accounts?brief=true HTTP/1.1 Host: bank.api Accept: application/json Content-Type: application/json Header ... Parameters Authorization: Bearer mt0dgHmLJMV_PxH23Y { "account-type": "saving", Body

(Payload) "opening-date": "01/01/2020", }

Figure 6.1: Example of an HTTP POST Request

get the customer with id being « customer_id »

Executable Canonical Form Sentence Paraphrase API Spec Generation Generation Generation Training Utterances

GET /customers/{customer_id} get the first customer

Figure 6.2: Classical Training Data Generation Pipeline the growing number of APIs and modifications of existing APIs, automated bot development has become paramount, especially for virtual assistants which aim at servicing a wide range of tasks [25, 190]. Supervised approaches such as sequence-to-sequence models can be used for translating operations to canonical utterances. However, the key challenge is the lack of training data (pairs of operations and canonical utterances) for training domain-independent models. In this chapter, we propose API2CAN, a dataset containing 14,370 pairs of operations and canonical utterances. The dataset is generated automatically by processing a large set of OpenAPI specifications2 (based on the description/summary of each operation). However, deep-learning- based approaches such as sequence-to-sequence models require much larger sets of samples to train from (ideally millions of training samples). That is to say,

2previously known as Swagger specification

101 Chapter 6 Automated Canonical Utterance Generation sequence-to-sequence models are easy to overfit small training datasets, and issues such as out of vocabulary words (OOV) can negatively impact their performance. To overcome such issues, we propose a delexicalization technique to convert an operation to a sequence of predefined tags (e.g., singleton, collection) based on RESTful principles and design guidelines (e.g., use of plural names for a collection of resources, using HTTP verbs). In summary, our contribution is three-folded:

• A Dataset. We propose a dataset called API2CAN, containing anno- tated canonical templates (a canonical utterance in which parameter values have been replaced with placeholders e.g., “get a customer with id being «id»”) for 14,370 operations of 985 REST APIs. We automatically built the dataset by processing a large set of OpenAPI specifications, and we converted operation descriptions to canonical templates based on a set of heuristics (e.g., extracting a candidate sentence, injecting parameter place- holders in the method descriptions, removing unnecessary words). We then split the dataset into three parts (test, train, and validation sets).

• A Delexicalization Technique. Deep-learning algorithms such as sequence-to-sequence models require millions of training pairs to learn from. To assist such models to learn from smaller datasets, we propose a delexicalization technique to convert input (operation) and output (canon- ical template) of such models to a sequence of predefined tags called re- source identifiers. The proposed approach is based on the concept of re- source in RESTful design. Particularly, we formalize various kinds of re- sources (e.g., collection, singleton) in REST APIs. Next, using the iden- tified resource types, we propose a delexicalization technique to replace mentions of each resource (e.g., customers) with a corresponding resource identifier (e.g., Collection_1). As such, for a given operation (e.g., GET /customers/{customer_id}), the model learns to translate the delexicalized operation (e.g., GET Collection_1 Singleton_1) to a delexicalized canon- ical templates (e.g., “get a Collection_1 with Singleton_1 being «Single- ton_1»”). A resource identifier consists of two parts: (1) the type of re- source and (2) a number n which indicates n-th occurrence of a resource type in a given operation. Resource identifiers are then used in time of translation to lexicalize the output of the sequence-to-sequence model (e.g., “get a Collection_1 with Singleton_1 being «Singleton_1»”) to generate

102 6.2 Related Work

a canonical template (e.g., “get a customer with customer id being «cus- tomer_id»”). Delexicalization is done to reduce the impact of OOV and force the model to learn the pattern of translating resources in an operation to a canonical template (rather than translating a sequence of words).

• Analysis of Public REST APIs. We analyze and give insight into a large set of public REST APIs. It includes how REST APIs are designed in practice and drifts from the RESTful principles (design guidelines such as using plural names, appropriate use of HTTP verbs). We also provide in- side into distribution of parameters (e.g., parameter types and location) and how values can be sampled various types of parameters to generate canon- ical utterances out of canonical templates using API specifications (e.g., example values, similar parameters with sample values). Automatic sam- pling values for parameters is essential for automatic generation of canonical utterances because current bot development platforms (e.g., IBM Watson) require annotated utterances (not canonical templates with placeholders).

6.2 Related Work

In this section, we discuss the background on building natural language interfaces for REST APIs, and we refer the interested reader to Chapter 2 for a detailed overview of dialog systems and how training utterances are acquired.

6.2.1 REST APIs

REST is an architectural style and a guideline of how to use the HTTP protocol3 for designing Web services [62]. RESTful web services leverage HTTP using spe- cific architectural principles (i,e., addressability, uniform interface) [151]. Since REST is a guideline without standardization, it is not surprising that API de- velopers only partially follow the guidelines or interpret REST in their own ways [178]. In particular, according to the uniform interface principle, resources must be accessed and manipulated using HTTP methods (e.g., DELETE, GET) and status codes (e.g. using “201” to show a resource is created, and “404” to show resource does not exist) [218, 178]. The uniform interface requires API to be de- veloped uniformly to ensure that API users can understand the functionality of

3REST isn’t protocol-specific, but it is designed over HTTP nowadays

103 Chapter 6 Automated Canonical Utterance Generation each operation without reading tedious and long descriptions. To ensure uniform interface, API developers are required to follow design patterns (e.g., using plural names to name collection of resources, using lowercase letters in paths). Exist- ing works have listed not only those patterns but also anti-patterns in designing interfaces of REST APIs [147, 178, 157]. Examples of anti-patterns also include using underline in paths and adding file extensions in paths [148, 146].

In this chapter, we build upon existing works on designing interfaces for REST APIs. In particular, we formalize resource types based on patterns and anti- patterns recognized in prior works and built a resource tagger to annotate the segments of a given operation with resource types.

6.2.2 Conversational Agents and Web APIs

Research on conversational agents dates back to decades ago [207]. However, there have been only a few targeting web APIs, particularly because of the lack of training samples [190, 7, 200]. In absence of training data, operations descriptions (e.g., having long descriptions containing unnecessary information) have been used for detecting the user’s intent [200]. However, operations often lack proper descriptions, and operations descriptions may share the same vocabularies in a single API, making it difficult for the bot to differentiate between operations [200]. Moreover, these descriptions are rarely similar to the natural language utterances which are used by bot users to interact with bots. That is to say, these descriptions are originally written to document operations (not intended to be used for training bots) [7, 200].

In our work, by adopting ideas from the principles of RESTful design and machine translation techniques, we tackle the main issue which is creating the canonical utterances for RESTful APIs. As opposed to current techniques, the proposed approach is domain-independent and can automatically generate initial utterances without human efforts. We thus pave the way for automating the process of building virtual assistants, which serve a large number of tasks, by automating the process of training datasets for new/updated APIs.

104 6.3 The API2CAN Dataset

6.3 The API2CAN Dataset

In this section, we explain the process of building the API2CAN dataset, and we provide its statistics (e.g., size). 03/04/2019 Swagger.xml 6.3.1 API2CAN Generation Process

To generate the training dataset (pairs of operations and canonical utterances), we obtained OpenAPI specifications indexed in OpenAPI Directory4. OpenAPI Directory is a Wikipedia for REST APIs5, and OpenAPI specification is a stan- dard documentation format for REST APIs. As shown in Figure 7.3, the Ope- nAPI specification includes description, and information about the parameters (e.g., data types, examples) of each operation. We obtained the latest version of each API index in OpenAPI Directory, and totally collected 983 APIs, containing 18,277 operations in total (18.59 operation per an API on average). Finally, we generated canonical utterances for each of the extracted operations as described in the rest of this section and illustrated in Figure 6.4.

paths: /customers/{customer_id}: get: description: gets a customer by its id, summary: returns a customer by its id, parameters: - { name: customer_id, in: path, description: customer identifier, required: true, type: string }

Figure 6.3: Excerpt of an OpenAPI Specification

Candidate . We extract a candidate sentence from either the summary or description of the operation specification. For a given operation, the description (and summary) of the operation (e.g., “gets a [customer] (#/def- initions/Customer) by id. The response contains ...”) is pre-processed by remov- ing HTML tags, lowercasing, and removing hyperlinks (e.g., “gets a customer by

4https://github.com/APIs-guru/openapi-directory/tree/master/APIs 5https://apis.guru/browse-apis/

105

1/1 Chapter 6 Automated Canonical Utterance Generation

Operation Description/Summary ... gets the [customer](#/definitions/Customer) by id ...

Extract a candidate sentence starting with a verb

gets the customer by id

Convert the candidate sentence to an imperative sentence

get the customer by id

Inject the path parameters using the CFG

get the customer with id being « id »

Canonical Template

Figure 6.4: Process of Canonical Utterance Extraction id. the response contains ...”) and then it is split into its sentences (e.g., “gets a customer by id.”, “the response contains ...”). Next, the first sentence starting with a verb (e.g., “gets a customer by id”) is chosen as a potential canonical utterance, and its verb is converted to its imperative form (e.g., “get a customer by id”).

Parameter Injection While the extracted sentence is usually a proper English sentence, it cannot be considered as a user utterance. That is because the sen- tence often points to the parameters of the operation without specifying their values. For example, given an operation like “GET /customers/{customer_id}” the extracted sentence is often similar to sentences like “get a customer by id” or “return a customer”. However, we are interested in annotated canonical utter- ances such as “get the customer with id being «id»”, and “get the customer when its id is «id»”; where “«id»” is a sampled valued for customer_id. To consider parameter values in the extracted sentence, we created a context-free grammar (CFG) as briefly shown in Table 6.1. This grammar has been created based on our observations of how operation descriptions are written (how parameters are mentioned in the extracted candidate sentences) by API developers. With this

106 6.3 The API2CAN Dataset

Table 6.1: Parameter Replacement Context Free Grammar Rule

N −→ {PN}|{NPN}|{LP N}|{RN}|{NRN}|{LRN} CPX −→ ‘by’ | ‘based on’ | ‘by given’ | ‘based on given’ | ... R −→ N | CPX N | N CPX N {PN} Parameter Name (e.g., “customer_id”, “Cus- tomerID”, “CustomersID”) {NPN} Normalized PN by splitting concatenated words and lowercasing (e.g., “customer id”, “customers id”) {LPN} Lemmatized NPN (e.g., “customer id”) {RN} Resource Name (e.g., “Customers”) {NRN} Normalized RN (e.g., “customers”) {RN} Lemmatized NRN (e.g., “customer”) grammar, a list of possible mentions of parameters in the operation description is generated (e.g., “by customer id”, “based on id”, “with the specified id”). Then the lengthiest mention found in the sentence is replaced with “with NPN being «PN»”, where NPN and PN are human-readable version of the parameter name (e.g., customer_id −→ customer id) and its actual name respectively (e.g., “get a customer with customer id being «customer_id»”). We also observed that path parameters are not usually mentioned in operation descriptions in API specifications. For example, in an operation description like “returns an account for a given customer” the path parameter accountId and customerId are absent, but the lemmatized name of collections “customer” and “account” are present. By using the information obtained from detecting such resources (see Section 6.4.2), it is possible to convert the description into “re- turn an account with id being «customer_id» for a given customer with id being «account_id»”. In the process of generating the API2CAN dataset, a few types of parameters were automatically ignored. As such, we did not include header parameters6 since they are mostly used for authentication, caching, or exchanging information such as Content-Type and User-Agent. Thus such parameters do not specify entities of users’ intentions. Likewise, using a list of predefined parameter names (e.g.,

6Header fields are components of the header section of request in the Hypertext Transfer Protocol (HTTP).

107 Chapter 6 Automated Canonical Utterance Generation

auth, v1.1), we automatically ignored authentication and versioning parameters because bot users are not expected to directly specify such parameters while talking to a bot. Moreover, since the payload of an operation can contain inner 02/05/2019 JSON Editor Online - view, edit and format JSON online objects, we assume that all attributes in the expected payload of an operation

areNew flattened.Open ▼ ThisSa isve done ▼ bySet concatenatingtings ▼ Help the ancestors’ attributes with the inner objects’ attributes. For instance, the parameters in the following payload are flattened to “customer name” and “customer surname”: powered by ace 1 { 2 "customer": { 3 "name": "string", 4 "surname": "string" 5 } 6 } 7 8 As such, we convert complex objects to a list of parameters that can be asked from a user during a conversation.

6.3.2 Dataset Statistics

By processing all API specifications, we were able to automatically generate a dataset called API2CAN 7 which includes 14,370 pairs of operations and their corresponding canonical utterances. We next divided the dataset into three parts as summarized in Table 6.2, and manually checked and corrected extracted ut- terances in the test dataset to ensure a fair assessment of models learned on the dataset8.

Table 6.2: API2CAN Statistics Dataset APIs Size

Train Dataset 858 13029 ⋮ Validation Dataset 50 433 Test Dataset 50 908

Figure 6.5 shows the number of operations in API2CAN based on the HTTP verbs (e.g., GET, POST). As shown in Figure 6.5, the majority of operations are of GET methods which are usually used for retrieving information (e.g., “get

7https://github.com/unsw-cse-soc/API2CAN 8Train and validation datasets will be also manually revised in near future

108

Ln: 8 Col: 1

JSON Editor Online 4.12.0 • History • Sourcecode • Report a bug • Data policy • Copyright 2011-2019 Jos de Jong https://jsoneditoronline.org 1/1 6.4 Neural Canonical Sentence Generation the list of customers”), followed by POST methods which are usually used for creating resources (e.g., “creating a new customer”). The DELETE, PUT, and PATCH methods are also used for removing (e.g., “delete a customer by id being «id»”), replacing (e.g., “replace a customer by id being «id»”), and partially updating (e.g., “update a customer by id being «id»”) a resource.

1414, 10% 557, 4%

GET PUT 3391, 23% POST 7749, 54% PATCH DELETE

1259, 9%

Figure 6.5: API2CAN Breakdown by HTTP Verb

Figure 6.6 also represents the distribution of number of segments in the oper- ations9 as well as the number of words in the generated canonical templates. As shown in Figure 6.6, many of the operations consist of less than 14 segments by 4 being the most common. Given the typical number of segments in the operations, Neural Machine Translation (NMT)-based approaches can be used for the gener- ation of canonical sentences [162, 36]. On the other hand, the canonical sentences in the API2CAN dataset are longer. The reason behind having such lengthier utterances is the existence of parameters, and operations with more parameters tend to be lengthier. However, given the maximum length of canonical sentences, NMT-based approaches can still perform well [36].

6.4 Neural Canonical Sentence Generation

Neural Machine Translation (NMT) systems are usually based on encoder-decoder architecture to directly translate a sentence in one language to a sentence in a different language. As shown in Figure 6.7, generating a canonical template for a given operation can be also considered as a translation task. As such, the oper-

9For example, “GET /customers/{customer_id}” has two segments: “customers” and “{customer_id}”

109 Chapter 6 Automated Canonical Utterance Generation 3

2.5 Operation Segments Canonical Template Words 2

1.5

1

Frequency Frequency Thousands 0.5

0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Number of Segments/Words

Figure 6.6: API2CAN Breakdown by Length ation is encoded into a vector, and the vector is next decoded into an annotated canonical template. However, the main challenge in building such a translation model is the lack of a large training dataset. Since deep-learning models are data thirty, training requires a very large and diverse set of training samples (ideally millions of pairs of operations and their associating user utterances). As mentioned in the previous section, we automatically generated a dataset called API2CAN. However, such a dataset is still not large enough for training sequence- to-sequence models. Having a large set of training samples requires a very large diverse set of oper- ations as well. However, such a large set of APIs and operations is not available. One of the serious repercussions of the lack of such a set of operations is that train- ing samples lack a very large number of possible words that can possibly appear in the operations (but did not appear in the training dataset). As a result, the mod- els trained on such datasets will face many out-of-vocabulary words at runtime. To address this issue, we propose a delexicalization technique called resource- based delexicalization. As such, we reduce the impact of the out-of-vocabulary problem and force the model to learn the pattern of translating resources in an operation to a canonical template (instead of translating a sequence of words).

6.4.1 Resources in REST

In RESTful design, primary data representation is called resource. A resource is an object with a type, associated data, relationships to other resources, and a

110 6.4 Neural Canonical Sentence Generation set of HTTP verbs (e.g., GET, POST) that operate on it. Designing RESTful APIs often involves following conventions in structuring URIs (endpoints) and naming resources. Examples include using plural nouns for naming resources, using the “GET” method for retrieving a resource and using the “POST” method for creating a new resource. In RESTful design, resources can be of various types. Most commonly, a resource can be a document or a collection. A document, which is also called singleton resource, represents a single instance of the resource. For example, “/customers/{customer_id}” represents a customer that is identified by a path parameter (“customer_id”). On the other hand, a collection re- source represents all instances of a resource type such as “/customers”. Re- sources can be also nested. As such, a resource may also contain a sub- collection (“/customers/{customer_id} /accounts”), or a singleton resource (e.g., “/customers/{customer_id} /accounts/ {account_id}”). In RESTful design, CRUD actions (create, retrieve, update and delete) over resources are shown by HTTP verbs (e.g., GET, POST). For example, “GET /customers” represents the action of getting the list of customers, and “POST /customers” indicates the action of creating a new customer. However, some actions might not fit into the world of conventional CRUD operations. In such cases, controller resources are used. Controller resources are like executable functions, with inputs and return-values. REST APIs rely on action controllers to perform application specific actions that cannot be logically mapped to one of the standard HTTP verbs. For example, an operation such as “GET /customers/{customer_id}/activate” can be used to activate a customer. Moreover, while it is unconventional, adjectives also are oc- casionally used for filtering resources. For example, “GET /customers/activated” means getting the list of all activated customers. In this chapter, such adjectives are called attribute controllers. While above-mentioned principles are followed by many API developers, there are still many APIs violate these principles. By manually exploring APIs and prior works [147, 178, 157], we identified some unconventional resource types used in designing operations as summarized in Table 6.3. A common drift from RESTful principles is the use of programming conventions in naming resources (e.g., “createActor”, “get_customers”). Aggregation functions (e.g., sum, count) and expected output format of an operation (e.g., “json”, “tsb”, “txt”) are also used in designing endpoints. Words similar to “search” (e.g. “query”, “item-

111 Chapter 6 Automated Canonical Utterance Generation search”) are used to indicate that the operation looks for resources based on given criteria. Moreover, collections are occasionally filtered/sorted by using keywords such as “filtered-by”, “sort-by”, or appending a resource name to “By” (e.g., “ByName”, “ByID”). Segments in the endpoints may also indicate API versions (e.g., v1.12), or authentication endpoints (e.g., auth, login). Even though the aforementioned types of resources are against the conventional design guidelines of RESTful design, they are important to detect since still they are used by API developers in practice.

Table 6.3: Resource Types Resource Type Example

Collection /customers Singleton /customers/{customer_id} Action Controller /customers/{customer_id}/activate Attribute Controller /customers/activated API Specs /api/swagger.yaml Versioning /api/v1.2/search Function /AddNewCustomer Filtering /customers/ByGroup/{group-name} Search /customers/search Aggregation /customers/count File Extension /customers/json Authentication /api/auth

6.4.2 Resource-based Delexicalization

In resource-based delexicalization, the input (API call) and output (canonical template) of the sequence-to-sequence model are converted to a sequence of re- source identifiers as shown in Figure 6.7. This is done by replacing mentions of resources (e.g., customers, customer) with a corresponding resource identifier (e.g., Collection_1). A resource identifier consists of two parts: (i) the type of resource and (ii) a number n which indicates n-th occurrence of a resource type

112 6.5 Parameter Value Sampling in a given operation. This number later is used in the lexicalization of the output of the sequence-to-sequence model to generate a canonical template. To detect resource types, we used the Resource Tagger shown in Algo- rithm 1. We convert the raw sequence of words in a given operation (e.g., “GET /customers/{customer_id}/accounts”) to a sequence of resource identifiers (e.g., “get Collection_1 Singleton_1 Collection_2”). Likewise, mentions of re- sources in the canonical templates are replaced with their corresponding resource identifiers (e.g., “get all Collection_1 for the Collection_2 with Singleton_1 be- ing Singleton_1”). The intuition behind the conversions is to help the model to focus on translating a sequence of resources instead of words. In the time of using the model for generating canonical templates, the tagged resource identifiers are replaced with their corresponding resource names (e.g., Collection_2 −→ customers). Meanwhile, in the process of replacing resource tags, occasionally grammatical errors might happen such as having plural nouns instead of singular forms. To make the final generated canonical template more robust, we used LanguageTool 10 (an open-source tool for automatically detecting and correcting linguistically errors) to correct linguistic errors in the generated canonical templates.

6.5 Parameter Value Sampling

To obtain canonical utterances, values must be sampled for the parameters (place- holders) inside a given canonical template. The sampled values help to generate canonical utterances which are understandable sentences without any placehold- ers. Canonical utterances can be paraphrased later either automatically or man- ually by crowd-workers to diversify the training samples. In this section, we investigate how values can be sampled for parameters of REST APIs. More specifically, we identified five main sources as follows.

1. Common Parameters. Parameters such as identifiers (e.g., cus- tomer_id), , and dates are ubiquitous in REST APIs. We built a set of such parameters paired with values. As such, a short random string or numeric value is generated for identifiers based on the parameter data

10https://languagetool.org

113 Chapter 6 Automated Canonical Utterance Generation

Algorithm 1: Resource Tagger Input: segments of the operation Result: List of resources 1 resources ←− []; 2 i ←− size(segments); 3 for i ← length(segments) down to 1 do 4 current ←− segments[i]; 5 resource ←− new Resource(); 6 resource.name ←− current; 7 previous ←− φ; 8 if i > 1 then 9 previous ←− segments[i − 1]; 10 end 11 resource.type ←− “Unknown”; 12 if current is a path parameter then 13 if previous is a plural noun and an identifier then 14 resource.type ←− “Singleton”; 15 resource.collection ←− previous; 16 else 17 resource.type ←− “Unknown Param”; 18 end 19 else 20 if current starts with “by” then 21 resource.type ←− “Filtering”; 22 else if current in [“count”, “min”, ...] then 23 resource.type ←− “Aggregation”; 24 else if current in [“auth”, “token”, ...] then 25 resource.type ←− “Authentication”; 26 else if current in [“pdf”, “json”, ...] then 27 resource.type ←− “File Extension”; 28 else if current in [“version”, “v1”, ...] then 29 resource.type ←− “Versioning”; 30 else if current in [“swagger.yaml”, ...] then 31 resource.type ←− “API Specs”; 32 else if any of [“search”, “query”, ...] in current then 33 resource.type ←− “Search”; 34 else if current is a phrase and starts with a verb then 35 resource.type ←− “Function”; 36 else if current is a plural noun then 37 resource.type ←− “Collection” 38 else if current is a verb then 39 resource.type ←− “Action Controller”; 40 else if current is an adjective then 41 resource.type ←− “Attribute Controller”; 42 end 43 resources.append(resource); 44 end 45 return reversed(resources)

114 6.5 Parameter Value Sampling

GET /customers/{id}/accounts

get Collection_1 Singleton_1 Collection_2 Convert to a Sequence of Resource Identifiers

...... Recurrent Layer Recurrent Layer Recurrent Layer Recurrent Layer Recurrent Layer Recurrent Layer

Encoder Decoder

get all Collection_1 for the Collection_2

emplate with Singleton_1 being «Singleton_1»

get all accounts for the customer with id being «id» Sentence T Convert to a Canonical

Figure 6.7: Canonical Template Generation via Resource-based Delexical- ization

type. Likewise, mock email addresses and dates are generated automati- cally.

2. API Invocation. By invocation of API methods that return a list of resources (e.g., “GET /customers”), we can obtain a large number of values for various attributes (e.g., customer names, customer ids) of the resource. Such values are reliable since they correspond to real values of entities in the retrieved resources. Thus they can be used reliably to generate canonical utterances out of canonical templates.

3. OpenAPI Specification. An OpenAPI specification may include an ex- ample or default values11 for parameters of each operation. Since these

11An example illustrates what the value is supposed to be for a given parameter. But a default value is what the server uses if the client does not provide the value.

115 Chapter 6 Automated Canonical Utterance Generation

values are generated by API owners, they are reliable. Moreover, API spec- ification specifies the data-types of parameters. This can also be used to automatically generate values for parameters in the absence of example and default values. For example, in the case of enumeration types (e.g., gen- der −→ [MALE, FEMALE]), one of the elements is randomly selected as a parameter value. In the case of numeric parameters (e.g. size), a random number is generated within the specified range (e.g., between 1 to 10) in the API specification. Likewise, for the parameters whose values follow regular expressions (e.g., “[0-9]%” indicates a string that has a single-digit before a percent sign), random sequences are generated to fulfill the given pattern in the API specification (e.g., “8%”).

4. Similar Parameters. Having a large set of API specifications, example values can be found from similar parameters (sharing the same name and datatype). This can be possible by processing parameters of API reposito- ries such as OpenAPI directory.

5. Named Entities. Knowledge graphs provide information about various entities (e.g., cities, people, restaurants, books, authors). Examples of such knowledge graphs include for Freebase [15], DBpedia [115], [201], and YAGO [172]. For a given entity type such as “restaurant” in the restau- rant domain, these knowledge graphs might contain numerous entities (e.g., KFC, Domino’s). Such knowledge bases can be used to sample values for a given parameter if the name of the parameter matches an entity type. In this chapter, we use Wikidata to sample values for entity types. Wikidata is a knowledge graph which is populated by processing Wikimedia projects such as Wikipedia.

6.6 API2CAN Service

The API2CAN dataset is also accompanied with a set of microser- vices called API2CAN Service. The service is implemented as a stan- dalone open-source REST-API in Python, and it is accessible from https://github.com/unsw-cse-soc/API2CAN. This REST service provides several functionaries as follow:

116 6.7 Experiments & Results

• Parsing OpenAPI Specification. This microservice is able to parse the given API specification (in YAML format) and extract API elements such as operations and their parameters in JSON format.

• Generating Canonical Utterances. This microservice generates the canoni- cal utterances based on the approaches used in generation of the API2CAN dataset. In a nutshell, two approaches are used for generating canonical utterances as introduced in this chapter: (i) converting operation summary as briefly introduced in the previous section, and (ii) the resourced-based translator which relays on the notion of Resources in REST APIs and is proposed also in this chapter.

• Sampling Parameter Values. This microservice generates values (e.g., “Syd- ney”, “Paris”) for the parameters (e.g., “to”) of the given operation based on the approaches introduced in this chapter. Generated values can be used to populate placeholders inside generated canonical utterances (e.g., “to=Sydney” in “book a flight to Sydney”).

6.7 Experiments & Results

Before delving into the experiments, we briefly explain the training process in the case of using neural translation methods. We trained the neural models using the Adam optimizer [107] with an initial learning rate of 0.998, a dropout of 0.4 between recurrent layers (e.g., LSTM, BiLSTM), and a batch size of 512. It is worth noting that the hyper-parameters are initial configurations set based on the size of the dataset and values suggested in the literature, and finding optimized values requires further studies. Furthermore, in case of not using delexicalization, we also populate word embeddings of the model with GloVe [154]. In the time of translation, we used beam search with a beam size of 10 to obtain multiple translations for a given operation, and then the first translation with the same number of placeholders as the number of the parameters in the given operation is considered as its canonical template. Moreover, we replaced the generated unknown tokens with the source token that had the highest attention weight to avoid the out-of-vocabulary problem.

117 Chapter 6 Automated Canonical Utterance Generation

6.7.1 Translation Methods

We trained translation models using different sequence-to-sequence architectures and we also built a rule-based translator as described next. Given the size of the API2CAN dataset, we configured the models using two layers for both encoding and decoding parts at the most.

GRU. This model consists of two layers (each having 256 units) of Gated Recur- rent Units (GRUs) [35] for both encoding and decoding layers using the attention mechanism [121].

LSTM. This model consists of two layers (each having 256 units) of two layers of LSTM for both encoding and decoding using the attention mechanism [121].

CNN. We also built a sequence-to-sequence model based on Convolutional Neu- ral Network (CNN) as proposed in [71]. In particular, we used 3x3 convolutions (one layer of 256 units) with the attention mechanism [121].

BiLSTM-LSTM. This model consists of two layers (each having 256 units) of Bidirectional Long-Short Term Memory (BiLSTM) [74] for encoding, and two layers (each having 256 units) of Long-Short Term Memory (LSTM) [87] for the decoder using the attention mechanism [121].

Transformer. The Transformer architecture [199] has been shown to perform very strong in machine translation tasks [47, 164]. We used the Transformer model implemented by OpenNMT [108] using the same hyper-parameters as the original chapter [199]. For an in-depth explanation of the model, we refer the interested reader to the original chapter [199].

Rule-based (RB) Translator. Based on the concept of resource in REST APIs, we also built a rule-based translation system to translate operations to canonical templates (shown in Algorithm 2). First, the algorithm extracts the resources of a given operation based on the resource types extracted by the Re- source Tagger algorithm (see Algorithm 1). Next, the algorithm iterates over an ordered set of transformation rules to transform the operation to a canonical template. A transformation rule is a hand-crafted Python function which is able to translate a specific HTTP method (e.g., GET) and sequence of resource types (e.g., a collection resource followed by a singleton resource) to a canonical tem- plate. We created 33 transformation rules by the time of writing this chapter,

118 6.7 Experiments & Results

Algorithm 2: Rule-Based Translator Input: operation, transformation_rules written by experts Result: A canonical template 1 resources ←− ResourceTagger(operation); 2 foreach t ∈ transformation_rules do 3 canonical ←− t.transform(resources, operation.verb); 4 if canonical 6= φ then 5 param_clause ←− to_clause(operation.parameters) ; 6 canonical ←− canonical + " " + param_clause ; 7 return canonical; 8 end 9 end 10 return φ

some of which are listed in Table 6.4. In this table, {c}, {s}, and {a} stands for 02/05/2019collection, singleton, and attribute controller respectively. And the singularBest(.) Python Formatter and Beautifier function returns the singular form of a given name. For instance, in case of an operation like “GET /customers”, given that the bot user requests a JSON collection Formatter | Hex Color Codes | My Ip | Search | Recent Links | More | Sign in | of customers, the provided transformer (rule number 1 in Table 6.4) is able to generate a canonical template as “get the list of customers”. Following Python Save & Share Pythonfunction also presents Formatter the transformation rule implementation which is able to translate such operations (a single collection resource when the HTTP method is “GET”): Python Input  sample   Result  Load Url 1 def transform(resources, verb): 1 2 if verb != "GET" or len(resources) != 1: 3 return 4 if resources[0].type != "Collection": Browse 5 return 6 collection = resources[0] 7 return "get the list of {}".format(collection.name) Select Indent: 8 9 2 10 A transformer is written based on the assumption that a sequence of resource types has special meaning. For example, considering “GET /customers/{id}/accounts” and “GET /users /{user_id}/aliases”, bothBeautify oper- / Format ations share the same HTTP verbs and sequence of resource types (a singleton followed by a collection). In such cases, possible canonical templates are “get accounts of a customer when its id is «id»” and “get aliases of a userMinify when / Compact its user id is «user_id»”. Thus such a sequence of resource types can be con- verted to a rule like: “get {collection} of a {singleton.collection} when its {sin-

119

Download

https://codebeautify.org/python-formatter-beautifier 1/4 Chapter 6 Automated Canonical Utterance Generation gleton.name} is «{singleton.name}»”; in which “{}” represents placeholders and singleton.collection represents the name of the collection for the given singleton resource (e.g., customers, users). Thus adding a new transformation rule would mean generalizing a specific sequence of resources types that is not considered in the existing translators. However, as discussed earlier, since many APIs do not conform to the RESTful principles, creating a comprehensive set of transforma- tion rules is very challenging.

Table 6.4: Excerpt of Transformation Rules

# Resources Sequence −→ Transformation Rule

1 Rule GET /{c}/ −→ get list of {c.name} Example GET /customers −→ get list of customers

2 Rule DELETE /{c}/ −→ delete all {c.name} Example DELETE /customers −→ delete all customers

3 Rule GET /{c}/{s}/ −→ get the {singular(c.name)} with {s.name} being {s.name} Example GET /customers/{id} −→ get the customer with id being

4 Rule DELETE /{c}/{s}/ −→ delete the {singular(c.name)} with {s.name} being <{s.name}> Example DELETE /customers/{id} −→ delete the customer with id being

6 Rule PUT /{c}/{s}/ −→ replace the {singular(c.name)} with {s.name} be- ing <{s.name}> Example PUT /customers/{id} −→ replace the customer with id being

7 Rule GET /{c}/{a}/ −→ get {a.name} {singular(c.name)} Example GET /customers/first −→ get first customer

8 Rule GET /{c1}/{s}/{c2}/ −→ get the list of {c2.name} of the {singular(c1.name)} with {s.name} being {s.name} Example GET /customers/{id}/accounts −→ get the list of accounts of the cus- tomer with id being

6.7.2 Canonical Utterance Generation

Quantitative Analysis. For each of the aforementioned NMT architectures, we trained models with and without using the proposed resource-based delexicaliza- tion approach as described in Section 6.4.2. In these experiments, we did not tune

120 6.7 Experiments & Results any hyper parameters and trained the models on the training dataset. For each baseline, we saved the model after 10000 steps and used the model with the mini- mum perplexity based on the validation set to compare with other configurations. Table 6.5 presents the performance of each model in terms of machine translation metrics: bilingual evaluation understudy (BLEU) [149], Google’s BLEU Score (GLEU) [212], and Character n-gram F-score (CHNF) [159]. In the case of using the RB-Translator, hand-crafted transformation rules are able to generate canonical templates for 26% of the operations. Creating such transformation rules is very challenging for lengthy operations as well as those not following RESTful principles. We did not include RB-Translators’ perfor- mance in Table 6.5 because the results are not comparable to the rest. Our ex- periments indicate that RB-Translator performs reasonably well (BLEU=0.744, GLEU=0.746, and CHRF=0.850). However, the BiLSTM-LSTM model built on the proposed dataset using the resource-based delexicalization technique outper- forms the RB-Translator (BLEU=0.876, GLEU=0.909, and CHRF=0.971), ig- noring the operations which RB-Translator could not translate. As experiments indicate, Delexicalized BiLSTM-LSTM outperforms the rest of the translation systems, and resource-based delexicalization improves the performance of NMT systems by large.

Table 6.5: Translation Performance Translation-Method BLEU GLEU CHRF

Delexicalized BiLSTM-LSTM 0.582 0.532 0.686 Delexicalized Transformer 0.511 0.462 0.619 Delexicalized LSTM 0.503 0.470 0.652 Delexicalized CNN 0.507 0.458 0.601 Delexicalized GRU 0.481 0.450 0.623

BiLSTM-LSTM 0.318 0.266 0.421 Transformer 0.295 0.248 0.397 LSTM 0.278 0.226 0.381 CNN 0.271 0.220 0.379 GRU 0.251 0.198 0.347

121 Chapter 6 Automated Canonical Utterance Generation

Qualitative Analysis. Table 6.6 gives a few examples of canonical templates generated by the proposed translator (Delexicalized BiLSTM-LSTM). While the machine-translation metrics do not show very strong translation performance in Table 6.5, our manual inspections revealed that these metrics do not reflect the actual performance of the proposed translators. Therefore, we conducted another experiment to manually evaluated the translated operations. For this reason, we asked two experts to rate the generated canonical templates manually using a Likert scale (in a range of 1 to 5 with 5 showing the most appropriate canonical sentence). In the experiment, the experts were given pairs of generated canonical utterances and operations (including the description of the operation in the API specification). Next, they were asked to rate the generated canonical templates in a range of 1 to 5. Figure 6.8 shows the Likert assessment for the best performing models in Table 6.5. Based on this experiment, canonical templates generated by RB-Translator are rated 4.47 out 5, and those of the delexicalized BiLSTM-LSTM are rated 4.06 out of 5 (by averaging the scores given by the annotators). The overall Kappa test showed a high agreement coefficient between the raters by Kappa being 0.86 [128]. Based on manual inspections, as also shown in Table 6.6, we observed that when APIs are designed based on the RESTful principles the delexicalized Delexicalized performs as good as RB-Translator. Figure 6.8 also shows how the automatically generated dataset (API2CAN ) represents their corresponding operations. Based on the rates given by the an- notators, the dataset (training part) is also of decent quality while being noisy, indicating that the proposed set of heuristics for generating the dataset are well- defined. Given the promising quality of generated canonical templates, we con- cluded that the noises in the dataset can be ignored. However, yet it is desirable to manually clean the dataset.

Error Analysis. Even though the proposed method outperforms the baselines, it still occasionally fails in generating high-quality canonical templates. Based on our investigations, there are three main sources of error in generating the canonical templates: (i) detecting resource types, (ii) translating APIs which do not conform to RESTful principles, and (iii) lengthy operations with many segments. Detecting resource types requires natural language processing tools to detect parts of speech (POS) of a word (e.g., verb, noun, adjective), and to detect if a

122 6.7 Experiments & Results

Table 6.6: Examples of Generated Canonical Templates Sample

Operation GET /v2/taxonomies/ Canonical fetch all taxonomies

Operation PUT /api/v2/shop_accounts/{id} Canonical update a shop account with id being

Operation DELETE /api/v1/user/devices/{serial} Canonical delete a device with serial being

Operation GET /user/ratings/query Canonical get a list of ratings that match the query

Operation GET /v1/getLocations Canonical get a list of locations

Operation POST /series/{id}/images/query Canonical query the images of the series with id being

Operation PUT /api /hotel /v0 /hotels /{hotelId} /rateplans/batch/$rates Canonical set rates for rate plans of a hotel with hotel id being

Likert Scale : 1 2 3 4 5

API2CAN

RB-Translator Delexicalized NM-Translator 0% 20% 40% 60% 80% 100%

Figure 6.8: Assessment of Generated Canonical Templates given noun is plural or singular (particularly for unknown words or phrases and uncountable nouns). However, these tools occasionally fail. Specifically, POS taggers are built for detecting parts of speech for words inside a sentence. Thus it is not surprising if they fail in detecting if a word like “rate” is a verb or noun in a given operation. For example, an operation like GET /participation/rate can indicate both “get the rate of participations” and “rate the participants”. Another

123 Chapter 6 Automated Canonical Utterance Generation source of such issues is tokenization. It is common in APIs to concatenate words (e.g., whoami, addons, registrierkasseuuid, AddToIMDB). While it seems trivial for an individual to split these words, existing tools frequently fail. Such issues affect the process of detecting resources and consequently impact the generation of canonical templates negatively. Unconventional API design (not conforming to RESTful principles) also exten- sively impacts the quality of generated canonical templates. Common drifts from RESTful principles includes using wrong HTTP verb (e.g., “POST” for retriev- ing information), using singular nouns for showing collections (e.g. /customer), adding non-resource parts to the path of the operation (e.g., adding response format like “json” in /customers/json. Since those API developers (who do not conform to design guidelines) follow their own thoughts instead of accepted rules, the automatic generation of canonical templates is challenging. Lengthy operations (those with roughly more than 10 segments) naturally are rare in REST APIs. Such lengthy operations convey more complex intents than those with a lesser number of segments. As shown in Figure 6.6, unfortunately, such operations are also rare in the proposed dataset (API2CAN ), impacting translation of lengthy operations.

6.7.3 Parameter Value Sampling

This section provides an analysis of parameters in the RESTful APIs and eval- uates the proposed parameter sampling approach which is used for generating canonical utterances out of canonical templates. To this end, we processed API specifications which are indexed in OpenAPI Directory. Based on our analysis, the dataset contains 145971 parameters in total, which indicates that an opera- tion has 8.5 parameters on average. Figure 6.9 presents statistics of parameters in the whole list of API specifi- cations in the OpenAPI Directory. As shown in the right-hand pie chart, most of the parameters are located in the payload (body) of APIs, followed by query and path parameters. Figure 6.9 also shows the percentages of parameter data types in the collection with strings being the most common type of parameters. About 1.5% of string parameters are defined by regular expressions, and 4.8% of them can be associated with an entity type12. String parameters are followed

12We looked up the parameter name in Wikidata to find if there is associating entity type

124 6.8 Conclusion by integers, booleans, numbers, and enumerations. Moreover, some parameters are left without any type, or they are given general parameter types such as “object” without any schemes. These parameters are combined together in the left-hand pie chart in Figure 6.9 with a single label– “others”. Moreover, 28% of parameters are required parameters (not optional), 10.6% of parameters have not assigned any value in the API specifications, and 26% of all parameters are identifiers (e.g., id, UUID). Thus, sampling values is required only for less than 10.6% of parameters (those without any values). In particular, value sampling for string parameters requires more attention. That is because string parameters are widely used, and they are more difficult to automatically be assigned values in comparison to other types of parameters (e.g., integers, enumerations). To evaluate how well the proposed method generates sample values for param- eters, we conducted an experiment. Since generating sample values for data types such as numbers and enumerations is straightforward, we only considered string parameters in this experiment. To this end, we randomly selected 200 parame- ters and asked an expert to annotate if a sampled value is appropriate for the given value or not. The results indicate that 68 percent of sampled values are appropriate for given parameters. The main reason for inappropriate sampled values is noises in the API specifications. For instance, developers occasionally describe the parameters in the example part instead of the description part of the documentation. For instance, for a string parameter like “customer_id”, the example part may be filled by “a valid customer id”. Moreover, sometimes the same parameter name is used in different contexts for different purposes. For example, the parameter name like “name” which can be used for representing the name of a person, school, or any object.

6.8 Conclusion

This chapter aimed at addressing an important shortcoming in current approaches for acquiring canonical utterances. In this chapter, we demonstrated that the generation of canonical utterances can be considered as a machine translation task. As such, our work also aimed at addressing an important challenge in training supervised neural machine translators, namely the lack of training data for translating operations to canonical templates. By processing a large set of API specifications and based on a set of heuristics, we build a dataset called API2CAN.

125 Type count Type count type count Integer 12449 None 278 Body 58856 Number 3993 Array 4421 Form 1788 Boolean 9847 Boolean 9847 Header 11119 Enum 1238 Enum 1238 Path 22851 String 112443 File 49 Query 51381 Others 6001 Integer 12449 Number 3993 Object 1253 String 112443 Chapter 6 Automated CanonicalOthers Utterance5723 Generation

Integer Others Number 8% 4% 3% Boolean 7% Enum Query 1% Body 35% 40%

String 77% Path 16% Form Header 1% 8% Figure 6.9: Parameter Type and Location Statistics

However, deep-learning-based approaches require larger sets of training samples to train domain-independent models. Thus, by formalizing and defining resources in REST APIs, we proposed a delexicalization technique to convert an operation to a tagged sequence of resources to help sequence-to-sequence models to learn from such a dataset. In addition, we showed how parameter values can be sampled to feed placeholders in a canonical template and generate canonical utterances. We also gave a systematic analysis of web APIs and their parameters, indicating the importance of string parameters in automating the generation of canonical utterances. In this study, we observed that the proposed techniques often fail in generating proper canonical utterances for lengthy operations. This is particularly because lengthy operations are rare in the proposed dataset. In addition, a more sophis- ticated resource taggers is requires to detect resource types in a given operation. The proposed resource tagger is based on existing POS taggers which occasionally fail in detecting POS tags for segments of an endpoint. One interesting extension to this work can be improving/augmenting the dataset (API2CAN ). Moreover, given that fulfilling complex intents usually requires a combination of operations [25, 61], another extension can be generating canonical utterances for composi- tions between operations. To achieve this, it is required to detect the relations between operations and generate canonical templates for complex tasks (e.g., tasks requiring conditional operations or compositions of multiple operations).

126 Chapter 7

REST2Bot: A Platform for Automated Bot Development

REST2Bot: Bridging the Gap between Bot Platforms and REST APIs

In this chapter, we introduce REST2Bot1, a tool that addresses shortcomings in bot development frameworks (e.g., translating APIs to Intents, and invoking APIs based on detected Intents) to automate several tasks in the life cycle of the bot development process. REST2Bot relies on automated approaches for parsing OpenAPI specifications, generating training data, building bots on desired bot development frameworks, and generating deployable webhook functions to map intents and entities to APIs. The rest of this chapter is organized as follows. Section 7.1 provides our intro- duction. Section 7.2 provides on overview of the proposed prototype. In Section 7.3, we provide an overview the motivations and benefits for the effective inte- gration of APIs and conversational services. And finally Section 7.4 provides our conclusion and future work.

7.1 Introduction

As discussed throughout this thesis, developing a bot typically implies the ability to invoke APIs corresponding to user utterances (e.g., “what will the weather be like tomorrow in NYC?"). This is done in two phases as briefly shown in Figure

1https://github.com/unsw-cse-soc/REST2Bot

127 Chapter 7 REST2Bot: A Platform for Automated Bot Development

API Specifications

Defining Intents & Params

Creating an initial Utterance

Paraphrasing Utterances REST2Bot

Natural Language Understanding Training Bots

Creating a Webhook Server Server Webhook Deploying Webhook Server

Ready to Go!

Figure 7.1: Typical Bot Development Process vs Bot Development Process Using REST2Bot (Green Arrow)

7.1: (i) training a Natural Language Understanding (NLU) model to map user utterances to intents, and (ii) developing Webhook functions to map intents to APIs. Machine-learning based NLU techniques require definition of intents (e.g., booking hotels), entity types (e.g., location, date), and a set of annotated utter- ances in which entities are labeled with the entity types and intents. As studied in Chapter 2, the typical approach is to obtain canonical utterances for given intents (or APIs) and then paraphrase them (either manually or automatically) to generate multiple and diverse utterances. Moreover, bot developers (manu- ally) specify mappings between intents and corresponding API calls through the development of Webhooks (i.e, intent-action rules that trigger API calls upon detection of associated intents). Although single-domain/application conversational bots (e.g. flight booking) are useful, the premise of our research is that the ubiquity of such bots will have more value if they can easily integrate and reuse concomitant capabilities across large number of evolving and heterogeneous devices, data sources and applications (e.g. flight/hotel/car bookings all-in-one bot). With the number of different APIs growing and evolving very rapidly, we need bot development to “scale” in terms of how effectively they can be integrated to APIs. Motivated by the above concerns, we developed the REST2Bot system for

128 7.2 System Overview urk A T

Bot API Specs Developer WIKIDA Apache Joshua Nematus Mechanical T

Webhook Server

wit.ai

Canonical Webhook Conv. Manager API Parser Sentence Paraphraser Generator Generator Generator

Building Webhook Server Training Natural Language Understanding

Figure 7.2: REST2Bot Architecture - Building conversational bots from APIs specifications

the rapid and semi-automated integration of a potentially large and evolving set of intents and APIs [219, 218, 220]. At the heart of the REST2bot system is the idea of providing a middleware that includes knowledge and processing techniques useful to train and automate various activities in bot development life cycle. In a nutshell, this middleware is a set of micro services (e.g., automated canonical utterance generation, automated paraphrasing, automated generation of Webhook functions) for automating bot development process.

7.2 System Overview

The REST2Bot architecture (Figure 7.2) consists of a pipeline of components that automate several tasks in the bot development process. First, API Parser obtains elements of the APIs (e.g., descriptions, operations, parameters) by parsing API specifications. Second, Canonical Sentence Generator automatically generates canonical utterances (e.g., “get the tracks of the album with id being 1”) for each operation. Third, canonical utterances are paraphrased by Paraphraser to obtain diverse utterances to train NLU models by the Conversation Manager Generator component on a given bot development platform (e.g., Dialogflow). Finally, Webhook Generator generates executable Webhook functions to invoke appropriate APIs based on the intents and entities detected by the chosen bot development framework at runtime.

129 Chapter 7 REST2Bot: A Platform for Automated Bot Development

7.2.1 API Parser

The API parser parses OpenAPI specifications to extract API elements (e.g., operations, parameters). Our prototype is an adaptation of Swagger-Parser in python2. OpenAPI specification is an industry standard for REST API docu- mentation. Figure 7.3 shows an excerpt of an OpenAPI specification for Spotify’s API3. OpenAPI specification includes operation description, summary, and infor- mation about its parameters (e.g., data types, description, and examples). Given that each operation is designed to perform a specific task, each of the extracted operations can be considered a single intent4 (e.g., getting the list of tracks for a given album):

Intent get_album_tracks

Operation GET /albums/ /tracks{id}

HTTP Verb Endpoint (URI/Path) Parameter

7.2.2 Canonical Sentence Generator

Based on the techniques and modes proposed in Chapter 6, Canonical Sentence Generator (CSG) automatically translates an operation to canonical templates (e.g., “get the tracks of the album with id being «id»”)5. Generating canonical sentences is essential since they represent corresponding expressions for opera- tions, and later they are paraphrased by the Paraphraser component to generate training samples required for training NLU models. The following approaches are used to create canonical sentences:

• API-Description Transformer. CSG relies on a set of handcrafted heuristics to automatically transform a description of a given operation to a canonical template [218]. The description of an operation usually con- sists of several sentences (e.g., “gets an album’s tracks by a given album id. See the following page:”) describing the functionality of the operation. In short, the description of the operation is split into its sentences (e.g., “gets

2https://github.com/Trax-air/swagger-parser 3Spotify Web API endpoints provides access to music artists, albums, and tracks 4current implementation does not support intents requiring API compositions 5a canonical template is a sentence in which entities are replaced with placeholders

130 7.2 System Overview

basePath: /v1 host: api.spotify.com info: title: Spotify version: v1 paths: /albums: ... /albums/{id}: ... /albums/{id}/tracks: get: description: get an album's tracks parameters: - { name: id, in: path, description: the Spotify ID for the album, required: true, type: string }

Figure 7.3: Excerpt of Spotify’s OpenAPI Specification

an album’s tracks by a given album id.”, “See the following page:”). Then the first sentence starting with a verb (e.g., “gets an album’s tracks by a given album id”) is chosen as a potential canonical sentence, and its verb is converted to imperative form (e.g., “get an album’s tracks by a given album id”). Next parameters are injected into the candidate sentence (e.g., “gets an album’s tracks by album id being «id»”) based on a set of hand-crafted rules [218] (e.g., replacing entity mentions like “given album id” with a phrase with placeholders like “album id being «id».”).

• Neural Translator. Canonical sentence generated by API-Description Transformer are of high quality, but many operations lack descriptions [218]. Since not all operations contain proper descriptions, REST2Bot re- lies also on the Neural Translator proposed in [218]. Neural Translator is based on encoder-decoder architecture to directly translate an operation (GET /customers) to a canonical template (e.g., “get the list of customers”). REST2Bot uses the same encoder-decoder architecture6 and training pro- cess (e.g., training dataset, hyper-parameters, resource-based delexicaliza- tion [218]).

Having generated canonical templates, the placeholders (entities) in the canon- ical templates are replaced with values. Values are sampled using various meth-

6two layers of Bidirectional Long-Short Term Memory (BiLSTM) [74] for encoding, and two layers of Long-Short Term Memory (LSTM) [87] for the decoder using attention mechanism [121]

131 Chapter 7 REST2Bot: A Platform for Automated Bot Development ods proposed in [218, 229]. Examples includes parameter example values ex- tracted from Swagger specification. Moreover, for entity types (e.g., restaurants), REST2Bot uses knowledge graphs (e.g., Wikidata [201]) to obtain sample values (e.g., KFC, Domino’s).

7.2.3 Paraphraser

Given that human language is rich and an intent can be expressed in countless ways, having a diverse set of utterances is of paramount importance [219]. A lack of variations in training samples may result in bots making incorrect intent de- tection or entity resolution, and therefore perform undesirable (even dangerous) tasks (e.g., pressing the accelerator instead of the brake pedal in a car) [219]. Existing solutions for paraphrasing involve either automated or crowdsourcing techniques [220, 219, 190]. REST2Bot is integrated with popular crowdsourcing platforms (e.g., figure-eight7, MTurk8), and can automatically create paraphras- ing tasks on the specified platform using the word recommendation approach proposed in Chapter 4, and train bots based on collected paraphrases. Moreover, Paraphraser relies on three paraphrasing components to obtain diverse variations of each canonical sentence utterance:

• Common Prefix/Postfix Concatenation. Paraphraser relies on a set of handcrafted common prefixes/postfixes (e.g., “can you please”, “please”, “I want to”) which can be concatenated with a canonical sentence (e.g., “create a new album”) without changing its meaning (e.g., “can you please create a new album”).

• Statistical Paraphrasing The proposed system also relies on a statistical paraphrasing system called Apache Joshua [161]. Apache Joshua is a sta- tistical machine translation decoder for phrase-based machine translation, and it relies on Paraphrase Dataset (PPDB) to generate paraphrases [67].

• Neural Paraphrasing. Inspired by language pivoting in (Neural Machine Translation) NMT systems , Paraphraser uses two pre-built NMT models proposed in [183]: one for translating English sentences to German, and the other one for translating from German back to English. 7https://www.figure-eight.com 8https://www.mturk.com/

132 7.3 Usecases

7.2.4 Conversation Manager Generator

Conversation Manager Generator (CMG) is used to instantiate a conversation manager in a third-party bot development platform (e.g. Dialogflow) [228]. To support this functionality, we develop extensions called generator plugins embed- ded inside CMG component. Each plugin is a program that takes as input the generated training data (e.g. annotated utterances) by Paraphraser, and exports them to the conversation manager in order to train included NLU model. This is done by exploiting the REST APIs provided by the bot development platform (e.g. Dialogflow).

7.2.5 Webhook Generator

Webhook Generator (WG) automatically generates webhook functions written in developer’s selected programming language (e.g. Python9) [228]. For each detected user intent and its associating parameter values, the generated Webhook invokes corresponding APIs and returns the response. The output of Webhook Generator is a fully-functional and ready-to-deploy source code for a webhook server. This source code can be reused and customized by bot developers.

7.3 Usecases

REST2Bot can facilitate the bot development process while using existing bot de- velopment platforms. REST2Bot is offered as both a RESTful API and web-based interface. REST2Bot gives the user control over the settings of the different com- ponents. This includes selecting algorithms for creating canonical sentences (e.g., deep learning approach, rule-based approach), paraphrasing models (e.g., Apache Joshua, appending common-prefix), bot development platform (e.g., Wit.ai, Di- alogflow), and programming framework (e.g., python flask, spring boot) to build the webhook server. REST2Bot can be used in the following scenarios:

• Training Dataset Generation. In REST2Bot, developer starts with pro- viding a list of API specifications (e.g. Spotify, Yelp) chosen for building the bot. REST2Bot extracts all operations from the given set of APIs. Next, based on selected approaches for translating (API-Description Transformer

9The current implementation this stage supports template codes written in python.

133 Chapter 7 REST2Bot: A Platform for Automated Bot Development

Figure 7.4: REST2Bot User Interface

and Neural Translator), operations are translated to canonical sentences using Canonical Sentence Generator (CSG) component. Then, paraphras- ing methods (common prefix, statistical, and neural paraphrasing) are se- lected by bot developer to diversify the training dataset which is handled by Paraphraser component. Since paraphrases are generated automatically, REST2Bot also scores the generated paraphrases (using Paraphraser com- ponent) to only keep high quality paraphrases based on the user provided score threshold (similarity of a canonical utterance and a given paraphrase is calculated by the method proposed in [29]). The outcome is a dataset containing natural language utterances to train conversation-enabled ser- vices such as bots (e.g. on-premise development platforms) and IoT devices (e.g. smart home appliances, aged-care gadgets).

• Automate Bot Development. Using the generated training dataset,

134 7.4 Conclusion

REST2Bot trains NLU model (as the main component within conversation manager) by defining intents (e.g. PlayMusic, FindRestaurants), entity types (e.g. album name, restaurant location) in a supported bot platform (e.g., ) chosen by bot developer. Next, REST2Bot automati- cally creates Webhook functions to map intents to APIs in a given based programming framework (e.g., python flask). This process is handled by CMG and WG components from REST2Bot architecture. Finally, in the demo, the bot users will be able to interact with the generated bot.

7.4 Conclusion

This chapter aimed at addressing an important shortcoming in current approaches for building bots, namely automatic training data generation and building Web- hook functions to invoke APIs. Future work includes adding new plugins (e.g., more bot development platforms, programming frameworks for webhook servers). Moreover, extensions of REST2Bot will include developers sharing of training datasets to minimize the cost of paraphrasing in case of using crowdsourced para- phrasing. Another feasible extension is deploying the webhook server, which can be done automatically using platforms (e.g., Microsoft Azure, Google Cloud).

135 Chapter 7 REST2Bot: A Platform for Automated Bot Development

136 Chapter 8

Conclusion

Dialog systems are growing and new approaches are being developed, as a result many users have started to investigate the usefulness of such interfaces. Nev- ertheless, training data acquisition remains a bottleneck, and nearly all systems depend on gathering high quality training datasets to build robust virtual assis- tants. This thesis identified the main approaches for acquiring user utterances for training intent detection systems and enumerated the existing challenges in this area along with existing solutions, and provided solutions for some of the challenges. This chapter provides a summary of the undertaken research issues in Section 8.1, as well as a summary of research outcomes in Section 8.2. Finally, Section ?? discusses future research directions.

8.1 Summary the Research Issues

This thesis presented multiple novel techniques to fill critical gaps in automating/semi-automating the process of building conversational bots. Chap- ter 2 provided background and identified open issues and the state-of-the-art of approaches for collecting training utterances. In the the second study (Chapter 3), we conducted an empirical study on how incorrect paraphrases are generated by crowd-workers to build online quality-control systems for automatically de- tection of incorrect paraphrases (e.g., semantically incorrect). In Chapter 4, we proposed a novel framework and technique for crowdsourcing user utterances to collect semantically correct but diverse paraphrases by continuously showing a list of potential words to enhance diversity of collected paraphrases. In Chapter 5, we introduced a new technique to identify insincere crowd-workers who inten- tionally provide incorrect paraphrases. In Chapter 6, we addressed a fundamental

137 Chapter 8 Conclusion issue in generating canonical utterances by automatically translating REST APIs to natural language expressions. Finally, in Chapter 7, we demonstrated the soft- ware prototype built using the techniques introduced in this thesis as well as the state-of-the-art approaches in this domain.

8.2 Summary of the Research Outcomes

As a first outcome, in Chapter 2, we surveyed training utterance acquisition methods and discussed their issues in respect to two main dimensions: cost and quality. We discussed the state-of-the-art techniques, identified and explored open issues, and inform an outlook on future research directions (some of which addressed in this thesis) [217]. The second outcome, presented in Chapter 3, includes a study of how incor- rect paraphrases are generated by crowd-workers, a taxonomy of incorrect para- phrases, baselines for detecting each type of incorrect paraphrases, and a anno- tated dataset of crowdsourced paraphrases. We discussed how each category of incorrect paraphrases can be detected automatically based on the existing state- of-the-art approaches in natural language processing. The third outcome, introduced in Chapter 4, consists of a framework and tech- niques that leverage word embeddings and word alignment to promote diver- sity of crowdsourced paraphrases. We focused on the automated generation of word/phrases suggestions based on already collected paraphrases to recommend suggestions which are more probably to enhance the diversity based on exiting diversity metrics. The experimental results obtained by applying our approach to crowdsourcing tasks demonstrate the feasibility and effectiveness of the proposed techniques by improving diversity while not promoting semantically incorrect paraphrases. In our fourth outcome, proposed in Chapter 5, we employed a data-driven approach to investigate various cheating behaviors and their characteristics in crowdsourced paraphrases. Moreover, we identified various features from lit- erature based on the characteristics of each type of cheating behaviors. Then we proposed two new features to overcome shortcomings in the literature (e.g., multi-paraphrase similarity, user edit habits) and propose an domain-independent method for modeling malicious workers. The fifth outcome, introduced in Chapter 6, consists of novel model to di-

138 8.3 Future Research Directions rectly translate REST APIs to canonical utterances. We demonstrated that the generation of canonical utterances can be considered as a machine translation task. As such, our work also aimed at addressing an important challenge in training supervised neural machine translators, namely the lack of training data for translating operations to canonical templates. By processing a large set of API specifications and based on a set of heuristics, we build a dataset called API2CAN. Since deep-learning-based approaches require larger sets of training samples to train domain-independent models, by formalizing and defining re- sources in REST APIs, we proposed a delexicalization technique to convert an operation to a tagged sequence of resources to help sequence-to-sequence models to learn from such a dataset. In addition, we showed how parameter values can be sampled to feed placeholders in a canonical template and generate canonical utterances. We also gave a systematic analysis of web APIs and their parameters, indicating the importance of string parameters in automating the generation of canonical utterances.

Finally, in Chapter 7 provides our final outcome which is a software prototype called REST2Bot, a tool that addresses the shortcomings (e.g., translating APIs to Intents, and invoking APIs based on detected Intents) in bot development frameworks to automate several tasks in the life cycle of the bot development process. REST2Bot relies on automated approaches for parsing OpenAPI spec- ifications, generating training data, building bots on desired bot development frameworks, and generating deployable webhook functions to map intents and entities to APIs.

8.3 Future Research Directions

We discussed several open issues and research direction in Chapter 2 and 3. While we targeted some of these issues in this thesis, there are still many challenges waiting to be addressed. In this section, we discuss future works and outlook for training utterance acquisition.

139 Chapter 8 Conclusion

8.3.1 Integrating Quality Control and Crowdsourcing with Feedback Mechanism

Considering the issue of collecting quality training utterances, we proposed vari- ous techniques and models to diversify the crowdsourced utterances (see 4) and to detect their quality issues in 3 and 5. A potential extension to the proposed approaches can be a feedback mechanism: generating feedback for workers to inform them about the quality issues (e.g., spelling errors, semantic incomplete- ness). At the same time the system can learn from workers’ feedback on detected quality issues (incorrect suggestions) to improve its machine intelligence. In time, we envision increasingly less dependence on users. This also requires developing software services which can easily be integrated with existing crowdsourcing plat- forms. Such a software service must have characteristics such as being domain- independent and highly reliable to be useful in crowdsourcing tasks. After acquisition of training utterances, it is also essential to reduce biases in the collection of the utterances. In this thesis, we only focused on one type of biases called lexical bias, but a quality set of utterances must not suffer from other biases (e.g., gender bias [16], personal view bias[173], blatant racial slurs[138]). This requires in-depth studies to identify various kinds of biases and approaches to automatically detect and resolve them.

8.3.2 Canonical Utterance Generation for Complex Intents

Efficient and cost-effective acquisition of training utterances is linked with how canonical utterances are generated for intents. In Chapter 6, we proposed a technique to automatically generate canonical utterances for REST APIs. While many of tasks can be done by just invoking a single API method, complex intents require compositions between operations (e.g., if-else conditions). To achieve this, it is required to detect the relations between operations and generate canon- ical templates for complex tasks (e.g., tasks requiring conditional operations or compositions of multiple operations). While REST APIs are one the most popular forms of intents, there are still other types of intents such as SQL queries, and programming codes [204, 97, 70]. Building natural language interfaces for such intents require special considerations (e.g., mining code repositories, programming language documentations, online

140 8.3 Future Research Directions communities such as Stack-overflow1) for generating canonical utterances.

8.3.3 Acquisition of Training Dialogues

In this thesis, we discussed and addressed some of the issues (e.g., quality, au- tomation) of collecting training utterance for intent detection systems. However, multi-turn dialog systems require more complicated training samples, including multiple utterances which may be exchanged between the user and bot in a single conversation [168]. This requires understanding how users interact with bots and simulation of the interaction between the user and bot [61, 168]. In future work, we will target these problems, together with many other exciting opportunities as extensions to this work.

1https://stackoverflow.com/

141 Chapter 8 Conclusion

142 Bibliography

[1] Abdalghani Abujabal et al. “Automated Template Generation for Ques- tion Answering over Knowledge Graphs”. In: Proceedings of the 26th In- ternational Conference on World Wide Web. WWW ’17. Perth, Australia: International World Wide Web Conferences Steering Committee, 2017, pp. 1191–1200. isbn: 978-1-4503-4913-0. doi: 10.1145/3038912.3052583. url: https://doi.org/10.1145/3038912.3052583. [2] Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. “A comparative survey of recent natural language interfaces for databases”. In: The VLDB Journal 28 (2019), pp. 793 –819. [3] Basant Agarwal et al. “A deep network model for paraphrase detection in short text messages”. In: Information Processing & Management 54.6 (2018), pp. 922–937. [4] Ali Ahmadvand, Jason Ingyu Choi, and Eugene Agichtein. “Contextual Dialogue Act Classification for Open-Domain Conversational Agents”. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019, pp. 1273–1276. [5] Layla El Asri et al. “Frames: A corpus for adding memory to goal-oriented dialogue systems”. In: arXiv preprint arXiv:1704.00057 (2017). [6] Ram G. Athreya, Axel-Cyrille Ngonga Ngomo, and Ricardo Usbeck. “En- hancing Community Interactions with Data-Driven Chatbots–The DB- pedia Chatbot”. In: Companion Proceedings of the The Web Conference 2018. WWW ’18. Lyon, France: International World Wide Web Confer- ences Steering Committee, 2018, pp. 143–146. isbn: 978-1-4503-5640-4. doi: 10.1145/3184558.3186964. url: https://doi.org/10.1145/ 3184558.3186964. [7] Petr Babkin et al. “Bootstrapping Chatbots for Novel Domains”. In: Work- shop at NIPS on Learning with Limited Labeled Data (LLD). 2017. [8] Sahar Badihi and Abbas Heydarnoori. “CrowdSummarizer: Automated Generation of Code Summaries for Java Programs through Crowdsourc- ing”. In: IEEE Software 34.2 (2017), pp. 71–80. [9] Moath Bagarish, Riyad Alshammari, and A. Nur Zincir-Heywood. “Are There Bots even in FIFA World Cup 2018 Tweets?” In: 2019 15th Inter- national Conference on Network and Service Management (CNSM). 2019, pp. 1–5. doi: 10.23919/CNSM46954.2019.9012743.

143 Bibliography

[10] Alessandro Balestrucci. “How many bots are you following?” In: arXiv preprint arXiv:2001.05222 (2020). [11] Rucha Bapat, Pavel Kucherbaev, and Alessandro Bozzon. “Effective Crowdsourced Generation of Training Data for Chatbots Natural Lan- guage Understanding”. In: Web Engineering. Ed. by Tommi Mikkonen, Ralf Klamma, and Juan Hernández. Cham: Springer International Pub- lishing, 2018. isbn: 978-3-319-91662-0. [12] Jonathan Berant et al. “ on Freebase from Question- Answer Pairs”. In: Proceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing. Seattle, Washington, USA: Associa- tion for Computational Linguistics, Oct. 2013, pp. 1533–1544. url: https: //www.aclweb.org/anthology/D13-1160. [13] Abraham Bernstein, Foster Provost, and Shawndra Hill. “Toward intel- ligent assistance for a data mining process: An ontology-based approach for cost-sensitive classification”. In: IEEE Transactions on knowledge and data engineering 17.4 (2005), pp. 503–518. [14] Michael S. Bernstein et al. “Soylent: A Word Processor with a Crowd Inside”. In: vol. 58. 8. New York, NY, USA: ACM, July 2015, pp. 85–94. doi: 10.1145/2791285. url: http://doi.acm.org/10.1145/2791285. [15] Kurt Bollacker et al. “Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge”. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. SIG- MOD ’08. Vancouver, Canada: ACM, 2008, pp. 1247–1250. isbn: 978-1- 60558-102-6. doi: 10.1145/1376616.1376746. url: http://doi.acm. org/10.1145/1376616.1376746. [16] Tolga Bolukbasi et al. “Man is to Computer Programmer As Woman is to Homemaker? Debiasing Word Embeddings”. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16. Barcelona, Spain: Curran Associates Inc., 2016, pp. 4356–4364. isbn: 978-1-5108-3881-9. url: http://dl.acm.org/citation.cfm?id= 3157382.3157584. [17] David M Boush. “How advertising slogans can prime evaluations of brand extensions”. In: Psychology & Marketing 10.1 (1993), pp. 67–78. [18] Florin Brad and Traian Rebedea. “Neural Paraphrase Generation using Transfer Learning”. In: Proceedings of the 10th International Conference on Natural Language Generation. 2017, pp. 257–261. [19] Patricia Braunger et al. “Towards an Automatic Assessment of Crowd- sourced Data for NLU”. In: Proceedings of the Eleventh International Con- ference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA), May 2018. url: https://www.aclweb.org/anthology/L18-1315.

144 Bibliography

[20] A.M. Brickman and Yaakov Stern. “Aging and Memory in Humans”. In: Sage Encyclopedia of Neuroscience 1 (Jan. 2010), pp. 175–180. doi: 10. 1016/B978-008045046-9.00745-2. [21] Peter F. Brown et al. “The Mathematics of Statistical Machine Trans- lation: Parameter Estimation”. In: Comput. Linguist. 19.2 (June 1993), pp. 263–311. issn: 0891-2017. url: http://dl.acm.org/citation.cfm? id=972470.972474. [22] Marc Brysbaert et al. “How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age”. In: Frontiers in psychology 7 (2016), p. 1116. [23] Dhyana Buckley. “The Persuasive Effects of Stylistic Variation in the Restaurant Review Domain”. PhD thesis. UC Santa Cruz, 2018. [24] Steven Burrows, Martin Potthast, and Benno Stein. “Paraphrase Acquisi- tion via Crowdsourcing and Machine Learning”. In: vol. 4. 3. New York, NY, USA: ACM, July 2013, 43:1–43:21. doi: 10.1145/2483669.2483676. url: http://doi.acm.org/10.1145/2483669.2483676. [25] Giovanni Campagna et al. “Almond: The Architecture of an Open, Crowd- sourced, Privacy-Preserving, Programmable Virtual Assistant”. In: Pro- ceedings of the 26th International Conference on World Wide Web. WWW ’17. Perth, Australia: International World Wide Web Conferences Steering Committee, 2017, pp. 341–350. isbn: 978-1-4503-4913-0. doi: 10.1145/ 3038912.3052562. url: https://doi.org/10.1145/3038912.3052562. [26] Lorenzo Cannone and Matteo Di Pierro. “Detection and classification of harmful bots in human-bot interactions on Twitter”. In: MSc Thesis, Po- litecnico di Milano (2018). [27] Claudio Carpineto and Giovanni Romano. “A Survey of Automatic Query Expansion in Information Retrieval”. In: vol. 44. 1. New York, NY, USA: ACM, Jan. 2012, 1:1–1:50. doi: 10.1145/2071389.2071390. url: http: //doi.acm.org/10.1145/2071389.2071390. [28] Fabio Casati et al. “Operating enterprise AI as a service”. In: International Conference on Service-Oriented Computing. Springer. 2019, pp. 331–344. [29] Daniel Cer et al. “Universal sentence encoder”. In: arXiv preprint arXiv:1803.11175 (2018). [30] David L. Chen and William B. Dolan. “Collecting Highly Parallel Data for Paraphrase Evaluation”. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. HLT ’11. Portland, Oregon: Association for Computational Linguistics, 2011, pp. 190–200. isbn: 978-1-932432-87-9. url: http://dl. acm.org/citation.cfm?id=2002472.2002497.

145 Bibliography

[31] Hongshen Chen et al. “A Survey on Dialogue Systems: Recent Advances and New Frontiers”. In: SIGKDD Explor. Newsl. 19.2 (Nov. 2017), 25–35. issn: 1931-0145. doi: 10.1145/3166054.3166058. url: https://doi. org/10.1145/3166054.3166058. [32] Chun Cheng, Yun Luo, and Changbin Yu. “Dynamic mechanism of social bots interfering with public opinion in network”. In: Physica A: Statistical Mechanics and its Applications (2020), p. 124163. [33] Wendy A. Chisholm and Shawn Lawton Henry. “Interdependent Com- ponents of Web Accessibility”. In: Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A). W4A ’05. Chiba, Japan: ACM, 2005, pp. 31–37. isbn: 1-59593-219-4. doi: 10.1145/ 1061811 . 1061818. url: http : / / doi . acm . org / 10 . 1145 / 1061811 . 1061818. [34] Timothy Chklovski. “Collecting Paraphrase Corpora from Volunteer Con- tributors”. In: Proceedings of the 3rd International Conference on Knowl- edge Capture. K-CAP ’05. Banff, Alberta, Canada: ACM, 2005, pp. 115– 120. isbn: 1-59593-163-5. doi: 10.1145/1088622.1088644. url: http: //doi.acm.org/10.1145/1088622.1088644. [35] Kyunghyun Cho et al. “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process- ing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIG- DAT, a Special Interest Group of the ACL. 2014, pp. 1724–1734. url: https://www.aclweb.org/anthology/D14-1179/. [36] Kyunghyun Cho et al. “On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”. In: Proceedings of SSST-8, Eighth Work- shop on Syntax, Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103– 111. doi: 10.3115/v1/W14- 4012. url: https://www.aclweb.org/ anthology/W14-4012. [37] Cuong Xuan Chu, Niket Tandon, and Gerhard Weikum. “Distilling Task Knowledge from How-To Communities”. In: Proceedings of the 26th In- ternational Conference on World Wide Web. WWW ’17. Perth, Australia: International World Wide Web Conferences Steering Committee, 2017, pp. 805–814. isbn: 978-1-4503-4913-0. doi: 10.1145/3038912.3052715. url: https://doi.org/10.1145/3038912.3052715. [38] Jacob Cohen. “A coefficient of agreement for nominal scales”. In: Educa- tional and psychological measurement 20.1 (1960), pp. 37–46.

146 Bibliography

[39] K. M. Colby. “Human-Computer Conversation in A Cognitive Therapy Program”. In: Machine Conversations. Ed. by Yorick Wilks. Boston, MA: Springer US, 1999, pp. 9–19. isbn: 978-1-4757-5687-6. doi: 10.1007/978- 1- 4757- 5687- 6_3. url: https://doi.org/10.1007/978- 1- 4757- 5687-6_3. [40] Kate Compton, Benjamin Filstrup, and Michael Mateas. “Tracery: Ap- proachable story grammar authoring for casual users”. In: 2014 Electronic Literature Organization Conference, ELO 2014. AI Access Foundation. 2014, pp. 64–67. [41] Alexis Conneau et al. “Supervised Learning of Universal Sentence Repre- sentations from Natural Language Inference Data”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, 2017, pp. 670–680. url: https://www.aclweb.org/anthology/D17-1070. [42] Joao Cordeiro, Gael Dias, and Pavel Brazdil. “A metric for paraphrase detection”. In: Computing in the Global Information Technology, 2007. ICCGI 2007. International Multi-Conference on. IEEE. 2007, pp. 7–7. [43] Justin Cranshaw et al. “Calendar.Help: Designing a Workflow-Based Scheduling Agent with Humans in the Loop”. In: CHI ’17. Denver, Col- orado, USA, 2017. isbn: 978-1-4503-4655-9. [44] Scott Crossley et al. “Combining Click-stream Data with NLP Tools to Better Understand MOOC Completion”. In: Proceedings of the Sixth In- ternational Conference on Learning Analytics & Knowledge. LAK ’16. Ed- inburgh, United Kingdom: ACM, 2016, pp. 6–14. isbn: 978-1-4503-4190-5. doi: 10.1145/2883851.2883931. url: http://doi.acm.org/10.1145/ 2883851.2883931. [45] Robert Dale. “The return of the chatbots”. In: Natural Language Engi- neering 22.5 (2016), pp. 811–817. [46] Florian Daniel et al. “Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions”. In: vol. 51. 1. New York, NY, USA: ACM, Jan. 2018, 7:1–7:40. doi: 10.1145/ 3148148. url: http://doi.acm.org/10.1145/3148148. [47] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transform- ers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long and Short Pa- pers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. url: https: //www.aclweb.org/anthology/N19-1423.

147 Bibliography

[48] Kedar Dhamdhere et al. “Analyza: Exploring Data with Conversation”. In: Proceedings of the 22Nd International Conference on Intelligent User Interfaces. IUI ’17. Limassol, Cyprus: ACM, 2017, pp. 493–504. isbn: 978- 1-4503-4348-0. doi: 10.1145/3025171.3025227. url: http://doi.acm. org/10.1145/3025171.3025227. [49] George Doddington. “Automatic Evaluation of Machine Translation Qual- ity Using N-gram Co-occurrence Statistics”. In: Proceedings of the Sec- ond International Conference on Human Language Technology Research. HLT ’02. San Diego, California: Morgan Kaufmann Publishers Inc., 2002, pp. 138–145. url: http://dl.acm.org/citation.cfm?id=1289189. 1289273. [50] Li Dong et al. “Learning to Paraphrase for Question Answering”. In: arXiv preprint arXiv:1708.06022 (2017). [51] Todd Donovan, Caroline J Tolbert, and Daniel A Smith. “Priming presi- dential votes by direct democracy”. In: The Journal of Politics 70.4 (2008), pp. 1217–1231. [52] Nan Duan et al. “Question Generation for Question Answering”. In: Pro- ceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing. Copenhagen, Denmark: Association for Computational Linguistics, Sept. 2017, pp. 866–874. doi: 10.18653/v1/D17-1090. url: https://www.aclweb.org/anthology/D17-1090. [53] Layla El Asri et al. “Frames: a corpus for adding memory to goal-oriented dialogue systems”. In: The 18th SIGDIAL. Saarbrücken, Germany: Asso- ciation for Computational Linguistics, 2017, pp. 207–219. doi: 10.18653/ v1/W17-5526. url: http://aclweb.org/anthology/W17-5526. [54] Ábel Elekes, Martin Schäler, and Klemens Böhm. “On the various se- mantics of similarity in word embedding models”. In: Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries. IEEE Press. 2017, pp. 139–148. [55] Mihail Eric et al. MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines. July 2019. [56] Katrin Erk and Sebastian Padó. “A Structured Vector Space Model for Word Meaning in Context”. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP ’08. Honolulu, Hawaii: Association for Computational Linguistics, 2008, pp. 897–906. url: http: //dl.acm.org/citation.cfm?id=1613715.1613831. [57] Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. “Paraphrase-driven learning for open question answering”. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2013, pp. 1608–1618.

148 Bibliography

[58] Roghayeh Fakouri-Kapourchali, Mohammad-Ali Yaghoub-Zadeh-Fard, and Mehdi Khalili. “Semantic Textual Similarity as a Service”. In: Service Research and Innovation. Ed. by Amin Beheshti et al. Cham: Springer International Publishing, 2018, pp. 203–215. isbn: 978-3-319-76587-7. [59] Manaal Faruqui et al. “Retrofitting word vectors to semantic lexicons”. In: arXiv preprint arXiv:1411.4166 (2014). [60] Ethan Fast, Binbin Chen, and Michael S. Bernstein. “Empath: Under- standing Topic Signals in Large-Scale Text”. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. CHI ’16. San Jose, California, USA: ACM, 2016, pp. 4647–4657. isbn: 978-1-4503-3362- 7. doi: 10.1145/2858036.2858535. url: http://doi.acm.org/10. 1145/2858036.2858535. [61] Ethan Fast et al. “Iris: A Conversational Agent for Complex Tasks”. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. CHI ’18. Montreal QC, Canada: ACM, 2018, 473:1–473:12. isbn: 978-1-4503-5620-6. doi: 10.1145/3173574.3174047. url: http://doi. acm.org/10.1145/3173574.3174047. [62] Roy Fielding et al. Hypertext transfer protocol–HTTP/1.1. 1999. [63] Mauajama Firdaus et al. “A Deep Multi-task Model for Dialogue Act Clas- sification, Intent Detection and Slot Filling”. In: Cognitive Computation (2020), pp. 1–20. [64] Kenneth I Forster and Chris Davis. “Repetition priming and frequency at- tenuation in lexical access.” In: Journal of experimental psychology: Learn- ing, Memory, and Cognition 10.4 (1984), p. 680. [65] Charlotte Fox and James E Birren. “Some factors affecting vocabulary size in later maturity: Age, education, and length of institutionalization”. In: Journal of gerontology 4.1 (1949), pp. 19–26. [66] Ujwal Gadiraju and Stefan Dietze. “Improving Learning Through Achieve- ment Priming in Crowdsourced Information Finding Microtasks”. In: Pro- ceedings of the Seventh International Learning Analytics & Knowledge Conference. LAK ’17. Vancouver, British Columbia, Canada: ACM, 2017, pp. 105–114. isbn: 978-1-4503-4870-6. doi: 10.1145/3027385.3027402. url: http://doi.acm.org/10.1145/3027385.3027402. [67] Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. “PPDB: The paraphrase database”. In: Proceedings of the 2013 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013, pp. 758–764. [68] Jianfeng Gao, Michel Galley, Lihong Li, et al. “Neural approaches to con- versational AI”. In: Foundations and Trends R in Information Retrieval 13.2-3 (2019), pp. 127–298.

149 Bibliography

[69] Shuyang Gao et al. “Dialog State Tracking: A Neural Reading Comprehen- sion Approach”. In: Jan. 2019, pp. 264–273. doi: 10.18653/v1/W19-5932. [70] Zhipeng Gao et al. “Checking Smart Contracts with Structural Code Em- bedding”. In: IEEE Transactions on Software Engineering (2020), pp. 1–1. doi: 10.1109/TSE.2020.2971482. [71] Jonas Gehring et al. “Convolutional Sequence to Sequence Learning”. In: Proceedings of the 34th International Conference on Machine Learning - Volume 70. ICML’17. Sydney, NSW, Australia: JMLR.org, 2017, pp. 1243– 1252. url: http://dl.acm.org/citation.cfm?id=3305381.3305510. [72] Nguyen Giang. Survey on Script-based languages to write a Chat- bot. Available: https://www.slideshare.net/NguyenGiang102/survey-on- scriptbased-languages-to-write-a-chatbot. [Online; accessed 15-July-2019]. 2018. [73] Ian Goodfellow et al. Deep learning. Vol. 1. MIT press Cambridge, 2016. [74] Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. “Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition”. In: Artificial Neural Networks: Formal Models and Their Applications – ICANN 2005. Ed. by Włodzisław Duch et al. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 799–804. isbn: 978-3-540-28756-8. [75] Ankush Gupta et al. “A Deep Generative Framework for Paraphrase Gen- eration”. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018). [76] E Haihong et al. “A novel bi-directional interrelated model for joint intent detection and slot filling”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, pp. 5467–5471. [77] Mark Hall et al. “The WEKA Data Mining Software: An Update”. In: vol. 11. 1. New York, NY, USA: ACM, Nov. 2009, pp. 10–18. doi: 10. 1145/1656274.1656278. url: http://doi.acm.org/10.1145/1656274. 1656278. [78] Jennifer L Harris, John A Bargh, and Kelly D Brownell. “Priming effects of television food advertising on eating behavior.” In: Health psychology 28.4 (2009), p. 404. [79] Lane Harrison et al. “Influencing Visual Judgment Through Affective Priming”. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI ’13. Paris, France: ACM, 2013, pp. 2949– 2958. isbn: 978-1-4503-1899-0. doi: 10 .1145 / 2470654. 2481410. url: http://doi.acm.org/10.1145/2470654.2481410. [80] Sadid A Hasan et al. “Neural clinical paraphrase generation with atten- tion”. In: Proceedings of the Clinical Natural Language Processing Work- shop (ClinicalNLP). 2016, pp. 42–53.

150 Bibliography

[81] Homa B Hashemi, Amir Asiaee, and Reiner Kraft. “Query intent detection using convolutional neural networks”. In: International Conference on Web Search and Data Mining, Workshop on Query Understanding. 2016. [82] Matthew Henderson, Blaise Thomson, and Jason D. Williams. “The Sec- ond Dialog State Tracking Challenge”. In: Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIG- DIAL). Philadelphia, PA, U.S.A.: Association for Computational Linguis- tics, June 2014, pp. 263–272. doi: 10.3115/v1/W14-4337. url: https: //www.aclweb.org/anthology/W14-4337. [83] Matthew Henderson et al. “A Repository of Conversational Datasets”. In: CoRR abs/1904.06472 (2019). [84] Matthew S Henderson. “Discriminative methods for statistical spoken di- alogue systems”. PhD thesis. University of Cambridge, 2015. [85] Peter Henderson et al. “Ethical Challenges in Data-Driven Dialogue Sys- tems”. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. AIES ’18. New Orleans, LA, USA, 2018, pp. 123–129. isbn: 978-1-4503-6012-8. [86] Dirk Hermans, Jan De Houwer, and Paul Eelen. “The affective priming effect: Automatic activation of evaluative information in memory”. In: Cognition & Emotion 8.6 (1994), pp. 515–533. [87] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural computation 9.8 (1997), pp. 1735–1780. [88] Shaohan Huang et al. “Dictionary-Guided Editing Networks for Para- phrase Generation”. In: CoRR (2018). [89] Ting-Hao Kenneth Huang, Walter S Lasecki, and Jeffrey P Bigham. “Guardian: A crowd-powered spoken dialog system for web apis”. In: Third AAAI conference on human computation and crowdsourcing. 2015. [90] Ting-Hao Kenneth Huang et al. “Evorus: A Crowd-powered Conversa- tional Assistant That Automates Itself Over Time”. In: Adjunct Publi- cation of the 30th Annual ACM Symposium on User Interface Software and Technology. UIST ’17. Québec City, QC, Canada: ACM, 2017, pp. 155–157. isbn: 978-1-4503-5419-6. doi: 10.1145/3131785.3131823. url: http://doi.acm.org/10.1145/3131785.3131823. [91] Ting-Hao Kenneth Huang et al. “"Is There Anything Else I Can Help You With?" Challenges in Deploying an On-Demand Crowd-Powered Conver- sational Agent”. In: Fourth AAAI Conference on Human Computation and Crowdsourcing. 2016.

151 Bibliography

[92] Michimasa Inaba et al. “Statistical Response Method and Learning Data Acquisition Using Gamified Crowdsourcing for a Non-task-oriented Dia- logue Agent”. In: Revised Selected Papers of the 6th International Con- ference on Agents and Artificial Intelligence - Volume 8946. ICAART 2014. Angers, France: Springer-Verlag, 2015, pp. 119–136. isbn: 978-3- 319-25209-4. doi: 10.1007/978-3-319-25210-0_8. url: https://doi. org/10.1007/978-3-319-25210-0_8. [93] Fuad Issa et al. “Abstract Meaning Representation for Paraphrase De- tection”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers). Vol. 1. 2018, pp. 442–452. [94] Minwoo Jeong and G. Geunbae Lee. “Triangular-Chain Conditional Ran- dom Fields”. In: Trans. Audio, Speech and Lang. Proc. 16.7 (Sept. 2008), pp. 1287–1302. issn: 1558-7916. doi: 10.1109/TASL.2008.925143. url: http://dx.doi.org/10.1109/TASL.2008.925143. [95] Youxuan Jiang, Jonathan K. Kummerfeld, and Walter S. Lasecki. “Under- standing Task Design Trade-offs in Crowdsourced Paraphrase Collection”. In: Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers). Vancouver, Canada: Asso- ciation for Computational Linguistics, 2017, pp. 103–109. doi: 10.18653/ v1/P17-2017. url: http://aclweb.org/anthology/P17-2017. [96] Cordeiro Joao, Dias Gaël, and Brazdil Pavel. “New functions for unsu- pervised asymmetrical paraphrase detection”. In: Journal of Software 2.4 (2007), pp. 12–23. [97] Rogers Jeffrey Leo John, Navneet Potti, and Jignesh M Patel. “Ava: From Data to Insights Through Conversations.” In: CIDR. 2017. [98] Armand Joulin et al. “Bag of Tricks for Efficient Text Classification”. In: Proceedings of the 15th Conference of the European Chapter of the Asso- ciation for Computational Linguistics: Volume 2, Short Papers. Valencia, Spain: Association for Computational Linguistics, 2017, pp. 427–431. url: http://aclweb.org/anthology/E17-2068. [99] Seikyung Jung, Jonathan L. Herlocker, and Janet Webster. “Click Data As Implicit Relevance Feedback in Web Search”. In: vol. 43. 3. Tarrytown, NY, USA: Pergamon Press, Inc., May 2007, pp. 791–807. doi: 10.1016/j.ipm. 2006.07.021. url: http://dx.doi.org/10.1016/j.ipm.2006.07.021. [100] D Jurafsky and JH Martin. “Dialog Systems and Chatbots”. In: Speech and language processing (2018). [101] Dan Jurafsky and James H Martin. Speech and language processing. Vol. 3. Pearson London, 2017.

152 Bibliography

[102] Juraj Juraska and Marilyn Walker. “Characterizing Variation in Crowd- Sourced Data for Training Neural Language Generators to Produce Stylis- tically Varied Outputs”. In: CoRR (2018). [103] Yiping Kang et al. “Data Collection for : A Startup Per- spective”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 3 (Industry Papers). New Orleans - Louisiana: Association for Computational Linguistics, June 2018, pp. 33–40. doi: 10. 18653/v1/N18-3005. url: https://www.aclweb.org/anthology/N18- 3005. [104] David Kauchak and Regina Barzilay. “Paraphrasing for Automatic Eval- uation”. In: Proceedings of the Human Language Technology Conference of the NAACL, Main Conference. New York City, USA: Association for Computational Linguistics, June 2006, pp. 455–462. url: https://www. aclweb.org/anthology/N06-1058. [105] Aziz Khan and Anthony Mathelier. “Intervene: a tool for intersection and visualization of multiple gene or genomic region sets”. In: BMC bioinfor- matics 18.1 (2017), p. 287. [106] Joo-Kyung Kim et al. “Intent detection using semantically enriched word embeddings”. In: Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE. 2016, pp. 414–419. [107] Diederik Kingma and Jimmy Ba. “Adam: A Method for Stochastic Opti- mization”. In: International Conference on Learning Representations (Dec. 2014). [108] Guillaume Klein et al. “OpenNMT: Open-Source Toolkit for Neural Ma- chine Translation”. In: Proceedings of ACL 2017, System Demonstrations. Vancouver, Canada: Association for Computational Linguistics, July 2017, pp. 67–72. url: https://www.aclweb.org/anthology/P17-4012. [109] Reno Kriz et al. “Simplification Using Paraphrases and Context-Based Lexical Substitution”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long Papers). New Or- leans, Louisiana: Association for Computational Linguistics, June 2018, pp. 207–217. doi: 10.18653/v1/N18-1019. url: https://www.aclweb. org/anthology/N18-1019. [110] Matt Kusner et al. “From word embeddings to document distances”. In: International Conference on Machine Learning. 2015, pp. 957–966.

153 Bibliography

[111] Saar Kuzi, Anna Shtok, and Oren Kurland. “Query Expansion Using Word Embeddings”. In: Proceedings of the 25th ACM International on Confer- ence on Information and Knowledge Management. CIKM ’16. Indianapo- lis, Indiana, USA: ACM, 2016, pp. 1929–1932. isbn: 978-1-4503-4073-1. doi: 10.1145/2983323.2983876. url: http://doi.acm.org/10.1145/ 2983323.2983876. [112] Stefan Larson et al. “An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction”. In: Proceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, : Association for Computational Linguistics, Nov. 2019, pp. 1311–1316. doi: 10.18653/v1/D19-1131. url: https://www.aclweb. org/anthology/D19-1131. [113] Stefan Larson et al. “Outlier Detection for Improved Data Quality and Diversity in Dialog Systems”. In: NAACL-HLT. Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 517–527. doi: 10.18653/v1/N19-1051. url: https://www.aclweb.org/anthology/ N19-1051. [114] Walter S. Lasecki et al. “Chorus: A Crowd-powered Conversational Assis- tant”. In: Proceedings of the 26th Annual ACM Symposium on User Inter- face Software and Technology. UIST ’13. St. Andrews, Scotland, United Kingdom: ACM, 2013, pp. 151–162. isbn: 978-1-4503-2268-3. doi: 10 . 1145/2501988.2502057. url: http://doi.acm.org/10.1145/2501988. 2502057. [115] Jens Lehmann et al. “DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia”. In: Semantic Web 6.2 (2015), pp. 167–195. [116] Alexander Lex et al. “UpSet: visualization of intersecting sets”. In: IEEE transactions on visualization and computer graphics 20.12 (2014), pp. 1983–1992. [117] Feng-Lin Li et al. “AliMe Assist : An Intelligent Assistant for Creating an Innovative E-commerce Experience”. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. CIKM ’17. Singapore, Singapore: ACM, 2017, pp. 2495–2498. isbn: 978-1-4503-4918- 5. doi: 10.1145/3132847.3133169. url: http://doi.acm.org/10. 1145/3132847.3133169. [118] Guoliang Li et al. “Crowdsourced data management: A survey”. In: IEEE Transactions on Knowledge and Data Engineering 28.9 (2016), pp. 2296– 2319.

154 Bibliography

[119] Greeshma Lingam, Rashmi Ranjan Rout, and DVLN Somayajulu. “Deep Q-Learning and Particle Swarm Optimization for Bot Detection in On- line Social Networks”. In: 2019 10th International Conference on Com- puting, Communication and Networking Technologies (ICCCNT). IEEE. 2019, pp. 1–6. [120] Phoebe Liu et al. “Optimizing the design and cost for crowdsourced conver- sational utterances”. In: Proc. KDD-Data Collection, Curation, Labeling. 2019. [121] Thang Luong, Hieu Pham, and Christopher D. Manning. “Effective Ap- proaches to Attention-based Neural Machine Translation”. In: Proceed- ings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, Sept. 2015, pp. 1412–1421. doi: 10.18653/v1/D15-1166. url: https: //www.aclweb.org/anthology/D15-1166. [122] Paweł Łupkowski and Jonathan Ginz. “A corpus-based taxonomy of ques- tion responses”. In: IWCS ’13 - International Workshop on Computational Semantics. 2013. [123] Xiao Ma, Trishala Neeraj, and Mor Naaman. “A Computational Approach to Perceived Trustworthiness of Airbnb Host Profiles”. In: 2017. url: https://aaai.org/ocs/index.php/ICWSM/ICWSM17/paper/view/ 15630. [124] Nitin Madnani and Bonnie J. Dorr. “Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods”. In: Computational Lin- guistics 36.3 (2010), pp. 341–387. doi: 10.1162/coli_a_00002. url: https://www.aclweb.org/anthology/J10-3003. [125] Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. “Paraphrasing revisited with neural machine translation”. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Vol. 1. 2017, pp. 881–893. [126] Sandya Mannarswamy and Saravanan Chidambaram. “GEMINIO: Find- ing Duplicates in a Question Haystack”. In: Advances in Knowledge Dis- covery and Data Mining. Ed. by Dinh Phung et al. Cham: Springer Inter- national Publishing, 2018, pp. 104–114. isbn: 978-3-319-93037-4. [127] Philip M. McCarthy, Rebekah H. Guess, and Danielle S. McNamara. “The components of paraphrase evaluations”. In: Behavior Research Methods 41.3 (2009), pp. 682–690. issn: 1554-3528. doi: 10.3758/BRM.41.3.682. url: https://doi.org/10.3758/BRM.41.3.682. [128] Mary L McHugh. “Interrater reliability: the kappa statistic”. In: Bio- chemia medica: Biochemia medica 22.3 (2012), pp. 276–282.

155 Bibliography

[129] Marcelo Mendoza and Juan Zamora. “Identifying the Intent of a User Query Using Support Vector Machines”. In: Aug. 2009, pp. 131–142. doi: 10.1007/978-3-642-03784-9_13. [130] Stefano Mezza et al. “ISO-Standard Domain-Independent Dialogue Act Tagging for Conversational Agents”. In: Proceedings of the 27th Inter- national Conference on Computational Linguistics. Santa Fe, New Mex- ico, USA: Association for Computational Linguistics, 2018, pp. 3539–3551. url: http://aclweb.org/anthology/C18-1300. [131] Tomas Mikolov et al. “Distributed Representations of Words and Phrases and Their Compositionality”. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. NIPS’13. Lake Tahoe, Nevada: Curran Associates Inc., 2013, pp. 3111– 3119. url: http://dl.acm.org/citation.cfm?id=2999792.2999959. [132] David R. H. Miller, Tim Leek, and Richard M. Schwartz. “A Information Retrieval System”. In: Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval. SIGIR ’99. Berkeley, California, USA: ACM, 1999, pp. 214–221. isbn: 1-58113-096-1. doi: 10.1145/312624. 312680. url: http://doi.acm.org/10.1145/312624.312680. [133] George A. Miller. “WordNet: A Lexical Database for English”. In: vol. 38. 11. New York, NY, USA: ACM, Nov. 1995, pp. 39–41. doi: 10.1145/ 219717.219748. url: http://doi.acm.org/10.1145/219717.219748. [134] Andrea Millimaggi and Florian Daniel. “On Twitter Bots Behaving Badly: Empirical Study of Code Patterns on GitHub”. In: International Confer- ence on Web Engineering. Springer. 2019, pp. 187–202. [135] Margaret Mitchell, Dan Bohus, and Ece Kamar. “Crowdsourcing Language Generation Templates for Dialogue Systems”. In: (June 2014), pp. 172– 180. doi: 10.3115/v1/W14- 5003. url: https://www.aclweb.org/ anthology/W14-5003. [136] Robert R Morris, Mira Dontcheva, and Elizabeth M Gerber. “Priming for better performance in microtask crowdsourcing environments”. In: IEEE Internet Computing 16.5 (2012), pp. 13–19. [137] Courtney Napoles, Chris Callison-Burch, and Matt Post. “Sentential Para- phrasing as Black-Box Machine Translation”. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Com- putational Linguistics: Demonstrations. San Diego, California: Associa- tion for Computational Linguistics, 2016, pp. 62–66. url: http://www. aclweb.org/anthology/N16-3013. [138] Gina Neff and Peter Nagy. “, algorithms, and politics| talking to bots: symbiotic agency and the case of tay”. In: International Journal of Communication 10 (2016), p. 17.

156 Bibliography

[139] Matteo Negri et al. “Chinese Whispers: Cooperative Paraphrase Acquisi- tion”. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12). Istanbul, Turkey: European Lan- guage Resources Association (ELRA), May 2012, pp. 2659–2665. url: http://www.lrec-conf.org/proceedings/lrec2012/pdf/772_Paper. pdf. [140] Tri Nguyen et al. “MS MARCO: A human generated machine reading comprehension dataset”. In: arXiv preprint arXiv:1611.09268 (2016). [141] Hamed Nilforoshan, Jiannan Wang, and Eugene Wu. “PreCog: Im- proving Crowdsourced Data Quality Before Acquisition”. In: CoRR abs/1704.02384 (2017). [142] Hamed Nilforoshan and Eugene Wu. “Leveraging Quality Prediction Mod- els for Automatic Writing Feedback”. In: Proceedings of the Twelfth In- ternational Conference on Web and Social Media, ICWSM 2018, Stan- ford, California, USA, June 25-28, 2018. 2018, pp. 211–220. url: https: //aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/view/17802. [143] Jekaterina Novikova, Oliver Lemon, and Verena Rieser. “Crowd-sourcing nlg data: Pictures elicit better data”. In: arXiv preprint arXiv:1608.00339 (2016). [144] Shereen Oraby, Sheideh Homayon, and Marilyn Walker. “Harvesting cre- ative templates for generating stylistically varied restaurant reviews”. In: CoRR (2017). [145] Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. “Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features”. In: NAACL 2018 - Conference of the North American Chapter of the As- sociation for Computational Linguistics. 2018. [146] Francis Palma et al. “Are RESTful APIs Well-Designed? Detection of their Linguistic (Anti)Patterns”. In: Service-Oriented Computing. Ed. by Alistair Barros et al. Berlin, Heidelberg: Springer Berlin Heidelberg, 2015, pp. 171–187. isbn: 978-3-662-48616-0. [147] Francis Palma et al. “Detection of REST Patterns and Antipatterns: A Heuristics-Based Approach”. In: Service-Oriented Computing. Ed. by Xavier Franch et al. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014, pp. 230–244. isbn: 978-3-662-45391-9. [148] Francis Palma et al. “Semantic Analysis of RESTful APIs for the Detec- tion of Linguistic Patterns and Antipatterns”. In: International Journal of Cooperative Information Systems (May 2017), p. 1742001. doi: 10.1142/ S0218843017420011.

157 Bibliography

[149] Kishore Papineni et al. “BLEU: A Method for Automatic Evaluation of Machine Translation”. In: Proceedings of the 40th Annual Meeting on As- sociation for Computational Linguistics. ACL ’02. Philadelphia, Pennsyl- vania: Association for Computational Linguistics, 2002, pp. 311–318. doi: 10.3115/1073083.1073135. url: https://doi.org/10.3115/1073083. 1073135. [150] Sunghyun Park et al. “Paraphrase Diversification Using Counterfactual Debiasing”. In: Proceedings of the AAAI Conference on Artificial Intel- ligence 33 (July 2019), pp. 6883–6891. doi: 10 . 1609 / aaai . v33i01 . 33016883. [151] Cesare Pautasso. “RESTful web services: principles, patterns, emerging technologies”. In: Web Services Foundations. Springer, 2014, pp. 31–51. [152] Ellie Pavlick et al. “PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification”. In: Pro- ceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Lan- guage Processing (Volume 2: Short Papers). Beijing, China: Association for Computational Linguistics, 2015, pp. 425–430. url: http : / / www . aclweb.org/anthology/P15-2070. [153] Maria Pelevina et al. “Making Sense of Word Embeddings”. In: Proceed- ings of the 1st Workshop on Representation Learning for NLP. Berlin, Ger- many: Association for Computational Linguistics, 2016, pp. 174–183. doi: 10.18653/v1/W16-1620. url: http://www.aclweb.org/anthology/ W16-1620. [154] Jeffrey Pennington, Richard Socher, and Christopher Manning. “Glove: Global Vectors for Word Representation”. In: Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543. doi: 10.3115/v1/D14-1162. url: https://www.aclweb. org/anthology/D14-1162. [155] Les Perelman. “Grammar checkers do not work”. In: WLN: A Journal of Writing Center Scholarship 40.7-8 (2016), pp. 11–20. [156] Matthew Peters et al. “Deep Contextualized Word Representations”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 2227–2237. doi: 10.18653/ v1/N18-1202. url: https://www.aclweb.org/anthology/N18-1202.

158 Bibliography

[157] Fabio Petrillo et al. “Are REST APIs for Cloud Computing Well-Designed? An Exploratory Study”. In: Service-Oriented Computing. Ed. by Quan Z. Sheng et al. Cham: Springer International Publishing, 2016, pp. 157–170. isbn: 978-3-319-46295-0. [158] A. S. Popov et al. “Unsupervised dialogue intent detection via hierarchical ”. In: RANLP. 2019. [159] Maja Popović. “chrF: character n-gram F-score for automatic MT eval- uation”. In: Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon, Portugal: Association for Computational Linguistics, Sept. 2015, pp. 392–395. doi: 10 . 18653 / v1 / W15 - 3049. url: https : //www.aclweb.org/anthology/W15-3049. [160] Maja Popović. “chrF deconstructed: beta parameters and n-gram weights”. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. Berlin, Germany: Association for Computational Linguistics, 2016, pp. 499–504. doi: 10.18653/v1/W16-2341. url: http: //aclweb.org/anthology/W16-2341. [161] Matt Post, Yuan Cao, and Gaurav Kumar. “Joshua 6: A phrase-based and hierarchical statistical machine translation system”. In: The Prague Bulletin of Mathematical Linguistics (2015). [162] Jean Pouget-Abadie et al. “Overcoming the Curse of Sentence Length for Neural Machine Translation using Automatic Segmentation”. In: Proceed- ings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Lin- guistics, Oct. 2014, pp. 78–85. doi: 10.3115/v1/W14-4009. url: https: //www.aclweb.org/anthology/W14-4009. [163] Chen Qu et al. “User Intent Prediction in Information-seeking Conver- sations”. In: Proceedings of the 2019 Conference on Human Information Interaction and Retrieval. CHIIR ’19. Glasgow, Scotland UK: ACM, 2019, pp. 25–33. isbn: 978-1-4503-6025-8. doi: 10 . 1145 / 3295750 . 3298924. url: http://doi.acm.org/10.1145/3295750.3298924. [164] Alec Radford et al. “Improving language understanding by generative pre-training”. In: URL https://s3-us-west-2. amazonaws. com/openai- assets/research-covers/languageunsupervised/language understanding pa- per. pdf (2018). [165] Pranav Rajpurkar et al. “Squad: 100,000+ questions for machine compre- hension of text”. In: arXiv preprint arXiv:1606.05250 (2016). [166] Kiran Ramesh et al. “A Survey of Design Techniques for Conversational Agents”. In: Information, Communication and Computing Technology. Ed. by Saroj Kaushik et al. Singapore: Springer Singapore, 2017, pp. 336–350. isbn: 978-981-10-6544-6.

159 Bibliography

[167] Abhinav Rastogi, Dilek Hakkani-Tur, and Larry Heck. “Scalable Multi- Domain Dialogue State Tracking”. In: arXiv preprint arXiv:1712.10224 (2017). [168] Abhinav Rastogi et al. “Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset”. In: ArXiv abs/1909.05855 (2020). [169] Alexander J Ratner et al. “Data programming: Creating large training sets, quickly”. In: Advances in Neural Information Processing Systems. 2016, pp. 3567–3575. [170] Alexander J. Ratner et al. “Snorkel: Fast Training Set Generation for ”. In: Proceedings of the 2017 ACM International Conference on Management of Data. SIGMOD ’17. Chicago, Illinois, USA: ACM, 2017, pp. 1683–1686. isbn: 978-1-4503-4197-4. doi: 10 . 1145 / 3035918 . 3056442. url: http : / / doi . acm . org / 10 . 1145 / 3035918 . 3056442. [171] Abhilasha Ravichander et al. “How Would You Say It? Eliciting Lexically Diverse Dialogue for Supervised Semantic Parsing”. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. 2017, pp. 374– 383. [172] Thomas Rebele et al. “YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames”. In: The Semantic Web – ISWC 2016. Ed. by Paul Groth et al. Cham: Springer International Publishing, 2016, pp. 177–185. isbn: 978-3-319-46547-0. [173] Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. “Linguistic models for analyzing and detecting biased language”. In: Pro- ceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2013, pp. 1650–1659. [174] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: arXiv preprint arXiv:1908.10084 (2019). [175] Navid Rekabsaz, Mihai Lupu, and Allan Hanbury. “Uncertainty in neu- ral network word embedding: Exploration of threshold for similarity”. In: arXiv preprint arXiv:1606.06086 (2016). [176] Jared Rivera et al. “Annotation Process for the Dialog Act Classification of a Taglish E-commerce Q\&A Corpus”. In: Proceedings of the Second Workshop on Economics and Natural Language Processing. 2019, pp. 61– 68. [177] Stephen E Robertson et al. “Okapi at TREC-3”. In: Nist Special Publica- tion Sp 109 (1995), p. 109.

160 Bibliography

[178] Carlos Rodríguez et al. “REST APIs: A Large-Scale Analysis of Compli- ance with Principles and Best Practices”. In: Web Engineering. Ed. by Alessandro Bozzon, Philippe Cudre-Maroux, and Cesare Pautasso. Cham: Springer International Publishing, 2016, pp. 21–39. isbn: 978-3-319-38791- 8. [179] Andreas Rücklé et al. “Concatenated p-mean Word Embeddings as Uni- versal Cross-Lingual Sentence Representations”. In: CoRR abs/1803.01400 (2018). [180] Md Shahriare Satu, Md Hasnat Parvez, et al. “Review of integrated ap- plications with aiml based chatbot”. In: 2015 International Conference on Computer and Information Engineering (ICCIE). IEEE. 2015, pp. 87–90. [181] Denis Savenkov and Eugene Agichtein. “CRQA: Crowd-powered real-time automatic question answering system”. In: Fourth AAAI Conference on Human Computation and Crowdsourcing. 2016. [182] Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. In- troduction to information retrieval. Vol. 39. Cambridge University Press, 2008. [183] Rico Sennrich et al. “The University of Edinburgh’s Neural MT Systems for WMT17”. In: Proceedings of the 2nd Conference on Machine Transla- tion. 2017. [184] Iulian Vlad Serban et al. “A survey of available corpora for building data- driven dialogue systems”. In: arXiv preprint arXiv:1512.05742 (2015). [185] Pararth Shah et al. “Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 3 (Industry Papers). New Orleans - Louisiana: Associa- tion for Computational Linguistics, June 2018, pp. 41–51. doi: 10.18653/ v1/N18-3006. url: https://www.aclweb.org/anthology/N18-3006. [186] Pararth Shah et al. “Building a conversational agent overnight with dia- logue self-play”. In: arXiv:1801.04871 (2018). [187] Charese Smiley et al. “The E2E NLG Challenge: A Tale of Two Systems”. In: Proceedings of the 11th International Conference on Natural Language Generation. 2018, pp. 472–477. [188] María J Soler, Carmen Dasí, and Juan C Ruiz. “Priming in word stem completion tasks: comparison with previous results in word fragment com- pletion tasks”. In: Frontiers in psychology 6 (2015), p. 1172. [189] Robert Speer, Joshua Chin, and Catherine Havasi. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. 2017. url: http://aaai. org/ocs/index.php/AAAI/AAAI17/paper/view/14972.

161 Bibliography

[190] Yu Su et al. “Building Natural Language Interfaces to Web APIs”. In: Pro- ceedings of the 2017 ACM on Conference on Information and Knowledge Management. CIKM ’17. Singapore, Singapore: ACM, 2017, pp. 177–186. isbn: 978-1-4503-4918-5. doi: 10.1145/3132847.3133009. url: http: //doi.acm.org/10.1145/3132847.3133009. [191] Karen Sullivan. “If you study a word do you use it more often? Lexical repetition priming in a corpus of Natural Semantic Metalanguage publi- cations”. In: Corpora 10.3 (2015), pp. 277–290. [192] Md Arafat Sultan, Steven Bethard, and Tamara Sumner. “Back to basics for monolingual alignment: Exploiting word similarity and contextual - dence”. In: Transactions of the Association for Computational Linguistics 2 (2014), pp. 219–230. [193] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learn- ing with neural networks”. In: Advances in neural information processing systems. 2014, pp. 3104–3112. [194] Carlos Toxtli, Andrés Monroy-Hernández, and Justin Cranshaw. “Un- derstanding Chatbot-mediated Task Management”. In: arXiv preprint arXiv:1802.03109 (2018). [195] Martin Tschirsich and Gerold Hintz. “Leveraging crowdsourcing for para- phrase recognition”. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse. 2013, pp. 205–213. [196] Endel Tulving and Daniel L Schacter. “Priming and human memory sys- tems”. In: Science 247.4940 (1990), pp. 301–306. [197] Gokhan Tur et al. “Towards Deeper Understanding Deep Convex Networks for Semantic Utterance Classification”. In: IEEE International Confrence on Acoustics, Speech, and Signal Processing (ICASSP), 2012. url: https: / / www . microsoft . com / en - us / research / publication / towards - deeper - understanding - deep - convex - networks - for - semantic - utterance-classification/. [198] Stefanie Ullmann and Marcus Tomalin. “Quarantining online hate speech: technical and ethical perspectives”. In: Ethics and Information Technology (2019), pp. 1–12. [199] Ashish Vaswani et al. “Attention is All You Need”. In: Proceedings of the 31st International Conference on Neural Information Processing Sys- tems. NIPS’17. Long Beach, California, USA: Curran Associates Inc., 2017, pp. 6000–6010. isbn: 978-1-5108-6096-4. url: http : / / dl . acm . org / citation.cfm?id=3295222.3295349.

162 Bibliography

[200] Mandana Vaziri et al. “Generating Chat Bots from Web API Specifica- tions”. In: Proceedings of the 2017 ACM SIGPLAN International Sympo- sium on New Ideas, New Paradigms, and Reflections on Programming and Software. Onward! 2017. Vancouver, BC, Canada: ACM, 2017, pp. 44–57. isbn: 978-1-4503-5530-8. doi: 10.1145/3133850.3133864. url: http: //doi.acm.org/10.1145/3133850.3133864. [201] Denny Vrandečić and Markus Krötzsch. “Wikidata: A Free Collaborative Knowledgebase”. In: Commun. ACM 57.10 (Sept. 2014), pp. 78–85. issn: 0001-0782. doi: 10.1145/2629489. url: http://doi.acm.org/10.1145/ 2629489. [202] Richard Wallace. “The elements of AIML style”. In: Alice AI Foundation 139 (2003). [203] W. Y. Wang et al. “Crowdsourcing the acquisition of natural language corpora: Methods and observations”. In: 2012 IEEE Spoken Language Technology Workshop (SLT). 2012, pp. 73–78. doi: 10.1109/SLT.2012. 6424200. [204] Yushi Wang, Jonathan Berant, and Percy Liang. “Building a Semantic Parser Overnight”. In: Proceedings of the 53rd Annual Meeting of the Asso- ciation for Computational Linguistics and the 7th International Joint Con- ference on Natural Language Processing (Volume 1: Long Papers). Beijing, China: Association for Computational Linguistics, July 2015, pp. 1332– 1342. doi: 10.3115/v1/P15- 1129. url: https://www.aclweb.org/ anthology/P15-1129. [205] Zhiguo Wang, Haitao Mi, and Abraham Ittycheriah. “Sentence Similarity Learning by Lexical Decomposition and Composition”. In: Proceedings of COLING 2016, the 26th International Conference on Computational Lin- guistics: Technical Papers. Osaka, Japan: The COLING 2016 Organizing Committee, Dec. 2016, pp. 1340–1349. url: https://www.aclweb.org/ anthology/C16-1127. [206] Thomas Wasow, Amy Perfors, and David Beaver. “The puzzle of ambigu- ity”. In: Morphology and the web of grammar: Essays in memory of Steven G. Lapointe (2005), pp. 265–282. [207] Joseph Weizenbaum. “ELIZA&Mdash;a Computer Program for the Study of Natural Language Communication Between Man and Machine”. In: vol. 9. 1. New York, NY, USA: ACM, Jan. 1966, pp. 36–45. doi: 10.1145/ 365153.365168. url: http://doi.acm.org/10.1145/365153.365168. [208] Joseph Weizenbaum. “ELIZA&Mdash;a Computer Program for the Study of Natural Language Communication Between Man and Machine”. In: Commun. ACM 9.1 (Jan. 1966), pp. 36–45. issn: 0001-0782. doi: 10 . 1145/365153.365168. url: http://doi.acm.org/10.1145/365153. 365168.

163 Bibliography

[209] Tsung-Hsien Wen et al. “A Network-based End-to-End Trainable Task- oriented Dialogue System”. In: EACL. Valencia, Spain: Association for Computational Linguistics, 2017, pp. 438–449. url: http://www.aclweb. org/anthology/E17-1042. [210] Jason Williams, Antoine Raux, and Matthew Henderson. “The dialog state tracking challenge series: A review”. In: Dialogue & Discourse 7.3 (2016), pp. 4–33. [211] Diana S Woodruff-Pak. “Eyeblink classical conditioning in HM: delay and trace paradigms.” In: Behavioral neuroscience 107.6 (1993), p. 911. [212] Yonghui Wu et al. “Google’s neural machine translation system: Bridg- ing the gap between human and machine translation”. In: arXiv preprint arXiv:1609.08144 (2016). [213] Yonghui Wu et al. “Google’s Neural Machine Translation System: Bridg- ing the Gap between Human and Machine Translation”. In: CoRR abs/1609.08144 (2016). [214] Jingjing Xu et al. “DP-GAN: diversity-promoting generative adversarial network for generating informative and diversified text”. In: CoRR (2018). [215] Qiongkai Xu et al. “D-PAGE: Diverse Paraphrase Generation”. In: CoRR abs/1808.04364 (2018). [216] Wei Xu et al. “Extracting lexically divergent paraphrases from Twitter”. In: Transactions of ACL 2 (2014), pp. 435–448. [217] M. Yaghoub-Zadeh-Fard et al. “User Utterance Acquisition for Training Task-Oriented Bots: A Review of Challenges, Techniques and Opportuni- ties”. In: IEEE Internet Computing PP (Mar. 2020), pp. 1–1. doi: 10. 1109/MIC.2020.2978157. [218] Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, and Shayan Zamanirad. “Automatic Canonical Utterance Generation for Task- Oriented Bots from API Specifications”. In: Proceedings of the 23nd In- ternational Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020. Ed. by Angela Boni- fati et al. OpenProceedings.org, 2020, pp. 1–12. doi: 10.5441/002/edbt. 2020.02. url: https://doi.org/10.5441/002/edbt.2020.02. [219] Mohammad-Ali Yaghoub-Zadeh-Fard et al. “A Study of Incorrect Para- phrases in Crowdsourced User Utterances”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 295–306. doi: 10.18653/v1/N19-1026. url: https://www.aclweb.org/anthology/N19-1026.

164 Bibliography

[220] Mohammad-Ali Yaghoub-Zadeh-Fard et al. “Dynamic word recommen- dation to obtain diverse crowdsourced paraphrases of user utterances”. In: IUI ’20: 25th International Conference on Intelligent User Interfaces, Cagliari, Italy, March 17-20, 2020. Ed. by Fabio Paternò et al. ACM, 2020, pp. 55–66. doi: 10.1145/3377325.3377486. url: https://doi.org/10. 1145/3377325.3377486. [221] Mohammad-Ali Yaghoub-Zadeh-Fard et al. “REST2Bot: Bridging the Gap between BotPlatforms and REST APIs”. In: The Web Conference. 2020. [222] Rui Yan. ““Chitty-Chitty-Chat Bot”: Deep Learning for Conversational AI.” In: IJCAI. 2018, pp. 5520–5526. [223] Zhao Yan et al. “Building task-oriented dialogue systems for online shop- ping”. In: Thirty-First AAAI Conference on Artificial Intelligence. 2017. [224] Guohua Yang, Xiaojie Wang, and Caixia Yuan. “Hierarchical Dialog State Tracking with Unknown Slot Values”. In: Neural Processing Letters 50.2 (2019), pp. 1611–1625. [225] Jie Yang et al. “Leveraging Crowdsourcing Data for Deep Active Learn- ing An Application: Learning Intents in Alexa”. In: Proceedings of the 2018 World Wide Web Conference. WWW ’18. Lyon, France: Interna- tional World Wide Web Conferences Steering Committee, 2018, pp. 23–32. isbn: 978-1-4503-5639-8. doi: 10.1145/3178876.3186033. url: https: //doi.org/10.1145/3178876.3186033. [226] Yi Yang, Wen-tau Yih, and Christopher Meek. “Wikiqa: A challenge dataset for open-domain question answering”. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015, pp. 2013–2018. [227] Yinfei Yang et al. “Learning Semantic Textual Similarity from Conversa- tions”. In: Proceedings of The Third Workshop on Representation Learning for NLP. Melbourne, Australia: Association for Computational Linguis- tics, 2018, pp. 164–174. url: http://aclweb.org/anthology/W18-3022. [228] Shayan Zamanirad. “Superimposition of Natural Language Conversations over Software Enabled Services”. PhD thesis. UNSW Sydney, 2020. [229] Shayan Zamanirad et al. “Programming bots by synthesizing natural language expressions into API invocations”. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineer- ing. IEEE Press. 2017, pp. 832–837. [230] Zheng Zhang et al. “Memory-augmented dialogue management for task- oriented dialogue systems”. In: ACM Transactions on Information Systems (TOIS) 37.3 (2019), p. 34.

165