Superimposition of Natural Language Conversations over Software Enabled Services

Shayan Zamanirad

A thesis in fulfilment of the requirements for the degree of Doctor of Philosophy

School of Computer Science and Engineering Faculty of Engineering January 2020 $!/%/%//!.00%+*$!!0

1.*)!)%(5)!  ! %2!*)!/  ' .!2%0%+*"+. !#.!!/#%2!%*0$!*%2!./%05 (!* .  "!  $#!  1(05  !  $++(    $#! $ ! "#   #$! $ %!"# " %! #&! $!/%/%0(!  !%"

          #  #    #     #!    #   "    %   $         &'#      &'#                %                  #  #   "  &$$#'$  "#           "            # #  # %   $

#    " #         "    "     #                "           $  "#    "      !                            $

                   #  #     $      "             %       "$ "  #!  #         #        #%  !   &$$ '#    &$$   !      ' &$$'  "        $  ! %           $         $

! (.0%+*.!(0%*#0+ %/,+/%0%+*+",.+&! 00$!/%/ %//!.00%+*

$!.!5#.*00+0$!*%2!./%05+"!3+10$(!/+.%0/#!*0/0$!.%#$00+. $%2!* 0+)'!2%((!)50$!/%/+. %//!.00%+*%*3$+(!+.%*,.0 %*0$!*%2!./%05(%..%!/%*(("+.)/+")! %*+3+.$!.!"0!.'*+3*/1&! 00+0$!,.+2%/%+*/+"0$!+,5.%#$0 0 .!0%*((,.+,!.05.%#$0/ /1 $/,0!*0.%#$0/(/+.!0%*0$!.%#$00+1/!%*"101.!3+.'//1 $/.0% (!/+.++'/((+.,.0+"0$%/0$!/%/+. %//!.00%+*

(/+10$+.%/!*%2!./%05% .+"%()/0+1/!0$! 3+. /0. 0+")50$!/%/%*%//!.00%+*/0. 0/*0!.*0%+*(0$%/%/,,(% (!0+ + 0+.( 0$!/!/+*(5

66666666666666666666666 66666666666666666666 66666666666666 %#*01.! %0*!//%#*01.! 0! $!*%2!./%05.! +#*%/!/0$00$!.!)5!!4 !,0%+*( %. 1)/0* !/.!-1%.%*#.!/0.% 0%+*/+* +,5%*#+. +* %0%+*/+*1/!!-1!/0/"+..!/0.% 0%+* "+.,!.%+ +"1,0+ 5!./)1/0!) !%*3.%0%*#!-1!/0/"+.(+*#!.,!.%+ +".!/0.% 0%+*)5! +*/% !.! %*!4 !,0%+*( %. 1)/0* !/*  .!-1%.!0$!,,.+2(+"0$!!*+". 10!!/!. $

     0!+" +),(!0%+*+".!-1%.!)!*0/"+.3.  ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’

Signed ......

Date ...... COPYRIGHT STATEMENT

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.'

Signed ......

Date ......

AUTHENTICITY STATEMENT

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

Signed ......

Date ...... INCLUSION OF PUBLICATIONS STATEMENT

UNSW is supportive of candidates publishing their research results during their candidature as detailed in the UNSW Thesis Examination Procedure.

Publications can be used in their thesis in lieu of a Chapter if: x The student contributed greater than 50% of the content in the publication and is the “primary author”, ie. the student was responsible primarily for the planning, execution and preparation of the work for publication x The student has approval to include the publication in their thesis in lieu of a Chapter from their supervisor and Postgraduate Coordinator. x The publication is not subject to any obligations or contractual agreements with a third party that would constrain its inclusion in the thesis

Please indicate whether this thesis contains published material or not.

This thesis contains no publications, either published or submitted for publication ܆ (if this box is checked, you may delete all the material on page 2) Some of the work described in this thesis has been published and it has been documented in the relevant Chapters with acknowledgement (if this box is ܈ checked, you may delete all the material on page 2)

This thesis has publications (either published or submitted for publication) ܆ incorporated into it in lieu of a chapter and the details are presented below

CANDIDATE’S DECLARATION I declare that: x I have complied with the Thesis Examination Procedure x where I have used a publication in lieu of a Chapter, the listed publication(s) below meet(s) the requirements to be included in the thesis. Name Signature Date (dd/mm/yy)

Postgraduate Coordinator’s Declaration (to be filled in where publications are used in lieu of Chapters) I declare that: x the information below is accurate x where listed publication(s) have been used in lieu of Chapter(s), their use complies with the Thesis Examination Procedure x the minimum requirements for the format of the thesis have been met. PGC’s Name PGC’s Signature Date (dd/mm/yy)

i For each publication incorporated into the thesis in lieu of a Chapter, provide all of the requested details and signatures required

Details of publication #1: Full title: “Dynamic Event Type Recognition and Tagging for Data-driven Insights in Law-Enforcement” Authors: Shayan Zamanirad; Boualem Benatallah; Moshe Chai Barukh; Carlos Rodriguez; Reza Nouri Journal or book name: Springer Computing Journal Volume/page numbers: NA Date accepted/ published:  Status Published Accepted and In Ŷ In progress press (submitted) The Candidate’s Contribution to the Work

I am the first author of the paper. I am primary contributor to design of the proposed concepts, techniques, implementation and experimentation. I acknowledged my collaborators in Acknowledgment section.

Location of the work in the thesis and/or how the work is incorporated in the thesis:

Chapter 3: Event Embeddings - Data-driven Insights in Law-Enforcement

Primary Supervisor’s Declaration I declare that: • the information above is accurate • this has been discussed with the PGC and it is agreed that this publication can be included in this thesis in lieu of a Chapter • All of the co-authors of the publication have reviewed the above information and have agreed to its veracity by signing a ‘Co-Author Authorisation’ form. Supervisor’s name Supervisor’s signature Date (dd/mm/yy)

ii Dedication

To my mother, who keep showering me with her love and kindness. Your love and prayers made me keep moving forward.

To the rock of my life, my father who kept asking me about the end of my research although I keep giving him wrong answers. I try my best to make you proud.

To my only brother, who is my only best friend, my hero. God bless you for all you did for me.

Ikaw ang buhay ko, Isa kang diamante sa buhay ko, Aking kaligayahan, Aking lakas, Ikaw ang lahat sa aken... Mahal kita mahal na mahal kita magpakailanman... Aking Dianne. Acknowledgements

I would like to thank all the people who supported me during this major step in my life. My family, friends and colleagues that helped me in achieving this thesis:

• I would like to thank my sponsor, Data to Decisions Cooperative Research Centre, for giving me the opportunity to be involved in this endeavour.

• I would like to thank my supervisor, Scientia Professor Boualem Benatallah. You have been such an inspiration and guidance during these years. Without you, I would have never been able to do this work. Your energy and knowledge has always amazed me. Thank you for honing my skills to become a better researcher, to be tough in failures, and be impactful and ethical in all my work. I really appreciated and enjoyed the opportunity to work with you everyday during these years.

• Thanks to Professor Fabio Casati for his helpful guidance and comments on my work. I really appreciated the opportunity to work with you. It was always useful to have your expert eye on my work.

• I am thankful to everyone in the Service-Oriented Computing (SOC) group at UNSW, es- pecially Dr Carlos Rodriguez, Dr Moshe Chai Barukh. Thanks to Dr Seyed Mehdi Reza Beheshti for your invaluable help. Special thanks to my former colleague Reza Nouri for all the fun and tough moments we had together during the projects. I gained an enormous amount of knowledge from you.

• Special thanks to MohammadAli for all his helpful insights and suggestions. I will never forget our tea times in the kitchen where most of our ideas emerged and came to life. • My brother, Mojtaba, thanks for the good laughs and fun times. I am glad that I had you in my Ph.D. journey.

• My older sister, Maisie, thank you for everything. Your comments and advice were always helpful in achieving this milestone.

• Special thanks to John and Olga, my brother and sister from Colombia. I hope to keep your friendship for the rest of our lives. John, you are amazing. I will never forget how we enjoyed our conference trips we attended together.

• To my aunties, Roghi and Latifeh, my uncle Kia, my grandmother Esi, thank you for your prayers and being present in spirit all the this time.

• Special thanks to FlatIcon0. Most of the icons used in this thesis are from FlatIcon.

0https://www.flaticon.com

Page ii of 194 Abstract

Digital assistants and their instantiation in the form of messaging or chat bots, software robots, virtual assistants, have become the quintessential engine for understanding user needs, expressed in natural language, and on fulfilling such needs by invoking the appropriate back-end software services. The continuous improvement in Natural Language Processing (NLP), Artificial Intelli- gence (AI), messaging interfaces and devices allow natural language-based interactions between users and a deluge of software enabled services including interactions with “data sources”, “ap- plications”, “resources” and “physical assets” (e.g., sensors). Increasingly, organisations leverage digital assistants to increase productivity and automate business processes in various application domains including office tasks, travel, healthcare, and e-government services.

Nonetheless, despite the early adoption, digital assistant technologies are still only in their preliminary stages of development, with several unsolved theoretical and technical challenges stemming from the lack of effective support for wide range of possibly ambiguous user intents and to leverage the large and growing number of services. More specifically, the lack of latent knowledge to represent the different types of software services and the lack for supporting com- plex interactions between users and services inhibit design and engineering of effective and effi- cient techniques that harness the full potential of natural interactions between users and software enabled services.

This thesis advances the fundamental and practical understanding of natural language based conversations between users, resources, services and devices. In this thesis we build upon ad- vances in NLP and entity recognition and devise novel concepts and techniques to address im- portant shortcomings in natural-language based conversational systems. Inspired by word em- beddings, their extensions and impacts, we develop novel vector space models and techniques to capture, represent and reason about rich latent knowledge about user intents, semi-structured and textual artefacts (e.g., emails), data structures (e.g., attributes in an indexing schema over data sources) and API elements (e.g. API methods) to support potentially ambiguous natural language user requests and tasks. We develop extended state-machine based models to capture conversation patterns among users and services. We provide validation and evaluation of the proposed models and techniques.

Page iv of 194 Contents

Acknowledgements i

Abstract iii

Contents v

List of Figures xii

List of Tables xviii

1 Introduction 1

1.1 Background, Motivations and Aims ...... 1

1.2 Research Issues ...... 4

1.2.1 From Raw Unstructured Information Items to Semantically Annotated In- formation Items ...... 4

1.2.2 Schema-less and natural language access to heterogeneous information sources ...... 5

1.2.3 User Intents and APIs integration ...... 7

1.3 Contributions ...... 8 1.4 Thesis structure ...... 12

2 Background and State of the Art 13

2.1 Intent Recognition ...... 15

2.1.1 Rule based Techniques ...... 16

2.1.2 Traditional Classification-based Techniques ...... 19

2.1.3 Deep learning-based Techniques ...... 21

2.2 Dialogue Management ...... 22

2.2.1 Flow Based Models ...... 23

2.2.2 Deterministic State Machines based Models ...... 25

2.2.3 Probabilistic State Machines based Models ...... 30

2.2.4 Memory based Learning Models ...... 31

2.3 Natural Language Generation ...... 33

2.4 Term Embeddings ...... 35

2.4.1 Word Embedding ...... 35

2.4.2 Sentence Embedding ...... 37

2.4.3 Document Embedding ...... 40

2.4.4 Domain Specific Embeddings ...... 41

3 Event Embeddings 42

3.1 Introduction ...... 43

3.2 Related work ...... 45

Page vi of 194 3.3 Evidence Collection & Analysis ...... 47

3.4 Dynamic Event Type Recognition and Tagging ...... 49

3.4.1 Training Data Generation ...... 51

3.4.1.1 Datasets ...... 52

3.4.1.2 Training Data for Event Type Recognition ...... 53

3.4.2 Event Type Recognition ...... 54

3.4.2.1 Event-Type Vector Encoding ...... 55

3.4.2.2 Tuning Event-Type Vectors ...... 55

3.4.2.3 Event-Type Recognizer ...... 57

3.4.2.4 Event Information Extraction ...... 58

3.4.2.5 Event Recognizer REST APIs ...... 59

3.4.3 Insights and Discovery ...... 60

3.5 Experiments ...... 61

3.5.1 Event-Type Recognition ...... 62

3.5.1.1 Conventional Validation ...... 62

3.5.1.2 K-Fold Cross Validation ...... 63

3.5.1.3 Human Validation ...... 63

3.5.1.4 Effect of Word Embedding Model ...... 64

3.6 Discussion ...... 66

3.7 Concluding Remarks ...... 67

Page vii of 194 4 Attribute Embedding based Indexing of Heterogeneous Information Sources 69

4.1 Introduction ...... 70

4.2 Related work ...... 72

4.3 Security Vulnerability Information Model ...... 77

4.4 Collecting, Enriching and Indexing Security Vulnerability Information ...... 79

4.4.1 Security Vulnerability Information Collection, Adaptation and Enrichment 80

4.4.2 Security Vulnerability Information Indexing ...... 81

4.5 Security Vulnerability Information Embedding ...... 83

4.5.1 Training the Vector Space Model (VSM) ...... 84

4.5.2 Attribute Embedding ...... 85

4.5.3 Tuning Attribute Embedding ...... 86

4.5.4 Attribute Value Embedding ...... 87

4.5.5 Attribute/Value Recognition REST APIs ...... 88

4.6 Security Vulnerability Information Querying with NL Support ...... 89

4.7 Evaluation ...... 93

4.8 Conclusion and Future Work ...... 97

5 API Elements Embeddings 98

5.1 Introduction ...... 98

5.2 Related Work ...... 101

5.2.1 Summary ...... 111

Page viii of 194 5.3 Approach overview ...... 111

5.4 API Knowledge Graph (API-KG) ...... 115

5.5 Deriving API vectors ...... 116

5.5.1 Training the Vector Space Model (VSM) ...... 117

5.5.2 Populating the Knowledge Graph ...... 118

5.5.2.1 API Description Embedding ...... 119

5.5.2.2 API Method Embedding ...... 120

5.5.2.3 API Parameter Embedding ...... 121

5.5.3 API Parameter Enrichment - Acquiring Mentions ...... 123

5.6 Building bots using API-KG ...... 125

5.6.1 API-KG REST APIs ...... 125

5.6.2 Bot development scenario ...... 126

5.7 Experiments ...... 131

5.7.1 Bot Development using only API-KG ...... 131

5.7.1.1 Results - Benefits of using API-KG ...... 132

5.7.1.2 Results - Issues of using API-KG ...... 133

5.7.2 Bot Development using Bot-Builder ...... 134

5.7.2.1 Results - Searching with API-KG vs others ...... 134

5.7.2.2 Results - Bot Development using Bot Builder ...... 135

5.7.3 Bot Development by Non-developers ...... 137

Page ix of 194 5.7.4 Effect of Vectorization Techniques ...... 138

5.8 Conclusions and Limitations ...... 140

6 Multi-Turn and Multi-Intent User Chatbot Conversations 142

6.1 Introduction ...... 142

6.2 Human-Chatbot Conversations ...... 144

6.3 Conversation State Machines ...... 146

6.3.1 Transitions between States ...... 148

6.4 Generating State Machines ...... 150

6.4.1 Generating Intent States from Bot Specification ...... 150

6.4.2 Generating Transitions between States ...... 150

6.5 Conversation Manager Service ...... 153

6.5.1 Conversation Manager Architecture ...... 154

6.5.2 State Machine Generator ...... 154

6.5.3 Dialog Act Recogniser ...... 154

6.5.4 Slot Memory Service ...... 155

6.5.5 User-Chatbot Conversation Scenarios ...... 155

6.6 Extended Bot Builder - Conversations Support ...... 157

6.6.1 The Bot Builder Architecture ...... 158

6.6.2 Defining Bot Specification ...... 158

6.7 Validation ...... 159

Page x of 194 6.7.1 Chatbot Development by Developers ...... 159

6.8 Conclusions and Future Work ...... 161

7 Conclusion 162

7.1 Summary the Research Issues ...... 162

7.2 Summary of the Research Outcomes ...... 163

7.3 Future Research Directions ...... 164

BIBLIOGRAPHY 165

Page xi of 194 List of Figures

1.1 Research Approach ...... 12

2.1 Relevant APIs to the search query ...... 14

2.2 An utterance belongs to an intent and contains entities ...... 16

2.3 Defined pattern/response pairs for Greeting intent ...... 16

2.4 Defined rules to return corresponding results from database [8] ...... 17

2.5 Rules written by AIML (Left) and Rivescript (Right) scripting languages ..... 18

2.6 An excerpt of training dataset for machine learning based intent recognition model 20

2.7 Extracted features to define rules for entity extraction (e.g. meeting duration) from email content [51] ...... 21

2.8 Given a process model, the approach generates a set of rules to deploy a flow-based chatbot [153] ...... 24

2.9 Quick reply is an instant input (answer) from user ...... 24

2.10 A Carousel contains list of items with more details ...... 25

2.11 Building a flow-based chatbot to book doctor appointments by using Chatfuel plat- form ...... 26 2.12 Dialog management by exploiting a defined finite state machine [112] ...... 27

2.13 An excerpt of conversation between user and Ava [109] ...... 27

2.14 User interaction with Daisy to explore available features and supported models [140] 28

2.15 Template code for Correlation command includes sample utterances and follow-up questions for required arguments, all defined by developer [69] . . . . 29

2.16 Iris state machine that supports command composition - call methods recursively, and sequencing - referencing previous command results [69] ...... 30

2.17 Devy’s finite state machine for handling workflows [192] ...... 31

2.18 Over informative answer from user for the question asked by state based chatbot [169] ...... 32

2.19 An sequence model (e.g. LSTM) is considered as a blackbox with a source se- quence (user utterance) and target sequence (chatbot response) ...... 33

2.20 Two-dimensional projection of vector space model that represents countries and their capital cities [176] ...... 35

2.21 Skip-gram model vs CBOW model - w(t) is the target word and w(t−2)...w(t+2) are context words [176] ...... 36

2.22 Subword-level in FastText - Given the word “accomodation” which is a mis- spelling of accommodation, we still get closest words ...... 38

2.23 A sequence to sequence machine translation neural network - encoder and decoder are connected together through a hidden state which represents the input sentence [232] ...... 39

2.24 Two-layers DAN - it first converts the given sentence to a vector by averaging all its words, then it feeds feed-forward DNN to generate a new sentence [107] . . . 40

3.1 Illustration of an investigator’s workspace...... 48

Page xiii of 194 3.2 Our framework for dynamic recognition and tagging of event-types for insights and discovery ...... 50

3.3 Detailed architecture for Training Data Generation, Event Type Recognition, and Insight and Discovery...... 51

3.4 Excerpt of one of the case files in our case dataset...... 52

3.5 Excerpt of the gold standard data showing two sentences and their corresponding event-types...... 52

3.6 Event Type Vector Encoding Using Seed n-grams ...... 54

3.7 Tuning Event Type Vector by using training dataset ...... 56

3.8 Recognition of Event Types from Evidence Logs ...... 58

3.9 Swagger documentation for Event Regognition Service ...... 59

3.10 Case Walls for Law-Enforcement ...... 61

3.11 Event Type Recognizer - Precision/Recall/F-Score for 0-50% training set (testing set is fixed on 50% of gold dataset) ...... 62

4.1 User can add (A) additional entities (rows) and (B) additional columns from list of suggestions [292] ...... 75

4.2 Example of a CI similarity query - find similar customers based on their purchased items [25] ...... 76

4.3 Converting tables into sentences to create a corpora to train embedding model [25] 77

4.4 (a) Security Vulnerability Information Model [54]. (b) Architecture for collecting, enriching, indexing and querying security vulnerability information. The bottom part of the architecture operates offline, while the upper part does it online. . . . . 79

Page xiv of 194 4.5 Index-ready JSON representation of security vulnerability information model (in- troduced in Figure 4.4(a)): (a) JSON schema of information model (shaded at- tributes correspond to enrichments), (b) Example of a single document containing vulnerability information, (c) JSON schema for storing attribute mentions, (d) ex- ample of two attributes and their possible mentions ...... 82

4.6 Pipline for constructing attribute- and value- embeddings...... 84

4.7 Attribute Vector Encoding Using Extracted Words ...... 86

4.8 Tuning Attribute Vector by using mentions ...... 87

4.9 Generating Value Vector by using indexed values ...... 88

4.10 Swagger documentation for Attribute/Value Embedding Service ...... 89

4.11 Steps for NL to ElasticSearch’s DSL query translation...... 90

4.12 Dependency tree indicates the attachments between tokens (words) in an NL query 91

4.13 Translating NL queries into ElasticSearch’s DSL query...... 92

5.1 An answer in StackOverflow which adds more details to the API documentation [255] ...... 104

5.2 A partial code completion - (Hx) are empty places which are fulfilled by SLANG [212] ...... 106

5.3 An example of API sequences and annotation for a Java method IOUtils.copyLarge [83] ...... 107

5.4 An example of Stack Overflow post used to extract the keyword-API mapping: (a) question, (b) accepted answer [208] ...... 110

5.5 Typical Bot development process (Left) vs Bot development process using API- KG (Right) ...... 112

5.6 Approach overview ...... 113

Page xv of 194 5.7 Finding relevant APIs for the given goal ...... 114

5.8 API Knowledge Graph (Yelp API) ...... 116

5.9 Generating an API description vector: (i) Keyword extraction, (ii) Keyword ex- tension, (iii) Generating the final vector by averaging the vectors of keywords . . 119

5.10 Crowdsourcing task to provide three paraphrases per annotated utterance ..... 121

5.11 Generating API method embedding - (i) API owner provides an initial utterance that describes best the interaction with method, (ii) initial utterance is then para- phrased by crowd workers to collect more utterances, (iii) collected paraphrases are then used to generate a method embedding ...... 122

5.12 Generating an API parameter vector: (i) Value extraction, (ii) Value extension, (iii) Generating the final vector by averaging the vectors of values ...... 123

5.13 Choosing a target synset: (i) Retrieve synsets from BabelNet, (ii) Extract key tokens from naming of sysnets, (iii) Generate a vector per synset by averaging the vectors of tokens, (iiii) Choose the synset with closest vector to the vector of parameter ...... 124

5.14 Swagger documentation for API-KG ...... 126

5.15 Relevant APIs to the search query ...... 127

5.16 Building chatbot using Bot Builder ...... 128

5.17 Relevant API Methods to seed utterances ...... 129

5.18 Utterances associated to an API Method ...... 130

6.1 Types of human-chatbot conversations - from less to more natural ...... 145

6.2 User changes the intent to know about her calendar schedule ...... 146

6.3 Transition between intent-states based on user intent - current intent-state is de- noted by blue color, “new intent” transition is highlighted in orange ...... 147

Page xvi of 194 6.4 Transition to nested slot-value state - current nested slot-value state is denoted by red color ...... 148

6.5 Transition to nested slot-intent state - user’s answer is a request to another intent, state machine moves from “location” nested slot-value state to a nested slot-intent state (“GetUserDetails”) to obtain the value for the missing slot ...... 149

6.6 Conversation Manager Architecture ...... 153

6.7 Bot Builder Architecture - Automated Chatbot Development ...... 157

Page xvii of 194 List of Tables

2.1 An illustrative example of a user utterance ...... 23

3.1 Examples of Evidence Sources ...... 47

3.2 Event Type Recognizer - Precision(P)/ Recall(R)/ F-Score(F)/ Average(Avg) for 5-Fold Cross-Validation ...... 64

3.3 Event Type Recognizer - Precision(P) and Recall(R) while using different embed- ding models ...... 66

4.1 Sample of questions asked when diagnosing security vulnerabilities (extracted from [233]). Terms that are relevant for the security vulnerability domain are underlined...... 78

4.2 Examples of adapted questions. The questions in bold font are the original ques- tions (Q) from [233], while questions in regular font are examples of adapted questions (AQ). We used a total of 65 variants of these AQs for the evaluation. . . 94

4.3 Evaluation results. We report on average values for |Rel|. We also report on the average values for R-Precision when using no embedding, GoogleNews embed- ding, Wikipedia embedding and Security embedding. We use the metric P@10 for questions with large |Rel| [226]. Entries marked with N/A means that the approach was not able to return any results whatsoever...... 95 5.1 Examples of API Methods, their associated utterance, and possible paraphrase. . 120

5.2 Comparison between API method vectorization techniques for given natural lan- guage search queries ...... 140

6.1 Examples of Dialog Acts in a conversation between user and chatbot...... 152

Page xix of 194

Chapter 1

Introduction

1.1 Background, Motivations and Aims

Software enabled services are central to the operation of digital processes [14, 28]. They are perva- sive to core processes that streamline the delivery of structured services such as HR, procurement, payroll, banking (e.g., loan), etc. They are also pervasive to support processes that enhance the ef- ficiency and effectiveness of indirect activities and streamline the management of day-to-day tasks (e.g., send emails, schedule meetings, record notes, manage tasks). The continuous improvement in connectivity, user interfaces, software platforms allow access to software enabled services del- uge including data services, Internet of Things (IoT) services, document management services, cloud resource services, task management services, platform services. With the advent of widely available software enabled technologies, coupled with intensifying global competition, fluid busi- ness and social requirements, organizations are rapidly shifting to digitization of their processes. Accordingly, organizations embraced radical changes that are necessary for increased productiv- ity and effectiveness. Capabilities arising from advances in digital transformation technologies, enabled organizations to increase productivity, embrace automation, and extend business to lo- cations far beyond their normal operations. Now, at all levels, software services enabled digital transformation is firmly recognised as a strategic priority for modern organizations [138, 197]. It is also firmly recognised that online service-enabled economy - also called the digital economy - is central priority for economic development. As economies undergo significant structural change, digital strategies and innovation must provide industries across the spectrum with tools to create a competitive edge and build more value into their services [19, 138, 52].

Clearly advances in online service technologies already transformed the Internet into global workplaces, social forums, collaboration and business platforms – allowing services to be deliv- ered via the Internet from any location to any other location. In addition, services integration also matured in recent years [18, 3]. More specifically, while the Software as Service (SaaS) trans- formed the enterprise software industry over last decades by reducing the cost of software-enabled operations and improving automation of processes, the enabling engine for scalable integration and automation are Application Programming Interfaces (APIs). APIs allow programmatic ac- cess to heterogeneous and autonomous data sources and applications via standard protocols and languages. They are at heart of integrating services and streamlining processes automation. In a nutshell, once an online service has reached a threshold of popularity, organisations are competi- tively compelled to implement APIs in order to allow third-party developers to write auxiliary or satellite ‘apps’ which add new uses to the original service, enrich its features and accessibility, enhance its agility and accelerate overall development and integration. As mentioned before, es- sentially APIs unlock application, data source and device silos through standardized interaction protocols and access interfaces [190, 138, 102]. They are the glue of online services and their interactions. They are fundamental to the Web, social media already depend heavily on APIs, as do cloud and enterprise services (e.g., document management tools, databases, platforms, appli- cations, appliances, IoT and sensors) [3, 152].

However, while software enabled services and APIs enabled organisations to increase effi- ciency, streamline services integration and process automation, new usability, productivity and effectiveness challenges have also emerged. Human users involved in processes delivery typically access services by separately accessing a variety of (cloud-based) software tools and interfaces like apps, Web sites, productivity, task management and collaboration interfaces. For instance, users may access, query, integrate and analyze data using common user productivity services (e.g., spreadsheets, database interfaces, and SaaS applications and tools such as CRM, social me- dia, email and collaboration tools) over underlying data services. Characteristic examples of such paradigm are numerous, across government, enterprise and the consumer arena. Consequently, processes are in general hidden and highly unstructured, i.e., no visibility and trace-ability over the end-to-end process interactions. In addition, even sophisticated end users regularly resort to low efficiency manual methods to draw information from one service and use it elsewhere to

Page 2 of 194 support their tasks. This problem worsens as the variety of available services and the flexibility of available tools increase. On the other extreme of the spectrum, structured processes known as automated business processes are managed using dedicated enterprise software systems called workflow management systems [257]. The main drawbacks of structured processes stem from the inherent development and maintenance of cost and thus fail to cater for much needed agility in today’s dynamic environments.

Users should be empowered to benefit from the power of SaaS and APIs in performing their day-to-day activities in digitally enabled processes. However, a commonly overlooked limitation of SaaS technologies is that they do not make available services accessible to human users in natural manner. For instance, Web browsers and applications allow users to access underlying information and services through clicking, scrolling, form filling on visual interfaces. At the same time, conversational Artificial Intelligence (AI) and its instantiation in the form of messaging or chat bots, task-oriented conversational bots, software robots, digital or virtual assistants, emerged as new paradigm to naturally access services and perform tasks through natural language (text or voice) conversations with software services and humans. Conversational AI based services enable the understanding of user needs, expressed in natural language, and on fulfilling such needs by invoking the appropriate back-end services. We will use the term conversational bots to refer to all different instantiation of conversation AI based services.

Applications such as Apple Siri, Google Assistant, Amazon Alexa, Baidu and Microsoft Cor- tana have enabled an increasingly large number of conversational bots in different application domains such as marketing, health, customer support. At present, efforts in conversation bots fo- cus on making software technologies human-inclusive: more than 100 million Alexa devices and almost 1 billion Google assistant devices are implanted in our homes collectively1. Similarly, or- ganizations are also embracing conversational bots as “side-by-side digital co-workers” [80, 75]. Forrester analysts predicts this will be more than 40% of all companies by the end of 2019 [116].

Conversational AI is part of a larger and major transformation in software enabled services, namely AI enabled process augmentation including augmentation of human work, AI-powered digital assistants [66, 156, 192]. From a service engineering perspective, integration of bots and software enabled services have not kept pace with our ability deploy individual devices and ser- vices. Despite advances, in various research areas on the one hand, and the greater availability and

1https://www.cnet.com/news/google-assistant-expands-to-a-billion-devices-and-80-countries/

Page 3 of 194 malleability of data and services on the other, the process still breaks down when we attempt to put it all together. For instance, current bots development techniques rely on human understanding of different APIs and extensive manual programming to produce bots that interact with enterprise services. This is clearly unrealistic in large scale and evolving environments. The ubiquity of conversational bots will have little value if they cannot easily integrate and reuse concomitant capabilities across large number of evolving and heterogeneous information sources, databases, resources, devices and applications. This a very ambitious objective and the investigation of re- lated research issues requires meaningful integration of concepts and techniques from machine learning, knowledge representation and extraction, API engineering and natural language conver- sations between users and back-end services.

In this thesis, we contribute novel abstractions and techniques focusing on re-conceptualizing the integration of existing natural-language based conversational systems and back-end services to better leverage available software-enabled capabilities across large number of evolving and het- erogeneous information sources, databases, resources, devices and applications. The abstractions and techniques seek to enable and, indeed, semi-automate the augmentation of services with latent knowledge and interaction models that are essential to reason about potentially ambiguous user tasks, semi-structured artefacts (e.g., emails, PDF files), support natural language interactions be- tween users, bots and back-end services (e.g., integrated data services and APIs).

1.2 Research Issues

1.2.1 From Raw Unstructured Information Items to Semantically Annotated In- formation Items

Most knowledge-driven processes involve accessing and understanding large number of items including documents (e.g., PDF, word files, spreadsheets), conversations (e.g., emails, tweets, social media posts, text messages, interactions in collaboration platforms like Yammer and slack) and other information (e.g., notes), to extract information, generate insights and make decisions.

This is the case for instance, in law enforcement investigation processes. Investigation cases may last for years. Investigators have to collect, annotate and organize troves of information (e.g.,

Page 4 of 194 witness statements, forensic reports and telephone intercepts) to identify evidence. Knowledge- driven processes are highly cognitive both with respect to collecting and analysing information, as well as to inferring dependencies between information to eventually produce insights (e.g., evidence and evidence items). Despite advances, in various research areas, most cognitive tasks are performed manually; this is no doubt tedious, error prone and highly inefficient [12, 207]. For instance in investigations this may lead to offenses not identified due to limited manual processing power. Search and query systems are not accurate in identifying specific evidence elements (e.g., events such as phone calls, bank transfers, travel movements). It is very time consuming for investigators to keep track of relevant events (facts about things that happened in reality) and identify possible offences from raw evidence items and logs. Overall, traditional search, enterprise search and query techniques merely scratch the surface of the knowledge-driven processes analysis problem, as both its applicability and its effectiveness are very limited, and consequently their usefulness in real life situations such as investigations is negligible.

The challenge is devising scalable and effective entity-based enrichment, exploration and ma- nipulation techniques that will unlock knowledge-driven processes. For instance, for enabling investigators to understand large-scale investigation items, there is a need to extract relevant en- tities from information items, find similarity among them, classify them into groups, and more. Effective cognitive support would not only help extract events, but also analyse, attach semantics to evidence items and ask natural language questions to identify information that are relevant to a given event, entity, offence, etc. Users should be able to identify evidence items that mention events. There should also assistance to reconstruct chains of events, the identification of parties involved, understanding of its temporal dynamics, among other aspects. Without such techniques, knowledge-driven investigations will be tedious, as they would be like querying a database where relations among tables are not modelled. Such a development represents a significant undertaking, not only in terms of effort and time required, but also in complexity.

1.2.2 Schema-less and natural language access to heterogeneous information sources

As mentioned before, the ubiquity of conversational bots will have little value if they cannot easily integrate and reuse concomitant capabilities across large number of evolving and heterogeneous information sources, databases, resources, devices and applications. Daily knowledge-driven pro-

Page 5 of 194 cesses are conducted through ad-hoc processes that require access to information stored in various data sources and services. No single information source can fully support requirements of a given process in terms of information inquiries. For instance, cyber security analysts and professionals need integrated access to information to become aware and informed about security vulnerabili- ties. They find information from several sources (e.g., vulnerability databases, security bulletins and advisories , social media) to identify newly discovered vulnerabilities, learn about existing vulnerabilities, identify new exploits, vulnerable software packages, link information, etc [294].

However, as in several other domains, while much of the security vulnerability information may be available from both private and public information sources, such information is in many cases scattered across different, heterogeneous and complex information silos (e.g., vulnerability databases, security bulletins and advisories , social media) [295]. Knowledge-driven processes over complex information silos is time consuming, frustrating, error prone, repetitive, and often bloated with non-necessary work. Even when sophisticated indexing techniques such as Elastic Search [81] are used to provide an integrated view over various data sources, interfaces that are used to access and integrate information are not appropriate for domain analysts [115]. Query in- terfaces used to access information through integrated and indexed sources are in general keyword search, SQL-like or Domain Specific Language (DSL) based [219]. Keyword search techniques are known for their non accuracy limitations [115, 219]. The other techniques presuppose tech- nical expertise comparable to that of professional users, including employing different low-level APIs to access various data sources, together with procedural data flow constructs [91, 50, 92].

While existing techniques in data management, information retrieval and indexing have pro- duced promising results that are certainly useful, more advanced techniques that cater for under- standing of ad-hoc and knowledge-driven processes, interacting with users in natural language, improving productivity, to effectively supporting user’ daily tasks are needed. More specifically, existing indexing and integration techniques lack latent knowledge about index and information source attributes. This knowledge is essential to reason about potentially ambiguous user intents and effectively map them to queries over indexed data sources.

Page 6 of 194 1.2.3 User Intents and APIs integration

While enriching non-structured information items with meta-data like events and supporting nat- ural language queries over integrated information sources are important requirements for effective integration of bots and software enabled services, modern organisations use a large number of applications like task and document management (e.g., Trello, Jira, Google Docs, Dropbox) and collaboration apps (e.g., Slack, Yammer). In addition, APIs can be used to provide uniform access to these apps and facilitate their integration. As mentioned before APIs are the engine for online services integration which essential for streamlining both knowledge-driven (e.g., manage docu- ments through Google Docs or Dropbox APIs) and structured processes (trigger workflows using a workflow management system APIs).

APIs enable an actionable framework across many different applications, data sources and de- vices. It is estimated that there are already 50,000 APIs, and that the number will grow rapidly over the next few years [73]. This growth will come from APIs for linking cloud resources and applications as well as APIs for appliances, mobile devices, sensors or vehicles. We believe that the seamless integration of bots and APIs will charts an effective new paradigm to make con- versational services both user and process-centred. A main hard challenge to achieve this new paradigm is linking low-level API abstractions (e.g., API calls) to high-level bot abstractions (e.g., user intents, user utterances) [289].

In existing bot and API integration techniques, developers leverage NLP and machine learn- ing capabilities to recognise user intents [165, 271, 227]. However, the burden of integrating user intents to APIs is shifted to developers to bolt intents onto existing low-level APIs. This leads to an inflexible and costly environment which adds considerable complexity, demands ex- tensive programming effort, requires extensive bot training and perpetuates closed cloud solutions. However, effective integration of natural language conversations and APIs requires rich API ab- stractions to reason about potentially ambiguous user intents and effectively integrate intents and APIs. We need more dynamic and knowledge driven techniques that provide high-level and latent knowledge based reasoning about API elements (i.e., description, methods and parameters) and automated support for matching intents and APIs. Furthermore, user intent may be complex and its realisation requires complex conversations between users, bots and APIs. Consequently, de- signing effective integration of natural language conversations and API-enabled services remains

Page 7 of 194 a deeply challenging problem.

1.3 Contributions

We build upon advances in NLP, machine learning, information indexing, knowledge graphs, dia- log and conversation modelling techniques. We contribute innovative concepts and techniques to scale the integration of natural language based and software enabled services [28]. The proposed concepts and technique resolving important gaps in integration of natural language based conver- sations and software enabled services. They enable new efficiencies to bridge these gaps includ- ing: (i) the enrichment of unstructured information items with entities and event types supporting their semantic understanding, (ii) semantic augmentation of indexing attributes of integrated in- formation sources to effectively support multi-entity mentions and ambiguous user queries over heterogeneous data sources, (iii) latent-knowledge based middleware techniques and services to support effective integration of user intents and APIs, (iv) hierarchical state machine based models to represent and reason about complex interactions between users, bots and APIs. We investigate and develop software architectures, prototypes, evaluation studies and applications to assess the proposed models and techniques.

From Raw Information Item to Semantic Information Item - Enrichment of unstructured information items with entities and event types.

This study was conducted in the context of knowledge-driven law enforcement investigations1. The study provided us with formidable challenges that are relevant not only to this domain but to most knowledge-driven processes where semantic understanding of raw unstructured information items (e.g., emails, PDF files, communication messages, social media post) is required. The ob- jective we have is to enable investigators to understand, organize and query the large amount of unstructured information items, collected and generated during investigation processes. We build upon advances in NLP (e.g., extracting information from unstructured information items) [91, 48], word embeddings [176] and knowledge-based enrichment (e.g., extracting entity mentions from knowledge graphs) [185] to enable the recognition of events from investigation case-related infor- mation (e.g., collected evidence items). We encode event-types as vectors in vector space model

1Data to Decisions CRC, Data Curation Foundry project

Page 8 of 194 based on the distributional semantics of sentences in evidence items. The proposed approach con- fers event robust recognition because it caters for the automated identification and enrichment of variations and word mentions across various information items. Event and event type vector based similarity and matching techniques are then used identify when sentences in evidence items relate to a particular event-type.

Semantic Augmentation of indexing attributes in integrated information sources.

This study was conducted in the context of knowledge-driven security vulnerability investigations2 [219]. The study provided us with formidable challenges that are relevant not only to this domain but to most knowledge driven processes where endowing integrated data services with capabili- ties to reason about potentially ambiguous user information search and linkage natural language queries and effectively map them to structured queries over an integrated index of large number of heterogeneous information sources. The objective we have is to improve the accuracy of data retrieval operations involved security vulnerability information search and linkage (e.g., identify newly discovered vulnerabilities, learn about existing vulnerabilities, identify new exploits, vul- nerable software packages) from various information sources (e.g., vulnerability databases, secu- rity bulletins and advisories , social media) [115]. They will allow security analysts and profes- sionals to leverage available information sources to gain insights into potential or existing security vulnerabilities and improve awareness and security assurance strategies in general.

We build upon advances in NLP (e.g., natural language queries) [48], word embeddings [176, 174], knowledge-based enrichment (e.g., WordNet, BabelNet and ConceptNet) [178, 186, 235], to enable the augmentation of indexing attributes with semantics that is essential to support multi- entity mentions and ambiguous user queries over heterogeneous data sources. We propose a novel attributed-based embedding indexing mechanism over vulnerability information data sources that leverages knowledge graph-based enrichment. We devise query mapping techniques that is able to translate NL queries into Elastic search queries. These techniques allow retrieving and correlating vulnerability information from large and heterogeneous information sources. They do not require the users to have precise knowledge of index or information source schemas.

2Data to Decisions CRC, Data Curation Foundry project

Page 9 of 194 Latent-knowledge based middleware techniques and services to support effective integration of user intents and APIs.

Effectively integrating conversational services with information and functionality that is accessi- ble through APIs will allow these services to exploit complete advantage of available software- enabled technologies in order to keep-up with this rapid increase of data and process enabled opportunities. We therefore set forth the semi-automated integration of user intents in conver- sational services with a potentially large and evolving set of APIs as key feature to achieve the above objective [289]. This will allow a new generation of conversational services where ad-hoc process tools (e.g., apps), structured process management systems (e.g., workflow management systems) are augmented with conversational digital assistance. This type of assistance will have bot-like natural language conversational interfaces, with various layers of integration intelligence as its core components. It will enable a framework where conversational services and APIs work together in tandem to unify disparate process and work tools giving life to new services allow processes to benefit from the power of NLP, conversational AI and APIs.

We build upon advances in NLP (e.g., user utterance understanding, entity recognition) [48], word embeddings [176, 174] and knowledge-based enrichment (e.g., extraction of entity mentions from knowledge graphs), to enable the augmentation of API elements (e.g., description, methods and invocation parameters) with semantics that is essential to support the effective mapping of high-level user intent abstractions (e.g., user goals, user utterances) to low-level API abstractions (e.g., API calls). We propose to represent API elements as vectors in an extended vector space model. We combine crowdsourcing, knowledge graph and word embedding techniques to build end enrich API element embeddings. We propose knowledge-powered middleware techniques and services to interact with APIs based on how users would call the method in natural language.

Hierarchical state machine based models to represent and reason about complex interactions between users, bots and APIs.

A key distinguishing feature of conversational services is dialog patterns, i.e., the interaction styles needed to fulfil user intents (e.g., a question by the bot to user to resolve the value of a missing intent parameter, an invocation of an API by the bot to resolve the value of a missing parameter, a question by a bot to a user to confirm an inferred intent value or make a choice among sev- eral options, extracting an intent parameter value from the history of user and bot interactions)

Page 10 of 194 [99]. Interactions involve user utterances, conversation management acts and actions (e.g., API calls, natural language response generation). Instead of relying on low-level scripting mecha- nisms or provider-specific rule engines, we argue that models and languages for describing nat- ural language conversations between users, bots and APIs should be endowed with intuitive and automation-friendly constructs that can be used to specify a range of dialog patterns. We propose the concept of conversation state machines as a abstraction to represent and reason about dialog patterns. Conversation state machines represent multi-turn and multi-intent conversations where state represent intents, their parameters and actions to realise them. Transitions between states are triggered when certain conditions are satisfied (e.g., detection of new intent, detection of missing intent parameter). Transitions automatically trigger actions to perform desired intent fulfilment operations. We propose automated generation of run-time nested conversation state machines that are used to deploy and control conversations with respect to user intents.

Page 11 of 194 1.4 Thesis structure

The thesis structure follows the research approach presented in Figure 1. In each chapter we present the related work to the study reported in this chapter to motivate the study and illustrate relevancy. We provide essential background knowledge and terminology in chapter 2. Chapter 3 proposes and presents techniques to automate event identification and enrichment in raw unstruc- tured information items. Chapter 4 proposes and presents an attributed-based embedding indexing mechanism to support flexible natural language queries over structured vulnerability information stored in data services. Chapter 5 proposes and presents latent knowledge-driven techniques to fa- cilitate the interaction with APIs via intent-based conversations (e.g., task-oriented conversational bots). Chapter 6 proposes and presents an extended state-machine based model and techniques to support multi-turn multi-intent natural language conversations between users and services.

Chapter 2 Review of background and state of the art approaches

Chapter 3 Chapter 4 Enrichment of unstructured Semantic Augmentation of information items with entities indexing attributes in integrated and event types information sources

Chapter 5 Latent-knowledge based middleware techniques and services to support effective integration of user intents and APIs.

Chapter 6 Hierarchical state machine based models to represent and reason about complex interactions between users, bots and APIs.

Figure 1.1: Research Approach

Page 12 of 194 Chapter 2

Background and State of the Art

In this chapter, we discuss background on dialog systems and distributed representations of words and their contexts (i.e., word embedding and its extensions). We introduce main concepts and tech- niques that are relevant to contributions that we present in the following chapters. As mentioned in the introduction, our work investigates language understanding, extended term embeddings and conversational services.

In Section 2.1, we discuss user intent recognition techniques. Section 2.2 discusses existing approaches for dialog management. In Section 2, we overview techniques to generate natural language responses. Finally, in Section 2.3, we discuss main term embedding models.

Part I - Dialog Systems

Dialog systems are computer programs that provide natural language conversations between users and software systems. The input of such systems is natural language utterances (text/voice). The system also generates an appropriate response (in form of text/voice) back to the user. Dialog sys- tems are generally categorized into two classes [39]: (i) Non-task oriented, and (ii) Task oriented.

Non-task oriented dialog systems focus on open domain conversations with users (i.e, non predefined goal of conversations). As reported in [39], around 80% of conversations are chi-chat messages in the online shopping scenario. Examples for this type of dialog systems include: Natural Language Understanding (NLU)

Dialogue Management (DM)

Natural Language Generation (NLG)

Figure 2.1: Relevant APIs to the search query

DBpedia chatbot [10], which answers faceted questions sourced from DBpedia, Cleverbot1 and Mitsuku2, which handle open-domain conversations. Non-task oriented dialog systems hardly keep track of conversation states and are therefore not designed to perform specific user tasks (e.g., travel booking, task management) [253]. In general, three main approaches have been proposed to build non-task oriented dialog systems:

• Providing question and answer pairs in the form of handcrafted rules with scalability and flexibility issues in large scale cases [229, 224].

• Exploiting generative sequence to sequence models to build an entire phrase word by word conditioned on a user utterance [262, 77].

• Retrieval-based methods, which learn to select responses from external repositories (e.g. knowledge graphs [280, 97, 179]).

Task-oriented dialog systems or simply Chatbots3, on the other hand, allow users to accom- plish a goal (e.g. maintain schedules [51], organise projects [253]) using information provided by users during conversations. Chatbots perform tasks [39] by utilising several specialised compo- nents. These components together with their interactions are shown in Figure 2.1:

• Language Understanding Component: The Natural Language Understanding (NLU) com- ponent parses user inputs into a structured format that can be used by chatbots. We will discuss the functions of this component in Section 2.1.

1https://www.cleverbot.com/ 2https://www.pandorabots.com/mitsuku/ 3Since the focus of this dissertation is on task-oriented dialog systems, the term “chatbots” refers to this type of dialog systems for simplicity.

Page 14 of 194 • Dialogue Management Component: this is the core component of a chatbot [39]. It man- ages the conversation flow, checks user inputs in each turn, and chooses next actions based on conversation history [39, 67]. In section 2.2, we will unfold existing techniques and approaches for this component.

• Natural Language Generation (NLG) Component: Generating human-like responses as re- sult of actions performed by chatbots (e.g. invoking an API, querying database) is the re- sponsibility of this component [228]. Later, in Section 2.3 we will discuss NLG techniques in more details.

2.1 Intent Recognition

In order to be able to interact with users, it is essential for chatbots to understand user intentions, a task known as intent recognition [120, 205, 86].

An intent refers to users’ purposes, which a chatbot should be able to respond to [205, 39]. For example, if a user says “Show me some Italian restaurants near UNSW”, the user wants to know about available restaurants in a specific area. Intents are specified using short names such as “FindRestaurant” or “BookTable”. Typically bot developers define a number of intents that the bot will be able to handle.

Defining an intent requires feeding chatbots with a set of sample user utterances.Anutterance refers to anything that a user says whilst conversing with a chatbot [39, 296]. As an example, when a user says “Show me some Italian restaurants near UNSW”, the entire sentence is the utterance. Often there may be several utterances that could say the same thing. For example, there are various ways to ask about “finding restaurant” (e.g. “is there any Italian restaurant near UNSW?”, “Can you help me to find a italian resto around UNSW?”). Based on the user’s utterance, a chatbot can recognize the intent. Thus, providing more and diverse utterances can help in the creation of more robust chatbots [281].

User utterances may carry out important information that is necessary for a chatbot to un- derstand in order to be able to serve the correct answer [77, 39]. Such information is known as entities or slots. These entities have types (e.g. date, time). In the previous example, when user

Page 15 of 194 Utterance

"Show me some Italian restaurants near UNSW"

Entity Entity

Intent: findRestaurant

Verb Noun Figure 2.2: An utterance belongs to an intent and contains entities says “Show me some Italian restaurants near UNSW”, the term UNSW is an entity of type loca- tion and it indicates that the user is referring to a specific place. Figure 2.2 shows an example that illustrates the relationship between intents, utterances and entities. In the following sections, we will describe different approaches to recognize user intents.

2.1.1 Rule based Techniques

A traditional approach to map user utterances into intents is to adopt hand-crafted rules [224]. In this approach, bot developers define intents by encoding a set of intent recognition rules. Such rules are in the form of pattern/response pairs (shown in Figure 2.3). They are similar to if- statements in programming languages [268].

+ pattern: hi bot response: Hi human!

+ pattern: my name is response: Nice to meet you, .

+ pattern: how are you response: I am good, how about you?

Figure 2.3: Defined pattern/response pairs for Greeting intent

For a given user utterance, a chatbot recognizes its intent by matching all rules with the ut- terance. It does this by checking the order of words in the utterance. For example, if a user says “My name is John”, since the first three words “My name is” are matched with the second pat- tern in Figure 2.3, the chatbot concludes that the user intent is Greeting. Thus, the answer should be “Nice to meet you, John.” considering that in that rule is a placeholder and can be anything.

Page 16 of 194 Figure 2.4: Defined rules to return corresponding results from database [8]

[8] proposed a pattern-matching system that uses rules (shown in Figure 2.4) to map user natural language utterances into SQL queries. In [10], they propose a rule-based chatbot based on DBpedia to improve community discussions and interactions in the DBpedia community. At the core of DBpedia chatbot, there is a request router that specifies type of user question (inten- tion). There are three different types of questions (i) questions about DBpedia e.g. “Is DBpedia down right now?”, (ii) fact questions e.g. “What is the Capital of France?”, (iii) jokes or ca- sual conversations e.g. Hi, How are you?, “What’s up?. In order to answer those questions, they used DBpedia mailing list as a source of question/answer pairs and build patterns (rules). Therefore, whenever user question is matched with one of those patterns, the chatbot returns as- sociated answer from the question/answer pairs extracted before. To answer factual questions, the chatbot benefits from WDAqua’s QANARY question answering system1, WolframAlpha and OpenStreetMap. Finally, for the banters, they simply use a predefined set of responses for each question. TQBot [173] is a virtual tutor, which is empowered with rules that guide students to find answers (course contents) related to their natural language questions (e.g. “give me some infor- mation about mammals”). Similarly, Charlie [172] is an online learning platform with the ability to communicate with students using controlled natural language (i.e. take quizzes, check marks).

A number of languages have been proposed to specify intent recognition rules [224]. Artificial Intelligence Markup Language (AIML), which is a derivative of XML, is a widely used language in this context [268]. ELIZA [272], PARRY [47] and ALICE [268] are the first generation of chatbots that leverage this language2. AIML consists of units called categories and topics. Categories are blocks of rules each consisting of (i) pattern which defines user input (e.g. “Hi bot”), and (ii) template which indicates the response of the chatbot (e.g. “Hi human”) to users’ input. Topics, on the other hand, are collections of categories. Figure 2.5-Left shows a chatbot which is able to answer simple user utterances using pre-defined AIML rules. The main drawback of AIML is that it is too verbose and its underlying pattern matching is quite primitive and generic, thus it requires

1https://github.com/WDAqua/Qanary 2Pandorabots is an online platform to build rule-based chatbots using AIML.

Page 17 of 194 a lot of rules to perform even simple tasks [169]..

+ hi bot - Hi human Hi bot + what is your name - My name is chatbot.

+ how are you What is your name -I am good, how are you?

How are you Figure 2.5: Rules written by AIML (Left) and Rivescript (Right) scripting languages

Rivescript3 and Chatscript4 are other scripting languages that provide an easy-to-understand syntax and extensive additional features compared to AIML. By comparing the rules written in AIML (Figure 2.5-Left) and in Rivescript (Figure 2.5-Right), it can be seen that the latter has clearer structure and easier to follow syntax. Thus, it easier to use to define rules, especially when chatbots utilise a large number of rules [79]. In general, rule-based approaches suffer from two main issues:

• Lack of flexibility: Users are not free to express their utterances in every possible way they want, because there might be no rule that matches with an utterance [39]. For example, bot developer defines a rule (“hi bot” in Figure 2.5-Right) to answer greeting messages. How- ever, at the time of using the chatbot, users might greet in other ways (e.g. “hey chatbot”, “hello bot”, or simply “hi”). Thus, the chatbot is not likely to answer properly.

• Cost of maintenance: Writing hand-crafted rules, which contain anticipating end-users’ lan- guage diversity, is an intensively time consuming for developers [293] As the number of rules grow, finding overlaps and conflicts between rules also cause other issues [209]. For instance, understanding the reason of chatbot’s behaviour (e.g. which rule has been ap- plied/matched with a given user utterance) by checking a large number of rules is hard. Thus, machine learning approaches have been proposed in recent years to address the limi- tations faced by rule-based approaches [293, 228].

3https://www.rivescript.com/ 4https://github.com/ChatScript/ChatScript

Page 18 of 194 Despite the above issues, rule-based approaches are the best option when: (i) there is no training data available [281], and (ii) the scope of the conversations is small (limited number of questions/answers) and it is unlikely to grow [79].

2.1.2 Traditional Classification-based Techniques

Framing intent recognition as a classification problem [228], where a user utterance is classified into one of the available classes (i.e, intents), have provided an opportunity to use machine learning algorithms to solve the flexibility limitations of rule-based techniques. In approaches based on machine learning, a classification model is trained based on data sets containing a large number of utterances, each labeled with an intent and its entities [158, 89].

Figure 2.6 shows the common structure for such training data set. Witnessing a variety of utterances for an intent, the model learns different ways that users may ask for that intent. Since mapping user utterance into a pre-defined intent is considered as a classification problem, thus classification algorithms can be used for this task. For instance, SVM5, with its generalisation capability on unseen data, performs well on intent recognition tasks [170, 90]. Language process- ing services like RasaNLU6 also uses SVM as their default algorithm to identify intent from user utterances.

[108] utilises a triangular CRF7 model coupled with an additional random variable that cap- tures user intent on top of standard CRF. [159] proposed a technique that uses the output of a discriminative semantic concept classifier to generate a semantic tree for given user utterance. The generated tree is used for both intent detection and entity extraction tasks.

Cal is a virtual assistant that helps users to schedule meetings using conversations over emails [51]. Although the proposed approach is mostly based on classification models, predefined rules (shown in Figure 2.7) are also utilised to gather relevant information from emails such as meeting duration or time options. Furthermore, using crowd workers to correct the mis-classifications (e.g. wrongly classified an email as “meeting request”) is another feature in Cal.

For example, when Cal is cc’d in an email from user Bob to user Alice saying “Alice, can

5Support Vector Machine 6https://rasa.com/docs/rasa/nlu/components/#sklearnintentclassifier 7Conditional Random Field

Page 19 of 194 Figure 2.6: An excerpt of training dataset for machine learning based intent recognition model we get together sometime next week to discuss the CHI reviews?”, it attempts to automatically determine whether the message relates to an existing meeting request (e.g. updating/canceling) or to a new meeting (e.g. invitations) using classification models. If this step is successful, Cal then exploits Named Entity Recognizer (NER) [184] models and set of rules to extract entities (e.g. date, phone number, meeting duration). Next, it checks Bob’s calendar availability (permission is given before), and sends a ballot email to Alice with available meeting time options (e.g., “Alice, I checked Bob’s calendar and here are available times: Wed Sep 20 at 9am, [...] -Cal”). Once Alice chooses one of the options (e.g., “Great. Wednesday at 9am works!”), Cal adds the meeting information to Bob’s calendar, and sends a notification to Alice. If Cal is not able to classify an email properly (e.g. low-confidence ratio), a crowd-sourcing task is created automatically on Amazon Mechanical Turk platform, and a worker is asked to provide classification (e.g. New meeting) given the email message. The provided class label (from worker) together with message are then collected to re-train/improve the classification model [51].

TaskBot [254] is a chatbot that helps team members to stay focused on their work by automat- ing management tasks (e.g. set deadlines, assign new tasks to people). Using TaskBot, workers are free from switching between communication, productivity and task management tools in a project. They can assign tasks to other members or even themselves (e.g. as a reminder), by men-

Page 20 of 194 Figure 2.7: Extracted features to define rules for entity extraction (e.g. meeting duration) from email content [51] tioning @TaskBot in the chat platform (e.g. Microsoft Teams) they use to chat with each other. TaskBot utilises a machine learning based NLU service, namely, Microsoft LUIS8, to identify user’s intention. For example, when a team member says “@Bob, don’t forget to finish the report. @TaskBot”, TaskBot identifies intent as Task assignment. As another example, when someone in the team says “@TaskBot, please remind me to leave at 5pm.”, TaskBot knows that user is asking for Self-Reminders intent. It then uses Wunderlist9 task management platform to add, assign, track and terminate tasks.

2.1.3 Deep learning-based Techniques

Applying deep learning to recognize intents and detect entities from user utterances is a recently explored approach [228, 256]. Integrating feature engineering in learning procedures and consid- ering long dependencies between words in utterances have made deep learning approaches more popular in the language processing area [77, 182, 183]. In this approach, utterances from training data set are converted into sequences of numbers (known as IDs) where each number refers to an indexed word in so-called vocabulary dictionary, which is built by scanning the training dataset to store unique words. These sequences of numbers are then used by a neural network (NN) [275] to update the weights of its hidden layers and learn about the sequence pattern of words for each intent.

Using deep belief networks (DBN) [60] to route customers calls for technical assistance into one of 35 call–types (classes) was an attempt in using deep learning for intent classification [223]. Using this call routing system, when a user says “hi there i was wondering um my car is kind of old and has been breaking down frequently just wondering if you’ve got good rates.” [130], the caller is routed to the “Consumer Lending Department” as one of the predefined call-types. In this

8https://www.luis.ai/ 9https://www.wunderlist.com/

Page 21 of 194 approach, DBN is used to discover features (e.g. word pair appears, parse tree contains a particular tag) from unlabeled data by learning a generative model (called a Restricted Boltzmann Machine), and then extracted features are used to initialize a feed-forward neural network to classify customer utterances into call-types.

[228] leveraged deep convex networks (DCNs) for intent classification and achieved better accuracy than a boosting based classifier [256]. [108] improved the triangular CRF model [132] by exploiting convolutional neural networks (CNNs) to automate features extraction. [84] applied recursive neural networks (RecNNs) for training of intent recognition task. Using joint recurrent neural network (RNN) is the proposed approach by [210] for intent and entity extraction tasks.

2.2 Dialogue Management

Typical chatbot development is based on the hypothetical idea that user-chatbot interactions are in single turn style, which means that every single user utterance is complete and carries all the information required by chatbot to perform a task. Thus, a bot developer needs to build or reuse an intent recognition model (discussed in Section 2.1). She defines intents, provides sample ut- terances, and annotates utterances with entities. However, studies on human-chatbot conversation patterns reveal that, in practice, conversations are rather multi turn [41, 211], where there may exist missing information (e.g. location) in user’s initial utterance (i.e. “Can you please find some Italian restaurants for me?”) that needs to be fulfilled by the chatbot by asking questions to the user (e.g. “Where are you?”). To support complex and multi-turn conversations, a new component called Dialogue Manager has been embedded within the body of recently hatched chatbots [228]. Such component monitors/tracks required information with acquired ones during a conversation, asks the user to provide missing ones, and allows a corresponding component (within the chatbot) to perform the task once everything needed has been obtained.

The following example shows the importance of the dialog manager component in action. When a chatbot receives a user utterance (e.g. “Show me flights to Lyon for tomorrow 10AM”), NLU extracts the user’s intent (e.g. “SearchFlight”) and slots (e.g. destinationCity: Lyon). The output of NLU component is shown in Table 2.1. Such information is then used by the Dialogue

Page 22 of 194 Table 2.1: An illustrative example of a user utterance Sentence Show me flights to Lyon for tomorrow 10AM Intent SearchFlight Slots destinationCity: [Lyon], departureDate: [tomorrow], departureTime: [10AM]

Manager (DM) to update the state1 of conversation and accordingly choose the next system actions to complete user intents (e.g. “asking user to provide more information”). As there exists a missing information (e.g. departureCity) in the user’s initial utterance, the Dialogue Manager asks the user to provide this information prior to final system action (e.g. invoke Expedia API to get information about flights). The outcome of an action is a message that is sent to NLG2 in the form of either a template “There is {flightCode} for {departureDate} at {departureTime}” or, in more sophisticated systems, a dialog act (e.g. inform(flightCode: “Qantas QA402”, departureTime: “10AM”)). NLG component then transforms this template to a natural-sounding sentence as response to the user (e.g. There is Qantas QA402 flight for tomorrow at 10AM).

In a nutshell, the dialogue manager keeps track of conversation information that is entered implicitly or explicitly and manages complex interactions with users [39, 210]. In the following sections, we will discuss existing approaches for Dialog Management.

2.2.1 Flow Based Models

Defining workflows that contain activities, triggers and actions is the focus of flow based ap- proaches. In such approach, also known as “chatbot development with zero coding”, developer creates a workflow for chatbot by defining the conversation flow between user and chatbot, transi- tions between flows (e.g. triggers), along with supporting actions in each flow (e.g. call Webhooks, link to third-party services).

[153] proposes a system that takes business process model, namely Business Process Model and Notation (BPMN)3 2.0 model, and generates a list of rules to create chatbot. Developers describe precisely their desired process model, using a graphical notation (BPMN [263]). The specified model is used by the system to generate AIML rules. Such rules are loaded into an AIML interpreter to deploy/run the chatbot. User can ask questions related to the business model such

1A state indicates the current values of slots based on the progress in conversation (e.g., “intent:SearchFlight”, “destinationCity:Lyon”, “departureDate:tomorrow”, “departureTime:10AM”, “departureCity:—” 2Natural Language Generation 3https://www.omg.org/spec/BPMN/2.0/

Page 23 of 194 Figure 2.8: Given a process model, the approach generates a set of rules to deploy a flow-based chatbot [153] as “who is the actor performing a given task?”, “please notify me when exclusive gateways are reached”, “To whom should document be sent”. By leveraging such chatbot, users can interact with a business process by using controlled natural language. Figure 2.8 shows steps taken by proposed approach to generate a chatbot that represents and interact with a process model.

Apart from research-based prototypes and languages, there are also various platforms that developers can use to build flow based chatbots. For instance, Chatfuel4, FlowXO5, ManyChat6, MotionAI7 and Botsify8 provide a design canvas (shown in Figure 2.11) to draw conversation workflows using visual elements (e.g. Carousel, Quick Reply, Text, etc). Each element in the canvas represents a task in a conversation flow (e.g. read user input, play audio, send email). For example, Carousel (shown in Figure 2.10), allows users to choose from a list of options each contains title, description, image and button.

Quick Reply (shown in Figure 2.9) is an element that allows users to reply to chatbot’s question (e.g. “Please choose your preferred doctor?”) by choosing an answer from a predefined list of allowable answers.

Figure 2.9: Quick reply is an instant input (answer) from user

A workflow which is defined by drawing elements in the canvas, is then converted into a list of rules (e.g. AIML) to deploy a chatbot. For example, [35] leverages Chatfuel platform to build a chatbot that helps users to achieve specific goals regarding their diet (e.g. losing weight).

4https://chatfuel.com/ 5https://flowxo.com/ 6https://manychat.com/ 7http://www.motion.ai/ 8https://botsify.com/

Page 24 of 194 Figure 2.10: A Carousel contains list of items with more details

Likewise, [231] is a chatbot that leverages Chatfuel platform to help students to find answers for their administrative questions such as payment, attendance, and course progressions.

2.2.2 Deterministic State Machines based Models

In this approach, bot developer defines the structure of user-chatbot conversation by restrictively specifying states and transitions between them using FSM (Finite State Machine) model [211, 265]. In each state, bot developer sets actions (e.g. ask user for departure city) that chatbot should perform; transitions between states depend on defined conditions (e.g. if user answers the “departure city”, then move to the next state to ask for destination city). Figure 2.12 shows an example of a state machine conversation model for a “flight booking” chatbot. For example, when chatbot is in “Ask for one-way flight” state, there are two transitions available based on what user will answer: (i) “Yes” moves to “Get final confirmation” state, (ii) “No” moves to “Ask for return date”.

For example, Ava [109] is chatbot that leverages a state machine to help data scientists perform analytics tasks (e.g. split dataset, use algorithms, visualize outputs) through natural language conversations. Given a user input (e.g. “do a 60-40 split”), the chatbot maps this utterance into an executable code containing sequence of underlying libraries to split a dataset into %60 for train and %40 for test. Conversation between Ava and data scientist is guided by storyboard.A storyboard defines the dialogue, sequence of states and corresponding actions that Ava must take as response in each state to drive the conversation towards a task completion (e.g. Train a decision tree model).

Page 25 of 194    

    

   

     

Figure 2.11: Building a flow-based chatbot to book doctor appointments by using Chatfuel platform

Mapping user input into a beginning state (starting point of storyboard) is handled by finding a predefined rule (written using AIML) that matches with user utterance. Figure 2.13 shows an interaction between user and Ava in order to split the dataset, and to train a decision tree model.

Likewise, Daisy [140] is a chatbot that helps data scientists to execute data analysis tasks on biomedical data using natural language. Their proposed chatbot leverages predefined states in a finite state machine controlling the conversation flow where each state represents a certain dialogue situation. The transition between states is triggered by upon recognizing a pattern that matches a user utterance. Figure 2.14 shows interactions between user and Daisy using controlled natural language utterances.

[69] Iris is another chatbot that performs data science commands (e.g. lexical analysis) by leveraging state machines. In Iris, each command (e.g. “compute pearson correlation: x and y”) is linked to a template (shown in Figure 2.15). Templates are created by developers. Each template contains a set of sample utterances to indicate different ways that users can ask Iris for

Page 26 of 194 Figure 2.12: Dialog management by exploiting a defined finite state machine [112] the command, using natural language.

At the time of user-Iris interaction, for a given user utterance, Iris first classifies the utterance into a command by using logistic regression model trained on sample utterances for each com- mand. Next, Iris extracts arguments/values from user input by applying patterns, which again are defined in each template code. Figure 2.15 shows a template code for Pearson Correlation metric [122] together with sample utterances and required arguments, all specified by the developer.

Conversation between Iris and user is managed by a state machine (shown in Figure 2.16) which is generated at runtime. Armed with this state machine, Iris is able to support command

Figure 2.13: An excerpt of conversation between user and Ava [109]

Page 27 of 194 Figure 2.14: User interaction with Daisy to explore available features and supported models [140] composition and sequencing. Command composition is the ability to nest commands in order to perform an action. For example, when Iris asks the user (e.g. “What is the array for y?”) to enter an array object for further actions, user asks Iris to execute a command as the answer (e.g. “generate a random array”). The output of user requested command is the answer of Iris’s question. In other words, Iris is able to nest commands within each other through conversation.

Command sequencing, on the other hand, is the ability to reference the value of previous commands in current interaction. For example, after a command has just returned a result (e.g. sum of two numbers), user can ask Iris to “multiply _that in 2” where that refers to the output of the previous command (e.g. sum).

Devy [192] is a context aware conversational agent to help developers by automating devel- opment tasks and actions e.g. create branches, commit codes, push codes, pull requests, in work- flows. At the core of Devy, there is a context aware development model which acts as a knowledge base of automatically collected required parameters for performing low-level actions, without de- veloper involvement. The proposed context model supports only version control actions (Git) and code hosting actions (Github). Besides context model, Devy uses an intent service that translates developer’s utterance into developer’s intent, with the help of context model. After determining the intent, intent service uses a finite state machine (shown in Figure 2.17) to (i) execute the intent

Page 28 of 194 Figure 2.15: Template code for Pearson Correlation command includes sample utterances and follow-up questions for required arguments, all defined by developer [69] in the form of workflow actions, and (ii) to reason about what the developer input command means in each state (context).

To conclude, deterministic approaches work very well in small scale use cases where user inputs/options are limited and known apriori. Conversation flow traceability, a feature that helps to track the interpretation of each user input and chatbot actions for further fixes/improvements, is an important advantage of this approach [153, 109]. However, deterministic approaches come with some shortcomings such as versatility and robustness in situations where user does not follow pre- defined sequences of states [169]. For instance, if user’s initial utterance carries all the required information (e.g. “Is there any Italian cafe near George Street?”) or at least part of it (e.g. “Any Italian cafe?”), the chatbot cannot handle such situation because it is programmed to ask for the information one by one by following the defined sequence of states (e.g. type → location → search for places). Thus, the user cannot deviate from pre-defined conversation paths. Figure 2.18 exemplifies a situation where user’s answer has more information than what chatbot expects [169]. In this case, chatbot asks the user to re-enter the information in the next turn (as it follows the defined states). [69] addresses this issue by leveraging a classifier model together with defined list of hand-crafted rules to extract all possible information from each user input. However, this approach raises another issue which is the cost of writing rules for each required information in large scale use cases.

Page 29 of 194 Figure 2.16: Iris state machine that supports command composition - call methods recursively, and se- quencing - referencing previous command results [69]

2.2.3 Probabilistic State Machines based Models

Due to ambiguity of natural language, uncertainty in interpreting user utterances is common in intent/entities extraction [48, 169]. Thus, probabilistic models are proposed to address the is- sues of deterministic methods. Instead of handcrafting rules or states/transitions as dialog paths, probabilistic methods learn from actual conversations by formalizing user-chatbot interactions as hidden states with random sequences and transition probabilities [90]. Markov decision processes (MDP), a widely used probabilistic model, is a statistical based state machine with assumption that the next user action can be predicted based on the current state of a conversation [90]. Unlike deterministic state machines where transitions from one state to another one depends on a set of rules, in MDP, transitions are based on probabilities. Such probabilities are learned from example conversations (e.g. Gutenberg9). Given a conversation current state, probabilities associated to transitions to other states are mostly computed based on conversation history [141].

Using the MDP model, we can imagine that the chatbot is an agent which travels within a network of interconnected states [90]. Starting from an initial state (it can be any state depending on user input), the chatbot transitions from state to state by performing actions. Since a user answer in each state is uncertain, the transitions are non-deterministic, meaning that moving between states does not follow predefined patterns as the flow depends on the user’s responses to chatbot actions. Thus, this approach addresses the issue of deterministic approach which is following

9https://www.gutenberg.org/

Page 30 of 194 Figure 2.17: Devy’s finite state machine for handling workflows [192] a predefined and fixed conversation flows [90]. For example, in a flight booking chatbot, user utterance “show me flights to Lyon” will lead to a high probability that the destination city is filled with “Lyon”, while the state with all slots empty is given a low probability. Thus, the chatbot skips the “asking for destination city” state and jumps into other relevant states (e.g. asking for departure date).

MDP requires training data set to build a model that encodes user-chatbot interactions as states. Unlike deterministic approaches which require extensive development and maintenance costs to manually construct rules to cover all conversation paths (with no guarantee that the rules can op- timally cover all possible aspects), MDP is trained on a data set of real human-human (or human- chatbot) dialogues to cover variations of conversations. The success of such model, however, is highly dependent on the quality of training data set. Thus, acquiring sufficient and high quality training data for domains with a large number of states is an issue in probabilistic models [90].

2.2.4 Memory based Learning Models

Most recent approaches in the area of dialog management are Memory based techniques. In MDP models, the next state and action depends on the current state, regardless of all previous states

Page 31 of 194 Figure 2.18: Over informative answer from user for the question asked by state based chatbot [169] and actions. Unlike MDP models, memory based learning models are based on the intuition that machine uses its memory to take into consideration previous actions/decisions. This memory is trained on previous conversations (as sequences of actions), to choose an optimal action at each turn of conversation.

Using sequence to sequence model, also known as Seq2Seq, is the main approach for gener- ating actions in a dialog management system. Sequence to sequence models are widely used in machine translation [43]. The idea behind this approach is to generate a sequence (e.g. sentence in another language) by looking at source sequence (e.g. original sentence “Hi, how are you?” [39, 124, 144]. In language translation, a source sequence is a sentence written in one language (e.g. Chinese) and the system generates an equivalent sentence in another language (e.g. Italian). This approach, can be used chatbot dialogues, with a adaptation. In a dialog system the source sequence is user utterance along with dialogue history [144, 39], and the target sequence is a corresponding action (e.g. API call, database query) performed by chatbot.

Seq2seq approaches consider actions as sequences of tokens ([26]), and they use memory networks ([293, 242]) to learn mappings from user utterances (sourced from dialog history) into sequences of tokens (actions).

[293] uses Encoder-Decoder recurrent neural network (RNN) model to train the whole system. However, as the proposed system uses a supervised approach, it has two main limitations: (i) a lot of training data is needed, (ii) the system may struggle with choosing proper actions because of lack of dialog intent changes in the training data. [145] proposes an end-to-end task completion

Page 32 of 194 Neural Network Model Natural Language Understanding (NLU) Dialog State Tracker (DST) End-User Natural Language Generation (NLG)

Figure 2.19: An sequence model (e.g. LSTM) is considered as a blackbox with a source sequence (user utterance) and target sequence (chatbot response) dialog system (for booking movie ticket) that leverages neural network and reinforcement learning (RL) [114]. The usage of RL is to improve the robustness in order to handle noises caused by users in real conversations. [285] proposed a neural tracker based on the long short-term memory (LSTM) [137, 246] that accepts raw user utterances (without any pre-processing) and generates sentences as chatbot responses (shown in Figure 2.19).

2.3 Natural Language Generation

Natural Language Generation (NLG) is another important component in a dialog system architec- ture (Figure 2.1). NLG converts the outcome of an action, performed by the dialogue manager (see Section 2.2) into a natural language response (e.g. sentence). For example, in a “hotel book- ing” chatbot, the NLG component takes the output of dialog manager (e.g. inform(hotelName: “Hyatt”, roomType: “Harbour View studio”, checkinDate: “July 3rd”)) and generates a response to the user (e.g. Hyatt hotel offers Harbour View studio available for July 3rd).

There exist several approaches for language generation. The most common approach to gener- ate language is template-based [238]. In this approach, domain experts design a set of templates, known as intermediary form of utterances, consisting of symbols (e.g. “{hotelName} offers {room- Type} for {checkinDate}”). NLG then converts these templates into the responses [266]. In this technique, NLG replaces the symbols (fill in the blank words) with actual values (e.g. obtained by invoking Expedia API to get hotel details). However, the cost to write and maintain templates leads to challenges in adapting these models to new domains [267]. Moreover, the quality of NLG component is limited by the quality of hand-crafted templates [238].

The above challenges in template-based NLG systems motivated researchers to consider more data-driven approaches, known as generative methods [77]. The intuition behind generative meth-

Page 33 of 194 ods is to train a model (from large data set) that can produce sentences on the fly For example, given words [“dish”, “tomorrow”, “I”, “cook”], the model can generate a sentence like “I cook dish tomorrow”), only if it has seen similar sentences (e.g. “cook dish tomorrow”, “I cook dish”) in dataset [126, 160]. Bengio et al [21] proposed an approach that uses neural network [275] to predict the next word given the preceding words in a sequence. However, the major issue of the approach is the fixed size of input (given by user) and output (generated by model) sentences. Cho et al [43] overcame this limitation by using two RNN models based on the Encode-Decoder archi- tecture. The encoder first converts the input sequence (e.g. sentence) into a vector representation. Next, the generated vector is fed to a decoder to predict a sequence of symbols (e.g. words) and form a sentence. Tsung et al [273] introduced an LSTM-based model that takes as input a dialogue act1 (e.g. inform) and the model generates a response that represents the given dialogue act (e.g. “Your answer is ...”).

Part II - Term Embeddings

Our work in the next 3 chapters build upon word embedding and its extensions [176, 7]. In this section we provide background on this area. Word Embedding is a distributed representation of a term (e.g. word, sentence, paragraph) in a low-dimensional space called vector space [176]. The basic idea of such representation is that terms that share similar context (and semantics) have closer vectors in the corresponding vector space. For instance, in an embedding model, the vector of word “cat” is closer to the vector of word “dog” than that of “salt”, because they both are animals (similar semantics) and they more likely appear near each other in documents (similar context). Recently, embedding models have been successfully used in different areas such as language processing [142], sentiment analysis [248], document similarity [131, 110]. In the following sections, we will describe different types of embedding and their usages.

1A dialogue act is a phrase that serves a function in the dialog (i.e. the act that speaker performs such as a question, a statement.

Page 34 of 194 Figure 2.20: Two-dimensional projection of vector space model that represents countries and their capital cities [176] 2.4 Term Embeddings

2.4.1 Word Embedding

Word embeddings are vector representation of words with assumption that words with similar contexts have similar meanings [176] (as shown in Figure 2.20). Word embeddings are usually generated using large amounts of unstructured text data in an unsupervised manner. The most popular algorithm to generate such embeddings is Word2vec [176]. Word2vec is a three layer neural network with a single hidden layer [176]. Its input is text corpora and its output is a set of feature (e.g. context) vectors for words within the corpus (vector representation of each word in the corpus). The network starts with a set of randomly initialized vectors for words (vocabulary dictionary). It then scans the corpus with a fixed window-size. Each window has a target word and n (n is window size) neighboring words called the context. The algorithm modifies the target word vector to become close to the rest of word vectors within the context.

In general, there are two methods to construct word vectors while training word2vec: CBOW1 and Skip-gram [176]. In CBOW method, the network takes the context words (e.g. “the quick brown jumped over the lazy dog”) as the input and predicts the corresponding target words (e.g. “fox”) to the context. Skip-gram, on the other hand, inverts the context and the target and predicts each context word from a target word. Figure 2.21 illustrates these two models. CBOW is faster and has better representations for more frequent words, whereas Skip-gram works

1Continuous Bag of Words

Page 35 of 194 well with small training data and represents well rare words [176].

Figure 2.21: Skip-gram model vs CBOW model - w(t) is the target word and w(t−2)...w(t+2)are context words [176]

Regardless of the methods (CBOW or Skip-gram), the algorithm computes the dot product between the target word and the context words to minimize loss function (e.g. Softmax, Cross Entropy) to calculate the model error computed by Stochastic Gradient Descent (SGD) [27]. A loss function is a measure of how good a prediction model predicts the expected outcome. Each time that two words appear near each other (in a sentence from training data), their vectors become closer to each other in the vector space. However, this base approach comes with limitations. In a large corpora, there may some sentences (e.g. “took his book while wearing jacket”) where non-relevant words (e.g. book, jacket) appear together only for few times. Thus, updating their word vectors will result in having very close vectors; even those words are not relevant. To ad- dress this issue, Mikolov et al [176] proposed a technique called Negative Sampling. The idea is that whenever the distance between two word vectors decreases (words appeared together in the corpus), some random words are selected and their distance to target word vector is increased. In this way, the algorithm guarantees that non-similar words that do not appear in the same context stay far from each other.

Global Vectors for Word Representation (GloVe) [198] is considered as an alternative to word2vec. Compared to word2vec which is based on contextual windows, GloVe utilises a log- bilinear regression model [181] to consider co-occurrences of words in global manner. The aim of Glove is to achieve the same vector space structure as Word2Vec, but by considering the global

Page 36 of 194 relations between words (global word-word co-occurrence counts) [118]. These relationships be- tween words are learned by calculating the ratio of their co-occurrence probabilities. GloVe creates a co-occurrence matrix for the entire corpora and fills this matrix by estimating the co-occurrence probability of a given word within the context of other words. For example, words “ice” and “steam” are relevant because both words co-occur with a shared property (e.g. word “water”) frequently, and both co-occur with the unrelated word (e.g. “fashion”) infrequently. Thus, their co-occurrence probability is high.

FastText, as another alternative to Word2vec, introduces modular embeddings [23]. The idea is that instead of generating vector per word, a vector is calculated for sub-words, usually n- grams, and then later combined by composition function2 (e.g. sum) to compute the final word embeddings. In other words, FastText treats each word as composition of characters. Thus, the vector for a word is formed of the sum of its n-grams. For example, the vector of word “orange” is a sum of the vectors of [“ora”, “oran”, “orang”, “orange”, “ran”, “rang”, “range”, “ang”, “ange”, “nge”] (e.g. with smallest n-gram 3 and largest n-gram 6).

One advantage of this approach is that the vocabulary size is significantly smaller when train- ing data set is large. This makes Fasttext to be more computationally efficient compared to word2vec [176]. Another advantage is that because of considering sub-words, morphological variations of words (prefix or suffix) are captured properly [7]. Word2vec treats variations of a word (e.g. cook, cooking, cooked) as separate tokens, thus, it creates an embedding for each token separately (which increases the noise). FastText, on the contrary, ignores morphological variations of a word and it creates only one embedding per word [7]. Moreover, exploiting sub-word level vectors allows the trained word embedding to find similar words even for unseen words. This is possible because the vector of an unseen word is constructed from the vectors of its character n-grams. Figure 2.22 exemplifies this feature by showing the close words to a misspelled word (“accomodation”).

2.4.2 Sentence Embedding

There are several unsupervised approaches to represent an utterance of words (i.e., sentences) in a vector space. Perhaps the simplest way is calculate the average of all word vectors within a given

2A composition function combines multiple word vectors into a single vector.

Page 37 of 194 Figure 2.22: Subword-level in FastText - Given the word “accomodation” which is a misspelling of accom- modation, we still get closest words utterance [56, 136]. The issue of averaging method is that it performs poorly in tasks such as sentiment analysis since it does not consider words order (which is a common problem in standard bag-of-words models). Thus, it fails to recognize many sophisticated linguistic contexts such as sarcasm [136]). However, averaging method is considered as a baseline across different use cases such as short text similarity tasks [118, 133, 13]. More sophisticated approaches emerged recently (based on both supervised and unsupervised deep learning models) to encode sentences into embeddings.

Recent approaches employ sequence-to-sequence model which convert a sentence (e.g. “Weather condition is sunny for today” as input sequence consists of words) into a vector (using word vec- tors). The model decodes the generated vector back to a new sentence (e.g., “Today weather is sunny”). Figure 2.23 shows an example of a sequence-to-sequence model which generates a new sentence for a given input sentence. Skip-thoughts [123] is an example of such approach for sen- tence embedding. It relies on an encoder-decoder deep neural architecture [44]. The idea behind such design is that there is a neural network that takes a target sentence (as input sequence of words) and generates a vector representation of the given sentence. This vector is then fed into a decoder to predict context sentences (which are represented as vectors).

Sent2Vec [194], inspired by the architecture and success of Word2Vec, defines sentence em- bedding as the average of its word embeddings. Thus,it optimizes word vectors towards additive combination over the sentence, by means of an unsupervised objective function. An objective function (e.g. Softmax, Cross Entropy) is a measure of how good a prediction model predicts the expected outcome (measures the error ratio).

Page 38 of 194 Figure 2.23: A sequence to sequence machine translation neural network - encoder and decoder are con- nected together through a hidden state which represents the input sentence [232]

Supervised methods, on the other hand, make use of transfer learning [195] to encode sen- tences into vectors for a specific supervised task. Tasks like Semantic Textual Similarity (STS) [68] using existing training data have been shown to outperform unsupervised techniques [36]. Examples includes InferSent [49], which is based on BiLSTM [101] network with max pooling layer. Universal Sentence Encoder [36] also proposes two encoding models to generate sentence embeddings. First encoder is called Transformer which first represent sentence vectors by lever- aging attention based neural networks [259]. Transformer then computes vector representations of words in a sentence by considering words ordering and sentence structure. Finally it sums word representations to generate a sentence embedding. This encoder model targets high accuracy at the cost of longer training time and larger data set as it needs to be trained with more sentences with different structures. On the other hand, second encoder utilises Deep Averaging Network (DAN) [107] to generate new sentence given an input sentence. In second encoder, an input sentence is first converted into a vector by averaging all its words (ordering of words are ignored), then the sentence vector is passed through a feed-forward deep neural network (DNN) to generate new sentence. Figure 2.24 shows the structure of proposed approach. The intuition behind the DNN is that each layer learns a more abstract representation of the input than the previous one [20]. For example, given an input sentence “I really loved Rosamund Pike’s performance in the movie Gone Girl”, the encoder generates two sentences by replacing “loved” with “liked”, and, “liked” with “despised”. The vector averages of the generated sentences are almost alike. However, the vectors associated with first two sentences are slightly more similar to each other than they are to the last sentence (“I really despised Rosamund Pike’s performance in the movie Gone Girl”).

Page 39 of 194 Figure 2.24: Two-layers DAN - it first converts the given sentence to a vector by averaging all its words, then it feeds feed-forward DNN to generate a new sentence [107]

2.4.3 Document Embedding

Paragraph Vector (also known as Doc2Vec) [136] is one of the attempts to build paragraph level embeddings. Similar to Word2Vec [176], Doc2Vec has two different models for training document embeddings: PV-DBOW and PV-DM. PV-DBOW learns paragraph vectors by having paragraphs as input. It then predicts context paragraphs around a target paragraph each time. This is similar to Skip-gram model in Word2vec algorithm, which uses a target word to predict context words. PV-DM, on the other hand, leverages the CBoW model. It utilises a context window which con- tains paragraphs to predict a target paragraph. Unlike PV-DBOW, in PV-DM, word vectors are generated along with the document vectors. However, these word vectors are equivalent to the ones generated by word2vec.

Doc2VecC [40] is another approach for document embedding. In Doc2Vec, document em- bedding is generated by averaging the embeddings of all words within that document. Doc2VecC reduces computational cost of training by ignoring parts of randomly chosen documents and ze- roing out dimensions of their vectors. Adding such noise during the training avoids over-fitting problem and improves generality of the trained model by creating different variations of train- ing data. Such a technique is similar to adding Gaussian noise3 to a linear regression model or exploiting drop-out technique [237] in deep neural models.

3https://en.wikipedia.org/wiki/Gaussian_noise

Page 40 of 194 2.4.4 Domain Specific Embeddings

Although general purpose embedding models have been proven to be useful for general-purpose NLP tasks (e.g. sentence classification [121]), in domain-specific tasks (e.g. expert recommenda- tion), they may fall short in correctly assigning semantics to the terminology that are specific to the domain [96, 65]. The limited ability of general-purpose WEs in terms of coping with ambigu- ities and lack of specialized terminology can become an important barrier for the effectiveness of systems that rely on such embeddings. Thus, researchers have proposed methods to infer domain related embedding models. For instance, [65] introduces a word embedding model for software engineering domain. This model, which is trained based on Stack Overflow posts, captures the meaning of words in the domain of software engineering (e.g. word ‘kill’ has different meanings in news and software engineering). In [95], they build a system for recommending software experts by utilising word embedding model trained on Stackoverflow posts. By extracting Stackoverflow questions together with their top-voted answers (such answers are good inferences toward expert users), they create a corpora which is fed into Word2vec algorithm. The trained vector space model then is used to create clusters, called topics, where each cluster contains vectors of ques- tions/answers with similar semantics. Using clustering techniques (e.g., constrained Laplacian Rank [193]) saves computation time to find similarities between user input (question) and target experts.

Page 41 of 194 Chapter 3

Event Embeddings

Entity recognition in data-driven Law Enforcement Investigations

In this chapter we present extensions and application of term embedding techniques in order to support natural language interactions with textual items. Central to our approach is automated of entity and event (e.g., people, phone numbers, phone calls) recognition from (textual) items. Entity and event recognition techniques are part of broader set of term embedding based techniques that we extended and leveraged for various computation techniques in text-based language understand- ing including in conversational services, natural language based interactions with data services and text-based information items. Work in this chapter was mostly motivated by our research on term embedding techniques for text understanding and our involvement in major application of these techniques in law enforcement investigations.

In Section 3.1, we introduce the issues and contributes we make. We discuss about related work in Section 3.2. In Section 3.3 we describe data collection and annotation in a typical law enforcement investigation. Section 3.4 presents our proposed approach to dynamic recognition and tagging of event-types. We discuss evaluation and validation of the proposed techniques in Section 3.5 and Section 3.6. Related work is presented in Section 3.2, and finally we provide concluding remarks and discuss future work in Section 3.7. 3.1 Introduction

Law enforcement has long benefited from the advances in Information and Communication Tech- nologies (ICT) including information and knowledge management as well as leveraging big data analytics and intelligence. Techniques stemming from social network analysis [4], data mining [119], machine learning [295] and video analytics [148] have pushed the frontiers of the ability to process case-related information as part of law enforcement investigations. However, the ever increasing amount of information gathered today as part of investigation tasks poses an important challenge in terms of missing important bits of information that may prove vital in a given case.

As a real-life example, an investigator may be tasked to retrieve the passenger manifest for all flights between two cities over a given time window and then check for evidence that a person of interest did indeed travel between these cities on a certain date. Similarly for bank transactions, witness statements or telephone records. Ultimately, each piece of data needed to prove an offense is obtained, and then manually analyzed by the investigator for significance to the case. Clearly, as the number of such tasks build up, and relevant information is added to the case, the cognitive load for investigators increases [240, 12]. This is further exacerbated in today’s technological landscape, where information may be obtained from multiple, disparate sources (e.g., emails, phone calls, SMSs and social media).

The primary objective of an investigation is to uncover sufficient evidence from information sources from various information sources (e.g, witness statements, emails, documents, collabo- ration messages, social media posts, bank transactions). In our work we focus on textual and unstructured information items (e.g., emails, SMSs, PDF files). Events are key elements for un- derstanding and reconstructing what has happened from a collection of evidence items. Events are categorized by types. For example, the event-type Phone Call represents all sorts of voice communications that happen through traditional landlines, cellular phones and communication software applications. Another example is the event-type Bank Transaction, which represents all transactions performed having a bank as an intermediary, including ATM operations, check cash- ing and transfers between bank accounts. The identification of individual events is not only one of the first step toward understanding the case but it also facilitates the reconstruction of chains of hypotheses and facts to understand what developed in the case, identification of parties involved, understanding of its temporal dynamics, among other aspects [215]. However identifying events

Page 43 of 194 from unstructured information items is a hard challenge due the inherent ambiguity in natural language (e.g., various mentions of a same entity or word, jargons, misspelling).

In this chapter, we aim to facilitate the work of investigators with a framework and techniques for deriving insights from data. Inspired by word embeddings and their impacts in various applica- tion domains [24], we develop a novel vector space model and techniques to represent event types. While basic word embedding only considers ’words’ in a vector, we focus on novel extensions to represent specific constituents of events. More specifically, is we focus on leveraging event embedding techniques for the auto-recognition and dynamic tagging of event-types from evidence logs. This work makes the following main contributions:

1. We devise a framework for data-driven insights: From generating training data, process- ing of data into contextualized knowledge,totechniques for insights and discovery.Akey component we propose is the notion of dynamic tags. Tags, in general, are useful for in- jecting semantics into data. However, we extend the basic idea of tags with the notion of dynamic tags – which are tags that are automatically created via machine processing. The use of tags provide a powerful mechanism for the organization, navigation, summarisation and understanding of data [261]. This is particularly useful in the context of large amounts of information. Additionally, tags also enable powerful capabilities for visualization (as we show later), which greatly helps in decision making.

2. We leverage techniques NLP and word embeddings [177] to enable the recognition of events from case-related corpora. Word embeddings is a technique that investigates how linguis- tic elements (words) are semantically similar based on their distributional properties [139]. Conventionally, words are represented as vectors in a vector space model (VSM) [177], where words that are semantically similar are closer to each other in the space vis-à-vis the remaining words. Extrapolating upon this idea, we encode event-types as vectors based on the distributional semantics of evidence items. We then leverage on the similarity properties of the vectors to help identify when sentences relate to a particular event-type. Our frame- work also includes methods for ongoing and incremental feedback from domain-experts.

3. Finally, we have validated our proposed approach with respect to accuracy.Wehavelever- aged a publicly available dataset of 4K real cases originally obtained from AustLII1 [62],

1www.austlii.edu.au

Page 44 of 194 from which sentences containing events where extracted and annotated (using a combination of manual and automated methods). This establishes our ground truth. We then compare the performance of our framework in automatically identifying event-types with respect to the annotated data. Our results demonstrate very promising, as reported in Section 3.5.

3.2 Related work

Data-driven insights in the legal domain. Several studies can be found where NLP and ML tech- niques are used for the analysis of legal corpora for organizational, understanding and prediction purposes. Works like [155, 207] focus on legal document clustering, where, e.g., Lu and Conrad [155] propose to cluster legal documents using a classification-based approach that is equipped with topic segmentation. The work demonstrates that clustering legal documents can be done ef- fectively and efficiently with this approach by leveraging on metadata currently available in legal corpora. On a different front, the problem of document summarization has also been addressed in the legal domain [202, 76, 129]. Here, Posley et al. [202] propose CaseSummarizer, a system that helps in the automatic summarization of legal texts. In their approach, the authors propose to leverage on existing, well-established methods from natural language processing (e.g., part-of- speech tagging and TF-IDF [164]) and domain-specific constructs and show that CaseSummarizer perform better that non-domain specific alternatives.

The problem of outcome predictions for court cases has also caught the attention of researchers [243, 157]. Here, Luo et al. [157] focus on predicting charges in criminal cases taking into account both the facts described in cases and the articles stemming from the criminal law. They propose an attention-based neural network that help predict both the charges and the relevant articles applicable to a case. Finally, action, event and offense identification has also been explored in the legal domain, however, within a broader scopus. For example, on the action and event identification front, Soria et al. [234] propose SALEM, a framework for automatically annotating legal text with semantics. The framework allows for the annotation of entities and actions, and the connection of these elements to specific types of regulations (e.g., obligations, penalties and prohibitions) as emerged from legal texts. Similarly, Liu and Liao [147] address a similar problem in the context of civil law. They use instance-based classification techniques on top of a Chinese legal corpora to classify documents based on the law articles involved in the case.

Page 45 of 194 The works discussed above focus mostly on analyzing collections of legal documents and providing a holistic and retrospective insights into them, except for the line of work on predicting outcomes of court cases [243, 157]. Our work instead focuses on utilising existing corpora for training event-type embeddings for recognition, with the final aim of providing insights into cases and proactively assisting investigators as they proceed with their investigative tasks.

Data-driven insights in law enforcement. The law enforcement domain has also traditionally benefited from NLP and ML for supporting criminal investigation tasks. In the area of crime and authorship, Rahman et al. [17] address the problem of topic and author identification as well as the level of contribution of authors to topics over time on a dataset made of chat logs. Two extensions are proposed to standard, probabilistic topic modeling [269]. Namely, the LDA-topics over time (LDA-TOT) and the Author-Topics over time (A-TOT). Zhen et al. [295], instead, investigate authorship in the context of cybercrime by analyzing the content of e-mails and forums. They study the use of traditional ML algorithms (C4.5, Neural Networks and SVM) [276] with different combinations of features, including style markers (e.g., number of uppercase characters), structural features (e.g., special characters) and content-specific features (e.g., references to prices in subjects of e-mails). The combination of structural features and style markers seems to produce the best results.

Other works focus on spatio-temporal, textual analysis [149, 88]. For example, Helbich et al. [88] study the use of narrative crime reports to build maps where such documents are clustered and correlated to a geographical space in order to get insights into the geography of crime. The authors used a combination of self-organizing maps [125] and point pattern analysis [220] the help identify clusters of documents and map them into geographical locations. Similarly, Liu et. al [149], propose a search engine that leverages on spatio-temporal visualizations and information retrieval to help investigators query and geographically render crime reports, where techniques from information extraction [164], indexing [249] and clustering [276] are employed.

Works like [244, 38, 127], instead, explore the use of entity and relation extraction techniques to help crime investigation tasks. Here, Sun et al. [244] propose to leverage on information extraction techniques [164] to extract terrorism-related events from documents, while Chau et al. [38] propose to extract relevant entities found in police reports through lexical rule patterns obtained with a neural network trained for the purpose. In the same line, Ku et al. [127] propose to

Page 46 of 194 extract crime investigation information such as locations, people and vehicles using standard NLP techniques [164]. Finally, crime matching and clustering has also been addressed in the context of criminal investigations. For example, works like [119, 57] both leverage on text clustering clustering techniques for crime matching and forensic analysis. The former addresses the problem of matching crimes and criminals to previous cases, while the latter focuses on demonstrating the usefulness of clustering techniques to support information retrieval and authorship.

In our work, we utilise recent advances in NLP and ML techniques, including the combination of powerful techniques such as Word2vec [177]. The framework we propose, therefore, not only leverages on the knowledge that can be obtained from law enforcement corpora alone but can also benefit from the more general knowledge (i.e., general word embeddings and knowledge graphs). Together this equips our framework with robust tools that can help gain useful insights into cases.

3.3 Evidence Collection & Analysis

Data collected (in this case evidence) during an investigation provide fundamental information to help understand a case and build legal arguments in a court-of-law [161]. Evidences can origi- nate from different sources, including interviews transcripts, surveillance (e.g., security cameras), telecommunication channels (e.g., SMS messages), witness statements, investigation materials (e.g., habitation check report) and electronic devices (e.g., laptops). Table 3.1 shows the different sources from which evidences are collected during an investigation.

Table 3.1: Examples of Evidence Sources

Evidence sources Examples Interview Video, audio and transcript Surveillance Video, audio, photograph, tracking, notes, and runsheet/log Telecommunication Transcript, audio, SMS, call log and e-mail Witness statement Police and civilian narratives, Border Force, Custom, Border Force Immigration, translator, and forensics Investigation material Warrant, evidentiary certificate, photograph, company search, habitation search, notes, and forensic item Electronic device Mobile phone, computer, tablet, SIM card, portable storage and GPS

Typically, a case handles a large amount of evidences that are collected and stored for further analysis [12]. We refer to the collection of evidence as the evidence log. Relevant aspects related to managing such evidence logs in the context of police investigations include the ability to quickly

Page 47 of 194 search for evidences and identify how they are connected to the various events, people, location, objects and offenses involved in the investigation.

On 23 Feb 2001, Andrew went to Lancaster to meet a person named Phillip in a pub, they Travel Event Person watched a football match together until 8.30pm Bank Transaction Price … Later, Phillip, used his phone to transfer 6 DateTime Location thousands dollars to Andrew.

9GECPKPHGTVJCV#PFTGYOC[YQTM 9KVPGUUUVCVGOGPVCDQWV#PFTGY HQT0QTOCPVQUGNNUVQNGPECTU ECPDGNKPMGFVQGOCKNHTQO#PFTGY

From: Andrew To: Norman 8KFGQ5WTXGKNNCPEG Date: 20 Feb 2001 CDQWV0QTOCP “Norman“Norman was reported stealing Meet me at the ECPDGNKPMGF a Carolla on 13th of Feb…”Feb…” Lancaster Garage.. VQGOCKNVQ0QTOCP Figure 3.1: Illustration of an investigator’s workspace.

Figure 3.1 shows an illustration of an investigator’s typical workspace. We also show how evidences are typically annotated to add semantics to data. Moreover, we exemplify how relation- ships can be inferred between two items that would otherwise remain unconnected. In Figure 3.1, we have a witness statement annotated by the investigator with two event-types, firstly a travel movement of Andrew who went to meet Phillip; and then a bank transaction between Phillip and Andrew. Next, we have an email exchange between Andrew and another person, Norman. The name of a Person as named entities [164], as well as timestamps are also annotated. Third we have a video surveillance that recorded Norman stealing a car.

Beyond evidence annotation for summarization and search purposes, the resulting annotations can be exploited to infer further, useful insights. For example, correlations can be drawn from co-occurring and semantically similar annotations across different evidences in the log. This can again help connect the dots between events, people, organizations, offenses and artifacts involved in a case. In addition, annotations can also be leveraged along with other analysis dimensions such as time and space. For example, the combination of annotations related to event types (e.g., bank transaction) can be analyzed from a time perspective to help re-construct how the sequence of events shaped a given case. This can be elaborated even further by considering the space perspective where event-related annotations are correlated to geographical locations plotted on a map, e.g., to help investigators precisely characterize the locations where the events took place. In fact, later in Section 3.4.3, we illustrate examples of how some of these interesting insights can be

Page 48 of 194 visualized using our implemented tool.

The tasks and examples discussed above provide an overview of how evidences are analyzed and annotated manually in an investigation. In this chapter, we replace such manual event type annotations with dynamic tags, and we show how they can be automatically recognized from evidence items. We also showcase how our tool, Case Walls, can help facilitate insights and discovery in investigation processes. We discuss all this next.

3.4 Dynamic Event Type Recognition and Tagging

The evidence collection and analysis scenario discussed in the previous section show how tightly data can be connected to a case. Moreover, much of such data is nowadays being increasingly generated through digital means. At the same time, social and other Web-oriented platforms are promoting the creation of tremendous amounts of data on a daily basis. As discussed previously, a significant part of the investigation work, therefore, focuses on the analysis of data for the recogni- tion of events relevant to a case, discovery of insights and presentation of such results to end-users (investigators) to support critical decision making within cases [215]. In this section, we propose a framework to support the process of automatically recognizing and tagging event-types from evidences collected as part of an investigation process.

The key rationale behind our proposed architecture is that much of the evidences in an in- vestigation are described and represented through textual reports (e.g., phone call transcripts and e-mails). In our proposed solution, we thus focus on the processing and analysis of textual data that can provide insights into the case. As we will discuss next, we therefore leverage on Nat- ural Language Processing (NL) techniques [162, 177] that can help us process such data for the automatic recognition and tagging of event-types.

Figure 3.2 illustrates our overall architecture, which consists of three main layers. Starting with Training Data Generation, this layer mines existing domain-knowledge, relevant corpora and general purpose knowledge graphs to formulate seed data. People (domain experts or crowd) involvement in this layer ranges from filtering the extracted data to taking upon an active role in the processing of data in order to formulate a seed dataset that will help our automatic event recognition approach. Next, at the heart of the framework, the Data Processsing layer transforms

Page 49 of 194 raw data from evidence logs into semantically rich event-type representations. This layer leverages on state of the art NLP techniques [162, 177] in order to support the automatic recognition of event- types out of the items found in the evidence log of a case. More specifically, we leverage on word embedding techniques [177] in order to encode various event-types of interest into a sematically rich VSM, similar to how words are represented as vectors through word embedding techniques.

Finally, at the Insights and Discovery layer, raw data is presented appropriately for end-users to understand and make decisions. A key component of this layer is the use of a dynamic tagging mechanism. Tagging in general is useful to assign semantic to data, which thereby enables power- ful analysis (the simplest example is, if two items share the same tag, we can infer a relationship between them) [261]. Our architecture, however, transcends the idea of basic tagging and also introduces the notion of dynamic tags: These are tags that are automatically created from raw evidence logs. Another important component for insights and discovery is the use of visualiza- tion techniques. Accordingly, we propose supporting this by providing suitable visualization for associating the dynamic tags with evidence items in the log.

In addition to the above, an important feature common across the whole framework is Feed- back and Learning. In our proposal, this can be achieved in three ways: Firstly, the process for seed generation is not one-off but re-invoked to improve the system’s perception of knowledge. The data being processed can be periodically monitored and analyzed, e.g., by domain experts,

Visual Insights and Raw Data Interface/s Discovery Adapter/s Adapter/sAdapter/s Adapter/s

Contextualized Raw Data Knowledge

Decode Encode Word Embedding Model

Data Processing (YHQW7\SH5HFRJQL]HU

Feedback and Learning (Autonomic Enrichment)

Updated Data Training Data Generation Raw Data Extract and Rich Data Filtering Process

- Knowledge Graphs Sample set - Domain-related Corpora Monitor, Analyze and Update Data

Figure 3.2: Our framework for dynamic recognition and tagging of event-types for insights and discovery

Page 50 of 194 Insight and Evidence Log Discovery

Word Embedding Model

Event Type Recognition Event Type Event Type/ Ngram Candidate Event Type Event Type to Ngram Sentence Selector Ngram Embeddings Recognizer Encoder Feedback

Open Corpora Sentence Sentences Sentence/Relation Sentence Training Relation (Case Dataset) Splitter List Tuples List Filtering Dataset Extractor Training Data Generation Seed Relations

Domain Expert Figure 3.3: Detailed architecture for Training Data Generation, Event Type Recognition, and Insight and Discovery. and updated accordingly based on their feedback. Secondly, feedback can also be obtained from real-time data. For example, when investigation data is inputted for processing, the existing VSM can be updated to reflect, for instance, new event-types of relevance for a case. Thirdly, the use of dynamic tags also enable end-users (investigators) to accept or reject tags (e.g., if tags were added incorrectly), thus providing additional feedback to the system.

In the following sections, we present the details of our architecture starting from the Training Data Generation layer. We then move on to discuss the remaining layers for Event Type Recog- nition, and Insights and Discovery. We use Figure 3.3 as reference to explain in more details the components of each layer of our architecture.

3.4.1 Training Data Generation

Given the textual nature of much of the evidence items found in investigations, in this section, we describe the corpus (dataset) used to extract raw sentences and generate labeled sentences with event-types. Such sentences are used as training data for the event recognition technique we propose in the next section.

Page 51 of 194 Its businesses are focused on media … According to an independent expert's report in evidence … 

Figure 3.4: Excerpt of one of the case files in our case dataset. [ ... { "sentence": "While in the house they telephoned him and asked for his whereabouts.", "eventType": "Phone_Call" }, { "sentence": "The appellant arrived in […] on 7 December 2003 … ", " eventType ": "Travel_Movement" } ... ]

Figure 3.5: Excerpt of the gold standard data showing two sentences and their corresponding event-types.

3.4.1.1 Datasets

The dataset we use for training purposes was obtained from past court cases (including both crim- inal and civil cases) in Australia, which are made publicly available by the Australasian Legal Information Institute (AustLII) [106]. While such court dataset does not contain police investi- gation data, it contains rich descriptions about events, facts, transcripts of communications and other relevant details of the case, which are similar to the content found in police investigation documents.

More specifically, in this work we use the pre-processed dataset available at the UCI Machine Learning repository [62]. This dataset contains a total of approximately 4K cases, where each case record is made available as an XML file. The XML file consists of a list of sentence nodes (e.g., In all these circumstances, subject to the applicants... ) that contain text recorded in each case. Figure 3.4 shows an excerpt of one of these files. The file contains additional metadata, such as catchphrases (summaries of the case) and links (URLs) to the original source (AustLII’s website [106]). We ignore these metadata since they do not contain relevant sentences to support event-type recognition.

Page 52 of 194 3.4.1.2 Training Data for Event Type Recognition

Armed with the dataset described above, we transform and process it into valid training data. For illustration purposes, consider an extract from our dataset below:

“Mr. Martens said that during 1987 whilst working part-time in Belmont [...]. However, in January 2002, Mr. Martens telephoned him at the offices of Mr. Hoffman, and engaged him as a tax accountant for him and his wife Victoria.”

In the paragraph above, the sentence “...in January 2002, Mr. Martens telephoned him at the offices of Mr. Hoffman, and engaged him as a tax accountant...” makes reference to an event of type Phone Call. In this layer of our framework, the goal is to mine our case dataset to identify sentences containing event-types and generate labeled data that can be used for training purposes.

First, we split the dataset using the Sentence Splitter (see the bottom layer in Figure 3.3). The output of this component consists in a list of sentences similar to the one exemplified above. Notice, however, that not all sentences will necessarily contain relevant event-types. Therefore, in the next step, the Open Relation Extractor uses relation extraction techniques [9] to help identify a list of sentences containing event-types (e.g., Phone Call). Relation extraction is the task of extracting tuples of the form in which subject and object are related to each other by relation. The output of this component is a list of pairs {< senti,relsj >k}, where senti represents a sentence (e.g., “However, in January 2002, Mr. Martens telephoned him as ...”) and relsj is a list of relation tuples identified in sentence sent. As an example, the Open Relation Extractor will produce a tuple for our exemplary sentence. The relations in these tuples (e.g., telephoned) can indicate the presence (or not) of an event-type of relevance (in this case, the verb telephoned indicates the presence of an event of type phone call).

The Open Relation Extractor extracts all sorts of relations from sentences, regardless of whether they refer to event-types of interest or not. In the sentence above, besides the relation tuple , additional relation tuples that are not related to our list of relevant event-types can also be extracted, such as . The Sentence Filtering component (see Figure 3.3) helps in filtering out sentences that do not contain relevant events. In order to do this, it takes in input the list of sentences from the case dataset and the identified relations within each of them (i.e., {< senti,relsj >}), as well as a list of curated

Page 53 of 194 Phone Call: ['call', 'contact', 'phone', 'ring back', 'telephone call'...] . . contact . . in touch. . money bank connection. dollar. . . store_money . .cash . . . euro HKPFKPIXGEVQTUQHYQTFU .money place . 镩 . . .send . . . HQTGCEJUGGFPITCO phone up pass on . . . . . phone call Projection . . . transfer . telephone . . 镪 IGPGTCVKPIXGEVQTHQT . call in ring back GXGPVV[RG . telephone call 2D Word Embedding Model 2D Word Word Embedding Model Word

phone Phone Call ring back call contact telephone call

Figure 3.6: Event Type Vector Encoding Using Seed n-grams seed relations (e.g., generated by domain-experts). It then keeps only sentences containing rela- tions (e.g. telephoned or contacted with) found among the list of seed relations. The final list of remaining sentences is then labeled with the event-types they contain and passed onto the upper layer.

Following this approach, a total of 500 sentences have been extracted from our dataset, each one labeled with one of the event-types of interest (Phone Call, Bank Transaction and Travel Movement). In the following section, we discuss how this training data is exploited for the purpose of event-type recognition.

3.4.2 Event Type Recognition

In order to recognize event-types from evidence items, we first need to be able to find a suitable representation for event-types that are relevant in the context of investigations. The key intuition behind our proposed approach for event-type recognition is that, similar to how word embeddings [177] are used to represent words in a VSM, we can also encode event-types into vectors. The resulting VSM for event-types can be thus leveraged in order to help recognize event-types within investigation data. Using the architecture shown in Figure 3.3 as reference, we elaborate next on the details of how we achieve this.

Page 54 of 194 3.4.2.1 Event-Type Vector Encoding

While general-purpose word embedding models often consider “words” as vectors, we have ex- tended this idea to represent event-types in a VSM. We start by expressing event-types as a list of n-grams [164]. For example, n-grams like {call on}, {ring back}, {phone call}, {later call} and {make contact on} convey the idea that we are referring to a Phone Call event-type.

In order to build a vector for the event-type Phone Call, we therefore start by first building vectors for each of these n-grams. These individual vectors are then combined to formulate the event-type vector. Figure 3.6 shows the steps taken to build a vector for the Phone Call event-type. This is based on an initial set of seed n-grams provided by a human expert (e.g. {call on}, {ring back}, {phone call}, {later call}, {make contact on}). Alternatively, this could also be sourced from an initial training data sample (e.g., see how the Ngram Selector component in Figure 3.3 is fed by the Training Data Generation layer). Using these seed n-grams, we then build vectors for each one of them by using a word embedding model [177] (see the Event Type to Ngram Encoder component in Figure 3.3).

Equation 3.1 formally describes the computations used to encode both individual n-grams and event-types as vectors: 1 n 1 m v(ngram)= v(wk) v(evtype)= v(ngrami) (3.1) n k=1 m i=1

Here, v(ngram) is a vector of an n-gram obtained by averaging the vectors (v) of each word in the ngram (wk), and n is number of words in that n-gram. Then, v(evtype) is a vector for an event-type generated by averaging a set of n-grams vectors (i.e., v(ngram)) corresponding to an event-type, and m is the number of n-grams linked to the event-type. The result of this is the encoding of event-types into a VSM. Next, we will discuss how we can fine-tune these event-type vectors in order to obtain a more precise representation of event-types.

3.4.2.2 Tuning Event-Type Vectors

In order to fine-tune our event-type vectors, we need to ensure that the underlying n-grams accu- rately represent each event-type. One way of achieving this is to build our n-gram vectors using

Page 55 of 194 . contact in touch . contact 1. "After receiving this email, Mr. Kaplan spoke to . bank . . money . dollar. . Mr. Gibson by telephone on the same day" . connection . .cash store_money . . 2. "Wilson went to bank and transferred money to ..." . get . . euro money place . . .send recieve. pass on . . . speak . . .mail . call . phone call Projection phone. . transfer . telephone . . HKPFKPIXGEVQTUQHMG[YQTFU . call in 镩 go ring back email . telephone call Word Embedding Model Word . Embedding Model Word HQTGCEJVTCKPKPIUGPVGPEG Embedding Model 2D Word 镪IGPGTCVKPIXGEVQTHQTGCEJPITCO

receive email email speak receive email speak email speak telephone 镫 speak telephone CFFKPIPGY Phone Call: Phone Call [['call','call', 'contact','contact', 'call'phone', on', 'ring'ring back',back', PITCO 'telephone'phone call',' call','speakspeak telephone telephone'] ']

go bank Bank Transaction: bank transfer ['send'send money',money', 'send'send cheque',cheque', 'bank'bank go bank transfer account', 'withdraw euro', 'give cash', transfer money ''bank transfer money''] bank transfer money Bank Transaction

Figure 3.7: Tuning Event Type Vector by using training dataset sample sentences (and thus, n-grams) from our target domain. The intuition behind this is that the use of domain-specific corpora will help us obtain n-grams that are tightly connected to the terminology used within the domain.

For our Law Enforcement scenario, we therefore utilise the dataset of court cases introduce previously. In Section 3.4.1 we explained how we acquired this training dataset in more details. Our training dataset is a collection of sentences (e.g., S.1) that are labeled with at least one event- type.

“33 After receiving this email, Mr. Kaplan spoke to Mr. Gibson by telephone on the same day.” (S.1)

Following our architecture in Figure 3.3, the N-gram Selector component extracts all keywords (nouns and verbs) from sentence of the corpora (see step 1 in Figure 3.7). Next, it removes stop words, and uses stemming techniques to get the root form of each remaining keyword. It then creates a list of n-grams out of these keywords. If we consider, e.g., sentence S.1 above, possible n-grams are {receive email, email speak, speak telephone, receive email speak, email speak telephone}.

The n-grams are then converted into vector representations, which are then compared with the

Page 56 of 194 vector of existing event-types (step 2 in Figure 3.7). Next, the N-gram Selector component tries to find and select the top n-gram – the one that (i) has the closest vector to an event-type vector in our VSM, and (ii) has a similarity score greater than a given threshold1. After finding the top matching n-grams, the N-gram Selector component sends these n-grams to the N-gram Encoder (step 3) to update the corresponding event-type vector.

Learning from Real-time Data. The tunning of event-type vectors, as described above, is done using pre-existing training data. In addition to this, our proposed solution also allows for fine- tunning event-type vectors through continuous feedback from users and real-time investigation data. Here, as corpora from inputted evidence log is processed, feecback is sent to our Event Type Recognizer to update the existing event-type vectors. For example, the sentence “Later, he, Ari, used his phone to transfer 6 thousands dollars to Philip" contains the n-gram {transfer, thousands, dollar}, which can be considered relevant the event-type Bank Transaction. Such n-grams are sent to the N-gram Selector component to start the processing pipeline for updating the corresponding event-type vectors.

3.4.2.3 Event-Type Recognizer

The task of the Event Type Recognizer component (see Figure 3.3) is to detect possible event- type(s) that may appear in each sentence of an evidence log. This task starts by breaking down each paragraph of an evidence log into separate sentences (step 1 in Figure 3.8). From each sentence, all keywords (nouns and verbs) are extracted. We specifically consider verbs because they refer to the performance of activities and actions (e.g., talking, travelling, transferring), and nouns (e.g., phone, bar, money) because they convey the general purpose of a sentence when considered in conjunction with verbs (e.g., talking on the phone, going to the bar, transferring money) [196, 70]. Other elements such as person or date are also detected and tagged separately.

The extracted keywords are then converted into their stem forms and then used to build a list of possible n-grams2 (e.g., {phone conversation, conversation hospital, hospital pretend, pretend problem, phone conversation hospital, ...}). The n-grams are then converted into vector repre- sentations (step 2 in Figure 3.8), which are then compared separately with our prevously built

1We set an initial threshold to 70%; this parameter can be tuned as needed. 2We choose bigrams and trigrams, as based on our observation, averaging the vector of more than 3 words together results in an embedding that is not semantically meaningful

Page 57 of 194 ... Later, he, Ari, used his phone to transfer 6 thousands dollar to Philip. At around 11.39pm, they had a phone "At around 11.39pm, they had a phone conversation with Dr. Green in hospital and they conversation with Dr. Green in hospital pretended that there is a problem there and they needed and they pretended that there is a problem an ambulance, but it was not true. ... there but it was not true."

镩 5RNKVVKPIRCTCITCRJ HKPFKPIXGEVQTUQHMG[YQTFUVQ KPVQUGPVGPEGU 镪 IGPGTCVGXGEVQTHQTGCEJPITCO

. . Travel Movement hospital in touch . contact problem. . bank . . Bank Transaction connection. dollar. . store_money pretend . get . . euro. . phone conversation money place . . . send Phone Call recieve. pass on. . .speak . . phone .mail . call . Projection . . transfer . telephone . . phone conversation call in hospital .ring back email . conversation ... hospital pretend

2D Word Embedding Model 2D Word 镫 HKPFKPIENQUGUVGXGPV V[RG U Figure 3.8: Recognition of Event Types from Evidence Logs event-types vectors (Sections 3.4.2.1 and 3.4.2.2). If the similarity between vectors (i.e., sen- tences’ n-gram vector and event-type vectors) exceeds a specified threshold (α = %55 as an initial setup), the Event Type Recognizer records the corresponding event-types as detected event-types (e.g., Phone Call in step 3 of Figure 3.8). Notice that, if a sentence contains more than one event-type, the Event Type Recognizer will still be able to capture all of them (depending on the similarity threshold used). We also provide the possibility to flag event-types as “suggested”: This corresponds to event-types that score a similarity value that falls between β and α (i.e., for our initial setup, between β = 40% and α = 70%). These parameters can be fine-tuned to reflect the investigator’s preferences in terms of precision and recall.

3.4.2.4 Event Information Extraction

In addition to recognizing events types, we also consider the possibility of extracting event infor- mation that can help provide a more holistic view on the events being analyzed. While different event-types can be characterized by different bits of information (e.g., a bank transaction may be characterized by the amount of money transferred, while a phone call may do so by the phone number making the call), we specifically focus on extracting three key entities that are relevant for any event during an investigation [63], namely, the date/time when the event happened, location where it took place, and the parties (e.g., people) involved in the event. To do so, we leverage on Information Extraction techniques [162] that allow us to recognize such entities from the textual

Page 58 of 194 Figure 3.9: Swagger documentation for Event Regognition Service descriptions of an event. More specifically, we leverage on named-entity recognition techniques [184] to perform this task and we rely on the state-of-the-art tools provided by the Stanford Core NLP Toolkit [162]. For instance, take the sentence below, which contains an event of type phone call.

“However, in January 2002, Mr. Martens telephoned Mr. Hoffman at his office in Lan- caster, and engaged him as a tax accountant...”

By performing named-entity recognition on the sentence above, we can extract the named- entities “January 2002” (date), “Lancaster” (location), and “Mr. Martens” and “Mr. Hoffman” (parties involved). The extracted named-entities can be used to annotate the recognized event, thus producing a richer representation thereof. As we will discuss in Section 3.4.3, these pieces of information are key to providing insights and guiding the discovery process in the investigation of cases.

3.4.2.5 Event Recognizer REST APIs

To assist developers to use Event Recognizer service, we have provided its APIs available at http:// eventify.ngrok.io. The service allows developers to detect events (e.g. Phone Call) that may appear in sentences. Besides, developers are able to tune generated event vectors to improve the accuracy of service using the available endpoint. Figure 3.9 shows list of endpoints

Page 59 of 194 offered by Event Recognizer service.

3.4.3 Insights and Discovery

After event recognition and enrichment (with event-related information such timestamps and peo- ple involved), we can now illustrate how this information can be presented to end-users in the form of dynamic tags in Case Walls, our tool for supporting investigations in Law Enforcement (see the top layer in Figure 3.3). As mentioned earlier, tags inject semantics into data, and we utilise them to produce powerful visualizations for insights and discovery. Figure 3.10 shows our Case Walls in action. The Digital Assistants Wall (see the upper part of the figure) helps users to automatically recognize event-types within evidence items (see the text file in the Workspace Wall). Three event-types are automatically recognized by the Digital Assistant, namely, phone calls, travel movements and bank transactions. Furthermore, the corresponding statements are dy- namically tagged in the text file displayed in the Workspace Wall (see the highlighted sentences). With the help of the Digital Assistants, the user can approve or reject the detected event-types, thus, proving feedback to the system to support learning.

From the dynamically recognized and tagged events, users can also derive useful insights that can help correlate events, people, locations and objects found in evidence items. At the bottom part of Figure 3.10 (see the Visualization Widgets), we show how travel movements are plotted on a timeline, along with other extracted elements such as Person, location, and timestamps. These elements correspond to the complementary information that is automatically extracted along with the recognized event-types, as discussed in the previous section.

The first timeline (top) presented in the Visualization Widgets shows a plot where the y-axis is set to location, while the x-axis to time. In this timeline, we are thus able to track the movement of Persons, across time (as the x-axis). In the second timeline (bottom), we now plot bank transaction event-types. We set the y-axis using a formula for the frequency of these event-types (which determines the size of the green dots), again across time (as the x-axis). By juxtaposing these two timelines side-by-side we are thus able to discover that after each travel movement and meeting between Andrew and Philip, there is a bank transaction shortly thereafter.

The example above is just one instance of the type of insights investigators can gain by using

Page 60 of 194 9QTMURCEG9CNN9QTMURRCEGG9CNN &KIKVCN#UUKUVCPVU9CNN&KIKI VCNN#UUKUVCPVU 9CNN

%JCPPGN9CNNCNNNN

8KUWCNK\CVKQP 9KFIGVU

Figure 3.10: Case Walls for Law-Enforcement our framework. Similar widgets can also be developed to support insights and discovery, e.g., to further connect events with offences, allegations and even past (or current) investigations. In the next section, we turn our attention to experiments that help validate the effectiveness of the techniques proposed for the automatic recognition of events.

3.5 Experiments

We demonstrate the effectiveness of our approach by conducting three experiments to test the pre- cision of the proposed event-type recognition approach. We provide the details of our experiments

Page 61 of 194 %tann e,ie,b nyuigvcosbitbsdo h nta e fse -rm (section n-grams seed of set initial the on to with based amounts even built precision high The vectors reasonably 3.4.2.1). using is only event-types by three i.e., all set, for training precision 0% the Here, 3.11. Figure in shown norgl tnaddtstcnitn fa of consisting dataset standard gold the our of in ability the evaluate to is experiment this of aim The Recognition Event-Type 3.5.1 fixed next. is set (testing set training dataset) 0-50% gold for of Precision/Recall/F-Score 50% - on Recognizer Type Event 3.11: Figure eeaut h efrac fthe of performance the evaluate We Validation Conventional 3.5.1.1 with tagged correctly extraction are relation sentences types. that combining event guarantee appropriate by to the 3.4.1.2) order in Section curation (see manual and dataset [9] cases techniques 4K our from generated is three etst rm0 o5% h i fti xeieti oso h feto riigo the on training of effect the show to is the experiment (w.r.t. this set of training aim the The the of while performance 50%. size relative to the 0% from increase set) we test as explore approach we the Furthermore, of dataset. performance standard and the gold sentences the output to the recognized both event-types comparing We corresponding by sets. the F-Score test 50% and and Recall training Precision, 50% standard into the dataset measure standard gold the split we where [276], approach Transaction as event-type an Precision 100% 25% 50% 75% 0% event-types: 01 02 03 04 50 45 40 35 30 25 20 15 10 5 0 Phone Call epciey sw nraetepretg ftann e fo %t 0) we 50%), to 0% (from set training of percentage the increase we As respectively. , Training (Percentage) Travel Movement detected hn al rvlMvmn n akTransaction Bank and Movement Travel Call, Phone vn yeRecognizer Type Event Bank Transaction is α =%55

Recall 100% .3 0.80 0.93, 25% 50% 75% 0% vn yeRecognizer Type Event n as and , 01 02 03 04 50 45 40 35 30 25 20 15 10 5 0 Phone Call ae6 f194 of 62 Page 500 and Training (Percentage) sbigue.Ntc httetrsod oidentify to thresholds the that Notice used. being is Travel Movement suspected etne Fgr .) ahlbldwt n of one with labeled each 3.5), (Figure sentences 0.90 for Bank Transaction vn yeRecognizer Type Event is hn al rvlMovement Travel Call, Phone β =%40 hog ovninlvalidation conventional a through

F-Score 100% 25% 50% 75% 0% h vlainrslsare results evaluation The . h odsadr dataset standard gold The . 01 02 03 04 50 45 40 35 30 25 20 15 10 5 0 Phone Call Training (Percentage) Travel Movement oietf events identify to and Bank Transaction Bank can see that precision improves to 0.99, 0.87 and 1.00 for Phone Call, Travel Movement and Bank Transaction, respectively.

Recall, prior to training, reaches the value of 0.50 for Bank Transaction. Instead, for Phone Call and Travel Movement the values are 0.91 and 0.88, respectively. This low recall is caused by the fact that we have a rather small set of n-grams to start with for Bank Transactions in comparison to the the other two event-types (Phone Call and Travel Movement events where found at a higher frequency in our training dataset). However, as the system learns a higher number of n-grams, the recall goes up to 1.00 (at a 50% training set size).

3.5.1.2 K-Fold Cross Validation

In this experiment, we evaluate the performance of the Event Type Recognizer using k-fold cross validation [276], where we randomly partition the gold standard dataset into k equal-sized subsets. We choose k =5and perform 5 rounds (folds) of training and testing. In each round, we choose one different subset at a time as test set, while we train our vectors on the remaining 4 subsets. For each fold, the results produced on the test set are used to compute precision, recall and F-score (using our golden data as reference), which are then averaged to estimate the overall performance of the Event Type Recognizer. For this evaluation we use an experimental setting of 80% training, 20% testing. The results of our cross validation test are reported in Table 3.2. The average per- formance show very promising, where we have Phone Call (0.94), Travel Movement (0.91) and, interestingly, Bank Transaction (1.00). The recall, on the other hand, is also optimistic; it is above 0.95 for both Phone Call and Travel Movement, while for Bank Transaction recall reaches 0.88. These results show that if we train the system to learn more precise n-grams, it can indeed boost performance.

3.5.1.3 Human Validation

To further evaluate the accuracy of our approach, we ran the system on the 4K cases dataset, and then randomly picked a total of 500 event-type recognitions. Then, we asked two participants (Postdoctoral researchers) to independently classify each recognized event-type as correct or in- correct. To measure the accuracy of event recognizer, we used Cohen’s Kappa coefficient (κ) [46],

Page 63 of 194 Table 3.2: Event Type Recognizer - Precision(P)/ Recall(R)/ F-Score(F)/ Average(Avg) for 5-Fold Cross- Validation

Phone Call Travel Movement Bank Transaction Folds P R F P R F P R F F1 0.89 1.00 0.94 0.90 0.98 0.94 1.00 0.83 0.91 F2 1.00 0.90 0.95 0.89 1.00 0.94 1.00 0.63 0.77 F3 0.94 0.97 0.96 0.94 1.00 0.97 1.00 0.83 0.91 F4 0.97 0.88 0.92 0.84 1.00 0.91 1.00 1.00 1.00 F5 0.89 1.00 0.94 0.96 1.00 0.98 1.00 1.00 1.00 Avg 0.94 0.95 0.94 0.91 1.00 0.95 1.00 0.86 0.92

which measures the agreement between two individual raters. Typical interpretation of such coef- ficient in relation to the agreement levels are Poor (κ<0.20), Fair (0.20 ≤ κ<0.40), Moderate (0.40 ≤ κ<0.60), Good (0.60 ≤ κ<0.80), Very good (0.80 ≤ κ ≤ 1.00) [6]. Kappa compares an Observed Accuracy with an Expected Accuracy (random chance) in a confusion matrix. As it considers random chance, it is less misleading than simply using accuracy as a metric. The results shows that both participants agreed on 96.05% of all the recognitions (κ =0.77). The overall average accuracy was 88.4% for Phone Call, 80.2% for Travel Movement and 77.1% for Bank Transaction.

3.5.1.4 Effect of Word Embedding Model

The aim of this experiment is to see the impact of using different word embedding models (trained on different corpora) on the performance of our Event Type Recognizer. We first shuffled the gold standard dataset and split it into two sets: 70% training, 30% test. We then evaluated the performance of Event Type Recognizer - in terms of Precision and Recall - while using different word embedding models. We present below the list of embedding models used for this test:

• GoogleNews: A publicly available word2vec model trained on Google News articles1 with about 3 million words and phrases.

• WikiNewsFast [22]: A previously fastText2 trained model on Wikipedia (2017), UMBC webbase corpus and statmt.org news with 16 billion words.

1https://code.google.com/archive/p/word2vec/ 2https://github.com/facebookresearch/fastText

Page 64 of 194 • GigaGlove: A pre-trained model on Wikipedia (2014) and Gigaword (5th edition) by using Glove3 algorithm.

• Numberbatch [235]: A pre-trained model on ConceptNet4, an open multilingual knowledge graph.

• TF-IDF5: It is a baseline model to measure similarity of sentences (documents) by repre- senting words according to their importance/frequency in corpus (training set).

• WikiW2V: Our own word embedding model trained on Wikipedia.

The results are reported in Table 3.3. Here, we can see that the Event Type Recognizer performs very well when it uses an embedding model trained on general purpose corpora (e.g., Wikipedia). Instead, the performance is dramatically reduced when traditional keyword matching techniques are used. For instance, for Phone Call, the Event Type Recognizer reaches its best Precision (0.96) and Recall (1.00) when it uses WikiW2V model. Likewise, it reaches to maximum Precision score in Travel Movement when WikiW2V is used. Furthermore, best Recall happens when either WikiNewsFast or Numberbatch model is used. In addition, in Bank Transaction, Precision faces its highest score while the pre-trained Glove embedding model is used. Finally, Recall has its best score in both WikiNewsFast and WikiW2V models.

While results are very positive when the Event Type Recognizer uses general purpose word em- bedding models, performance is rather modest when the TF-IDF embedding model is employed. The reason of such low performance is the high percentage of False Positives. As TF-IDF con- siders only (i) frequency of a word in each training sentence, and (ii) weight of rare words across all training sentences in the corpus, it does not exploit the contextual meaning of words in sen- tences. This leads to losing semantic similarities/relations between words and sentences. As a practical example of this, consider the sentence, “Mr Green talked with his secretary in a meeting, and asked her to send his phone number to Mr Ari before 2PM”. Here, although the sentence has four important keywords (talked, asked, phone, number) that are somewhat related to a phone communication, the event-type contained in the sentence cannot be recognized as a “Phone Call”.

3https://nlp.stanford.edu/projects/glove/ 4http://conceptnet.io/ 5Term Frequency Inverse Document Frequency

Page 65 of 194 Table 3.3: Event Type Recognizer - Precision(P) and Recall(R) while using different embedding models

Phone Call Travel Movement Bank Transaction Models P R P R P R GoogleNews 0.73 0.90 0.48 1.00 0.79 0.38 WikiNewsFast 0.92 0.94 0.84 1.00 0.95 1.00 GigaGlove 0.83 0.95 0.86 0.96 1.00 0.80 Numberbatch 0.81 0.95 0.55 1.00 0.72 0.84 WikiW2V 0.96 1.00 0.93 0.93 0.97 1.00 TFIDF 0.80 0.33 0.73 0.27 0.78 0.46

3.6 Discussion

The framework discussed in this chapter is part of a larger effort for supporting insights and dis- covery in police investigations. Such an initiative is realized through our prototype platform, Case Walls for Law Enforcement (or just Case Walls for short). Our platform provides a collabora- tive, assistive and analyst-friendly environment to both manage and analyze case data. We do this through features that can be categorized along the following dimensions:

• Cognitive intelligence. First, Case Walls provides a powerful set of computational tech- niques that provide support to cognitive tasks that are performed as part of police inves- tigations. Such tasks typically require the ability to obtain insights into cases in order to make decisions. In this context, our platform can help in tasks such as the identification of relevant events in textual data (e.g., from police narratives). In addition, it is also able to automatically detect the presence of a potential offense (i.e., violation of a law) from such data, and more so correlate the elements of an offense (e.g., fraud) to a set of events (e.g., bank transaction to a certain person). Finally, it also provides support for natural language (NL) searches on top of insights obtained from such intelligent processing of data (e.g., search for events and people within a case).

• End-user Digital Assistance. Layered upon cognitive intelligence, Case Walls then pro- vides a set of digital assistants (in the form of chatbots) that can help investigators perform their tasks. For example, by leveraging on the event and offense detection intelligence dis- cussed before, the digital assistants can facilitate the tagging of evidences by automatically recommending tags that can be used for annotating evidences. Other types of digital as-

Page 66 of 194 sistance include the support for investigation tasks creation (e.g., habitation checks), entity search (e.g., searching for persons of interest) and NL queries (e.g., to search for offenses). We designed and implemented a range of digital assistance techniques including: (i) in- teractive tagging where investigators are automatically suggested tags, but can also add or remove to teach the system; (ii) Natural-language search capabilities (e.g. search evidence, events, offenses); (iii) context-awareness and producing required information as needed (e.g. suggesting evidence to substantiate an offense); and (iv) investigation briefing.

The dynamic event type recognition and tagging approach proposed in this chapter plays an important role in the realization of the features above. While this chapter discussed our approach in the context of event type recognition, the techniques can be easily extrapolated to other important investigation tasks such as tagging of evidence items with potential offences (as discussed above).

The experimental results presented in the previous section demonstrate the effectiveness of our approach for auto-recognition of event-types. The proposed technique, however, comes with its own limitations. As each event-type vector evolves over time (learning/adding new n-grams), it may shift away and no longer represent the event-type precisely. In order to keep vectors precise, we can leverage Reinforcement Learning [114] to learn from positive/negative feedbacks from investigators, to reward/punish the system by adding/removing n-grams.

Another issue that is worth noting is that the growing number of event types might degrade the recognition performance. Therefore, as the number of event types expands, recognition tasks will become more rigid and it may need a supervised approach. We aim to explore this concern as we collect more training data. We plan to also conduct larger scale experiments with training in the midst of real-life investigations. Finally, in order to apply our framework to other domains, further experiments and evaluations are necessary, e.g., using existing publicly available datasets (news archives, tweets, etc).

3.7 Concluding Remarks

The approach discussed in this chapter represents a step toward the practical realization of Cog- nitive Case Management (CCM) for Law Enforcement, where data, NLP, AI and computational

Page 67 of 194 processing power are consider first-class citizens. The implications of leveraging on such tech- nologies in this context can be translated into an increased productivity and efficiency, and im- proved decision making and insights when managing cases.

The increased productivity and efficiency comes from the help that the technologies above can provide in facilitating the automation of repetitive, tedious and lengthy tasks. Without such help, investigators not only risk the waste of a large amount of time that could be otherwise spent in tasks of higher cognitive nature (e.g., identifying patterns in a case) but also getting exposed to the risk of missing important bits of information that could be relevant to the case. This is particularly relevant in cases involving a very large corpora. Bringing these functionalities into the context and operations of investigators (e.g., identifying events while entering allegations into the system) can help avoid switching contexts and create a deeper awareness and insights into the current task.

The improved decision making and insights can be obtained by leveraging on the outcomes of the cognitive assistance provided by the discussed framework. For example, further data analyt- ics techniques can be applied on the recognized events in order to identify named entities (e.g., organizations and locations) and relations among them, tags and summaries can be automatically derived from event descriptions, and patterns can be recognized from event dynamics. Finally, deeper exploration of cases can be supported on top of the insights above in order to accelerate discovery, pattern recognition and linking.

The future has in store many further exciting opportunities. Such as codifying offenses from legislation, and methods for correlating collection of events and mapping to offenses. This can help automate the task of determining whether the elements of an offense can indeed be substantiated. Furthermore, we are confident the foundations of our work can be applied and extended to many other domains involving investigative tasks (e.g., science and research).

Acknowledgement. We acknowledge Data to Decisions CRC (D2D-CRC) for funding this re- search.

Page 68 of 194 Chapter 4

Attribute Embedding based Indexing of Heterogeneous Information Sources

Natural Language Queries over Security Vulnerability Information Sources

This chapter presents attribute embedding techniques we devised to build flexible indexing meth- ods over information stored in data services. More specifically, our work focused on applying the above techniques in the indexing of information over vulnerability information sources. We also discuss how the above techniques can be used to effectively support natural language queries over vulnerability data sources. We show that attribute based embedding improve dramatically natural interactions with data sources. While we use security information as context, the proposed techniques are applicable for other data sources.

The rest of this chapter is organized as follows. Section 4.2 discusses related work. Section 4.3 presents the security vulnerability information model used for our indexing techniques. Section 4.4 presents techniques we propose to create attribute based embeddings over an index of integrated data sources. Section 4.6 discusses how NL queries over integrated vulnerability information data sources. In Section 4.7, we discuss evaluation and validation of the techniques presented in this chapter. Finally we provide concluding remarks in Section 4.8. 4.1 Introduction

In July 2017, one of the most notorious cyber security attacks was discovered in Equifax’s dispute portal servers, which resulted in a breach of personal information of approximately 145 million individuals1. The attack was possible due to a known and unpatched vulnerability found in their servers running Apache Struts2. Reports estimate a breach-related cost of $439 million through the end of 2018 [166].

While events like these have significantly raised the concern regarding software security, the ever increasing reliance on software systems to support business operations and the rising numbers of new vulnerabilities reported every year make the task of keeping software systems secure a very difficult task and time consuming knowledge-driven process [294]. Recent cybersecurity studies and reports [213] show that in year 2017, approximately 21,000 vulnerabilities were discovered and reported. This value is 31% larger than what was discovered in the year before. Furthermore, as of mid-2018, more than 10,000 vulnerabilities were disclosed [214], of which 16.6% have high severity scores (CVSSv2 [199]).

In knowledge-driven security vulnerability related processes (e.g., understanding, inquiry), in order for security analysts and professionals to become aware and informed about security vulner- abilities, an integrated access to such information is needed. However, while much of the security vulnerability information is publicly available, such information is in many cases scattered across different, heterogeneous and complex information silos that have low or no integration [115]. The cybersecurity domain have traditionally relied on multiple sources when it comes to inquiring about security vulnerabilities. One of the most widely used sources is the National Vulnerabil- ity Database (NVD)3, a U.S. government repository of security vulnerability information. NVD provides a list of vulnerabilities dating back to year 1988, where each vulnerability is uniquely identified by its CVE ID (Common Vulnerabilities and Exposures). Another example is the Zero Day Initiative or ZDI4, which allows security researchers to privately report 0-day vulnerabili- ties to vendors. Vulnerabilities are collaboratively made public by ZDI and the affected vendor through a joint advisory.

1https://www.gao.gov/assets/700/694158.pdf 2https://struts.apache.org 3https://nvd.nist.gov 4https://www.zerodayinitiative.com

Page 70 of 194 In addition to the resources above, other useful sources of vulnerability-related information include security bulletins and advisories created and managed by vendors. Examples include Mozilla’s security advisory5, Redhat’s product security center6 and the Apache Security Team7. Further sources exist that provide useful archives of exploits8, breach reports9, vendor-specific patches10 and crowd-sourced vulnerability reports11. Vulners12 aims at partly mitigating the het- erogeneity and complexity of security vulnerability information by normalizing and aggregating the available sources. However, there is no integration among different sources and the query in- terface is still at the complexity level of the Domain Specific Language (DSL) of the underlying indexing and search engine (Elastic Search). Besides the multiplicity and heterogeneity of in- formation sources, accessing such information may require different forms of query mechanisms including manual keyword search, the use of DSLs, REST API calls, among other mechanisms.

In this chapter, we aim at mitigating the above issues in vulnerability related knowledge- driven processes. We focus on making vulnerability information sources accessible to security analysts and professionals through Natural Language Interfaces (NLIs). We build upon advances in NLP, word embeddings and knowledge-based enrichment to augment indexing attributes with embeddings providing semantics that is essential to support multi-entity mentions and ambiguous user queries over heterogenous data sources. More specifically, we propose to leverage multi- ple, heterogeneous and complementary sources, which we further enrich using state-of-the-art Knowledge Graphs (KGs) [236] and attribute embeddings to enable a richer and semantic repre- sentation of vulnerability information. Such integrated and semantically enriched information is capable of not only providing information about vulnerabilities alone, but also affected software and vendors, associated exploits, attacks and patches, which jointly can help understand and mit- igate the risks posed by security vulnerabilities. In order to overcome the complexity of querying such heterogeneous information, we propose a Natural Language Interface (NLI) that leverages embedding techniques [177] and allows security professionals and analysts to seamlessly query security vulnerability information. Such NLI does not require learning specialized languages for query purposes, nor it needs familiarity with the underlying information schema. In summary, the

5https://www.mozilla.org/en-US/security/advisories/ 6https://access.redhat.com/security/ 7https:// www.apache.org/ security/ 8https:// www.exploit-db.com 9https://breachlevelindex.com 10https:// portal.msrc.microsoft.com 11https://www.hackerone.com 12https://www. vulners.com

Page 71 of 194 contributions of this chapter are:

• We leverage a unified architecture to collect and integrate across vulnerability information sources.

• We propose a novel indexing mechanism over vulnerability information data sources. This mechanism relies on KG-based enrichments and attribute embeddings.

• We devise an NLI that is able to translate NL expressions into queries that are executable by the underlying index and search engine, namely, ElasticSearch. Such NLI does not require familiarity with the underlying security information schema.

• We evaluate the proposed techniques and demonstrate that they are effective is able an- swering questions typically asked by security analysts and professionals using NL queries. More specifically, we study the performance of answering security vulnerability questions using attribute embedding indexing based on both domain-specific word embeddings (built from security vulnerability corpus) as well as general purpose ones (e.g., GoogleNews13 and Wikipedia14).

4.2 Related work

We explore related work from two perspectives that are key to our work: Information querying with NL support and structured embeddings.

Information Querying with NL support. The use of NL for querying information sources has been largely enabled thanks to the research carried out mainly in the intersection of areas such as Natural Language Processing (NLP) [91], database systems [50] and Semantic Web technologies [92]. Among the works that combine NLP with Semantic Web, Tablan et al. [247] proposes QuestIO, an NLI that allows for querying structured information in a domain independent fashion. The work leverages both NLP techniques and Semantic Web technologies to query structured information. More specifically, QuestIO translates queries expressed in NL to formal ontology

13https://code.google.com/archive/p/word2vec/ 14https://fasttext.cc/docs/en/pretrained-vectors.html

Page 72 of 194 languages. A key advantage brought by QuestIO includes the possibility of querying ontologies without the need of learning formal ontology languages. Lopez et al. [154] proposes PowerAcqua, a Q&A system for querying information stored in heterogeneous, semantic resources. It leverages distributed, ontology-based semantic markups on the Web to help answer questions expressed in NL. This system does not assume any prior knowledge about the semantic resources being queried. PowerAcqua achieves this through steps that include linguistic and query classification analysis, syntactic term mapping and resource discovery, semantic mapping from user terminology into ontology terminology, relation similarity computation and triples linking, and answer generation. Kaufmann et al. [117] proposes Querix, a system that provides a NLI to query ontologies by leveraging clarification dialogs. Querix aims to address the challenges posed (to end-users) by machine-understandable frameworks that are typically used to query ontologies in the context of the Semantic Web. A key advantage of Querix is that it allows users to provide clarifications when ambiguities emerge while querying ontologies.

Other works focused on providing NLIs to query databases. Li and Jagadish [143] proposes NaLIR (Natural Language Interface for Relational databases), an NLI that is able to translate NL queries written in English into SQL statements. It leverages so-called Query Trees, which is a query representation that lies between a linguistic parse tree and a SQL statement. The resulting SQL queries may include comparison predicates, conjunctions, aggregations, nesting and differ- ent types joins. Similarly to Querix [117] above, when ambiguities are found, NaLIR generates multiple possible interpretations of the NL query and interactively lets the user choose the appro- priate one. In the same direction, PRECISE [203] is a system that translates NL queries into SQL queries by focusing on key database elements, namely, relations, attributes and values. The sys- tem focuses on so-called Wh questions (“what”, “which”, “where”, “who”, “when”) and builds an attribute/value graph that helps match tokens (found in NL queries) representing database attribute and values to actual attributes and values in a database. PRECISE further relies on a lexicon to help expand the NL query vocabulary that can be used by end-users. TiQi [204] proposes an NLI that leverages PRECISE [203] to query traceability information in software repositories. In order to do this, it provides a domain-specific model that is endowed with software-traceability-related concepts, a set of heuristic rules (for phrase-based synonym matching, definition replacements, etc.) and project-specific terminology collected from trace experts. By leveraging on these arte- facts, the authors are able to customise PRECISE [203] to support answering unstructured NL

Page 73 of 194 trace queries.

The works discussed above focus on the complex task of matching NL queries to formal database and ontology queries by trying to close the syntactic and semantic gaps between such NL and formal queries. In this chapter, we do not aim at advancing the field of NLIs along these lines. Instead, we leverage on such techniques and we propose a novel indexing mechanism over vulnerability information data sources that leverages KG-based enrichments and attribute embeddings. On top of this, we devise an NLI that is able to translate NL expressions into queries that are executable by the underlying index and search engine used to index security vulnerability information. Such NLI does not require learning ad-hoc languages for query purposes, nor it needs familiarity with the underlying information schema.

Structure Embeddings. Besides representing text (e.g. word, sentence, paragraph) as vectors in a vector space model (discussed in Chapter2), other types of embeddings exist such as data item embedding, which focuses on representing values, columns and tables in a vector space.

Databases store information based on a schema that describes the data types, keys and func- tional dependencies. Knowing the schema allows a user to query relevant information (e.g. finan- cial transactions). However, as tables and columns in a database are in a form of structured data, the user is forced to know the schema of stored data (such as table names and columns) in order to be able to write correct database queries. Recently, a number of approaches have been proposed to facilitate a number of data processing and analysis processes (e.g., data integration, record link- age) by leveraging distributed representation models [189], inspired by the success of the like of word embeddings and its extensions, to represent semantics of data elements (e.g., tuples, cells, columns, tables).

Table2Vec [59, 292] recommends database rows and columns by utilising word embeddding models. It helps users by offering “smart suggestions”, that is, recommending (i) additional enti- ties (rows) and (ii) additional columns, to be added to the table. Figure 4.1 shows an example of such recommendations. This recommendation is based on two different embeddings: Table2VecW, Table2VecH. Table2VecW is vectorial representations of all values (words) in a table generated by using pre-trained word embedding models (Google News and DBpedia). Table2VecH, on the other hand, represents columns (e.g. engine, country) in a vector space. At the time of usage, when user is working with a table, Table2Vec helps user to fill each column (e.g. Constructor in Figure 4.1)

Page 74 of 194 by suggesting new values (e.g. “McLaren”, “Mercedes”, “Red Bull”). Table2Vec generates a vector for all the values (e.g. “Ferrari”, “Force India”, “Hass”, “Manor” in Figure 4.1) within the column, and it queries Table2VecW (values embeddings) to find the top-n values with closest vec- tors to the generated vector. New column suggestion (e.g. Sessions, Races Entered), on the other hand, is based on the similarity between the vector generated from table columns (e.g. constructor, engine, country and base in Figure 4.1) and closest column vectors in Table2VecH.

Figure 4.1: User can add (A) additional entities (rows) and (B) additional columns from list of suggestions [292]

TabVec [78] targeted the issue of finding data tables on the Web and organizing data in large corpus. TabVec creates a vector space model for data tables to represent the semantic of such tables. It uses these representations to identify clusters of table types using Kmeans clustering algorithm [150]. As these clusters do not have any label, they manually assign a label to each cluster. They obtain the centroid point of each cluster, and ask crowd workers to label the tables corresponding to each point (e.g. entity tables, matrix tables, list tables). These labeled clusters are then used to classify new tables (extracted from HTML webpages) by comparing the vector of each cluster with the vector of new table. However, the drawback of this approach is their table representation technique. As TabVec uses deviation from the median of word vectors within the table to construct the final table vector, it incorporates information from cells (e.g. adjacent cells) that are not conceptually similar.

Bordawekar et al [25] focused on the problem of discovering semantic relationships between data in databases by using vector space model. Such relationships (e.g. dangerous vacation pack- ages) are hidden and left to be explored by user who is expressing queries (e.g. “which are the most

Page 75 of 194 dangerous vacation packages.”). By offering a new class of SQL-based queries called Cognitive Intelligence (CI) queries (Figure 4.2), their approach helps users to exploit meaningful relation- ships between values in the database. CI queries leverage word embeddings (Figure 4.3) to enable complex queries such as semantic matching, inductive reasoning such as analogies or semantic clustering, and predictive queries. For example, using this approach, user is able to query “pro- vide the names of the 10 employees most related to John Dolittle” on human resources database. While traditional SQL-based systems are not able to answer this query (lack of semantics), pro- posed approach returns list of people who know the employee (e.g. worked with him (similar workplace), played with him (members of a team)). However, the main drawback of the proposed approach is the techniques they exploit to generate word embeddings. In order to train the embed- ding model, they use cell values and column names from database to construct a corpora (Figure 4.3). As out-of-vocabulary (OOV) is a common issue in word embedding models [23], adding new cell values or new columns in the database requires re-training the model. OOV refers to a word in user query, which is not part of the vocabulary lexicon (dictionary) at the time training the word embedding model. Thus, there is no vector representation for that word in the vector space.

Figure 4.2: Example of a CI similarity query - find similar customers based on their purchased items [25]

Petrovski et al [201] proposed an approach to vectorize database tables and columns by using cells (content). They build vectors for table cells by leveraging a trained word embedding model. As table cells are not semantic atoms and can contain multiple words (e.g. New York), they com- pute individual cell embedding as the average of the embeddings of the different words in a cell. Next, they construct column embeddings by aggregating vectors of cells within that column. Col- umn embeddings is then used as a substitute for the encoding of the column descriptions (headers). To build table embedding, they first construct a corpus of sentences, corresponding to columns. Each sentence is a concatenation of all the cells within one table column. They generate 10 differ- ent sentences per column by shuffling cells of each column for 10 times. Using this corpus, they generate table embeddings by training Word2vec [176]. Generated embeddings are then used to paraphrase user queries with table and column names exist in database. The approach computes the similarity between vectors of keywords (from user query) with generated table and column

Page 76 of 194 embeddings. For example, given a query “what is the length (miles) of endpoints west-lake park to wilshire?” the system replaces word “length” with column name “distance” because of their high similarity ratio.

Figure 4.3: Converting tables into sentences to create a corpora to train embedding model [25]

In this chapter, on top of the above techniques namely natural language translation and struc- tured embeddings, we enhance indexing mechanism over vulnerability information services. The proposed extensions transform traditional schema based indexing to attribute embedding based in- dexing. The implication of proposed approach is more flexible and schema-less natural language queries. It should be noted that the proposed techniques are applicable to other domains as well.

4.3 Security Vulnerability Information Model

Building an integrated source of security vulnerability requires, first of all, the identification of useful information sources that will meet the information needs of users inquiring about security vulnerability. Several such information sources exist as previously discussed in the related work section. However, what are the concrete elements users inquire about in the context of security vulnerablities? In an empirical study on questions asked while diagnosing security vulnerabilities, Smith et al. [233] identified a total of 78 questions typically asked in this context, which are cat- egorized into (i) vulnerabilities, attacks and fixes (e.g., “how can I prevent this attack”), (ii) code and applications (e.g., “where is this method defined?”), (iii) individual questions (e.g., “have I seen this before”), and (iv) problem solving support (e.g., “can my team members/resources provide me with more information?”).

Page 77 of 194 Table 4.1: Sample of questions asked when diagnosing security vulnerabilities (extracted from [233]). Terms that are relevant for the security vulnerability domain are underlined. Security vulnerability questions Q1: “Is this a real vulnerability?” Q2: “What are the possible attacks that could occur?” Q3: “How can I prevent this attack?” Q4: “How can I replicate an attack to exploit this vulnerability?” Q5: “What is the problem (potential attack)?” Q6: “What are the alternatives for fixing this?” Q7: “How do I fix this vulnerability?” Q8: “How serious is this vulnerability?” Q9: “Are all these vulnerabilities the same severity?”

The study above provides a useful guide into typical, vulnerability-related questions. It is worth noting, however, the wide range of questions asked in relation to security vulnerabilities, many of which go beyond inquiring strictly about security vulnerability information. For example, category (ii) focuses mostly on questions related to the programming code being analyzed, while category (iii) involves developers’ self-reflection, understanding and expectation questions. In our work, we leverage on the results of this study and exclusively focus on questions related to inquiring about security vulnerability information that is publicly available on the Web. Table 4.1 shows an excerpt of questions as emerged from [233].

The questions listed in Table 4.1 focus on five main information elements to make our inves- tigations tractable and feasible (see the underline words), namely, vulnerabilities / weaknesses, attacks, fixes and exploits. Leveraging our experience and results from previous research [2, 54] on security vulnerability discovery, exploration and understanding, we derive the model shown in Figure 4.4(a), which aims at capturing the information elements listed above. In this model, the vulnerability entity represents a reported vulnerability, which is characterized by properties such as an id (i.e., the CVE of the vulnerability), publishedDate, description, among other properties. A vulnerability typically exists in a software that is developed by a vendor. Moreover, a vulner- ability is typically reported by a discoverer (e.g., a white hat hacker). Vulnerabilities are further characterized by a weakness that is uniquely identified by an identifier known as CWE (Common Weakness Enumeration) (https://cwe.mitre.org). Besides the information above, we also consider additional entities such as exploits, attacks and patches.Anexploit is a piece of software or data that can take advantage of an existing vulnerability. An exploit can be used to attack a software containing a vulnerability. Finally, a patch is a fix to a software that can help mitigate the risks of being attacked due to a vulnerability.

Page 78 of 194 The information model presented in Figure 4.4(a) integrates different entities that jointly pro- vide a more complete and richer picture about security vulnerabilities. In the next section, we discuss in more details how existing, disparate information silos publicly available on the Web can be integrated, enriched and indexed for providing a unified access to security vulnerability information.

Answer NL Query

Natural Language Interface (NLI) vendor discoverer reports Answer NL Query

develops Index/Search Engine DSL Query Translator Query DSL Query NL Query Term Document causes weakness software T1 Enrichment Use Knowledge Word T2 Graphs Embedding Tokenizer Dep. Parser exists in T3 e.g. ConceptNet

exploit Adapt Sources Offline Use leverages vulnerability Adapters

... uses fixes Vulstore NVD ExploitDB Security Tracker OWASP

Feed

attack patch Information Sources

Security ... Vulners NVD ExploitDB OWASP Tracker

(a) (b) Figure 4.4: (a) Security Vulnerability Information Model [54]. (b) Architecture for collecting, enriching, indexing and querying security vulnerability information. The bottom part of the architecture operates offline, while the upper part does it online.

4.4 Collecting, Enriching and Indexing Security Vulnerability Infor- mation

The approach proposed in this chapter consists of four phases, namely, (i) information source collection, adaptation and enrichment, (ii) information indexing, (iii) information vectorization, and (iii) information querying with NL query support. This section presents the first three (we discuss the latter separately in Section 4.6). We show the overall architecture of our proposed solution in Figure 4.4(b) and use it as a reference to elaborate on each of the phases above.

Page 79 of 194 4.4.1 Security Vulnerability Information Collection, Adaptation and Enrichment

At the bottom of Figure 4.4(b), we show examples of various, publicly available information sources that can be used to collect security vulnerability information. The collection of such information can be done through various mechanisms, including REST API calls and web data extraction. For example, while the information provided by Vulners is represented as JSON doc- uments and accessible through REST APIs, SecurityTracker’s information1 is available mainly as HTML Web pages, which requires Web data extraction techniques [72]. The tasks of accessing, collecting, adapting and integrating the information sources to the target representation (which is presented in the next section) require, therefore, the creation of dedicated adapters for each of the information sources of interest. A list of adapters is exemplified in the Adapters component at the bottom of Figure 4.4(b).

In order to have a richer representation of the integrated information resulting from the pre- vious step, we propose to enrich such information with semantics from the security vulnerability domain (see the Enrichment component in Figure 4.4(b)). Such enrichment allows more fluidity in expressing NL queries. Consider, for example, an NL query such as what vulnerabilities are there in Internet Explorer?. In this query, a user may choose to ask the same question using different mentions for Internet Explorer, such as IE or simply Explorer. The enrichment of the original security vulnerability information (e.g. attribute values) aims at enabling the possibility of using alternative mentions for such entities, thus, allowing for more flexibility in NL expressions.

Security vulnerability information enrichment is performed at two levels. First, at the attribute level, we store the various mentions that can be used to refer to an attribute in information model (see Figure 4.4(a)). For example, for the attribute vulnerability.publishedDate (we use the dot notation, entity.attribute, to refer to attributes of an entity). Other mentions of attribute name are stored in indexing database such as {publication date, release date, announcement date, ...}. In this way, whenever a user uses any of these mentions in an NL query (e.g., release date), the mention used in the user query is linked to the target attribute (i.e., vulnerability.publicationDate). Second, at the value level, we store mentions of attribute values for each attribute. For example, for attribute weakness.name, a possible weakness (i.e., attribute value) in the context of security vulnerabilities is Improper Neutralization of Input During Web Page Generation. This weakness

1https://securitytracker.com/

Page 80 of 194 is, however, also commonly referred to as CWE-79, XSS and Cross-Site Scripting2. With this enrichment we can therefore refer to weakness CWE-79 by using any of its alternative mentions.

The above enrichment techniques are performed by leveraging named-entity recognition [162], Knowledge Graphs (KGs) [236] and word embeddings [177] (see the Enrichment component in Figure 4.4(b)). Named-entity recognition is used to recognize named-entities appearing in at- tributes values of the security vulnerability information. In this work, we focus on three main entity types that are relevant in this domain, namely, software, weakness and vendor. Examples of such named-entities include Internet Explorer (software), Microsoft (vendor) and XSS (weakness). The named-entity recognizers (we use Stanford’s NER [162]) are trained using a combination of publicly available lists of named-entities (e.g., for software, NVD’s Common Platform Enumer- ation (CPE) list3), which are extended with alternative mentions using KGs (e.g. ConceptNet [236]). Such alternative mentions for named-entities are obtained by using the REST APIs of ConceptNet [236]. In addition, for enrichment at the schema level, we use word embeddings [177] trained on data from Information Security Stack Exchange4, which allows the enrichment of attribute mentions with semantically-related terms from the security domain. Next, we show how we adapt this collected information for indexing purposes.

4.4.2 Security Vulnerability Information Indexing

The majority of security vulnerability information available on the Web is of unstructured or semi- structure nature [115]. Some information consists of textual descriptions of vulnerability-related artifacts such as exploits, patches, breaches and security advisory bulletins. In order to efficiently query this information, we therefore propose to rely on existing indexing techniques that are ca- pable of efficiently dealing with such unstructured and semi-structured information. More specif- ically, in this work we rely on ElasticSearch5 (see the Index / Search Engine component in Figure 4.4(b)), an open-source index and search engine based on Apache Lucene6.

In ElasticSearch, indexes are flat collections of documents repesented as JSON objects. In order to represent the information model shown in Figure 4.4(a), we therefore need to translate that

2https://cwe.mitre.org/data/definitions/79.html 3https://nvd.nist.gov/products/cpe 4https://security.stackexchange.com 5https://www.elastic.co 6https://lucene.apache.org

Page 81 of 194 Security vulnerability¬ information indexing Attribute mention indexing

{ { ... ¬ cveId : String, ¬ cveId : "CVE-2009-1295", { ¬ publishedDate : String, ¬ publishedDate : "2009-04-30", ¬ attribute_name : String ¬ description : String, ¬ description : "Apport before¬ ¬ ¬ attribute_mentions : [ ¬ cvssScore : Float, ¬ ¬ ¬0.108.4 on Ubuntu...", ¬ ¬ String ¬ software : [ ¬ cvssScore : 1.9, ¬ ] ¬ ¬ {¬ ¬ software : [ } ¬ ¬ ¬ id : String ¬ ¬ {¬ ... ¬ ¬ ¬¬name : String ¬ ¬ ¬ id : "39398", ¬ ¬ ¬¬version : String ¬ ¬ ¬ name : "Ubuntu"¬ (c) ¬ ¬ ¬¬vendor : { ¬ ¬ ¬ vendor : { ¬ ¬ ¬ ¬ ¬id : String ¬ ¬ ¬ ¬ id : "75847" ¬ ¬ ¬ ¬ ¬name : String ¬ ¬ ¬ ¬ name : "Ubuntu" ... ¬ ¬ ¬ } ¬ ¬ ¬ } { ¬ ¬ } ¬ ¬ } ¬ attribute_name : "publishedDate" ¬ ], ¬ ], ¬ attribute_mentions : [ ¬ weakness : [ ¬ weakness : [ ¬ ¬ "published date", ¬ ¬ { ¬ ¬ ¬{ ¬ ¬ "publication date", ¬ ¬ ¬ ¬ ¬ ¬¬id : String, ¬id : "CWE-16", ¬ ¬ "release date", ¬ ¬ ¬¬name : String, ¬ ¬ ¬ ¬name : "Configuration", ¬ ¬ "announcement date", ¬ ¬ ¬ description : String ¬ ¬ ¬ ¬description : "Weaknesses ¬ ¬ ... more ¬ ¬ } ¬ ¬ ¬ ¬ ¬ ¬ in this category are ¬ ] ¬ ] ¬ ¬ ¬ ¬ ¬ ¬ typically...." }, ¬ ... more attributes ¬ ¬ ¬} { ¬ softwareMentions : [ ¬ ¬] ¬ attribute_name : "cvssScore", ¬ ¬ String ¬ ... more attributes ¬ attribute_mentions : [ ¬ ], ¬ sofwareMentions : ["Ubuntu ¬ ¬ "cvss", ¬ vendorMentions : [ ¬ ¬ ¬ Linux", "Ubuntu OS", ... ], ¬ ¬ "severity score", ¬ ¬ String ¬ vendorMentions : ["Ubuntu", ¬ ¬ "seriousness", ¬ ], ¬ ¬ ¬"Canonical Ltd.", ... ], ¬ ¬ ...more ¬ weaknessMentions : [ ¬ weaknessMentions : ["CWE- ¬ ] ¬ ¬ String ¬ ¬ ¬16", "Innapropriate } ¬ ] ¬ ¬ ¬Configuration", ...] ... } } (a) (b) (d)

Figure 4.5: Index-ready JSON representation of security vulnerability information model (introduced in Figure 4.4(a)): (a) JSON schema of information model (shaded attributes correspond to enrichments), (b) Example of a single document containing vulnerability information, (c) JSON schema for storing attribute mentions, (d) example of two attributes and their possible mentions model into JSON documents while still keeping the relationships among the different entities of the model7. We do so by denormalizing (flattening) [87] the model in Figure 4.4(a). In databases, flattening means that we store the data in one large table containing all the information, with slight imposition of structure. The result is shown in Figure 4.5(a). In this representation, we have one document per each vulnerability. This document is self-contained (and vulnerability- centric) meaning that all related entities (as shown in Figure 4.4(a)) are contained within the same document. As an example, Figure 4.5(b) shows a document for vulnerability CVE-2009-1295, which also includes related entities such as software affected (e.g., Ubuntu) and weaknesses (e.g., Configuration) involved. This representation allows to effectively retrieve documents from the index to answer queries such as Vulnerabilities in Ubuntu, with weakness CWE-16, which involve relationships (encoded in JSON representation) found in security vulnerability information model (Figure 4.4(a)).

7https://www.elastic.co/guide/en/elasticsearch/guide/current/relations.html

Page 82 of 194 In addition to the attributes of original information model (Figure 4.4(a)), we add to each doc- ument the enrichments discussed in the previous section (see the shaded attributes in Figure 4.5(a) and (b)). More specifically, we extend the original information model with additional attributes that contain mentions of named-entities that are relevant in this domain. As explained before, in this work we focus on the entity types software, vendor and weakness. Figure 4.5(b) shows exam- ples of alternative mentions for the named-entities Ubuntu OS (the software), Ubuntu (the vendor) and CWE-16 (weakness).

Finally, besides indexing security vulnerability information, we also keep a separate index in which we store different mentions of the attributes of information model. The mentions are stored using the schema shown in Figure 4.5(c), where each JSON document stores the name of an attribute (as used in the main index schema) and its alternative mentions. Figure 4.5(d) shows an example JSON document for the attribute publishedDate. In the next section, we show how the enrichments (i.e., entity and attribute mentions) discussed above are used by natural language interface (NLI) we discuss later in Section 4.6.

4.5 Security Vulnerability Information Embedding

In this section, we develop a novel extension on top of the indexed data in ElasticSearch (EL). The proposed extension transforms traditional schema based indexing to attribute based indexing. We introduce this extension as attribute and value-based embeddings which consist every attribute and their values in the indexed data. Figure 4.6 illustrates a pipeline which consists of two main layers. Starting with Training Vector Space Model, this layer mines existing domain-knowledge to formulate a training dataset. This dataset is then used to train a word embedding model by utiliz- ing the Word2Vec [176] algorithm. Next, the Attribute/Value Embedding layer transforms indexed security vulnerability information into semantically rich attribute and value vectorial representa- tions. This layer build upon word embedding model provided by the layer below (Training Vector Space Model) in order to encode indexed attributes and values into a sematically rich VSM, simi- lar to how words are represented as vectors through word embedding techniques. These attribute- value- embeddings are then used by natural language interface (Section 4.6) to map user queries into ElasticSearch (EL) queries. In the following sections, we present the details of the proposed pipeline starting from the training vector space model.

Page 83 of 194 Values Normalized Value Value Normalizer Values Encoder Embeddings Attribute/Value Tuned Attribute Recognizer Embeddings Attribute/Value¬ Recognition Attribute Indexed Attribute Attribute Embedding Indexed Attributes Encoder Embeddings Tuner Attribute Values

Training Corpora Pre Training Dataset Word Embedding Vector Space Model Word2Vec (EL indexed data) processing (Sentences) Model

Figure 4.6: Pipline for constructing attribute- and value- embeddings.

4.5.1 Training the Vector Space Model (VSM)

Vector space models, and more specifically word embeddings are widely used in NLP. One vector is created for each word, where words with similar meaning are close to each other in the vector space. While there are many pre-trained models available (e.g., trained from GoogleNews1 or Numberbatch2), we decided to build a word embedding model targeting the security vulnerability domain in order to address the terminology and ambiguity problems that comes with the use of general-purpose word embedding models (e.g., the word virus in security have a different mean- ing than in biology) [65]. We will show later (section 4.7) that the choice of embedding model (pre-trained models vs domain-specific model) can affect the performance of Query Translator (shown in Figure 4.4) in converting user NL utterances into DSL queries. To create the vector space model we leveraged Vulners3, a vulnerability database containing descriptions for software vulnerabilities, attacks, patches, exploits in machine-readable format.

We processed the training dataset with Word2Vec [175], and trained it using a skip-gram model with negative sampling. We set the number of dimensions for the model to 150 and the negative sampling rate to 10 sample words, which have been shown to work well with big-sized datasets [176, 174]. We chose 150 dimensions as it produces good empirical results in terms of vectors accuracy. Using more than 150 dimensions shows very similar performance with the disadvantage of a higher cost in terms of computational resources and training times needed [176].

The resulting word embedding model is available online4. After enough training, the neural net- work learns a deep representation of each word in a vector space. We then use this vector space

1https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing 2https://github.com/commonsense/conceptnet-numberbatch 3https://vulners.com 4At https://tinyurl.com/y3l6kxx7. This material is not public and currently available for review purposes only.

Page 84 of 194 model to vectorize indexed attributes and their values.

4.5.2 Attribute Embedding

Armed with the VSM described above, we can now process the text-based information and asso- ciate vectors to attributes (e.g. vulnerability_publishedDate) and values (e.g. 2018-10-21). We build vectors using the averaging technique as it is shown to work well for short texts [56, 53]. Al- though more complex vectorization techniques may be used, we later show that in fact this simple averaging approach outperforms other approaches as demonstrated in our experiments.

Generating an initial vector for an attribute (e.g. vulnerability_publishedDate) is performed by leveraging the name of the attribute. We first extract a bag of words (“vulnerability”, “published”, “date") from the attribute name. This extraction is done by utilising two heuristics:

• Split words if there is any underline between them. This rule extracts ”vulnerability“ and “publishedDate” from vulnerability_publishedDate attribute name.

• Split words that are following the camel case naming convention. This rule splits the word “publishedDate”, which is extracted from the first rule, into two “published” and “date” words.

This bag of words (“vulnerability”, “published”, “date") is then filtered by removing stop words (e.g. “on”, “the”, “and”) and stemming the remaining words. The result is a list of words (“vul- nerability”, “publish”, “date") that is considered as an initial set of seed words that describes the attribute, and contributes in initializing an attribute vector.

In order to build a vector for the attribute vulnerability_publishedDate, we therefore start by first building vectors for each of the seed words (“vulnerability”, “publish”, “date") by using the word embedding model we built before (see Section 4.5). These individual vectors are then aggregated (through vector averaging) to formulate the attribute vector. Figure 4.7 shows the steps taken to build a vector for the vulnerability_publishedDate attribute. Equation 4.1 formally describes the computations used to encode both individual seed words and attributes as vectors: 1 n v(attrvec)= v(wordi) (4.1) n i=1

Page 85 of 194 Attribute: "vulnerabilty_vendor"

(1) words extraction . . date . . time. . utilize breach connection. exploit. . . [ vulnerability, vendor¬] . ignore . usage . . . operation . disregard . . owner .. . send . outsorucer attack ...... vendor . liability . . . product . risk (2) finding words vectors . exposure . Model Projection supplier . vulnerability Embedding 2D Word

vulnerability_vendor

vulnerability vendor

(3) generating attribute vector

Figure 4.7: Attribute Vector Encoding Using Extracted Words

Here, v(word) is the vector representation of word obtained from the word embedding model we proposed in Section 4.5.1. Then, v(attrvec) is a vector for an attribute generated by averaging a set of seed words vectors (i.e., v(word)) corresponding to an attribute, and n is the number of seed words. The result of this is the encoding of attribute names into a VSM. Next, we will discuss how we can fine-tune these attribute vectors in order to obtain a more precise representation of attributes.

4.5.3 Tuning Attribute Embedding

Initial attribute vectors are generated by utilising a limited number of seed words extracted from attributes’ names. However, due to the lack of words diversity, such vectors may not reflect the semantics of attributes precisely. Thus, we need to fine-tune these vectors. In order to fine-tune the attribute vectors, we need to obtain n-grams that accurately represent each attribute. N-grams are combinations of adjacent words of length n that can be found in text/sentence (e.g. “exploit title”, “severity score”). One way of achieving this is to utilise mentions of an attribute’s name. As discussed earlier (Section 4.4.1), we store a curated list of mentions that can be used to refer to an attribute in information model. For example, for the attribute vulnerability_publishedDate,we store other mentions (such as “publication date”, “release date”, “announcement date”). In this way, whenever a user uses any of these mentions in an NL query (e.g., “vulnerabilities with release date 2018-03-12”), we can associate it to a target attribute (i.e., vulnerability_publishedDate).

We treat mentions of an attribute as list of n-grams that are then converted into vector represen-

Page 86 of 194 Attribute: "vulnerabilty_vendor" Mentions: ["supplier", "product owner", "manufacturer", "owner", ...]

(1) word extraction . . date . . time. . utilize breach connection. exploit. . . [ supplier, product,¬owner, manufactuter¬] . ignore . usage . . . operation . disregard . . owner .. . send . manufactuter attack ...... vendor . liability . . . product . risk . exposure (2) finding word vectors . Model Projection supplier . vulnerability Embedding 2D Word

Initial Vector manufacturer supplier product owner (3) generating mentions vectors

Tuned Vector manufacturer supplier product owner (4) tuning attribute vector

Figure 4.8: Tuning Attribute Vector by using mentions tations by applying equation 4.1. Such vectors are combined with the initial attribute vector to update the corresponding vector. Figure 4.8 exemplifies the tuning process.

4.5.4 Attribute Value Embedding

Generating a single value vector that represents the semantic of all linked values to an attribute is handled using the same technique we used to create attribute vectors. Figure 4.9 shows the steps of generating such vector. The process starts from extracting indexed values and applying the following heuristic to normalize numeric values and convert them into textual format that is needed for vector generation:

If a value is in the type of date (e.g. 2018-02-14), or number (e.g. 67), or version (e.g. 3.5), then replace that value with “date”, “number” or “version” labels as placeholders. This process is known as delexicalization [180, 58] in NLP to achieve better generalization in training data. We normalize numeric values using this technique (i) to reduce redundancy [167], and (ii) to catch the semantic meaning of value(s) [15]. We use Stanford Core NER5 Parser [162] to identify the type of numeric values. 5Named Entity Recognizer

Page 87 of 194 Attribute: "vulnerabilty_vendor" Values: ["cisco", "sun", "oracle", "apple", "hp", "ubuntu", "sophos", "zoph", ...]

(1) normalizing values

[ cisco, sun, oracle, apple, hp, . . date ubuntu, sophos, zoph¬] . . time. . utilize breach connection. exploit. . . ignore . .usage . . . operation .disregard . . sun . . .send . oracle zoph attack . . (2) finding normalized . . . . ibm hp . centos values vectors . . . cisco sophos . risk . linux . . appliance Model Projection apple . ubuntu Embedding 2D Word Value Vector zoph sophos sun hp ubuntu cisco (3) generating value vector apple

Figure 4.9: Generating Value Vector by using indexed values

Ignoring non-informative parts of values (e.g. “the”, “is”) and stemming values is the next step, before deriving vectors. Finally, a single value vector is obtained by averaging (see Equation 4.1) the vectors of normalized/filtered values, acquired from the word embedding model we proposed in Section 4.5.1. Next, in Section 4.6, we will show how we leverage embeddings that we built in this section to detect attributes and values in user NL queries.

4.5.5 Attribute/Value Recognition REST APIs

To assist developers to detect attributes and values from user utterances, we have provided our sys- tem in the form of a Web service and it is available at http://vulnerquerify.ngrok.io. Figure 4.10 shows list of available REST APIs. There are two categories of APIs available: (i) Vectorization endpoints, (ii) Recognition endpoints.

Vectorization endpoints are used only by the admin of the service to generate attribute and value vectors. This endpoint offers different embedding models to choose from such as the security vulnerability embedding model (see section 4.5) as well as general embedding models such as GoogleNews6, NumberBatch7 and Wikipedia8. By using recognition endpoints, on the other hand, developers are able to detect a keyword (which is extracted from user utterances) as an attribute mention or an attribute value leveraging the generated vectors.

6https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUT TlSS21pQmM/edit?usp=sharing 7https://github.com/commonsense/conceptnet-numberbatch 8https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

Page 88 of 194 Figure 4.10: Swagger documentation for Attribute/Value Embedding Service 4.6 Security Vulnerability Information Querying with NL Support

In order to be able to answer NL queries on top of the index and the (attribute and value) em- beddings introduced in the previous section, our approach first needs to understand the semantic and syntatic structure of of users NL queries. In this section we discuss how this translation takes place (see the Query Translator component in Figure 4.4(b)). Since this work uses ElasticSearch for indexing and searching, we show how we do this translation into ElasticSearch’s own DSL1. Figure 4.11 summarizes the steps of the translation process.

Our translation operates on a subset of ElasticSearch’s DSL to support attribute-based queries. In order to do this, we focus on two features from this DSL: (i) Attribute selection, and (ii) attribute-based filtering. Attribute selection is used to select attributes that will appear in the result set. Attribute-based filtering is used to specify query conditions that will help filter (with attribute:value conditions) the entries to be retrieved from the index [219]. Next, we will discuss how NL queries are translated into this subset of DSL query.

NL query pre-processing. The first step toward translating an NL query (e.g. “What vulnerabili- ties are there in Ubuntu, with a severity score of 10?”) consists in dividing the query into tokens that carry the semantics of the user’s NL expression. We use Stanford Core NLP Parser [162], which helps us to identify tokens in the query. Once tokenization is performed, removing stop words that do not contribute to the semantics of the NL expression is the next step. Our system

1https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html

Page 89 of 194 What vulenerabilities Attribute-token ElasticSearch NL query pre- Attribute/Value DSL query are there in Ubuntu... to value-token DSL query processing identification generation attachments

- Tokenization - Attribute-tokens - Dependency - Attribute-token, - Stop words identification parsing value-token and removal - Value-tokens - Attachments attachments - Key tokens identification identification mapping to DSL identification query Figure 4.11: Steps for NL to ElasticSearch’s DSL query translation. removes the tokens What, are, there, in, with, a and of. This leaves us with the tokens vulnerabli- ties, Ubuntu, severity score and 10, which we call key tokens (see Figure 4.13). These key tokens are used for identifying the attributes and values in the next section.

Attribute/Value identification. Once key tokens are identified, the next task consists in recogniz- ing the attribute and values within the tokens (extracted from the NL query). The system needs to perform two tasks: (i) attribute selection, and (ii) attribute-value based filtering. In the first case, the user expresses her information needs by indicating the information item (attribute) she is interested in. For example, in “What vulnerabilities are there in Ubuntu, with a severity score of 10?” NL query, the user is asking about vulnerabilities in a software, which can be thought of as the list of CVEs (see the cveId attribute in Figure 4.5(a) and (b)) of vulnerabilities affecting such software. The system fulfills this task by finding the most semantically similar attributes in the in- dex to key tokens using attribute embeddings. In the second task (attribute-value based filtering), the system considers filtering conditions that a user may express in an NL query. For instance, in “What vulnerabilities are there in Ubuntu, with a severity score of 10?” query, the user does not want just any vulnerability, but only the ones affecting the software Ubuntu. To perform this task, the system identifes key tokens that refer to values stored in our index, which can be used as filtering conditions.

Accomplishing the above tasks, requires the system to utilize the vector representations of attributes and values we construct before (see Section 4.5). System takes each key token and generate a vector that represents the semantic of that token. It does this by exploiting the word embedding model we proposed in Section 4.5.1. Next, the system finds an attribute stored in the index (see, e.g., Figure 4.5(d)) that has the closest vector (above a given threshold2) to the key token’s vector representation. If the system finds a match between a key token and an attribute (e.g., similairty between the key token severity score with the attribute cvssScore is more than 0.60), it saves the corresponding attribute name (e.g., cvssScore) for later and designate that key

2As an initial threshold, we set th =0.60.

Page 90 of 194 Figure 4.12: Dependency tree indicates the attachments between tokens (words) in an NL query token as an attribute-token [203].

When a key token cannot be matched to any attribute, the system considers it as an attribute value. This is done by comparing the key token’s vector representation to all value vectors gener- ated for each attribute (see Section 4.5.4). For example, the key token Ubuntu does not match any attribute name. However, its vector is close to softwareMention attribute values’ vector (see, e.g., Figure 4.5(b)) with more than 0.423 similarity. Since a value mapping could be found to one of the attributes, we designate Ubuntu as a value-token, and we associate it to the corresponding attribute (softwareMention). As a result, the overall identification task will yield a mapping of attribute- tokens to attributes names (e.g., severity score → cvssScore) and a mapping of value-tokens to attributes names (e.g., Ubuntu→softwareMentions).

Attribute-token to value-token dependencies. In some cases, users express their queries to find items by explicitly mentioning conditions that need to be satisfied. For example, in “What vul- nerabilities are there in Ubuntu, with a severity score of 10?” query, the user is asking about vulnerabilities in Ubuntu, where such vulnerabilities have a given severity score (severity score of 10). In order to identify such conditions, system needs to detect if any of the attribute-tokens is re- lated to a value-token [203]. To do so, we utilise dependency parsing [162] technique delivered by Stanford Parser4 to detect such relations (also known as attachments [203]). An attachment is the syntactic dependency between one word with other word(s) in a sentence. Figure 4.12 shows such attachments between words in a tree produced by Stanford Parser. For example, as can be seen in the Figure 4.12 the dependency parsing results indicate that the attribute-token severity score is attached to the value-token 10 (e.g. ). From this dependency attachment, the system infers that the attribute corresponding to the attribute-token severity score (the attribute is cvssScore in this case) contains the value-token 10 (i.e., cvssScore:10). We use this dependency attachment next to build query conditions using the search engine’s DSL query.

On the other hand, a key token can potentially be a value for multiple attributes (key token vector is

3Threshold for value vectors similarities is set to 0.40, unlike attribute vectors which is 0.60 4https://nlp.stanford.edu/software/lex-parser.shtml

Page 91 of 194 What vulnerabilitiesare there in Ubuntu, with a severity score of 10?

attachment Key tokens { vulnerabilities, Ubuntu, severity score, 10 }

Elastic Search DSL ... "_source"¬: [ attribute-token "cveId" ], query : { "bool" : { "should" : [ { "match_phrase" : { value-token "softwareMention" : "Ubuntu" }, { "match_phrase" : { attachment "cvssScore" : "10" } ] ... Figure 4.13: Translating NL queries into ElasticSearch’s DSL query. similar to multiple attribute value vectors equally). In this case, the system exploits the extracted dependency attachments (mentioned above) to solve the ambiguity. As an example, for “What vulnerabilities are there in Ubuntu, with a severity score of 10?” query, system first compares the vector of token 105 with all attribute value vectors (see Section 4.5.4). The outcome is a list of candidate attributes e.g. impactScore, exploitabilityScore, cvssScore with the most closest value vectors to the vector of token 10. The next step is to utilise the dependency attachments (e.g. ) extracted before. By considering these attachments, system identifies that token 10 is a value for cvssScore (attachment between severity score with 10).

DSL query generation. Given the mappings and dependency attachments (e.g. ) identified in the previous steps, the system is now ready to generate the DSL query to be executed by the search engine. Figure 4.13 shows how ElasticSearch’s DSL query is generated based on the NL query. In order to generated the DSL, the system applies the following three mapping rules:

• MR1) attribute-tokens are used for attribute selection in ElasticSearch’s DSL: the system maps attribute names corresponding to attribute-tokens (without dependency attachments)

5Numerical value is normalized before we generate a vector for it.

Page 92 of 194 to the _source attribute of the DSL (see the mapping of vulnerabilities to cveId in Figure 4.13). Attribute names listed will appear in the result set returned by ElasticSearch (this can be thought of as the projection operation in relational databases).

• MR2) value-tokens without dependency attachment are used as query conditions: the sys- tem maps value-tokens to conditions under the query attribute of ElasticSearch’s DSL. For example, Figure 4.13 shows that Ubuntu is translated to the condition softwareMen- tion:“Ubuntu” (because “Ubuntu” was found as one of the values of attribute softwareMen- tion).

• MR3) attachments are used as query conditions: System maps attribute-tokens (with at- tachments) to the DSL’s query attribute by creating a condition where the attribute (cor- responding to the attribute-token) should contain the attached value-token. In Figure 4.13, this results in the condition cvssScore:10.

4.7 Evaluation

We conducted experiments with the objective of evaluating the feasibility of our approach to ef- fectively answer NL queries related to security vulnerabilities. We compare the performance of our approach using four different word embedding models: (i) Our security vulnerability embed- ding model (see the details in Section 4.5), (ii) GoogleNews word embedding1, (iii) Wikipedia word embedding2, and (iv) Numberbatch3 word embedding. In addition, we compare the previ- ous two approaches with a baseline implementation where no embeddings are used for matching attribute and value tokens (i.e., only exact match of words is used). We present the details of the experimental setting, evaluation mechanism and results below.

Questions used in the evaluation. We use the questions from Smith et al. [233] for our evalua- tion. We adapted the questions listed in Table 4.1 to questions that are more contextualized to the information model supported by our solution (see the examples in Table 4.2). This adaptation is necessary in order to turn questions that make references to generic or vague information items into more concrete questions that can be answered with the information we support. For example,

1https://code.google.com/archive/p/word2vec/ 2https://fasttext.cc/docs/en/pretrained-vectors.html 3https://github.com/commonsense/conceptnet-numberbatch

Page 93 of 194 Table 4.2: Examples of adapted questions. The questions in bold font are the original questions (Q) from [233], while questions in regular font are examples of adapted questions (AQ). We used a total of 65 variants of these AQs for the evaluation. Examples of adapted security vulnerability questions Q1: “Is this a real vulnerability?” AQ1: “What are the details of vulnerability CVE-2004-1305?” Q2: “What are the possible attacks that could occur?” AQ2: “What are the possible attacks on Firefox?” Q3: “How can I prevent this attack?” AQ3: “How can I prevent attacks exploiting shellshock?” Q4: “How can I replicate an attack to exploit this vulnerability?” AQ4: “Is there any exploit for vulnerability CVE-2004-1305?” Q5: “What is the problem (potential attack)?” AQ5: “What’s the weakness in HeartBleed?” Q6: “What are the alternatives for fixing this?” AQ6: “Is there a workaround to protect against Dirty COW?” Q7: “How do I fix this vulnerability?” AQ7: “What are the patches to remediate CVE-2017-3561?” Q8: “How serious is this vulnerability?” AQ8: “What’s the severity of shellshock?” Q9: “Are all these vulnerabilities the same severity?” AQ9: “What’s the severity of vulnerabilities CVE-2017-3561 and CVE-2017-3563?” the reference to “this vulnerability” in question Q8 cannot be answered with our solution without providing a reference to a vulnerability (e.g., by using a CVE identifier or similar). We created variants for each of the AQs (e.g., using references to different vulnerabilities and software), which gave us a total of 65 variants (a mean of approximately 7 variants per AQ). We use these variants for the purpose of our evaluation.

Dataset. We collected security vulnerability information from Vulners and NVD (for vulnerabili- ties, weaknesses, discoverer, software and vendors), ExploitDB (for exploits), Breach Level Index (for attacks), and SecurityTracker (for patches). The dataset we use for evaluation consisted of a sample of approximately 102K vulnerabilities, 25K exploits, 33K patches, 21K software and 12K vendors affected, and 124 weaknesses. These sources were integrated and enriched as discussed in previous sections.

Implementation. The proposed solution was implemented based on the architecture shown in Figure 4.4(b). The adapters for the sources listed above were implemented using Python 2.7. For enrichment, we used ConceptNet (as explained earlier) and word embeddings trained on secu- rity vulnerability information sources (see the details in Section 4.5). In addition, we also used a

Page 94 of 194 Table 4.3: Evaluation results. We report on average values for |Rel|. We also report on the average values for R-Precision when using no embedding, GoogleNews embedding, Wikipedia embedding and Security embedding. We use the metric P@10 for questions with large |Rel| [226]. Entries marked with N/A means that the approach was not able to return any results whatsoever. . R-Precision (P@10) Quest. |Rel| No Embed. GoogleNews Wikipedia Numberbatch Security AQ1 1.30 1.00 1.00 0 0 1.00 AQ2 265.40 1.00 (P@10) N/A 0 (P@10) 0 (P@10) 0.90 (P@10) AQ3 1.60 0.85 1.00 1.00 1.00 1.00 AQ4 1.00 1.00 0 0 0 1.00 AQ5 1.17 1.00 1.00 1.00 1.00 1.00 AQ6 1.60 0.85 1.00 0 1.00 1.00 AQ7 1.10 1.00 1.00 1.00 1.00 1.00 AQ8 1.17 1.00 N/A N/A N/A 1.00 AQ9 2.00 1.00 0 0 0 1.00 pre-trained, general-purpose model based on Google News 4, Wikipedia5 and Numberbatch6 for comparison purposes (in our experiments, we use a cosine similarity threshold th =0.60 when using these embeddings). We used ElasticSearch 5.5.2 as our indexing and search engine. Tok- enization, dependency parsing and named-entity recognition were done using Stanford Core NLP 3.87. The NLI and query translator were implemented using Python 2.7.

Expert evaluation. We fed the 65 AQs discussed before to our system and obtained the corre- sponding answers. Such answers were evaluated by a domain expert in the area of cybersecurity who judged whether the answers provided by our solution satisfy the NL query in input. Since the results returned by ElasticSearch are ranked using a TF/IDF-based scoring system, we use the metric R-Precision, which is typically used in Information Retrieval [226]. R-precision is com- puted as TP/|Rel|, where TP is the number of true positives and |Rel| is the total number of relevant results for a given query (here, TP is computed only for the top |Rel| answers). For queries with a potentially large number of results we employ Precision@n (P@n). Precision is a metric to compute the number of relevant results on top n answers. This is useful in scenarios where end-users only check the top results (e.g., web search) [163]. Precision@n is computed as P @n = TP/n, where TP is the true positives and n corresponds to the number of top results to be considered (here, TP is computed only for the top n results). In this evaluation we consider the top-10 results and therefore compute P@10.

Results and discussion. The result of our evaluation is presented in Table 4.3. Most questions

4https://code.google.com/archive/p/word2vec/ 5https://fasttext.cc/docs/en/pretrained-vectors.html 6https://github.com/commonsense/conceptnet-numberbatch 7https://stanfordnlp.github.io/CoreNLP/

Page 95 of 194 havealow|Rel| value, except for question AQ2. High |Rel| values were observed whenever an NL query contained value-tokens that can appear among the values of attributes inside entries that are not relevant to an NL query. For example, in one of the questions for AQ2 the expert expected to get vulnerabilities related to Firefox browser only. The system returned, however, also vulnerabilities affecting Mozilla Firefox OS. The results for the utilised approaches show that both No Embed. and Security managed to obtained a P@10 of 1.00 and 0.90, respectively. The remaining approaches were not able to obtain relevant results among the top 10 returned results (e.g., only information about breached data was returned, with no details about existing vulnerabilities and exploits, which represent the actual potential attacks and weaknesses that can affect a software product). In the case of question AQ8, we can observe that only No Embed. and Security where able to return useful results. For the remaining approaches, no results where returned (see the entries with marked N/A). In these cases, we observed that these approaches were not able identify any attribute tokens at the similarity threshold used in our experiments (i.e., th =0.60), and thus no queries were generated. Overall, we can see that No Embed. and Security obtained the best results across the various AQs used in our experiments, outperforming the approaches based on general-purpose embeddings.

The proposed solution and evaluation come with their own limitations. Range queries (e.g., vul- nerabilities with severity between 7 and 10) and questions that imply Yes/No answers are not supported in the current version. While our solution cannot provide answers for the former, in the latter case the answer provided is either an empty result set (for “No” answers) or a list of results (for “Yes” answers). In addition, questions involving aggregate functions such maximum / minimum values (e.g., what’s the latest vulnerability in Ubuntu?) and sums / counts (e.g., how many vulnerabilities are there in FreeBSD?) are not currently supported. The same applies for comparison operators such as > and <. In addition, our evaluation focuses only on the set of questions obtained from [233] and involves only one expert evaluator. More thorough evaluations are needed with a larger and more varied set of questions, involving also pilot users and additional evaluators. We plan to expand our evaluation in this direction.

Page 96 of 194 4.8 Conclusion and Future Work

This chapter proposes an approach and architecture for supporting the exploration and understand- ing of security vulnerabilities. Our approach stems from the pressing needs for a unified, integrated and easy-to-query security vulnerability information platform that helps businesses mitigate the threats from the growing number of security vulnerabilities. By leveraging (attribute and value) embedding techniques, users can query our index without the need of becoming familiar with the underlying schema of the information stored our the index. This NL query capabilities make our solution a good candidate for integration into productivity tools used in software development and devops environments (e.g., through chatbots), which can help bring security vulnerability infor- mation seamlessly into context and while performing core development and devops tasks. Our proposed solution can also be applied to any use-cases where data is indexed in data sources.

Directions for future work include the development of domain-specific ontologies and knowl- edge graphs for supporting more complex queries beyond attribute-based queries (e.g., relationship- based queries), further enrichments using intelligent information taggers that leverage on advance- ments in NLP and AI, and the use of alternative sources (e.g., Twitter) for obtaining updates on the latest cybersecurity developments (e.g., 0-day vulnerabilities and attacks).

Acknowledgement. We acknowledge Data to Decisions CRC (D2D-CRC) for funding this re- search.

Page 97 of 194 Chapter 5

API Elements Embeddings

Latent Knowledge-Based Middleware for Integrating Bots and APIs

This chapter presents the latent knowledge-driven techniques we introduced to facilitate the effec- tive integration of APIs, dynamic composition and intent-based conversations (e.g., task-oriented conversational bots). Central to our approach is a latent knowledge-driven middleware that rep- resents intent and API elements (i.e., API methods, API call parameters, API and intents descrip- tions) as vectors (i.e., representation of element meanings and associations) in a vector space.

The rest of this chapter is organized as follows: We start with introduction in Section 5.1. In Section 5.2 we discuss related work. In Section 5.3, we provide an overview the motivations and benefits for the effective integration of APIs and conversational services. In Section 5.5, we describe techniques we propose to create vectors for API elements (e.g, API description, methods and parameters) by leveraging and extending word embedding models. In Section 5.7, we discuss evaluation the proposed techniques, and finally we provide concluding remarks and directions for future work in Section 5.8.

5.1 Introduction

As mentioned in the introduction chapter, messaging bots, software robots, and virtual assistants (hereafter simply called bots), are used by millions of people every day [100, 71]. Applications such as Siri, Google Now, Amazon Alexa, Baidu and Cortana have a presence in our living rooms and are with us all the time. New bots are developed continuously, from those providing psycho- logical counseling to task-oriented bots that help book flights and hotels. They use human-friendly interfaces, using natural language (e.g., text or voice), to access complex cognitive backend, which tries to understand user needs and serve them by invoking the proper services. Despite the interest and presence of development support tools, creating and maintaining bots still presents several challenges.

Developing a bot typically implies the ability to transform a natural language user expression (such as “what will the weather be like tomorrow in NYC?") into one or more intents, correspond- ing to the identification of the tasks the user wants to accomplish (e.g., weather forecast). The bot then extracts relevant parameters (e.g. location and date of forecast). Finally, it maps the intent and parameters to back-end services (e.g, API calls) to obtain the results. We emphasised before that, while APIs are rather fundamental to the Web, they have far-reaching ramifications; social media already depend heavily on APIs, as do cloud services and enterprise services (e.g., CRM, databases, ERP, platforms, applications, appliances and devices, sensors, vehicles) [138]. Much of the information we receive about the world will therefore be API-regulated. There is hence a need to evolve bots through an API-powered paradigm in which people, services, resources and devices establish on-demand interactions, to realise useful and natural conversations and to obtain services. Clearly, the ubiquity of conversational bots will have little value if they cannot easily integrate and reuse concomitant capabilities across large number of evolving and heterogeneous devices, data sources and applications. At the same time, APIs are unlocking applications, data sources and devices silos through standardised interaction protocols and access interfaces.

The process of integrating bots and APIs/applications requires techniques such as NLP, entity extraction and recognition, intent recognition and mapping of intents to executable computational programs (e.g., API calls) [100, 165, 289]. Today, finding the right API and then training the bot to map utterances into correct API invocations is largely a manual and labor-intensive process. While platforms such as Wit or DialogFlow simplify some of the NLP tasks, bot developers still have a lot on their plate: they are required to define a goal for the bot, identify the relevant API(s), and then train the bot extensively to identify API methods and the parameters required to invoke them. This approach presents two challenges (or, rather, missed opportunities): First, developers leveraging the same API for their bots would likely be doing similar and possibly overlapping

Page 99 of 194 work, and it would be helpful if some of this effort could be done once per API as opposed to once per bot using that API. Second, with the number of different APIs growing and evolving very rapidly, we need bots to “scale” in terms of how effectively they can find APIs of interest, train these bots, and ultimately map user expressions into API calls.

In this chapter, we devise a knowledge-driven middleware that represents API elements (i.e., API methods, API call parameters, API descriptions) as vectors in a vector space (building upon the concept of Word Embedding [176, 198]), instead of API definitions as it is common in main- stream services discovery techniques. More specifically, the intuition we follow here is to map API methods into vectors based on sets of utterances users would say to invoke these method. The rationale behind this approach is to make it easier to search API repositories for relevant methods by uttering how we would request such functionality in natural language.

A key element of the the middleware is an API Knowledge Graph (API-KG, or just KG). API-KG contains knowledge that facilitates: (i) searching for APIs that can support a desired bot functionality, where the search query can be expressed in terms of intent or utterances to be supported; (ii) training the bot, by providing a diverse set of utterances users may say to invoke the service as well as extraction of API invocation parameters out of the utterances; and (iii) mapping the extracted information into API invocations (e.g., https calls to the service). Moreover, the intuition behind API-KG is that it serves as a repository of knowledge such that that the effort that previously needed to be done once per (bot, API) pair can now be re-used on countless iterations.

On top of API-KG, we develop the Bot Builder, a system that enables even non-skilled de- velopers to rapidly create bots in a matter of minutes, simply be inputting the bots goal/intent and using similarity between vectors in the vector space of API elements. More specifically, this chapter makes the following main contributions:

– We propose to represent API elements as vectors in an extended vector space model. We combine crowdsourcing, knowledge graph and word embedding techniques to build end enrich API element embeddings. We propose knowledge-powered middleware techniques and services to interact with APIs based on how users would call the method in natural language. The proposed middleware techniques allow bot developers to generate operational bots from user intents and API methods and integrate bot development and deployment platforms such Wit.ai.

Page 100 of 194 – We show how to effectively build such embeddings and how we can facilitate and semi- automate the mappings of user intents and expressions with a potentially large and evolving set of APIs. APIs, along with their vector representations, are captured in an API Knowledge Graph (API-KG).

– We devise an approach to enrich API-KG via a three-step process: (i) Firstly, we obtain manual user input as seeds (either through API administrators, experts and/or crowdsourc- ing); (ii) Secondly, deep learning is used to significantly enhance the set of keywords; and (iii) Thirdly, we apply ongoing and incremental enrichment using additional topic and/or keyword from knowledge-graphs. Our work relies on BabelNet [185], one of the most com- prehensive multilingual machine-readable semantic networks.

– Layered on top of API-KG, we show how a Bot Builder can take any method of interest in the API-KG and deploy a trained bot that provides a NL interface to the method over chatbot platform such as Wit.ai. API-KG together with the instant bot building functionality is currently accessible via a user-friendly UI, at: http://apikg.ngrok.io/.

– Finally, we validate the effectiveness of our approach by assessing the benefits in terms of effectiveness of the API search and of ease of deployment of bots over the identified APIs of interest. Although enabling end-user development is not a main goal of this chapter, we also show that users with no bot development experience can easily create and deploy a bot.

5.2 Related Work

In this section, we discuss related work in the context of the integration of bots and APIs. We first start with bot development approaches and challenges. Then, we explain existing API repositories and their differences compare to our proposed API knowledge graph. Next, we examine code rep- resentation and recommendation approaches. We discuss API representation approaches. Finally, we investigate crowdsourcing techniques to collect natural language utterances.

Bot Development. As discussed in Chapter 2 chatbot development approaches include: (i) rule- based, (ii) flow-based, and (iii) Machine learning (ML) based. In this chapter, we focus on bots developed using ML approaches, which represent the common trend today, and specifically on bots where the resolution of the intent (the identification of the users’ request and its relevant

Page 101 of 194 parameters) results in a call to a service API for the required action. Examples of platforms supporting this approach are Google Dialogflow1, Facebook wit.ai2, Microsoft Bot Framework3, Amazon Lex4 and IBM Watson AI assistant service5.

The typical bot development workflow for machine learning based chatbots is as follows [260]: (1) define user intents; (2) add as many sample user utterances per intent; (3) identify APIs to satisfy the intents; and finally (4) backend coding to map intents to APIs, functions or queries. Examples that of bots that follow this workflow 2 include Devy [192] for interaction with code repositories in natural language, TaskBot [253] to manage project tasks, Calendar.help [51] to maintain schedules, Tone-aware chatbot [94] to generates tones (e.g. sadness, politeness, satisfied) in responses to customer questions, the Bing developer assistant [291] which suggests code snippets to developers based on free text queries, and DBpedia [11] to answers questions related to DBpedia. However, all these examples share the same essential limitations [225, 290, 146]:

1. Finding the right API is very time-consuming; in some cases even reading through posts/comments on forums or code repositories to understand its proper functionality [222, 93, 168].

2. Collecting rich training data is difficult, with respect to both quality and quantity of data [252, 286, 245].

In this chapter, we focus on solving these two main issues by maintaining knowledge about APIs, and proposing novel vector space model techniques to more effectively suggest the most relevant APIs. We now assess related work in these areas:

API Repositories. API-KG is not the first or the only APIs repository. ProgrammableWeb6 is arguably one of largest index of APIs, with more than 17,000 APIs. However, it contains mostly metadata (such as data exchange formats supported) or links to API documentation and end points. It does not include semantic representations of APIs, their methods and parameters, or any provision for advanced search. OpenAPI Directory7, a wikipedia for RESTful APIs, goes a step forward by indexing APIs and their endpoints. It currently has around 1000 APIs. However, it does

1https://dialogflow.com/ 2https://wit.ai 3https://dev.botframework.com/ 4https://aws.amazon.com/lex/ 5https://www.ibm.com/watson/ai-assistant/ 6https://www.programmableweb.com/ 7https://apis.guru/browse-apis/

Page 102 of 194 not support searching for APIs based on intent or utterances (only one keyword search). It does not contain information that can be useful to train bots over the APIs. Thingpedia [33] is an Open API repository dedicated to the Internet of Things (IoT). It currently indexes 65 devices’ APIs and includes a few sample utterances for each endpoint. API-KG includes similar knowledge, but surpasses these works by offering a wide range of utterances per API method and possible values for each parameter, as well as the ability to extend the knowledge graph with the addition of mentions of parameters. Moreover, API-KG offers a unique natural language querying interface (see Section 5.3), and is accessible via a RESTful Web service.

Code Representations and Suggestions

CUTIe [277] is a linkage tool for Eclipse that maps Java source code into API usage tutorials by leveraging word embedding model. CUTIe first partitions code fragment into multiple snippets where each corresponds to one API method (this is based on the number of API methods used in the code). CUTIe then generates a query based on nodes in the abstract syntax tree (AST)8 which is parsed from the code snippet. Next, a vector is generated by aggregating vectors of words within code snippet. Generated vector is then compared with vectors of documents (e.g. HTML pages) to find the closest tutorials for the code snippet.

Similar to CUTIe, [200] proposed an approach to find relevant fragments of tutorials that explain how to use a given API. They classified tutorial sections using MaxEnt text classification algo- rithm based on structural (e.g. table of contents, HTML header tags) and linguistic features (e.g. dependency between words, number of words). For a given user query that contains API method names, the approach retrieves tutorial sections that contain query words. Next, candidate sections are classified as relevant (i.e., containing any explanation such as snippets) or irrelevant using the classification model.

SISE [255] is an augmentation approach that enriches API documentation with extracted sen- tences from StackOverflow using text summarization (LexRank) and pattern-based (heuristics) approaches. Figure 5.1 shows an example of a query about “DataSource and Driver Manager on J2SE”, where Stackoverflow user added a sentence to clarify and explain the usage of API in different situations, which is the missing part of the API documentation. They first extract Stack-

8A tree representation of the syntactic structure of source code where each node of the tree denotes a construct within the code.

Page 103 of 194 Figure 5.1: An answer in StackOverflow which adds more details to the API documentation [255] overflow threads related to API methods, and then manually annotate answers as “meaningful” or “not meaningful” to construct a dataset. Then, they use summarization algorithm to extract a summary from each answer. Next, following the fragmentation approach (proposed by [217]) they split API documentations into sections. From each section, they apply heuristics (e.g. “must, not, , null”) to extract important parts (e.g. sentences). Important parts are those that explain the usage of an API method. Finally, the dataset of query and annotated answers (“meaningful” or “not meaningful”) pairs together with extracted parts are fed into a Support Vector Machine (SVM) classifier to train a model.

CommandSpace [1] focuses on linking user natural language terms (e.g. “give comic book effect to an image”) and application language terms (e.g. “From menu bar, click on Filter, then click on Render, then clock on Clouds”) by utilising word embedding model. The objective is to simplify the usage of application commands. In order to do this, CommandSpace generates an embedding model by using Adobe Photoshop usage documents that contain commands, features and their descriptions. For a given user input e.g. ’rotate picture’, CommandSpace finds the five most relevant commands with closest vectors to the vector of user query (v(query)=v(strengthen)+ v(picture)). It then displays the result in the form of a commands (e.g. image > rotate canvas > arbitrary, image > rotate canvas > flip canvas horizontal). If user clicks on the displayed task, corresponding webpage or document will be shown to user.

Similar to CommandSpace, QF-Graph [74] addresses the gap between terms used in user queries and terms used system features necessary to achieve user goals. They propose QF-Graph, a graph of mappings between users’ search queries and relevant system features. To generate this graph, they processed logs from publicly available applications to extract google search queries expressed by real users. Next, they use those search queries to collect system features (commands and actions) mentioned within the Web pages returned from a search engine. Using this technique, they build the links between search queries and system features. However, the major drawback

Page 104 of 194 of QF-Graph is its exact matching technique. It matches user’s vocabularies with terms found in query logs strictly. Thus, users are forced to use terms that exist in logs.

API-book [287] is a search-based recommendation approach for API methods. Given an API description as user input “How can I clear or empty the StringBuilder?”, APIBook extracts de- scriptive keywords (e.g. “clear”, “empty”) and method keywords (e.g. “StringBuilder”). Method keywords are extracted by exploiting a matching algorithm that scans all possible API method names stored in APIBook’s database. Descriptive keywords are used to generate a description vector by utilising the TF-IDF algorithm. Description vector is then used to perform compari- son analysis: comparing similarities between description vector and indexed APIs in API-Book’s database. The output is a list of candidate APIs. Next, from list of candidate APIs, APIBook chooses/ranks methods that matches the best with type keywords (e.g. methods that have same input/output parameter names and types). anyCode [85] is code assistant system to synthesise Java source code from a given user query within Eclipse IDE. anyCode input consists of (i) a description (e.g. “copy file fname to bname” entered by user, and (ii) an incomplete Java code fragment (extracted automatically from the code editor). It then suggests a list of methods that can be used to complete the extract code. To support this functionality, anyCode first extracts a list of words from input description and partial code fragment. anyCode then expands list of extracted words by adding their synonyms obtained from WordNet. This list of words, which is called WordGroups is then used to identify method signatures. The outcome is a list of candidate methods that have highest matching score (number of words matched with the WordGroups list) and highest declaration Unigram score (popularity and frequency of a method in the corpus). Finally, a probabilistic context free grammar (PCFG) model is used to expand method signatures (e.g. replace literals, local variables).

SLANG [212] formulates the code suggestion problem as code completion task where parts of the code are missing. To perform code completion, SLANG focuses on sequences of method invo- cations within the training dataset to train a Recurrent Neural Network (RNN) model. The result of training is probabilities associated with sequences of method invocations. At the usage time, for a given partial code with empty places, SLANG extracts the sequences of method invocations from the code. It then utilises the trained RNN model to predict a method invocation or sequences of method invocations with highest probability. Figure 5.2 shows an example of a partial code

Page 105 of 194 Figure 5.2: A partial code completion - (Hx) are empty places which are fulfilled by SLANG [212] completed by SLANG.

Similar to SLANG, SWIM [206] is a code generator which returns code snippets using possible APIs that can accomplish a given query [30]. Swim uses a word alignment model that calculates the probability of using an API method to solve a given user task. Word alignment model is generated based on the clickthrough data9 collected by Bing search engine. The clickthrough data maps words (from user queries) to the corresponding APIs that contain those words in their names. This word-to-API mappings are then used to generate code snippets by using a code synthesizer model. The synthesizer model is trained on sequences of API method invocations collected from open-source projects on Github. However, SWIM is unable to detect semantics of an NL query. For example, as it is described in their paper, SWIM does not distinguish between “convert int to string” from “convert string to int” user queries. This drawback ignores the sequences of words and it only considers bag-of-words.

Portfolio [168], is a search engine to find and visualize relevant methods and their usages for a given user query. It extracts the dependencies between methods in source code from FreeBSD 10 projects by using spreading activation network (SAN) algorithm [286]. Portfolio then ap- plies Pagerank [135] algorithm on method dependencies to model the usage patterns of methods.

9A list of (query, URL) pairs recorded by search engines. Each pair indicates that the user clicked on the result URL within the list of the query results. 10http://www.freebsd.org/ports

Page 106 of 194 Figure 5.3: An example of API sequences and annotation for a Java method IOUtils.copyLarge [83]

PageRank is an algorithm to rank nodes based on their importance in a graph. The importance is calculated based on the number of edges point to a given node and the rank of the origin nodes.

DeepAPI [83] is a deep learning approach to generate API usage sequences for a given natural lan- guage input. DeepAPI formulates the API suggestion problem as a machine translation problem: given an NL query (which contains a set of keywords e.g. “create file”), DeepAPI translates it into sequences of API methods (File.exists → File.createNewFile). This language model is Recurrent Neural Network (RNN) Encoder-Decoder [43] trained using annotated API sequences extracted from Java projects in Github (shown in Figure 5.3). Using these annotation pairs (i.e., ), DeepAPI trains the RNN model that encodes each sequence of words (annotation in the pair) into a vector (word representation) and decodes API sequences based on the generated vector. Unlike SWIM which exploits word alignment algorithm to find relevant API sequences, DeepAPI learns the sequences of words in NL query and maps these sequences into sequences of API calls. Thus, it can distinguish the semantic differences between queries with different word sequences (e.g. “convert int to string” from “convert string to int” user queries).

Chan [37] proposes a code search approach that takes a query (e.g. “create sql statement”) and returns a graph where each node is a method with high similarity score to the query. Moreover, edges in the graph denote the invocation relationships between methods. The approach generates an API graph by representing classes and methods as well as their invocation relationships in an API library. For example, the object-oriented library - JavaTM Platform Standard Edition v1.6

Page 107 of 194 (JSE 1.6) provides more than 28,900 methods. Given a query, their system first finds a set of candidate API methods by measuring the similarities between query and method names using TF- IDF. These candidate methods are then used to rank sub-graphs within the API graph. Among all sub-graphs, a sub-graph is selected if it has the highest number of nodes (methods) match with words (from user query). The limitation of this approach is its matching technique. To get proper results, the user must provide exact API method names in the query.

All the above mentioned approaches rely on code snippets to build method vectors which hinders scaling to support more APIs and add more knowledge about their methods and parameters. We devise techniques to construct fine grained API, method, and parameter embeddings from natural language corpus rather than using code snippets. Thus our approach is also complementarty to code based approaches.

API Representations and Recommendations. With the increasing proliferation and utilisation of APIs, many languages and models have been developed to describe service interfaces and pro- tocols. Examples include RAML, Swagger, WSDL, WADL. Extensions of these modelling lan- guages focus on service behaviours, semantics and meta-data and rely on definition-based match- ing techniques to support services discovery and composition. While these techniques are certainly useful, what is lacking are more advanced techniques to leverage rich and latent knowledge about intents and APIs to efficiently map a broad range of user utterances into user intents and API calls. Recent approaches to model APIs as part of recommendations systems focus on API embeddings. These are mappings of APIs onto a vector space, much like what is done for word embeddings.

API2Vec [191] provides an API embedding generated from usage examples of methods in Java and C# programming languages. They first extract the sequences of API methods from a code repositories to represent such methods in usage examples. Then, they annotate these extracted se- quences of methods by generating an Abstract syntax tree (AST) [188]. The annotation is based on the syntactic units in each API method (e.g. literals, variable declarations). Annotated sequences are then used to train a Continuous Bag of Words (CBOW) model by converting each sequence into a sentence. Generated API embedding model can be used to find pairs of API elements in Java and C# that share same usage relations.

BIKER [98] recommends API methods based on embeddings of previous questions (posts) in Stackoverflow. For a given query (programming task), BIKER compares the semantic similarity

Page 108 of 194 between a vector representation of the query and vectors of previous questions to choose a list of candidates with higher similarities. In the next step, it extracts APIs mentioned in the answers within Stackoverflow posts (candidate questions) by exploiting heuristics (e.g. HTML tag, camel-case naming). Finally, to (re)rank the list of recommended APIs, it compares the semantic similarity between the query and both Stackoverflow post (where the API was mentioned in answers) and the API’s official documentation. The returned result contains relevant APIs (to the query) with some additional information that is relevant to developers (e.g. similar questions, code snippets). The main difference between our work and BIKER is that we take a bot-oriented approach: we map APIs to a vector space based on how bot users would request the corresponding functionality, collect bot training information and semi-automate the process of bot development over the chosen APIs.

Nguyen et al [258] find similarities between Java classes by vectorizing them using their descrip- tions as part of code snippets in Javadoc documentation. This vector representation can be learned from a large corpus of software documentation as it typically contains short natural language texts that describe the functionality of APIs, their relationships and usage scenarios. For example, from the description of FileInputStream class in Javadoc - “A FileInputStream obtains input bytes from a file in a file system. It reads streams of raw bytes such as image data... ”, we can see that FileInputStream is closely related to “read”, “input”, “file”. The intuition behind this approach is that the APIs with similar functionalities have similar vectors; and such vectors can be used to search for APIs without explicit word matching.

RACK [208] utilises mappings between keyword tokens (extracted from questions in Stack Over- flow) and API methods (extracted from accepted answers of those questions) to recommend a list of relevant methods for a given user query. In order to construct these mappings, they first extract keywords from the title of each question in Stack Overflow by applying stop word removal, split- ting and stemming. Then, they parse the accepted answer of each question to extract API class names using HTML parser (Jsoup11), island parsing [216] and regular expressions describing Java language specifications [82]. For example, they extract tokens (e.g. generate, md5, hash) from a question title (e.g. How can I generate an MD5 hash?) and API names (e.g. MessageDigest) from the accepted answer shown in Figure 5.4. These keyword-API mappings are then used to find/rank API methods that match best with the keywords extracted from user query.

11http://jsoup.org.

Page 109 of 194 Figure 5.4: An example of Stack Overflow post used to extract the keyword-API mapping: (a) question, (b) accepted answer [208]

Thung et al [250] utilises word embedding model to map feature requests posted in JIRA code repository to relevant methods from a set of libraries. Proposed approach takes as input a new feature request with a summary (e.g. “Add scanner batching to Export job”) and a description (e.g. “When a single row is too large for the RS heap then an OOME can take out the entire RS. Setting scanner batching in custom scans helps avoiding this scenario, but for the supplied Export job this is not set.”) and recommends a list of methods that can be used to achieve requested feature. It first searches for similar closed or resolved feature requests in the past (training dataset) by comparing their summary and description vectors generated using TF-IDF algorithm. From candidate feature requests, their approach measures the relevance of API methods, that are used to implement those (old) requests, with newly requested feature by comparing the description vector (of requested feature) and description vectors of API methods. However, the limitation of this approach is that it requires a large amount of information such as feature requests from JIRA repositories, official API methods documentations, and code snippets.

These approaches do not focus on acquiring fine-grained latent knowledge about intent and API elements knowledge from text corpora and knowledge graphs, intent-to-API integration patterns and dynamic APIs synthesis to support user intents.

Crowdsourcing Utterance Acquisition. Crowdsourcing has been employed to obtain natural

Page 110 of 194 language corpora for chatbots [16, 33, 241]. To the best of our knowledge, the first attempt for obtaining paraphrases via crowdsourcing goes back to 2005 when Chklovski used gamification to collect paraphrases by asking contributors to guess paraphrases based on given hints (e.g., a few words and phrases inside the paraphrase) [42]. Later, three primary crowdsourcing strategies were suggested [270]. In sentence-based paraphrasing, crowd workers are asked to paraphrase an existing utterance to obtain new variations of the utterance [241, 33, 270] (e.g., “book a flight from Sydney to Houston”). In goal-based paraphrasing [270], workers are given a goal (e.g., “book a flight”) and a set of possible values for its entities (From: “Sydney”, To: “Houston”) and asked to make proper sentences. Scenario-based methods employ a storytelling approach. The story puts the worker in a situation to perform a task (“Your goal is to book a flight; you are in Sydney; your destination is Houston”). Next, the worker is asked to express the intent [270, 33].

5.2.1 Summary

In conclusion, research from these efforts is complementary to our work and some elements can be adopted to our work. We recognise that having a broad range of possibly complex interactions between users and numerous evolving APIs is challenging because APIs are not designed having interoperability with particular user intents in mind (as is often the assumption in existing bot development techniques). There are still major challenges, including that building bots requires rich abstractions to support fuzzy and ambiguous natural language user utterances, whereas latent knowledge required to integrate intents and APIs until now is rarely extracted and organized in a knowledge graph of user utterances, intent and API element embeddings. This makes both developing as well as training bots a time consuming and costly process.

5.3 Approach overview

Figure 5.5 illustrates the typical bot development workflow. Starting with a goal (e.g. “I want to build a chatbot to find restaurants and cafes in city"), developers then specify the users’ intent(s) that the bot should support [5]. For example, if a user says “is there any cafes nearby?”, the user’s intent is to retrieve list of cafes and restaurants around. Intents are given a name, often a verb and a noun, such as “FindRestaurant” or “BookTable”.

Page 111 of 194 Intent(s) Intent(s)

Train NLU Model Utterance(s) Utterance(s)

APIs/Methods APIs/Methods

Implement Backend Webhook Server Webhook Server

Ready to Go! Ready to Go!

Figure 5.5: Typical Bot development process (Left) vs Bot development process using API-KG (Right)

Once the intent(s) are defined, the developer specifies a set of user utterances that capture each intent, such as: “Show me some Italian cafes in Manhattan”, “Let me see if there is any Italian restaurant nearby”. The aim here is to feed and train a NLU [151] model1 to understand user intention(s) and possible ways they might express their requests. For each utterance, the developer might also need to identify the entity mentions within the utterance. For instance, ‘Italian’, ‘cafe’, ‘Manhattan’. Implementing a webhook2 server and establishing API calls (i.e, run queries, execute commands or call methods) to process user intents in the backend is the next step. The developer would need to find relevant APIs and methods that can satisfy each intent in this step. Finally, the developer would check the chatbot’s accuracy by inputting unseen utterances, and debug the backend for possible improvements.

While this approach is significantly better than coding a bot from scratch and a leap compared to the technology of just a few years ago, it does come with the following limitations: (a) Finding the right API (and especially the right API method) among the large and ever-growing number of APIs is a major challenge, and at present there is no way of searching for APIs based on user- intent; (b) Even assuming we have found these APIs/methods, we need to train a bot to recognize a large variety of user-utterances, or the different ways users may refer to the same parameter, and (c) Lastly, even if we have invested time to find APIs and train associated bots, the development efforts cannot be re-used and remains “embedded” in the bot. We cannot capture and re-utilise this wealth of bot-to-API knowledge.

Bot builder and API-KG. As a shift in paradigm to current bot development, we propose a

1Natural Language Understanding model is used to understand the semantic of user utterances and to extract intents and entities 2A webhook is an HTTP request that is sent automatically whenever certain criteria is fulfilled

Page 112 of 194 (7)

Bot Builder

(6) Platforms

(1)

(3)

Knowledge¬Explorer Bot Developers (4) (2) (5)

APIs specs Vectors

Raw Data (Deep Learning)

API Owners API-KG

Expressions ¬Entity mentions¬ ¬API descriptions¬

Processed Data Crawled Data (Other KGs) (Crawlers) Crowd-Sourced Data Figure 5.6: Approach overview knowledge driven middleware to overcome and reduce the above-mentioned limitations. Figure 5.6 shows the bot development process when Bot Builder, Knowledge Explorer and the API-KG are available.

The bot developer begins by describing the bot they want to build [step 1 in Figure 5.6]. This can be done in two main ways: i) by stating the bot’s goal (“I want to build a bot to find restaurants in NYC”, as exemplified in Figure 5.7), or ii) by example, that is, stating user utterances we would like the bot to understand and act upon (such as “give me name of a Chinese restaurant near Central Park”) .

The Explorer will then convert these requests into a vector representation and use them to query the KG [step 2], which stores vector representations of APIs, intents supported by the API, methods, parameters, as well as set of utterances used to invoke such methods. We discuss later how utterances and vectors are created and added to the KG as well as how the KG is queried for finding APIs and methods that match the user’s request.

The Explorer returns different information to developers [step 3] depending on their prefer- ences: it can return the name and description of one or more APIs matching the user needs (e.g.

Page 113 of 194 Figure 5.7: Finding relevant APIs for the given goal

"Yelp") as shown in Figure 5.7, or the method signatures, or the intents corresponding to a given method.

By looking at APIs and intents and methods they support, developers can then select the APIs and methods they wish to leverage [step 4]. Steps 1 and 4 are all the developer needs to do for a simple deployment. Based on the selection, the Explorer retrieves the selected API and methods from the API-KG [step 5], along with the training utterances associated with such methods and stored in the KG.

The Bot Builder, layered on top of the Explorer, after receiving information on the selected API (intent, methods, example utterances, entity types - e.g., "location", and parameters - e.g. "NYC") to invoke the methods, will deploy this information on a chatbot platform such as Wit.ai3. In addition to the training information, Bot Builder also uploads onto platform the template re- quired to map the entities extracted from user request (e.g., type= “Chinese” and location= “NYC”) into a JSON object required by the API method, as well as the URL to be invoked. With this information, the chatbot platform can train an NLU model and run the chatbot, in- voking the backend service as needed. Bot Builder can be accessed via a user-friendly UI, at: http://apikg.ngrok.io/.

Finally, we point out that the current approach does not yet support more complex bot require- ments where an intent (or a set of related intents) may be supported by a set of APIs and does not do any sort of automated composition - this can be added but it requires ad hoc work by the bot developer.

3https://wit.ai/

Page 114 of 194 5.4 API Knowledge Graph (API-KG)

A knowledge graph is a labeled directed graph G =(V,E,L) where V is a set of vertices (nodes), L is a set of labels and E =⊆ V × L × V is a set of ordered triples [64]. API-KG is a knowledge graph with specific types of nodes and relationships. In particular, nodes v ∈ V can describe APIs, API Methods, Method Parameters, parameter Values,orUtterances. Part of the information in the KG is the graph representation of what we find in API specification documents. Such documents typically include the name, list of methods, and the parameters for each method and optionally their type, along with textual descriptions of the functionality of each method and of the API as a whole [282, 34]. This information is added to the API, Method and Parameter nodes, both in textual form (essentially with the same content than the API specification document, though in structured form) and in vectorized representation (discussed later).

In addition, API-KG includes additionally attributes for each node that are useful for our goal of searching APIs. This means, to assist Knowledge Explorer match bot developer request with the correct API or method that can fulfill the request; as well as for training chatbots that use the deployed methods. For example, Method nodes are enriched with information on the intents that can be fulfilled by calling that method (a string, such as SearchBusiness), again stored in full text and in vectorized form.

Utterance nodes correspond to examples of how users could request the functionality repre- sented by a method (“find a seafood restaurant close to the central park”). The utterances are annotated to denote which part of the utterance correspond to method parameters (e.g., denot- ing that Central Park corresponds to the location parameter). We discuss later how utterances are obtained. utterances are associated to a Method node, and one Method can have several utterances.

Finally, Value nodes are associated to method parameters and denote possible values for such parameters, along with alternative forms for specifying the same value. For example, the value New York can be represented as “NY”, “NYC”, “N.Y.C”, and more. The KG captures this richness which is then important to map user requests into API invocation with the correct parameters.

Both utterances and values are stored in full textual form as well as in terms of vector represen- tation, as a set of word vectors. In addition, we enrich the Method node with a single vectorized representation which aggregates the information from all utterances, and enrich the Parameter

Page 115 of 194 Yelp API EXPRESSION

Description BaseUrl Text Annotations Vector

New York VALUE Location PARAMETER Mentions Search METHOD Vector NY NYC N.Y.C Path Intent

Vector McDonalds VALUE Term PARAMETER Mentions Vector Reviewes METHOD McD Maccas MacD

Figure 5.8: API Knowledge Graph (Yelp API) node with an aggregated vectorized representation of values for that parameter. As an example, nodes with gray edges in Figure 5.8 denote the part of the API-KG populated with the information available for the Yelp interface, imported from source doc1. Solid arrows are connections between nodes and dashed rounded rectangles are properties of a node.

This information has a dual purpose: Firstly, the vectorized representations are leveraged at bot development time to help bot developers identify the right API and methods they want to include in the bot. This is particularly useful when users search APIs by uttering example utterances corresponding to what they need (e.g., “I want the bot to support customers asking ’find me an Indian restaurant near Rockefeller Center’ ”), which is one of the features we support. Secondly, the textual information for utterance and values are loaded into a bot building platform (such as Wit.ai) via that platform’s APIs as training data as discussed earlier. In this way, anybody wanting to use the functionality of the API (Yelp, in our example) has a trained bot available out of the box, with the training work done once per API rather than once per chatbot leveraging that API.

Next, we discuss: i) how this information can be obtained and ii) how the word vectors are generated, before presenting how it is used at bot development time and at runtime.

5.5 Deriving API vectors

We now describe how API-KG can be generated semi-automatically. In general there are at least four possible sources of API-KG knowledge: (i) existing API documentation (provided by the de-

1https://raw.githubusercontent.com/APIs-guru/unofficial _openapi_specs/master/yelp.com/v3/swagger.yaml

Page 116 of 194 veloper or by the community, as done in ProgrammableWeb1); (ii) information explicitly inserted into API-KG by API developers; (iii) information provided by the community via active or passive crowdsourcing (e.g., harvesting knowledge from StackOverflow, explicitly asking the community via crowdsourcing platforms, or collecting users’ feedback on bot platforms responses to progres- sively improve the KG); or (iv) information available by observing API usage, such as call logs or, once the bot is deployed, by looking at bot conversations.

We leverage existing API documentation - automatically imported in the KG - and on active crowdsourcing as methods for obtaining API knowledge, and we do this both to show that it is indeed feasible to build and boostrap the API-KG. In the following however we will only briefly touch upon information sourcing, and focus instead on how to build the KG - and specifically how to generate vectors - once the source information is available.

5.5.1 Training the Vector Space Model (VSM)

Vector space models, and more specifically word embeddings are widely used in NLP [120, 230, 55]. As mentioned in Chapter 2, in word embeddings, a vector is created per word, where words with similar meaning have closer vectors. Representing words as vectors based on their meanings empowers NLP applications by adding semantics [284].

Most popular algorithms (see Chapter 2) learn these representations by scanning and analyzing big datasets contain a large number of sentences. However, there is a single assumption that all these algorithms rely on: words that appear in similar contexts have similar meanings [176]. The task of generating these vectorial representations is always to factorize a “word to word” matrix that holds words co-occurrence counts, or Point-wise Mutual Information (PMI) metric [29]. These matrices are usually called U and V to define two separate embedding spaces. U is a matrix that contains the final word embeddings and V is a temporal (intermediate) set of embeddings that contain the representations used for context (or neighbour) words (e.g. “I want to learn” around a target word e.g. “python” ). An important aspect of these vectorial representations, is the ability to obtain more complex word analogies e.g. “king - man + woman = queen” using simple arithmetic [175].

1https://programmableweb.com

Page 117 of 194 Leveraged on the concept of embeddings, we represent APIs and their elements in the vector space. while word embedding only considers “words” in a vector, we extend this to represent all the constituents of APIs, namely API descriptions, methods and parameters. Mapping APIs and their related knowledge into vectors on a vector space allows us to derive semantic relationships to determine when APIs might have similar functionality (e.g. “Dropbox”, “GoogleDrive”), methods from different APIs might share similar intent (e.g. “UploadFile”, “CreateFile”), and parameters from different methods might have similar type and values (e.g. “location”, “path”).

While there are many pre-trained models available (e.g., trained from GoogleNews2 or Stack- overflow3) we decided to build our own since we found that vectors corresponding to bigrams and trigrams are important to capture intents and entities from user utterances, and many pre-trained models do not cover them. We will show later (section 5.7.4) that the choice of embedding model (pre-trained embedding models vs our trained embedding model) can affect the performance of Knowledge Explorer in finding relevant APIs/API Methods.

To create the vector space model we leveraged monolingual English Wikipedia articles4 as a source, since it is general and covers a variety of topics. The corpus we used contains more than 2 billion words. We processed the training dataset with Word2Vec [176], and trained it using a skip-gram model with negative sampling. We set the number of dimensions for our model to 300 and the negative sampling rate to 10 sample words, which have been shown to work well with big-sized datasets [176, 274, 175]. We set the context word window to be 5, as this setting is suitable for utterances where average utterance length is less than 11 words, as is the case in user utterances [274]. We made the resulting word embedding model is available online5.

5.5.2 Populating the Knowledge Graph

Armed with the VSM described above, we can now process the text-based information and asso- ciate vectors to API, Method, and Parameter nodes. We build vectors using averaging technique as it is shown that it works well for short texts [56, 53]. Although more complex vectorization techniques may be used, we later show that in fact this simple averaging approach, when used -

2https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUT TlSS21pQmM/edit?usp=sharing 3https://github.com/vefstathiou/SO_word2vec 4https://dumps.wikimedia.org/ 5https://tinyurl.com/ya2hr56u

Page 118 of 194 (1) extracting keywords

API Description: "Yelp is¬website¬to find¬businesses¬ [ website, business, dentist, hair ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ like¬dentists,¬hair stylists,¬mechanics¬ stylist, mechanic, restaurant¬] ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ and¬restaurants."

(2) finding word vectors

[ website, webpage, business, company, dentist, website. webpage restaurant pharmacist, hair stylist, barber_shop, . . eatery blog . . . ¬mechanic, electrician, restaurant, eatery, ...¬] . homepage . resto ...... bistro (3) extending keywords . . . (4) generating description vector .. . enterprise . . . corporate . .. company. website . business webpage blog page API Description mechanic company ... business restaruant eatery bistro

Figure 5.9: Generating an API description vector: (i) Keyword extraction, (ii) Keyword extension, (iii) Generating the final vector by averaging the vectors of keywords as it is our usage scenario - to search an API repository containing hundreds or thousands of APIs based on a user utterance, outperforms more sophisticated approaches in our experiments.

5.5.2.1 API Description Embedding

We generate API description vectors by first retrieving API specification documents i) importing API descriptions provided in OpenAPI6 specification by the API owner, ii) crawling the Pro- grammableWeb repository7 to extract description, and iii) crawling the web page of API owner8. Once we obtain the documents, API description vectors are generated by extracting nouns from the description of an API. This tells us information about topics and functionality of an API [196, 200, 251], and we focus on nouns because it has been shown that they are central in inferring the purpose of a description [196, 218]. We use Stanford Core NLP Parser [162], which helps us identify keywords in description. Once tokenization is performed, we proceed with removing stop words that do not contribute to the semantics of the description. Next, we add additional keywords by considering the top-n closest neighbors of each extracted noun in our vector space (with n=59). Finally, we average all vectors of extracted nouns and their neighbours to generate API descrip-

6We leverage the Flex framework https://flex-swagger.readthedocs.io/en/latest/ for this. 7https://www.programmableweb.com/api/yelp-fusion 8https://www.yelp.com/about 9Our workplan is the investigation of how performances change as we change this number

Page 119 of 194 Table 5.1: Examples of API Methods, their associated utterance, and possible paraphrase.

API Method Annotated Utterance Annotated Paraphrase Utter- ance

https://api.yelp.com/.../search? Is there any Italian cafe in Bay- Greek eatery near Canterbury term=[term]&location=[location] lands park? cathedral please.

http://api.openweathermap.org/.../ What is the weather like in For- How does the weather look like weather?q=q est Hill SF? in San Basilio?

https://api.elsevier.com/.../scopus? What is the latest publication Recently published papers query=query about Internet of Things? about resource allocation in private Cloud?

tion vector. Figure 5.9 shows the steps taken by the system to generate an embedding for Yelp API description.

5.5.2.2 API Method Embedding

In order to vectorize methods (Figure 5.11), we first obtain a set of annotated natural language utterances that reflect the functionality implemented by a method, such as “search for any Indian restaurant near UNSW”. These utterances are added by API owners or the admin of API-KG. Utterances include annotations to denote that Indian restaurant is a type of cuisine and UNSW is a location. There are many ways in which this set of utterances can be obtained. In our cur- rent implementation, we adopt active crowdsourcing proposed in [33, 241], where we ask crowd worker via the Figure-Eight crowdsorucing platform to provide different ways to request a certain functionality. Figure 5.10 shows an example of our crowd-sourcing task to collect paraphrases for the exampled utterance associated to Yelp - Business Search method10.

Such richness of utterance is important in bot training as different users would likely utter somewhat different sentences to invoke a service. Specifically, we collect 3 annotated utterances per API method, since [33] shows that using 3 different paraphrases ensures variability and di- versity in the utterances set. Such set of annotated utterances (Table 5.1) can be beneficial to all bots leveraging the API [105]. We then use these annotations to link entity mentions (values) to method parameters. Utterances can also be simply inserted by the API-KG admins at API import time or provided by the API owners if they so wish, though the crowdsourcing approach allows

10https://www.yelp.com/developers/documentation/v3/business_search

Page 120 of 194 Figure 5.10: Crowdsourcing task to provide three paraphrases per annotated utterance for automation, scale and diversity.

After collecting paraphrased utterances we perform the following steps: (i) first, we extract keywords (nouns and verbs) from each annotated utterance - we consider verbs because methods are about performing activities / actions (e.g. booking, listing, forecasting), and nouns (e.g. ticket, restaurant, weather) because nouns convey the general purpose of an utterance when considered with verbs (e.g. booking ticket, listing restaurant, forecasting weather); (ii) next, we add more keywords by considering the top-n neighbors in VSM - as an initial threshold we set n=5; (iii) we then obtain utterance vector by averaging keyword vectors; (iv) finally, we generate API method vector by averaging utterance vectors. Equation 5.1 formally describes these steps:

⎧ ⎫ ⎪ ( )= 1 ( n ( )) ⎪ ⎨⎪ v uttr n k=1 v kwk ⎬⎪ (5.1) ⎪ ⎪ ⎩⎪ 1 m ⎭⎪ v(method)= m ( i=1 v(uttri))

In the equation above, v(uttr) is vector of an utterance obtained by averaging keyword vectors; n is number of keywords in that utterance; v(method) is vector of method generated by averaging utterances vectors; m is number of utterances linked to the method.

5.5.2.3 API Parameter Embedding

Vectors for parameters are generated from their possible values. We acquire such possible values from the annotated paraphrased utterances obtained as discussed above. We then use our word embedding model to generate one vector per parameter by averaging vectors of values linked to a

Page 121 of 194 Vectorization θ

Vector/Paraphrase Averaging anyMcDonalds¬nearby, I'm nearVondelpark vectors find Chinese¬resto¬around¬Times Square look for an Indian¬eatery near Bondi junction Paraphrases

Crowd workers Crowd-based Paraphrasing Generated vector Associate¬vector to¬method API-KG search for any indian¬restaurant¬near UNSW

Initial utterance¬ to invoke method¬ API owner Figure 5.11: Generating API method embedding - (i) API owner provides an initial utterance that describes best the interaction with method, (ii) initial utterance is then paraphrased by crowd workers to collect more utterances, (iii) collected paraphrases are then used to generate a method embedding parameter in API-KG. First, we extract parameter values from annotated utterances (red and blue colored words in Table 5.1). Next, we add more values by considering top-n closest neighbors in word embedding model using cosine similarity metric (as an initial threshold n=5). Finally, we get the vector of each value from embedding model; then, we average them together to generate a vector for parameter (equation 5.2). In the equation, n is number of values, and v is vector of value (vl). Figure 5.13 shows the steps taken by the system to generate an embedding for Yelp - Business Search method parameters: term and location. 1 n v(parameter)= v (vli) (5.2) n i=1

We note that in this approach, we associate one vector to each method and to each parameter. There are many other possible approaches to process paraphrased utterances, such as keeping one vector per utterances, or clustering vectors into n clusters and then averaging vectors per clusters. The averaging approach we followed proved to work well and reduces the number of vectors to compare and bot building time thereby allowing for faster processing, As such it provides a reasonable baseline that has proven to be surprisingly effective as we discuss later.

Page 122 of 194 (1) extracting values Paraphrases:

anyMcDonalds¬nearby, I'm nearVondelpark [ mcdonalds, chinese, resto, indian, eatery¬] find Chinese¬resto¬around¬Times Square [ vonderlpark, time square, bondi junction¬] look for an Indian¬eatery near Bondi junction (2) finding word vectors

[ mcdonalds, kfc, chinese, italian, resto, cafe,¬... ] . . . . . times square . . restaurant [vonderlpark, leidseplein, bondi junction, randwick,¬...¬] vondelpark. . . zuidoost eatery . . leidseplein . . . resto . . museumplein café bistro .. . . (3) extending values .. . (4) generating parameter vector spanish . . french domino pizza . . kfc. . italian . chinese . . pizza hut indian mcdonalds vondelpark pizza hut kfc mcdonalds museumplein chinese term leidseplein ... spanish ... bar location italian cafe coogee beach

... restaurant randwick

... resto bondi junction ... eatery

Figure 5.12: Generating an API parameter vector: (i) Value extraction, (ii) Value extension, (iii) Generating the final vector by averaging the vectors of values

5.5.3 API Parameter Enrichment - Acquiring Mentions

In some cases, relying on parameter embeddings alone is not sufficient to obtain satisfactory map- pings of values within user utterances to API parameters. Indeed, in many situations there might be mentions of a parameter value not contained in our training corpus: For example, the parame- ter New York can be expressed as NY or New York City or N.Y.C. as the other possible values of parameter (e.g. location).

We address the problem by leveraging complementary resources such as open knowledge graphs [274, 71, 61]. While there is a range of different knowledge graphs, in this work we utilise BabelNet - a comprehensive multilingual machine-readable semantic network [187]. For exam- ple, for a parameter value (e.g. “New York"), we add all the mentions of that value such as “NY", “NYC", “N.Y.C", “New York City" as other possible values of the parameter (e.g. location). In the following, we will discuss how these mentions are extracted and added into a parameter.

BabelNet groups words of different languages into sets of synonyms, called synsets [187]. Each synset is a set of words that share the same meaning. For example, the word “New York" is expressed by the following synsets:

Page 123 of 194 (1) finding synsets for "New York"

Parameter: "location" Value:"New York"

Synsets: [ "Music, Sound and Dancing" ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ "Food and Drink",¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ "Literature and Theatre",¬ ¬ ¬ ¬¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ "Geography and Places"¬] (2) retrieving synsets

(3) extracting key tokens jazz sound . literature . . . . . poetry music . . indology . musical . . . playhouse Synsets:["Music, Sound and Dancing":¬music¬sound dance . theatre. . .dance . spot ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ "Food and Drink": food drink disco . . . breakdance .. .place ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬¬"Literature and Theatre": literature theatre position. . placein . ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ "Geography and Places": geography place¬] . sipping . food . lemonade . . . . . drink climate

. .geology Model Projection (4)finding word vectors . canned food .meat . geography Embedding 2D Word

Music, Sound and Dancing Food and Drink Parameter:"location" Geography and Places

Literature and Theatre (5) generating synsets vectors

Figure 5.13: Choosing a target synset: (i) Retrieve synsets from BabelNet, (ii) Extract key tokens from naming of sysnets, (iii) Generate a vector per synset by averaging the vectors of tokens, (iiii) Choose the synset with closest vector to the vector of parameter

• Literature and Theatre: New York Magazine11, New York Novel12.

• Food and Drink: New York Wine

• Language and Linguistics: New York Typeface

• Geography and Places: New York City

The first step toward finding mentions of a parameter value (e.g. New York) consists in choosing the most relevant synset among all retrieved synsets (listed above). In computational linguistics, this technique is known as Word Sense Disambiguation (WSD) [104]. WSD identifies which “sense” or meaning of a word is used in a particular context. To apply WSD, we first take domain label property (e.g. “Geography and Places”) of each synset and identify tokens within that using Stanford Core NLP Parser [162]. Once tokenization is performed, we proceed with removing stop words (e.g. “and”, “or”, “the”) that do not contribute to the semantics of the label. This leaves

11An American biweekly magazine concerned with life, culture, politics, and style generally, and with a particular emphasis on New York City 12A historical novel by British novelist Edward Rutherfurd

Page 124 of 194 us with the tokens geography and place, which we call key tokens. These key tokens are used for identifying the relevant synset.

Once key tokens are identified, the next task consists in comparing the semantic similarity between each synset and parameter (e.g. location) of which the value (“New York”) is linked to. We take each synset and generate a vector that represents the semantic of that synset by averaging the vectors of key tokens (extracted in the previous step). We do this by utilising embedding model we proposed in Section 5.5.1. Next, we try to find a synset that has the closest vector (above a given threshold13) to the parameter (“location”) embedding. If we find a match between a synset and the parameter (e.g., similarity between “Geography and Places" with the “location” is more than 0.80), we choose the corresponding synset as target. Figure 5.9 shows the steps taken by the system to choose a target synset.

Once we find the target synset14, we pick top-n (with n=1015) synonyms (e.g. “New York City”, “NYC”, “Greater New York”, “City of New York”, “N.Y”) included in the target synset and add them into the API-KG as mentions of parameter value (New York).

5.6 Building bots using API-KG

5.6.1 API-KG REST APIs

To assist bot developers to use API-KG and build bots, we have developed our API Knowledge Graph as a Web service and it is available at http://apikg-dc.eu.ngrok.io/api/docs/. It also has a user-friendly web interface available at http://apikg.ngrok.io/ which inte- grates all the available APIs together. The service allows bot developers to use knowledge stored in the API knowledge graph to build chatbots. Figure 5.14 shows list of available REST APIs. There are three categories of APIs available: (i) Intent endpoints, (ii) Knowledge endpoints, and (iii) Bot Development endpoints.

Intent endpoints are used to interact with Knowledge Explorer component (see Figure 5.6) to

13As an initial threshold, we set th =0.80. 14https://babelnet.org/synset?word=bn:00041611n&details=1&lang=EN&orig=new york 15The value of 10 is set as a reasonable baseline that has worked well in our experiments, but the deeper investigation of the optimal can be employed.

Page 125 of 194 Figure 5.14: Swagger documentation for API-KG search/find relevant APIs and Methods for given goal and sample utterances, respectively. Knowl- edge endpoints, as they are called, provide information stored in the API-KG such as list of APIs, methods and parameters, together with utterances linked to methods and values attached to pa- rameters. Finally, by using Bot Development endpoints, developers are able to interact with Bot Builder component to train and deploy their chatbots directly on supported platforms (e.g. Wit.ai).

5.6.2 Bot development scenario

We now show how API-KG support users in finding APIs of interest, in training a bot that lever- ages those APIs, and finally in deploying the bot.

Let’s consider that developers would like to create a bot to find restaurants. They may input a sentence such as “a bot to find restaurants in NYC”. This then triggers an API call that returns a list of APIs with a similarity score matching intent to APIs descriptions (see Figure 5.7).

Page 126 of 194 Figure 5.15: Relevant APIs to the search query

As shown in Figure 5.15, in this case Yelp is returned as the most suitable match. Developers can then directly deploy the service as a bot (as long as the API method has training information associated in the KG). Moreover, users with more advanced development skills can explore each returned API (internally this is done via a call to API-KG of the form ‘GET /apis/{id}/’), to gather information about the API’s name, description, methods, method training utterances and method parameters.

Querying over the API-KG is executed by creating a query vector composed of the developer’s request and comparing it with the API vectors in the KG to find the closest match. First, we construct the query vector by extracting nouns (we focus on nouns for the same reasons stated earlier for building API vectors in the KG), using Stanford’s CoreNLP [165] library. We then query the VSM to retrieve the vector associated to each noun, and average them to obtain our query vector. Finally we measure the cosine similarity of the query vector with the API vectors in the KG and return the most similar ones to the developer (the current version shows the top 5), ordered, with the most similar on top. Notice that the comparison is made with API vectors in the KG because the goal of this use case is to search by intended goal of the bot as a whole.

In a basic usage, the developer’s work can stop here. Once an API has been selected, the explorer can employ Bot Builder, which is layered atop, to deploy the API and create the bot (Figure 5.16). In order to create a bot, Bot Builder develops three components: (i) Utterance Parser (UP), (ii) API Manager (APM), and (iii) Bot Response Generator (BRG). UP is similar to NLU component (details in Chapter 2). It identifies user intent and extracts entity types/values. Bot Builder trains UP by connecting to a third-party bot platform (such as Wit.ai) via the platform’s APIs. The Bot Builder adds one intent per API method, loading the annotated training utterances and sample values from the API-KG into the platform. APM maps detected intent and entities

Page 127 of 194 (output of UP component) into invocation of an API method. To develop this, Bot Builder uses a pre-built template code1 developed by API-KG admins at this stage. BRG is a component similar to NLG (details in Chapter 2). It produces natural-sounding messages from the output of APM component. Bot Builder uses a pre-built template code to develop BRG. At this stage, BRG utilises rules (discussed in Chapter 2) to generate bot responses.

Figure 5.16: Building chatbot using Bot Builder

However, we may want a finer degree of control, by mixing methods from different APIs to create our bot. Furthermore, we may want to search for relevant API and methods “by example”, i.e., by uttering utterances we would like the bot to support, as opposed to describing the bot purpose: Developers can do so by providing a set of seed utterances that they want the bot to support, from which a set of "similar" API methods will be retrieved. Given that user utterances are typically in form of action requests, they are more naturally linked to methods and therefore when searching the KG we focus on vectors associated to Method nodes in this case. This is illustrated in Figure 5.17.

As it can be seen in the figure, method similarity refers to similarity between given utterances and an API method. Computing similarity is analogous to the one described above. However, here the query vector is built from the utterances as follows: Firstly, we extract nouns and verbs from the utterances via CoreNLP (as discussed verbs are also important in matching intents). Secondly, we then extract the corresponding word vectors and average them, thereby obtaining one vector per utterance, and then further average them to obtain our query vector. This is then compared

1A template code is a function that performs a specific task

Page 128 of 194 Figure 5.17: Relevant API Methods to seed utterances with the vectors associated to API methods, using cosine similarity as a metric.

Developers are also able to obtain other utterances that are relevant to this API method (i.e. Yelp-Search) (internally this is done by calling ‘GET /method/{id}/expressions’). The call pro- duces the associated set of utterances as shown in Figure 5.18, which developers can further browse to see if the method indeed fits their need. Once methods of interest have been selected, Bot builder can once again automatically load them along with training utterances in a bot platform of choice. Developers may also further add or remove training utterances as they wish.

Bots accept user utterances in natural language. This means not only that utterances may be described in various ways, but that the bot should also be capable of identifying the entity types (e.g., location) within user utterances. Bot platforms can do this well if, again, they have training data which in this case means sample values for the entity type (e.g., "London", "Paris", "New York" for location. This means that training the bot with potential parameter values is imperative. The API-KG does have values for each parameter, but such values are limited to those extracted from crowdsourced training utterances as discussed earlier. Besides loading these sample values into the bot platform, we can help developers devise a broader set of values by searching for “similar” parameters and by then instructing bot builder to add values from such similar parameters to our bot entity types. For example, if our selected method has a location parameter, and associated values as "London", "Paris", "New York", we may then query API-KG to retrieve similar parameters from other methods of different APIs (e.g. q from OpenWeatherMap, originPlace and destinationPlace from SkyScanner, city from Expedia). A parameter such as

Page 129 of 194 Figure 5.18: Utterances associated to an API Method location of a particular API, is associated to similar parameters from other APIs. We can then use the associated values of these other parameters as training input (e.g. (“location”, [“London”, “Paris”, “New York”]), since we expect them to be also be relevant.

Developers can then choose to add some or all values from a returned parameter to the entity type of interest in the bot platform. The query is executed by finding similarity between the vectors of the parameter values from the selected methods with the vectors associated with the parameter nodes in API-KG. We use the usual averaging mechanism. The more similar parameter is then returned to the developer, along with the list of values. For example, if we want to restrict to Californian cities, the query will return other location parameters that have Californian cities as values closer than parameters that accept locations from other continents, or worldwide locations. Developers may also select specific values if they desire, instead of “importing” all values from the returned parameters. Most likely because they want their bot to be restricted to, e.g., Californian cities even if the underlying service supports worldwide locations.

Overall, the above process allows developers to quickly create a training bot that can be used as is or as a starting point for modifications and extensions. Although end-user design was not a primary goal of this work, preliminary studies discussed next indicate that Bot Builder might also be used by non-developers thanks to its simplistic interface.

Page 130 of 194 5.7 Experiments

We experiment API-KG and Bot Builder along three dimensions: (i) First, from the developer perspective, we study if and how API-KG helps in finding the proper APIs and methods to de- velop and then deploy a trained bot over the chosen API. (ii) Second, we wanted to understand if non-developers could create a bot, given a short (5-minute) tutorial. This helps us understand the potential of end-user bot development. (iii) Finally, we test how the specific - and rather simple - technique we chose to vectorize utterances and API-KG queries compares with a set of alterna- tives, including recently proposed sentence vectorization techniques.

5.7.1 Bot Development using only API-KG

While our initial inclination was to conduct a productivity study for the time taken to develop bots with or without our approach, it seemed obvious that using API-KG would take significantly help. We therefore focused on a user-experience study. We conducted an experiment assigning partici- pants a task in developing a bot, we then qualitatively assessed user-experience by conducting an in depth interview. We then summarize overall benefits and issues faced.

Participants were asked to complete the task using API-KG APIs1 (except Bot Builder – no au- tomation) in any way they preferred. Since our focus is on offering/reusing knowledge of API(s), our experiment therefore is conducted only for the API(s) that we have enough knowledge about.

Participants. We recruited 5 participants with exposure to software development and machine learning.The recruitment was based on a convenience sample of bachelor and master students in computer science that had no connection with the project. Two of them had experience in bot development while the others had none.

Procedure. (a) Firstly, a 30min tutorial of API-KG was offered with a discussion of its added value; (b) Next, participants were assigned a chatbot development task that simulated a real-world use-case; and (c) Finally, a follow-up semi-structured interview was conducted about the user experience of API-KG during the bot development process.

1http://apikg-dc.eu.ngrok.io/api/docs/

Page 131 of 194 Experiment. Participants were asked to build a chatbot from any of these topics: 1. Weather forecast, 2. Search videos, or 3. Search restaurants. Task duration was limited to 3 hours, and they were free to use any online documentation, tutorials or code samples. Moreover, participants were allowed to use utterances (e.g. as training data) provided by API-KG as are or modify them before using. Although this task is relatively straightforward, it shows that access to knowledge about an API will indeed assist bot developers even if their goals are not too complicated.

5.7.1.1 Results - Benefits of using API-KG

The most common benefit was high quality and diversity of utterances obtained:

“The feature I liked the most was the user utterances. It saves a lot of time not having to think of all possible paraphrases..." (P1)

“Getting a bunch of expressions was very relieving. Having built several bots, I had found it very exhausting to collect domain-specific user expressions." (P3)

“Getting plenty of values for parameters is a cool feature that helped me a lot to generate new expressions by extending existing expressions..." (P4)

The second main advantage of using API-KG is to help easily find relevant APIs:

“I like the suggestions about which api and method you can use for a specific use case. I think it’s intuitive and easy to use." (P4)

“finding relevant apis to meet the goal is nice! At first, I thought it’s just indexing or keyword searching, but it seems powerful that I don’t have to say exact words to get suggestions." (P3)

Overall “Do you think that API-KG adds value? Why or why not?", P1 highlighted one of the key features:

Page 132 of 194 “Yes, I do. It saves a lot of time in finding phrases and paraphrases, also entity val- ues. Also very useful if you do not have good domain knowledge and don’t know what exactly people may ask. For example, even though I have some basic knowledge about international cuisine, I wouldn’t know all the different types of sailing boats for a sailing shopping bot, or the best places around the world for surfing to build a “sporting" bot and so on."

5.7.1.2 Results - Issues of using API-KG

P5 commented that “training bots with the same set of utterances will lead to build same bots, no matter who builds the bot". While it is true that using the same platform and utterances will result in the same bots, our aim is to provide method utterances and parameter values for bootstrapping the training part of the bot development as it is also mentioned by P4 “... it provides a starting point for building a new bot."

As another issue, P5 reported that “while API method suggestion feature works like a charm, when I ask for API suggestions, in some cases, the first ranked API is not the one I’m looking for.". After some investigations, we realized that participants sometimes do not provide appropri- ate explanations of their goals “a shopping bot"(P5), “a bot to help on organizing trips"(P4) and consequently the results might not be ranked as they expect. We also are aware that more sophis- ticated algorithm can be employed as we gather more knowledge about APIs. However, we will discuss later about the performance of finding relevant APIs/Methods using different vectorization techniques.

Surprisingly, P1 and P5 stated that integration with existing bot building platforms and having a user interface would be useful to ease the use of API-KG. Since in this experiment, participants were not aware of integration feature of API-KG with bot platform (e.g. Wit.ai), they pointed out that having an automated mechanism to import utterances into platform and train NLU model would be very helpful and save a lot of time as they had to copy/paste and annotate utterances (again) and values inside their preferred platforms separately.

“Integrate with bot-building platforms so that we don’t have to process the example utterances and values before feeding to the platform." (P5)

Page 133 of 194 “While the endpoints and knowledge is great, I would have also liked integration with platforms I used. Including those GUI-based (not just API)." (P1)

5.7.2 Bot Development using Bot-Builder

The objectives of this experiment were to i) understand how developers compare searching for APIs/Methods using API-KG versus leveraging ProgrammableWeb or other existing web resources, and ii) obtain feedback on the bot development experience with Bot Builder.

Participants. We recruited 8 participants, all with experience in building chatbots. The recruit- ment was based on a convenience sample of bachelor and master students in computer science that had no connection with the project.

Experiment. Each participant was given a scenario to implement, which required to first search for relevant APIs and then develop and deploy a bot over the selected APIs and methods. We were therefore looking to i) qualitatively compare API-KG with other API search methods and ii) assess the bot development experience once the API was found. Participants were shown a short instruc- tional video on API-KG (and on an alternative method of API search, with ProgrammableWeb), and then asked to search for methods related to "Weather forecast", "Search videos", "Search restaurants". Some participants were asked to experiment with API-KG first and then with other means (specifically with ProgrammableWeb, or with whatever other source they wished), while other were asked to use API-KG after having tried with other means. They were then asked to develop the chatbot with Bot Builder. The complete experiments, along with instructional videos and follow-up questions we ask participants are available to the interested reader2.

5.7.2.1 Results - Searching with API-KG vs others

We asked participants to describe their search strategies (without API-KG) to find related APIs/Methods:

“I searched on Google and followed the links “recursively" in a tree-like way and searching immediately after if good documentation were available” (P1)

2At https://tinyurl.com/ya2hr56u. This material is not public and currently available for review purposes only.

Page 134 of 194 “Just used ProgrammableWeb. The results I was looking for were right there. Just fol- lowed the links provided and I landed on https://developers.google.com/youtube/ which provides the full details of this API, code examples, etc.” (P5)

As an answer to “How would you compare the API search process with and without API-KG", participants mentioned:

“API-KG brought me straight to the API I was looking for (YouTube); ranked first in position. ProgrammableWeb returned the result in position #50, which mean I had to browse two pages of results to find it (alternatives APIs such as Bing video search came among top results, but I was not interested in them)." (P5).

Moreover, participants emphasized on the ability to search for API methods using seed utter- ances which is missing in other search options:

“Without API-KG, you have to search your goal and after that find to your own the method. With API-KG you can specify directly the utterance and find the method. I think this is the feature that I like more in API-KG. It is useful, direct and intuitive." (P4).

“I think it is interesting that I can get results based on the utterances I would use. I tried the same in ProgrammableWeb and it returned 0 (zero) results when using the search queries above." (P5).

On the other hand, quantitatively comparing the results to free search (Googling) option was the comment mentioned by P2: “Searching with API-KG is surely great and faster than google but it needs to expand the quantity of API provider (found 2 on API-KG and more than 10 by searching with Google)".

This is indeed a limit of the fact that API-KG is only populated with a few thousands methods at this time.

5.7.2.2 Results - Bot Development using Bot Builder

Participants mentioned two main benefits of using Bot Builder: Firstly, the ability to enumerate and annotate training utterances directly inside bot platform:

Page 135 of 194 “It let’s you start with a predefined template and with a list of per-compiled annotated utterances, I liked the speed for bootstrap of a project." (P1)

“Adding training utterances is a challenge for me because I have to think of real utter- ances that people can use, and annotate them one by one." (P4)

Secondly, the ability to provide prebuilt template codes as starting point for backend coding.

“... for someone starting to develop chatbots, it is easier to have pre-built source code. It can be a really good point to learn how build a good chatbot." (P6)

“Automation feature requires less coding and training. It greatly simplifies the process in just 4 steps. I liked the fact that the generated code is in one of the languages I know and that I can easily customize." (P5)

These observations supports the claims from our study on bot development processes dis- cussed earlier at Section 5.2. Overall, to answer “How would typical bot development compare when using Bot Builder versus existing solutions?”, participants responded:

“The typical bot development is way slower because you have to write on your own all the training expressions and backend code. Bot builder helps to save time, so I think it is a great feature." (P7)

“Bot builder cuts down time from the initial setup phase of the bot which sometimes needs a lot of boilerplate, specially writing code from scratch is a lot slower as a lot of boilerplate is needed" (P6)

“I didn’t really feel that was me creating the bot. All of the work was done in the background by the service." (P8)

Nevertheless, P8 mentioned that “..., support for other bot building platforms (e.g. DialogFlow) is also need[ed]", and P5 suggested to “add other languages such as Javascript and Typescript besides Python will make template codes more prominent." (the template referred to here is the code that maps detected intent and entities into invocation of the API method, with the correct JSON object).

In summary the feedback was quite positive while pointing at useful extensions that can be

Page 136 of 194 added to the platform to accommodate preferences of different users.

5.7.3 Bot Development by Non-developers

In this qualitative study we sought to understand the difficulties non-developers face when develop- ing bots with Bot Builder. The goal was to see what non-developers could do after being exposed to a training of two minutes, and consequently obtain a preliminary assessment of whether Bot Builder can be used by non-developers, and if not to understand what is missing.

Participants. We recruited 6 participants from fields other than Computer Science (Sociology, Nursing, Medicine, etc). They all had no development experience, neither in software development nor in building chatbots.

Experiment. Each participant was first exposed to a short tutorial (2 minutes) on API-KG and on how to build bots using Bot Builder. Participants were then asked to build and deploy a bot in any of these three topics 1. Weather forecast, 2. Search videos, 3. Search Hotels. The deployment is performed by Bot Builder automatically, by accessing wit.ai and loading the required information. Users can specify their own wit.ai account if they have, but this is not the case for non-developers so the deployment is done to a general Bot Builder account. Users were asked to answer followup questions in a google form. The tutorial, experiment description as seen by the participants, and the followup questions are available online (see earlier Footnote 2).

Results. Interestingly, all the participants could successfully build their chatbots (with exception of one person, who did not looking at the tutorial before doing the test). Participants were asked to describe their experience:

“I can’t believe that is quite easy to develop a chatbot using the program." (P3)

“A new experience for a health person to create a chatbot." (P2)

“It was a really clean and straight-forward experience, given also the demo video" (P4)

We also asked participants to tell us about how they would improve their chatbot if they could. P2 mentioned that “Incorporating more places and also including temperature will make my chat-

Page 137 of 194 bot really handy", P6 said that “I like that it is simple and direct to the point. It could have been better if more places/cities are incorporated", and P3 stated that “I like that it immediately an- swered my queries. I think it can improve by giving more accurate answers". The limitations reported were related to cases where the participants were assuming that the Yelp bot would also answer to queries related to hotels, or that the weather bot would give weather forecasts for an entire country ("How cold is in Germany now?"). These experiences point to the need for captur- ing "similar" domains or categories of parameter values that are not supported by a specific API, and either explicitly mention the lack of support for these cases or possibly add intent and training utterances to explicitly answer that the domain is not supported by the bot when related utterances are uttered by the user (e.g., train a bot to also mention that hotels are not supported when such a query is made).

5.7.4 Effect of Vectorization Techniques

As we discussed earlier in this chapter, we use a simple averaging technique to build vectors for APIs/Methods and their parameters (section 5.5). Our approach uses these vectors to suggest appropriate APIs and Methods for given natural language search queries. In the following we study how accurate this suggestion is and how it compares when we replace averaging with other vectorization techniques.

Dataset and Procedure. To evaluate the different vectorization approaches we proceeded as follows: we crowdsourced a total of 3790 utterances for 100 API methods in the KG (for an average of 34 utterances per method). We then randomly selected 70% of the utterances for each method (2653 utterances in total) to build the method vectors with the different approaches, and the remaining 30% (1137) was used as test set, to issue queries to the API-KG and get test values for different evaluation metrics.

Evaluation Metrics. In essence what we test is that, when we query the API-KG with an utterance from the test set, the method that the utterance was associated with is among the "top K" returned. We compare vectorization techniques using Success Rate at k (SR@k), Normalized Discounted Cumulative Gain (NDCG), and Mean Reciprocal Rank (MRR), which are widely used metrics in classification and information retrieval systems [163, 45, 279, 288, 134, 208, 278]. Success

Page 138 of 194 assesses if the relevant method is inside the top k result set. Notice that since there is only one cor- rect answer, it doesn’t make sense to measure and compare precision and recall. NDCG considers the position of the correct answer within the top k list - having highly relevant answers on top of the list is better than having them at the bottom - while MRR similarly measures how far we need to check in the recommended list to find the first correct answer.

Vectorization Approaches. To assess the quality of the search results, we first experiment with the simple averaging method discussed in section 5.5 (we refer to it as Vector Per Method or VPM in the following), as well as a variant where we build one vector per utterance (by averaging keywords). In this variant (called Vector per Utterance or VPE in the following), a method is therefore associated with several utterance vectors as opposed to just one vector as in the previous case.

Furthermore, we experiment with the most popular vectorization methods with publicly avail- able models. Specifically, we experiment with InferSent [49], Sent2Vec [194], with the recently released Universal Sentence Encoder (USE) [36], and with Concatenated Power Mean Embed- dings (CPME) [221]. We briefly discussed these methods in the Related Work section so we do not detail them here. For each of the above algorithms, we implemented two approaches: We first combined all utterances to create one utterance and then generated a method vector with the algorithm by using that utterance. Next, in what we refer to as the Keywords version, we only consider keywords (nouns and verbs) when building the long utterance to then be vectorized. The code and data for the experiments are available at the same URL (see Footnote 2).

Results. As shown in Table 5.2, the averaging technique, more specifically one vector per method (VPM), has the highest overall MRR. In terms of success rates, Sent2Vec is slighly superior in SR@1 and NDGC@1, USE and VPM perform best at top 3, while VPM outperforms the other methods at top 5.

For these reasons we adopt the simple VPM technique as API-KG search method - though we observe that Sent2Vec and USE would give similar results. Notice that the specific vectorization and search algorithm is a pluggable module of API-KG and knowledge explorer. With the contin- uing progresses in NLP, it is likely that new algorithms will be devised that clearly outperforms VPM and in that case they can be integrated to improve method search quality.

Page 139 of 194 Table 5.2: Comparison between API method vectorization techniques for given natural language search queries

Top 1 Top 3 Top 5 Overall

Vectorization SR@1 NDCG SR@3 NDCG SR@5 NDCG MRR

Averaging (VPM) 0.882 0.882 0.930 0.915 0.967 0.959 0.948 Averaging (VPE) 0.653 0.653 0.811 0.786 0.841 0.811 0.734 Sent2Vec (Original) 0.902 0.902 0.936 0.912 0.961 0.950 0.941 Sent2Vec (Keywords) 0.722 0.722 0.831 0.828 0.844 0.828 0.775 InferSent (Original) 0.673 0.673 0.792 0.781 0.821 0.795 0.746 InferSent (Keywords) 0.475 0.475 0.603 0.592 0.613 0.597 0.541 USE (Original) 0.841 0.841 0.940 0.913 0.960 0.943 0.892 USE (Keywords) 0.663 0.663 0.821 0.807 0.861 0.826 0.746 CPME (Original) 0.603 0.603 0.772 0.757 0.841 0.790 0.714 CPME (Keywords) 0.752 0.752 0.900 0.879 0.937 0.893 0.826

With VPM, Knowledge Explorer achieves a success rate of 0.882 for top 1 (nearly 90% of the time the correct method is returned at the top of the list), which grows to nearly 97% for top 5. This shows that API-KG can discriminate well between different APIs - different from the perspective of how users would invoke them. Notice that the scores are so high because in the API set under test we do not have several methods for the exact same functionality. If we had API methods offering the same functionality and that the user would invoke with the same utterance, then Knowledge Explorer would not be able to (and indeed, it should not) discriminate among them based on the search utterance. It would not make sense to do so as they all represent relevant answers (and it would not make sense to consider only one answer as correct). In this case the ranking would likely have to be made based on reputation, popularity, or other criteria.

5.8 Conclusions and Limitations

In this chapter we proposed methods and tools to collect API descriptions and map them into vector representations in a way that makes API search effective and that simplifies bot development over retrieved APIs. API representations, stored in a knowledge graph, and Bot Builder have shown to be beneficial for developers and usable by non-developers in our qualitative studies, and quantitative studies for method retrieval have shown good performances.

Page 140 of 194 The work as presented has several limitations and many possible extensions, which are part of our current work. A limitation of the current analysis is that we have not explored search perfor- mance and quality when the number of APIs grows large, e.g., greater than 10,000. Such scenario will likely require also the incorporation of some form of reputation or popularity mechanism in addition to relevance. A parallel thread of work is also exploring how many paraphrases per method we need and how diverse they should be to perform search effectively in large API-KGs. We will also extend current studies and add quantitative studies as well to the qualitative ones described here.

Another issue we have not addressed is how to facilitate the identification of the parameter scope and the mapping between possible parameter values (which we obtain and expand as dis- cussed in the chapter) and the specific parameter values accepted by the method. For example, we may expand values for the "city" parameter of a method and include the mentions "New York", "NYC", "NYC, NY" and so on, but then the method might be strict in what it can accept to indicate New York City. Similarly, a location parameter for the method may only accept cities, or may also accept countries, or perhaps lat-lon coordinates. These are issues that we also experienced in our preliminary tests and a promising way to address them is to have access to API invocation logs with successful and unsuccessful examples.

We are continuously growing the API-KG and as we release it for public use we also expect the community to contribute in adding APIs as well as template codes for mapping intents into API calls. We also plan to extend API-KG by considering information from StackOverflow, GitHub and other sources of community-provided knowledge.

Page 141 of 194 Chapter 6

Multi-Turn and Multi-Intent User Chatbot Conversations

Hierarchical State Machine based Conversation Model and Services

This chapter presents techniques we devised to support multi-turn and multi-topic natural language conversations with APIs. We build upon Hierarchical State Machines (HSMs) to develop a novel conversation engine that represents conversations as sequences of states each covering a topic. States may contain nested states to handle complex conversations.

The rest of this chapter is organized as follows. Section 6.1 provides an introduction to explain issues and proposed solutions. Section 6.2 provides background information on human-chatbot conversations. Section 6.3 introduces exteded HSM-based models to represent human-chatbot- API conversations. Section 6.6 and 6.7 present validation and evaluation of the proposed methods. We provide concluding remarks in Section 6.8.

6.1 Introduction

In the previous chapter, we focused on intent recognition in the context of single turn user-chatbot interactions. This means that every single user utterance is considered complete and carries all the information required by the chatbot to perform a task. However, studies on human-chatbot conversation patterns (e.g. [171]) reveal that, in practice, conversations are rather multi-turn, where there may exist missing information (e.g. location) in users’ utterances that needs to be fulfilled by the chatbot before an actual API call be invoked. Other examples include an invocation of an API by the bot to resolve the value of a missing parameter, a question by a bot to a user to confirm an inferred intent value or make a choice among several options, extracting an intent parameter value from the history of user and bot interactions. In addition, according to studies on human-chatbot dialogue patterns (e.g. [239]), switching between different intents is a natural behaviour for users. Thus, there is a need for more dynamic and rich abstractions to represent and reason about multi-turn and multi-intent conversational patterns. The main challenge of achieving this objective arise from variations in open-end interactions and the large space of APIs that are potentially unknown to developers.

In this chapter, we propose a muti-turn and muti-topic conversational model that leverages Hierarchical State Machines (HSMs) [283]. HSMs are a well-known model suited to describing reactive behaviours, which are very relevant for conversations but other specific users-bot-API conversation behaviours must be modelled too. More specifically, HSMs reduce complexity that may be caused by the number of states that are needed to specify interactions between users, bots and services.

Conversations are represented as a sequence of states each covering an intent. A state mat rely on a nested state machine to manage required tasks towards handling an intent to completion. Transitions between states are triggered when certain conditions are satisfied (e.g., detection of new intent, detection of missing intent parameter). The proposed Conversation model and engine, together with new techniques to identify implicit information from dialogues (by exploiting Dialog Acts [239]), enable chatbots to manage tangled and multi-turn conversational situations. We extend Bot Builder (presented in the previous chapter), a service that semi-automate chatbot development and deployment, to support multi-turn and multi-intent conversations. More specifically, we make the following contributions in this chapter:

• We propose the concept of conversation state machines as an abstraction to represent and reason about dialog patterns. Conversation state machines represent multi-turn and multi- intent conversations where state represent intents, their parameters and actions to realise them. Transitions automatically trigger actions to perform desired intent fulfilment opera-

Page 143 of 194 tions. The proposed model extends hierarchical state machine model, to effectively support complex user intents through conversations among users, bots services and API invocations. Nested states represent interactions to resolve missing slot values.

• We propose dialog act recognition techniques to identify state transition conditions. We use dialog acts to specify interaction styles between users, bots and APIs (e.g., user submit utterance, bot detect missing slot value, bot ask user to provide missing slot value, user submit a new utterance to supply missing value).

• We develop a conversation management service that is used to initiate, monitor and control the run-time interactions between users, bots and APIs. The knowledge required at runtime by the conversation management service is extracted from bot specification (i.e„ developer supplied user intents) and user utterances. In this way, the conversation manager automates the generation of run-time nested conversation state machines that are used to deploy, mon- itor and control conversations with respect to user intents and utterances.

• We extend the Bot Builder (presented in the previous chapter) to build and manage multi- intent multi-turn conversations.

• We provide validation and evaluation of the techniques proposed in this chapter

6.2 Human-Chatbot Conversations

A conversation between user and chatbot can be formulated as a sequence of utterances. For example, to answer user utterance “Please remind @Sarah that we have meeting tomorrow by 1:30PM”, after performing the task (e.g. reminder) chatbot replies with, e.g., “Ok, reminder sent to Sarah”.

Studies on human conversation patterns [103, 158] reveal that human-chatbot dialogue can be divided into three categories [69, 39] (from “less” to “most” challenging situations) (Figure 6.1):

• Single Intent - Single Turn: Interaction between user and chatbot is in the form of pairs. The assumption here is that user provides all the required information (e.g. slot/values) at once, in one single utterance [289]. Thus, each utterance from user (e.g.

Page 144 of 194 More natural conversation Multi Topic/ - User can talk to chatbot about multiple topics Multi Turn - Information is provided by user in separate utterances

Single Topic/ - Conversation is centered around one single topic Multi Turn - User provides required information in turns

Single Topic/ - Conversation topic is limited to one Single Turn - User is forced to provide all information at once Less natural converation

Figure 6.1: Types of human-chatbot conversations - from less to more natural

“Please text Bob that we are in meeting room 401K”) has a reply from chatbot (“’We are in meeting room 401K’ is sent to Bob”). This type of conversation is stateless, i.e, each user utterance is treated separately without using any knowledge from past conversations or context. Thus, if any information is missing in user utterance (e.g location, date), chatbot is not able to perform the required task (e.g. book a flight ticket) [290]). Finally, in this type of conversation, parties talk about only about one specific intent (e.g. schedule meetings [51]) during the whole conversation.

• Single Intent - Multi Turn: Providing missing information (e.g. location, date) to generate a complete intent is a common behaviour that people follow in their daily conversations [103, 112]. For example, while talking to a friend on the phone we may ask, “I’m going to have lunch, do you you have any suggestion?” to get some ideas for lunch. However, without specifying the place where we are at (e.g. “UNSW”) or our preference for today (e.g. “Thai food”, “Sandwich”, “noodle”), our friend is less likely able to give us concrete suggestions. Thus, she asks questions to get more details (e.g. “Where are you?”, “What do you prefer”, “Do you like something soupy?”). Similar to this example, information (e.g. “departureDate”, “destinationCity”) that chatbot needs to perform a task (e.g. call Expedia API to book a flight ticket) is scattered across multiple user utterances. In Chapter 2, we discussed about different approaches that utilise a dialog manager to address this issue. Dialog manager is the component, in a dialog system architecture, that chooses next chatbot actions, maintains the conversation context, and keeps conversation state and track intent slot values (see Chapter 2 for more details).

• Multi Intent - Multi Turn: In this type of conversation, the intent continuously changes. Figure 6.2 exemplifies a dialogue where user changes the intent by asking about “her ap-

Page 145 of 194 pointment on the weekend”. Changing intent is something usual that people do in their day-to-day conversations [103]. However, participating in a multi-intent conversation where conversation information is scattered into multiple utterances, is a challenging task for di- alog management systems [228, 281]. Such difficulty stems mainly from the challenges of identifying user intent changes [171, 264], and tracking slots information for each intent in a conversation. In the following sections, we explain our approach to empower chatbots in handling this type of conversations by utilising Hierarchical State Machines (HSMs).

1. User Book a table at Time for Thai please 2. Chatbot What is the date? 3. User hmmm... never mind! Do I have any appointment on Saturday 4. Chatbot I cannot see anything on your calendar, you look free for Saturday. 5. User Ok, thanks! Figure 6.2: User changes the intent to know about her calendar schedule

6.3 Conversation State Machines

We propose to represent User-chatbot-API conversations using an extended hierarchical state ma- chine model. In this model, a state machine contains a set of states representing user intents. We call these states, “intent-states”. An intent-state characterizes the fulfillment of specific user intent. In the following, we describe different types of states and transitions between states:

Basic Intent State: When user utterance carries all the required information to fulfill the user intent, chatbot does not need to communicate with user to get any further information. We call this state a basic intent state, where chatbot has everything needed to perform required action (e.g., API call). For example, given a user utterance (e.g.“What is the weather forecast in Sydney?”) with an intent (e.g. “GetWeather”), chatbot invokes an API (e.g. OpenWeatherMap to get weather condition) and returns a message to user (e.g. “We have Scattered showers in Sydney.”). The interaction between user and chatbot is straight forward, without any further question from chatbot.

Nested State: If user utterance has missing information, then the chatbot needs to communicate with user to get missing intent slot values before it perform further actions to fullfill the intent. In this situation the intent-state relies on other nested states [69] to complete the intent. More specifically, a nested state is used by a bot to ask user for missing values of intent slots. Based

Page 146 of 194 nue epneanse tt sdvddit w aeois (i) categories: two into divided is state nested a blue response by user denoted on is intent-state current orange - in intent highlighted user is transition on intent” based “new color, intent-states between Transition 6.3: Figure o question bot in example, For chatbot. by asked question slot-filling hsase osntrqieayfrhrpoesn si rvdstemsigso (e.g. slot missing the provides it as Date processing further any require not does answer This state slot-intent nested bantemsigvle o xml,cnieigthe to considering process example, to needs For chatbot the value. that intent) missing new the (i.e, obtain request new a with utterance an provides but ae h aetitn-tt “oketuat)tigr rniint etdso-netstate slot-intent nested a to transition this a (“CheckCalendar”). in triggers value (“BookRestaurant”) slot the intent-state obtain parent to the order case, In is “CheckCalendar”. which intent e.g. another intent-state identifies another processing by whose represented utterance another is user the from answer The question chatbot etdso-au state slot-value Nested etdso-netstate slot-intent Nested value. ) Wa stedate?” the is “What Wa stedate?” the is “What . "What istheweatherforecastinSydney?" ersnsastainweeue osntpoieso vle directly, “value” slot provide not does user where situation a represents ersnsastainweeue xlctypoie vle o the for “value” provides explicitly user where situation a represents Slot/Value Intent End-User "We havelightraininSydney" : "GetWeather" t fulfill (to :

"location: Sydney" srrpiswith replies user , ae17o 194 of 147 Page bookingDate

etuatbooking restaurant Greeting Wihdyo h ekn mIfree?” I am weekend the of day “Which lt,ue nwr with answers user slot), SearchBusiness etuatbooking restaurant SearchBusiness GetWeather Greeting Business Weather Search Get etdso-au state slot-value nested

Greeting

hto,gvntechat- the given chatbot, GetWeather hto,gvnthe given chatbot, Ti Sunday” “This booking- n (ii) and , . . ifrn net(..“eWahr) hs ttigr rniint oefo “SearchBusiness” from move to transition a triggers it Thus, “GetWeather”). (e.g. intent different utters user after, Then e.g. request, intent-state). (new another “SearchBusiness” state the to intent-state) (current ing” tt ahn si Getn”itn-tt,ue ssfrrsarn ugsin.Ue utterance User suggestions. restaurant for asks user i.e. intent-state, “Greeting” in is machine state usint srt eov nitn aaee au)o pntedtcigitn wthin (i) switch transitions: intent of detecting types the three upon identify We or value) intent). new parameter a intent detecting an (i.e, clarification resolve conversations a to asking user (e.g., a performed to are actions question when triggered are states between Transitions States between Transitions 6.3.1 rniinbtentoitn-tts( is of intent-states intent example two an (an shows between intent 6.3 new Figure transition a state). conversation identifies current utterance the by user handled new not a is of that processing the if intent-state new a an intent New (ii) iue64 rniint etdso-au tt urn etdso-au tt sdntdb e color red by denoted is state slot-value nested current - state slot-value nested to Transition 6.4: Figure etdslot.value nested AyIainrsarn erKingsford” near restaurant Italian “Any rniin ee otemvmnsbtenitn-tts h tt ahn rnisto transits machine state The intent-states. between movements the to refer transitions "What istheweatherforecast?" n (iii) and , Wa stewahrfrcs nSydney?” in forecast weather the is “What "We havelightraininSydney" End-User End-User "I'm inSydney" "Where areyou?" etdslot.intent nested erhuies GetWeather SearchBusiness, ae18o 194 of 148 Page rgesatasto omv rmtesae“Greet- state the from move to transition a triggers transitions. location location nest ed slot.value ful .Ti e srutrnehsa has utterance user new This ). .Freape suigthat assuming example, For ). fi

lled Basic Basic

GetWeather GetWeather nwintent” “new e intent new , etdslot.intent Nested conversation the continue to “GetWeather” state user. parent with the to “location” state slot-value nested (e.g. information provide to user asks you?” chatbot are where 6.4) Figure (de- in “location” colour state red slot-value with nested picted the to intent-state) (current “GetWeather” state parent from etdslot.value Nested 6.3). Figure in state colored (blue intent-state “GetWeather” to the obtain to (“GetUserDetails”) machine state slot state slot-intent missing intent, nested the another a for to to request value state a slot-value is nested answer user’s “location” from - state moves slot-intent nested to Transition 6.5: Figure Gtete”itn-tt.Freape ofilamsigso (e.g. slot missing a fill to of example, upon example For slot an missing intent-state. shows “GetWeather” the 6.4 for Figure “value” value. a such provides for user request bot if a state slot-value a to moves machine state The .Ue ele ihavle(..“’ nSde”.Tesaemciemvsbc from back moves machine state The Sydney”). in “I’m (e.g. value a with replies User ). End-User End-User "We havelightraininSydney" "What istheweatherforecast?" "Where ismyhometown?" rniinrpeet h oeeto tt ahn onse ltvlestate. slot-value nested to machine state of movement the represents transition rniin niaetemvmnso tt ahn onse ltitn states. slot-intent nested to machine state of movements the indicate transitions "Where areyou?" ae19o 194 of 149 Page location location nest ed slot.value ful fi lled Basic Basic

location

nest

GetWeather GetWeather ed slot.value nse slot.value” “nested Basic user's hometown nested slot.intent

location GetWeather ,saemciemoves machine state ), GetUser Details rniinwithin transition “Where Figure 6.5 shows an example of “nested slot.intent” transition from “GetWeather” intent-state. For example, to fill a missing slot (e.g. location), chatbot asks user to provide information (e.g. “Where are you?”). User’s answer (e.g. “Where is my home town?”) carries another intent (e.g. “GetUserDetails”) to obtain the missing value. This new intent is handled by a nested state in which the value of the missing value is obtained (e.g., by invoking an API).

6.4 Generating State Machines

We devise “State Machine Generater” (SMG), a service that is used to generate a state machine that allows chatbot to manage conversations at run-time. SMG takes as input (i) user utterances, and (ii) bot specification which is a set of intents (e.g. Greeting, SearchBusiness, GetWeather, GoodBye). In the following, we explain the steps taken by SMG to generate a state machine.

6.4.1 Generating Intent States from Bot Specification

When SMG receives a bot specification (i.e, a set of user intents), it creates an intent-state per user intent. For example, an intent-state, namely SearchBusiness state is created to represents the user intent “SearchBusiness” (i.e, get list of restaurants and cafes).

SMG creates intent-states for two types of user intents. The first type of user intent is gen- eral communication intent (e.g. Greeting, GoodBye). These intents are fullfiled using (question, answer) pairs that do not require any API invocation. The second type of user intent requires the invocation of an API method to be completed. For example, in “GetWeather” user intent, the chatbot needs to invoke OpenWeatherMap API to retrieve weather conditions and fulfills user intent.

6.4.2 Generating Transitions between States

At run time, the bot generates three types of transitions namely ”new intent state”, ”intent state to nested slot.value state”, ”intent state to nested slot.intent state”. To generate these transitions, we leverage dialog acts [32]. In this section, we first describe dialog acts and then we explain how

Page 150 of 194 SMG generates transitions using dialog acts.

Dialog Acts Understanding user needs and engaging in natural language conversations requires chatbot to identify hidden actions in user utterances, called dialog acts. Whether the user is mak- ing a statement, or asking a question, or negotiating on suggestions, are all hidden acts in user utterances [32]. Asking questions, giving orders, or making acknowledgements are things that people commonly do in conversations. Such actions are called dialog acts [111, 31].

In a nutshell, dialog acts convey the meaning of utterances at the level of illocutionary force [239, 111]. For instance, 42 dialog acts were identified in [239]. Inspired by this work and empiri- cal studies on human-chatbot conversations [103, 290], we adapted dialog acts to the requirements of multi intent - multi turn chatbots that leverage APIs. Table 6.1 shows examples of these dialog acts.

More specifically, we focus on the following dialog acts:

• U-New Intent: this act indicates that user has a new intent. For example, when user says “Which day of this weekend am I free?”), her intention is to know about her availability time for the weekend (intent is e.g. CheckCalendar).

• C-Request Information: This act indicates that chatbot asks user to provide missing slot value. For example, chatbot asks the user (e.g. “Where are you?”) to provide her location.

• U-Provide Information: This act indicates that user provides an information (e.g. “15 March”) for a former question asked by the chatbot (e.g. “What is the date?”).

• U-Provide Nested Intent: This act indicates that user provides utterance (e.g. “When is my birthday?”) to answer a former question asked by chatbot (e.g. “What is the date?”). The completion of this utterance requires transition to a nested intent state.

• C-Provide Information: This act indicates that chatbot replies to user request by providing an answer. For example, chatbot answer (e.g. “I found Mamma Teresa, it’s in 412 Anzac Parade...”) to the question asked by user (e.g. “Is there an Italian restaurant around?”).

Generating Transitions. We annotate user-chatbot conversation messages using dialog acts. Thus, sequences of dialog acts (e.g. ) can be inferred from

Page 151 of 194 Table 6.1: Examples of Dialog Acts in a conversation between user and chatbot.

User Chatbot Is there any Italian restaurant around? Where are you? [New Intent] [Request Information] I’m in Kingsford. [Provide Information] I found Mamma Teresa, it’s in “412 Anzac Pa- rade...” [Provide Information]

conversations. We call these sequences dialog act patters. Dial act patterns are used by the SMG to generate state transitions.

New Intent-State: the SMG generates this type of transition upon identifying the dialog act pat- terns: (i) pattern, or (ii) contains <..., C-provide info, U-new intent> pattern. The first pattern describes a situation where user starts a conversation by uttering a request (e.g. What is the weather forecast for Sydney today?) which is annotated with U-New Intent dialog act. This triggers a “new intent-state” transition from “Greeting” (current intent-state) to the “GetWeather” intent-state (as shown in Figure 6.3 with blue color).

The second pattern represents a situation where user utters another request (annotated with U-New Intent dialog act) right after an answer from chatbot. The chatbot answer is related to a request asked by user in previous conversation turns. For example, when chatbot answers user request with e.g. “The weather in Sydney is sunny today”, the user asks a new utterance i.e. “I want to drink slushy, is there any McDonald’s around?”. The new user utterance carries a new intent (e.g. SearchRestaurant). This triggers a “new intent-state” transition from “GetWeather” intent-state to “SearchRestaurant” intent-state.

Intent state to nested slot.value state: The SMG generates this type of transition upon identifying the following pattern: <..., C-request info, U-provide info>. This pattern describes a situation where user utters a request with missing information that the chatbot needs before it can fulfills the user intent. Thus, chatbot asks user to provide the missing information. In this case the user answer provides the missing value. Figure 6.4 shows an example of “nested slot.value” transition within “GetWeather” intent-state. For example, when user asks for weather condition (“What is the weather forecast?”), a “nested slot.value” transition from “GetWeather” intent-state (current state) to “location” nested slot-value state is created. Chatbot then asks “Where are you?” and user replies with “I’m in Sydney”. The state machine then goes back to the parent intent-state “GetWeather”.

Page 152 of 194 Extracts Intent and Slots/Values Stores Slots/Values per intent

Invokes APIs

UP SM APM

SMG DAR BRG

Detects user dialog acts Generates states and transitions Generates NL responses

Figure 6.6: Conversation Manager Architecture

Nested slot.value state to nested slot.intent state: SMG generates this type of transition upon identifying the following pattern: <..., C-request info, U-provide nested intent>. This pattern describes a situation where a chatbot asks user to provide a value for a missing slot value (e.g. location). The user answers with another request with an intent to compute this value using another service. For example, when chatbot asks “Where are you?”, the user answers with the utterance “Where is my home town?” which is annotated with U-Provide Nested Intent dialog act. This pattern triggers a “nested slot.intent” transition from “location” nested slot-value state (current state) to “GetUserDetails” nested intent-state (depicted with green color in Figure 6.5). In this state, chatbot invokes an API to get the “user home town”. The result (e.g. “Sydney”) is the value for missing slot (e.g. location).

6.5 Conversation Manager Service

In order to support multi-intent and multi-turn conversations, chatbots need we devise a service that initiates, monitors and controls conversations. This service is called conversation manager.It utilises a set of components to communicate with users, manages the hierarchical state machine, and invoke APIs.

Page 153 of 194 6.5.1 Conversation Manager Architecture

Figure 6.6 shows the architecture of conversation manager service. In terms of software architec- ture, the conversation manager relies on the following components namely Utterance Parser (UP), Dialog Act Recogniser (DAR), State Machine Generator (SMG), Slot Memory (SM), API Manager (APM) and Bot Response Generator (BRG). We presented UP, APM and BRG components in the previous chapter. In this section, we present DAR, SMG, and SM implementation details.

6.5.2 State Machine Generator

SMG leverages pyTransitions1, an off-the-shelf python library to generate state machines. To generate intent-states for general user intents, SMG initialises the Machine class from the library with intent-state (e.g. Greeting, GoodBye) along with an initial state (e.g. Greeting). To generate intent-states for intents with API invocations, SMG uses an extension module in the library. It imports NestedState class from the library with initialization arguments as “name” and “children”. “name” refers to the name of intent (e.g. SearchBusiness) and “children” refers to required slots of the API (e.g. term, location). SMG retrieves required parameters of an API from API-KG (Chapter 5). To generate transitions, SMG uses the “add_transition” operation in Machine class of pyTransitions library. A transition is generated by passing the “source” (e.g. GetWeather) and “destination” (e.g. SearchBusiness) states as arguments to the operation.

6.5.3 Dialog Act Recogniser

DAR classifies user utterance into a corresponding dialogue act class. It can use any classifica- tion model such as Naive Bayes, MaxEntropy and Support Vector Machine (SVM). In the current implementation, DAR uses a Bi-LSTM classifier [128] trained on Switchboard Dialogue Act Cor- pus [113] available online2 which contains 1155 human to human conversations with dialog act annotations. 1https://github.com/pytransitions/transitions 2At https://web.stanford.edu/˜jurafsky/swb1_dialogact_annot.tar.gz

Page 154 of 194 6.5.4 Slot Memory Service

“Remembering” the information that user provides in each turn is an essential feature for chatbots. This feature is indispensable when it comes to multi-intent conversations, where conversations in- volve several intents and slot values might be missing in some conversation turns [69, 264]. Having a component that helps recall such information from user utterances throughout the conversation is therefore necessary. This is where the Slot Memory (SM) service comes into play:

• It records extracted information from user utterances (e.g. intents, slots/values) sourced from utterance parser component in each turn of conversation.

• It updates value of slots (e.g. location) when user provides new values (e.g. “George street”) in next turns of conversation.

There are two sources that SM service can acquire information from: (i) conversation history, and (ii) user preferences. This version of SM maintains conversation history only.

SM uses Redis3 to store, update and fetch slots/values information per intent. Redis is a key-value memory database. Thus, each key represents an intent (e.g. SearchBusiness) and the value is a set of slot/value (e.g. “location”: “Barker street”) pairs.

6.5.5 User-Chatbot Conversation Scenarios

Scenario A - All slot values are provided by user Given a user utterance i.e. “Show me list of Italian restaurants in Kingsford”, conversation man- ager first triggers the UP component to parse the utterance. The output is an intent SearchBusiness together with slot/value pairs [“term”: “italian restaurant”, “location”: “Kingsford”]. Next, conversation manager triggers DAR to detect the user dialog act of the utterance. The output of DAR is U-New Intent dialog act. Based on user intent and dialog act, SLG generates a “new intent- state” transition from “Greeting” (current intent-state) to “SearchBusiness” intent-state (as shown in Figure 6.3). After this transition, conversation manager notifies SLG that all the required slots are fulfilled. Thus the state machine should move to the “basic state” inside the “SearchBusiness” 3https://redis.io/

Page 155 of 194 intent-state. In this state, the conversation manager triggers APM to invoke the Yelp API because there are no missing slot values. The result of API invocation is then passed to BRG to generate a natural-sounding response such as “There is one restaurant found: Mamma Teresa in “412 Anzac Parade ...”.

Scenario B - Missing slot value is provided by user User now asks for weather forecast by uttering “Do I need umbrella today?”. After extracting user intent (e.g. CheckWeather) and dialog act (U-New Intent), SLG generates “new intent-state” transition. Thus the state machine moves to from “SearchBusiness” (current intent-state) to the “CheckWeather” intent-state. Next, conversation manager notifies SLG that the “location” slot is missing. Thus, SLG generates a “nested slot.value” transition. The state machine moves to the “nested slot-value” state to obtain the value of “location” slot. In this state, chatbot requests user to fill the “location” slot by asking “Where are you?”. User replies with “I’m in Sydney”. As the user dialog act is U-Provide Info, the state machine goes back to the parent state “CheckWeather” and triggers the OpenWeatherAPI to retrieve the weather condition (because all slot values are available).

Scenario C - Missing slot value is processed by chatbot In this case, the asks for table reservation by uttering “Book a table in Time for Thai for two”. This utterance carries the BookRestaurant intent with [“name”:“Time for Thai”, “numPeople”: “2”] as slot-value pairs. The utterance is annotated with and U-New Intent dialog act. However, the “bookingDate” slot is missing. Thus, state machine moves to the “nested slot-value” state based on the transition generated by SLG. In this case, when the chatbot asks user to fulfill the missing value of “bookingDate” slot, user replies with “When is my birthday?”. The answer from user is annotated with a U-Nested New Intent dialog act. Thus, SLG generates a “nested slot.intent” transition and the state machine moves to nested slot-intent “GetUserDetails” accordingly. From this state, an API is invoked to retrieve user’s birthday. Next, the state machine moves back to the parent “BookRestaurant” intent-state where the OpenTable API is invoked to book a table because all slot values are available and hence API parameter values are known.

Page 156 of 194 Combine template codes to build webhook server

Bot specification Assembler Developer Instantiate conversation manager Templates/ Rules API Knowledge

API-KG

Integrator

Figure 6.7: Bot Builder Architecture - Automated Chatbot Development 6.6 Extended Bot Builder - Conversations Support

In this section, we explain how bot developers can build chatbots that support multi intent - multi turn conversations. We extended the capabilities of the Bot Builder service introduced in Chapter 5 to achieve this objective. There are two deployment options: (i) Using existing conversation manager provided by existing bot platforms, (ii) Using our proposed conversation manager. In the current implementation, Bot Builder supports the first option.

In Chapter 5, we provided API-KG along with Knowledge Explorer to (i) provide a rich source of knowledge about APIs, and (ii) facilitate the process of finding target APIs/Methods that are relevant to develop bots. Together with these two services, we also introduced Bot Builder, a ser- vice to semi-automate chatbot development process. The first version of Bot Builder was designed to support single intent - single turn conversations. The extended version is designed to support multi intent - multi turn conversations. Thus, in the rest of this section, term Bot Builder refers to the extended version of Bot Builder.

Page 157 of 194 6.6.1 The Bot Builder Architecture

Figure 6.7 provides the architectural view of Bot Builder. The Bot Builder relies on a set of compo- nents to semi-automate chatbot development. When Bot Builder receives a bot specification (i.e., a set of user intents), it fetches required knowledge from API-KG, and triggers both Integrator and Assembler components for further action.

Assembler. This component combines code templates (see details in Chapter 5) to generate a webhook server written in developer’s selected programming language (e.g. Python1). The output of assembler is a fully-functional and ready-to-deploy source code for a webhook server. This source code is reused and customized by bot developers.

Integrator. This component is used to instantiate a conversation manager in a third-party platform (e.g. DialogFlow). To support this functionaly, we develop extensions called integration plugins embedded inside Integrator component. Figure 6.7 shows these plugins. Each plugin is a program that takes as input the details of an intent (e.g. utterances with annotations, slots with values) retrieved from API-KG, and exports this information to the conversation manager. This is done by exploiting the REST APIs provided by the implementation platform (e.g. DialogFlow).

6.6.2 Defining Bot Specification

Building chatbots with Bot Builder involves specification of desired user intents. The bot devel- oper specifies a desired user intent that the bot should support as follows: (i) by stating a goal (e.g. “find cafe and restaurants”), or (ii) providing example utterances, i.e., by stating user utterances the bot should understand and act upon (such as “give me name of a Chinese restaurant near Cen- tral Park”). Based on specification of a desired intent, the bot builder retrieve actual intents using Knowledge Explorer (see details in Chapter 5). The results is a set of intents that can be used to deploy in instance of a bot as explained in section 6.4.1.

1The current implementation this stage supports template codes written in python.

Page 158 of 194 6.7 Validation

We run a user study similar to our experiment in Chapter 5 to validate our approach. We involved developers and explored if and how Bot Builder can effectively help in deploying trained chatbots over multiple intents.

6.7.1 Chatbot Development by Developers

Participants. We involved participants with experience in building chatbots. We included PhD students in Computer Science who had no connection with our project.

Study design. Each participant was given a scenario for which they were asked to build a chatbot. Participants were given three possible scenarios to choose from, namely, video search, weather forecasting and restaurant search. Participants were asked to use at least two scenarios in their chatbots. This way, they can check to see whether their chatbots are able to manage multiple intents or not. After developing and deploying their chatbots, participants where asked to answer a follow-up questionnaire.

Findings. Participants mentioned two main benefits of using Bot Builder. Firstly, the ability to train models and build state machines for multiple intents automatically:

“It minimises the need for technical knowledge; specially the specification of APIs and train- ing complex bots like building multi domain bots.”

“The feature that I like was auto training. The tool trained the bot platform directly without any problem. Also, we can get the code for state machine and that’s nice! in case if I want to change or understand how it works.”

“As a developer, I’m always concern about privacy. I liked your tool that trains the models and auto-generates the code; but what I like the most is that you give everything to developer to check and run.”

Secondly, the ability to generate and provide source codes for webhook service:

Page 159 of 194 “it is a great idea to reduce the workload for building a multi-topic chatbots for developers, it looks after the initial requirements. I do prefer to have a small code in hand to start from, rather than starting from scratch.”

“Given source code is in good quality; although it is not in my coding style, it was easy to follow and modify as I did a bit.”

Furthermore, when asked the question “Does your chatbot answer your questions about mul- tiple topics?”, participants responded:

“it did, except a few attempts; rephrasing my expressions helped the bot to finally recognise my intentions. Interestingly, in one case, it also asked me to provide more information about a missed entity in my expression."

“It is a great idea to have a Chatbot that would respond to different topics. ( more like a real helpdesk) an intelligent bot or I would call it a MegaBot.”

“it answered most of my questions correctly, although some of the answers were not correct, as it seems more training data would help to improve the performance of my bot.”

when asked the question “How does bot development compare when using Bot Builder versus existing solutions?", participants responded:

“Considering manual efforts required when training bots, I believe it facilitates the develop- ment process by and large.”

“This is a great tool for bot developers, it can be used as a template that has some predefined features with the capacity to be programmed for more advanced chatbots. As a start point, it can also reduce the workload for developers.”

Nevertheless, one of the participants also mentioned that “template codes should be also in other languages such as Java as some developers prefer Java more that Python". In summary, feedbacks were both positive and useful in terms of suggestions for possible extensions that ac- commodate the preferences of different users.

Page 160 of 194 6.8 Conclusions and Future Work

In this chapter we proposed a novel approach for the management of multi-intent multi-turn con- versations based an extended HSM model. The proposed techniques leverage Dialog Acts to characterize conversation patterns. We proposed state machine generation techniques to support the automated initiation, monitoring and control of conversatopns. Furthermore, extended Bot Builder to support the effective and automated building and deployment of chatbots that support multi-intent conversations.

Our work also comes with its own limitations and space for possible improvements. For example, the current version of Bot Builder does not yet support building chatbots with more complex requirements such as complex intent (de)compositions. The studies we run in this work are qualitative in nature and limited in the number of participants. We plan to extend our current study to involve quantitative studies by exploring how chatbots with our conversation manager perform compare to the chatbots using conversation managers provided by third-party platforms.

Page 161 of 194 Chapter 7

Conclusion

This chapter wraps up the thesis with a summary of the undertaken research issues (Section 7.1) and a summary of research outcomes (Section 7.2). We also discuss some future research direc- tions (Section 7.3).

7.1 Summary the Research Issues

Our research in this thesis presented multiple novel concepts and techniques to fill critical gaps in bridgings ambiguous natural language conversations with conversational services. In the first study (Chapter 3), we proposed extensions of term embedding technique to support natural lan- guage interactions with unstructured textual information items (e.g. email, PDF document) col- lected through law enforcement investigations. In the the second study (Chapter 4), we proposed attribute embedding techniques to support flexible natural language queries over vulnerability in- formation stored in data heterogeneous services. In the third study (Chapter 5), we introduced the latent knowledge-driven middleware services and techniques to facilitate the interactions be- tweenusers, bots and with APIs through intent-based conversations (e.g., task-oriented conver- sational bots). In Chapter 6, we extended state-machine based model and techniques to support multi-turn multi-intent conversation patterns. 7.2 Summary of the Research Outcomes

The first outcome introduced in chapter 3, consists of a framework and techniques that leverage event embeddings for the automated data curation and tagging of evidence items and their link- age to investigation meta-data. We focused on the automated recognition of entities and events (e.g., people, phone numbers, phone calls) from unstructured evidence items. The experimental results obtained by applying our approach to a real, legal dataset demonstrate the feasibility of the proposed techniques by achieving good performance in the task of automatically recognising and tagging entities and events.

The second outcome presented in chapter 4 consists of a novel attribute embedding indexing mech- anism to mitigate the complexity of querying vulnerability information through integrated index- ing of heterogenous information sources. The roposed techniques transform traditional schema based indexing to attribute embedding based indexing. We showed that attribute based embed- ding can support flexible, schema-less natural language queries over indexed data. While we use vulnerability information as context, our proposed techniques are applicable for domains as well.

The third outcome introduced in chapter 5 consists of a latent knowledge-driven middleware that represents intent and API elements (i.e., API methods, API call parameters, API and intents de- scriptions) as vectors (i.e., representation of element meanings and associations) in a vector space by extending term embedding techniques. This representation makes it easier to search API repos- itories for relevant methods by uttering how we would request such functionality in natural lan- guage. We showed how to effectively build such embeddings and how we can facilitate and semi- automate the mappings of user intents and utterances with a potentially large and evolving set of APIs. We also validated the effectiveness of our approach by assessing the benefits in terms of effectiveness of the API search and of ease of deployment of bots over the identified APIs of interest.

Finally, the fourth outcome introduced in chapter 6 consists of novel conversation model and engine for multi-intent multi-turn conversations. The proposed model extends hierarchical states to cater for wide variations in potentially complex interactions between users, bots, and APIs. We showed that the proposed approach empowers conversational systems to manage complex user interactions.

Page 163 of 194 7.3 Future Research Directions

The future has in store many further exciting opportunities. Leveraging the approach we presented in chapter 3 can help codifying offenses from legislation, and correlating collection of events and mapping to offenses. This can help automate the task of determining whether the elements of an offense can indeed be substantiated. Moreover, the foundations of the proposed approach can be extended to any other domains involving knowledge driven investigative tasks (e.g., science and research)

Considering the development of knowledge graphs for supporting more complex queries be- yond attribute-based queries (e.g., relationship-based queries) can be one of the extensions for the approach we proposed in chapter 4. Using alternative sources (e.g., Twitter), on the other hand, for obtaining updates on the latest cybersecurity developments (e.g., 0-day vulnerabilities and attacks) is also another future direction.

Extending the Intent-API interaction patterns supported by knowledge-driven middleware de- scribed in chapter 5 is another future research direction. While our proposed approach focused on Intent-SingleAPI interaction pattern, other patterns Intent-APIGroup and Intent-CompositeAPI can be captured in conversational services. In an Intent-APIGroup, the interaction style is a con- tainer of substitutable API calls (e.g., providing similar capabilities but with different quality of services for instance such as Dropbox, GoogleDrive). The interaction style provides a description of a desired API call category. A specific API call is selected by the system based on user require- ments (e.g., preferences for quality of service like price, preferences for location of the service). On the other hand, Intent-CompositeAPI pattern provides support for complex and multi-intent user requests and tasks. The interaction style of this pattern refers to a composition (e.g., a se- quence of API calls) that aggregates multiple API calls (e.g., a flight booking API call and a hotel booking API call), either statically or dynamically, to release complex user utterances, i.e., utterances that involve more than one intent (e.g., “booking a hotel and a flight”).

Page 164 of 194 Bibliography

[1] Eytan Adar, Mira Dontcheva, and Gierad Laput, Commandspace: Modeling the relationships between tasks, descriptions and features, Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology (New York, NY, USA), UIST ’14, ACM, 2014, pp. 167–176.

[2] Mortada Al-Banna, Crowdsourcing software vulnerability discovery: Expertise indicators, organizations perceptions and quality control., Ph.D. Thesis, Computer Science and Engi- neering, Faculty of Engineering, UNSW, 2018.

[3] Ala Al-Fuqaha, Abdallah Khreishah, Mohsen Guizani, Ammar Rayes, and Mehdi Mo- hammadi, Toward better horizontal integration among iot services, IEEE Communications Magazine 53 (2015), no. 9, 72–79.

[4] Noora Al Mutawa, Ibrahim Baggili, and Andrew Marrington, Forensic analysis of social networking applications on mobile devices, Digital Investigation 9 (2012), S24–S33.

[5] J Allen and C R Perrault, Readings in natural language processing, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1986, pp. 441–458.

[6] Douglas G Altman, Practical statistics for medical research, CRC press, 1990.

[7] Jon Ezeiza Alvarez, A review of word embedding and document similarity algorithms applied to academic text, 2017.

[8] Ion Androutsopoulos, Graeme D. Ritchie, and Thanisch, Natural language interfaces to databases - an introduction, CoRR cmp-lg/9503016 (1995).

[9] Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D Manning, Leveraging linguistic structure for open domain information extraction, Proceedings of the 53rd An- nual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), vol. 1, 2015, pp. 344–354.

[10] Ram G Athreya, Axel-Cyrille Ngonga Ngomo, and Ricardo Usbeck, Enhancing community interactions with data-driven chatbots â the dbpedia chatbot, WWW, 2018, p. 4.

Page 165 of 194 [11] Ram G. Athreya, Axel-Cyrille Ngonga Ngomo, and Ricardo Usbeck, Enhancing community interactions with data-driven chatbots–the dbpedia chatbot, Companion Pro- ceedings of the The Web Conference 2018 (Republic and Canton of Geneva, Switzer- land), WWW ’18, International World Wide Web Conferences Steering Committee, 2018, pp. 143–146.

[12] Chris Baber, Paul Smith, James Cross, John E Hunter, and Richard McMaster, Crime scene investigation as distributed cognition, Pragmatics & Cognition 14 (2006), no. 2, 357–385.

[13] Petr Babkin, Md Faisal Mahbub Chowdhury, Alfio Gliozzo, Martin Hirzel, and Avraham Shinnar, Bootstrapping chatbots for novel domains, Workshop at NIPS on Learning with Limited Labeled Data (LLD). https://lld-workshop. github. io/papers/LLD_2017_paper_10. pdf, 2017.

[14] Jianbo Bai, Hong Xiao, Xianghua Yang, and Guofang Zhang, Study on integration technologies of building automation systems based on web services, 2009 ISECS Inter- national Colloquium on Computing, Communication, Control, and Management, vol. 4, IEEE, 2009, pp. 262–266.

[15] Rucha Bapat, Pavel Kucherbaev, and Alessandro Bozzon, Effective crowdsourced generation of training data for chatbots natural language understanding, Web Engineering (Cham) (Tommi Mikkonen, Ralf Klamma, and Juan Hernández, eds.), Springer Interna- tional Publishing, 2018, pp. 114–128.

[16] , Effective crowdsourced generation of training data for chatbots natural language understanding, Web Engineering, 2018.

[17] Abdur Rahman MA Basher and Benjamin CM Fung, Analyzing topics and authors in chat logs for crime investigation, Knowledge and information systems 39 (2014), no. 2, 351– 381.

[18] Boualem Benatallah, Fabio Casati, Daniela Grigori, Hamid R Motahari Nezhad, and Farouk Toumani, Developing adapters for web services integration, International Conference on Advanced Information Systems Engineering, Springer, 2005, pp. 415–429.

Page 166 of 194 [19] Boualem Benatallah, Mohand-Said Hacid, Alain Leger, Christophe Rey, and Farouk Toumani, On automating web services discovery, The VLDB JournalâThe International Journal on Very Large Data Bases 14 (2005), no. 1, 84–96.

[20] Yoshua Bengio, Aaron Courville, and Pascal Vincent, Representation learning: A review and new perspectives, IEEE transactions on pattern analysis and machine intelligence 35 (2013), no. 8, 1798–1828.

[21] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin, A neural probabilistic language model, Journal of machine learning research 3 (2003), no. Feb, 1137–1155.

[22] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov, Enriching word vectors with subword information, arXiv preprint arXiv:1607.04606 (2016).

[23] , Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (2017), 135–146.

[24] Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Kalai, Man is to computer programmer as woman is to homemaker? debiasing word embeddings, CoRR abs/1607.06520 (2016).

[25] Rajesh Bordawekar, Bortik Bandyopadhyay, and Oded Shmueli, Cognitive database: A step towards endowing relational databases with artificial intelligence capabilities, arXiv preprint arXiv:1712.07199 (2017).

[26] Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio, Learning structured embeddings of knowledge bases, Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.

[27] Léon Bottou, Large-scale machine learning with stochastic gradient descent, Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.

[28] Athman Bouguettaya, Munindar Singh, Michael Huhns, Quan Z Sheng, Hai Dong, Qi Yu, Azadeh Ghari Neiat, Sajib Mistry, Boualem Benatallah, Brahim Medjahed, et al., A service computing manifesto: the next 10 years, Communications of the ACM 60 (2017), no. 4, 64–72.

Page 167 of 194 [29] Gerlof Bouma, Normalized (pointwise) mutual information in collocation extraction, Pro- ceedings of GSCL (2009), 31–40.

[30] Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer, The mathematics of statistical machine translation: Parameter estimation, Computational lin- guistics 19 (1993), no. 2, 263–311.

[31] Harry Bunt, The semantics of dialogue acts, Proceedings of the Ninth International Confer- ence on Computational Semantics, Association for Computational Linguistics, 2011, pp. 1– 13.

[32] Harry Bunt, Jan Alexandersson, Jean Carletta, Jae-Woong Choe, Alex Chengyu Fang, Koiti Hasida, Kiyong Lee, Volha Petukhova, Andrei Popescu-Belis, Laurent Romary, et al., Towards an iso standard for dialogue act annotation, Seventh conference on International Language Resources and Evaluation (LREC’10), 2010.

[33] Giovanni Campagna, Rakesh Ramesh, Silei Xu, Michael Fischer, and Monica S Lam, Almond: The architecture of an open, crowdsourced, privacy-preserving, programmable virtual assistant, Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2017, pp. 341–350.

[34] Hanyang Cao, Jean-Rémy Falleri, and Xavier Blanc, Automated generation of rest api specification from plain html documentation, Service-Oriented Computing (Cham) (Michael Maximilien, Antonio Vallecillo, Jianmin Wang, and Marc Oriol, eds.), Springer International Publishing, 2017, pp. 453–461.

[35] Jacky Casas, Elena Mugellini, and Omar Abou Khaled, Food diary coaching chatbot, Pro- ceedings of the 2018 ACM International Joint Conference and 2018 International Sym- posium on Pervasive and Ubiquitous Computing and Wearable Computers, ACM, 2018, pp. 1676–1680.

[36] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al., Universal sentence encoder, arXiv preprint arXiv:1803.11175 (2018).

Page 168 of 194 [37] Wing-Kwan Chan, Hong Cheng, and David Lo, Searching connected api subgraph via text phrases, Proceedings of the ACM SIGSOFT 20th International Symposium on the Founda- tions of Software Engineering, ACM, 2012, p. 10.

[38] Michael Chau, Jennifer J Xu, and Hsinchun Chen, Extracting meaningful entities from police narrative reports, Proceedings of the 2002 annual national conference on Digital government research, Digital Government Society of North America, 2002, pp. 1–5.

[39] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang, A survey on dialogue systems: Recent advances and new frontiers, ACM SIGKDD Explorations Newsletter 19 (2017), no. 2, 25–35.

[40] Minmin Chen, Efficient vector representation for documents through corruption, CoRR abs/1707.02377 (2017).

[41] Yun-Nung Chen, Dilek Hakkani-Tür, Jianfeng Gao, and Li Deng, End-to-end memory networks with knowledge carryover for multi-turn spoken language understanding., 2016.

[42] Timothy Chklovski, Collecting paraphrase corpora from volunteer contributors, K-CAP, 2005, pp. 115–120.

[43] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, Learning phrase representations using rnn encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078 (2014).

[44] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555 (2014).

[45] Charles LA Clarke, Maheedhar Kolla, Gordon V Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon, Novelty and diversity in information retrieval evaluation, Proceedings of the 31st annual international ACM SIGIR conference on Re- search and development in information retrieval, ACM, 2008, pp. 659–666.

[46] Jacob Cohen, A coefficient of agreement for nominal scales, Educational and psychological measurement 20 (1960), no. 1, 37–46.

[47] K. M. Colby, Human-computer conversation in a cognitive therapy program, pp. 9–19, Springer US, Boston, MA, 1999.

Page 169 of 194 [48] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa, Natural language processing (almost) from scratch, Journal of machine learn- ing research 12 (2011), no. Aug, 2493–2537.

[49] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes, Supervised learning of universal sentence representations from natural language inference data, arXiv preprint arXiv:1705.02364 (2017).

[50] Carlos Coronel and Steven Morris, Database systems: design, implementation, & management, Cengage Learning, 2016.

[51] Justin Cranshaw, Emad Elwany, Todd Newman, Rafal Kocielnik, Bowen Yu, Sandeep Soni, Jaime Teevan, and Andrés Monroy-Hernández, Calendar.help: Designing a workflow-based scheduling agent with humans in the loop, Proceedings of the 2017 CHI Conference on Hu- man Factors in Computing Systems (New York, NY,USA), CHI ’17, ACM, 2017, pp. 2382– 2393.

[52] Francisco Curbera, Matthew Duftler, Rania Khalaf, William Nagy, Nirmal Mukhi, and San- jiva Weerawarana, Unraveling the web services web: an introduction to soap, wsdl, and uddi, IEEE Internet computing 6 (2002), no. 2, 86–93.

[53] Andrew M Dai, Christopher Olah, and Quoc V Le, Document embedding with paragraph vectors, arXiv preprint arXiv:1507.07998 (2015).

[54] Kirtana Darabal, Vulnerability Exploration and Understanding Services, Master Thesis, Computer Science and Engineering, Faculty of Engineering, UNSW, 2018.

[55] Arjun Das, Debasis Ganguly, and Utpal Garain, Named entity recognition with word embeddings and wikipedia categories for a low-resource language, ACM Trans. Asian Low- Resour. Lang. Inf. Process. 16 (2017), no. 3, 18:1–18:19.

[56] Cedric De Boom, Steven Van Canneyt, Thomas Demeester, and Bart Dhoedt, Representation learning for very short texts using weighted word embedding aggregation, Pattern Recognition Letters 80 (2016), 150–156.

[57] Sergio Decherchi, Simone Tacconi, Judith Redi, Alessio Leoncini, Fabio Sangiacomo, and Rodolfo Zunino, Text clustering for digital forensics analysis, Computational Intelligence in Security for Information Systems, Springer, 2009, pp. 29–36.

Page 170 of 194 [58] Mathieu Dehouck and Pascal Denis, Delexicalized word embeddings for cross-lingual dependency parsing, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 2017, pp. 241–250.

[59] Li Deng, Shuo Zhang, and Krisztian Balog, Table2vec: Neural word and entity embeddings for table population and retrieval, arXiv preprint arXiv:1906.00041 (2019).

[60] Anoop Deoras and Ruhi Sarikaya, Deep belief network based semantic taggers for spoken language understanding., 2013.

[61] Kedar Dhamdhere, Kevin S McCurley, Ralfi Nahmias, Mukund Sundararajan, and Qiqi Yan, Analyza: Exploring data with conversation, Proceedings of the 22nd International Conference on Intelligent User Interfaces, ACM, 2017, pp. 493–504.

[62] Dua Dheeru and Efi Karra Taniskidou, UCI machine learning repository, 2017, [Online; last accessed 26 March 2019].

[63] R Emerson Dobash and Russell P Dobash, The nature and antecedents of violent events, The British Journal of Criminology 24 (1984), no. 3, 269–288.

[64] Mohnish Dubey, Debayan Banerjee, Debanjan Chaudhuri, and Jens Lehmann, Earl: Joint entity and relation linking for question answering over knowledge graphs, The Semantic Web – ISWC 2018 (Cham) (Denny Vrandeciˇ c,´ Kalina Bontcheva, Mari Carmen Suárez- Figueroa, Valentina Presutti, Irene Celino, Marta Sabou, Lucie-Aimée Kaffee, and Elena Simperl, eds.), Springer International Publishing, 2018, pp. 108–126.

[65] Vasiliki Efstathiou, Christos Chatzilenas, and Diomidis Spinellis, Word embeddings for the software engineering domain, Proceedings of the 15th International Conference on Mining Software Repositories (New York, NY, USA), MSR ’18, ACM, 2018, pp. 38–41.

[66] Anthony Elliott, The culture of ai: Everyday life and the digital revolution, Routledge, 2019.

[67] Mihail Eric and Christopher D Manning, Key-value retrieval networks for task-oriented dialogue, arXiv preprint arXiv:1705.05414 (2017).

[68] Roghayeh Fakouri-Kapourchali, Mohammad-Ali Yaghoub-Zadeh-Fard, and Mehdi Khalili, Semantic textual similarity as a service, Service Research and Innovation, Springer, 2015, pp. 203–215.

Page 171 of 194 [69] Ethan Fast, Binbin Chen, Julia Mendelsohn, Jonathan Bassen, and Michael S Bernstein, Iris: A conversational agent for complex tasks, CHI ’18, 2018.

[70] Ethan Fast, William McGrath, Pranav Rajpurkar, and Michael S Bernstein, Augur: Mining human behaviors from fiction to power interactive systems, Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, ACM, 2016, pp. 237–247.

[71] Ethan Fast, William McGrath, Pranav Rajpurkar, and Michael S. Bernstein, Augur: Mining human behaviors from fiction to power interactive systems, Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (New York, NY, USA), CHI ’16, ACM, 2016, pp. 237–247.

[72] Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner, Web data extraction, applications and techniques: A survey, Knowledge-based systems 70 (2014), 301–323.

[73] McKenzie Fetzer, How apis are changing the business landscape, Available: https://blog.cloud-elements.com/apis-changing-business-landscape, 2015, [Online; ac- cessed 7-September-2019].

[74] Adam Fourney, Richard Mann, and Michael Terry, Query-feature graphs: Bridging user vocabulary and system functionality, Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (New York, NY, USA), UIST ’11, ACM, 2011, pp. 207–216.

[75] Susan Galer, Ai predictions 2019: Humans work side-by-side digital âco-workersâ in 40 percent of companies, Available: https://www.forbes.com/sites/sap/2019/02/12/ai- predictions-2019-humans-work-side-by-side-digital-co-workers-in-40-percent-of- companies/793cd7685819, 2019, [Online; accessed 10-September-2019].

[76] Filippo Galgani, Paul Compton, and Achim Hoffmann, Citation based summarisation of legal texts, Pacific Rim International Conference on Artificial Intelligence, Springer, 2012, pp. 40–52.

[77] Jianfeng Gao, Michel Galley, Lihong Li, et al., Neural approaches to conversational ai,

Foundations and TrendsR in Information Retrieval 13 (2019), no. 2-3, 127–298.

Page 172 of 194 [78] Majid Ghasemi-Gol and Pedro Szekely, Tabvec: Table vectors for classification of web tables, arXiv preprint arXiv:1802.06290 (2018).

[79] Nguyen Giang, Survey on script-based languages to write a chatbot, Avail- able: https://www.slideshare.net/NguyenGiang102/survey-on-scriptbased-languages-to- write-a-chatbot, 2018, [Online; accessed 15-July-2019].

[80] Laurence Goasduff, Chatbots will appeal to modern workers, Available: https://www.gartner.com/smarterwithgartner/chatbots-will-appeal-to-modern-workers/, 2019, [Online; accessed 10-September-2019].

[81] Clinton Gormley and Zachary Tong, Elasticsearch: The definitive guide, 1st ed., O’Reilly Media, Inc., 2015.

[82] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha, The java language specification, Addison-Wesley Professional, 2000.

[83] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim, Deep api learning, Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ACM, 2016, pp. 631–642.

[84] Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin, Dialog-to-action: Conversational question answering over a large-scale knowledge base, Advances in Neural Information Processing Systems 31 (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), Curran Associates, Inc., 2018, pp. 2942–2951.

[85] Tihomir Gvero and Viktor Kuncak, Synthesizing java expressions from free-form queries, Acm Sigplan Notices 50 (2015), no. 10, 416–432.

[86] H. Hashemi and A. Asiaee, Query intent detection using convolutional neural networks.

[87] Scott Heimendinger, Flattening multi-dimensional data sets into de-normalized form, May 27 2010, US Patent App. 12/324,538.

[88] Marco Helbich, Julian Hagenauer, Michael Leitner, and Ricky Edwards, Exploration of unstructured narrative crime reports: an unsupervised neural network and point pattern analysis approach, Cartography and Geographic Information Science 40 (2013), no. 4, 326– 336.

Page 173 of 194 [89] Matthew Henderson, Pawel Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrksic, Georgios P. Spithourakis, Pei hao Su, Ivan Vulic, and Tsung- Hsien Wen, A repository of conversational datasets, CoRR abs/1904.06472 (2019).

[90] Matthew S Henderson, Discriminative methods for statistical spoken dialogue systems, Ph.D. thesis, University of Cambridge, 2015.

[91] Julia Hirschberg and Christopher D Manning, Advances in natural language processing, Science 349 (2015), no. 6245, 261–266.

[92] Pascal Hitzler, Markus Krotzsch, and Sebastian Rudolph, Foundations of semantic web technologies, CRC press, 2009.

[93] Raphael Hoffmann, James Fogarty, and Daniel S. Weld, Assieme: Finding and leveraging implicit references in a web search interface for programmers, Proceedings of the 20th Annual ACM Symposium on User Interface Software and Technology (New York, NY, USA), UIST ’07, ACM, 2007, pp. 13–22.

[94] Tianran Hu, Anbang Xu, Zhe Liu, Quanzeng You, Yufan Guo, Vibha Sinha, Jiebo Luo, and Rama Akkiraju, Touch your heart: A tone-aware chatbot for customer care on social media, Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (New York, NY, USA), CHI ’18, ACM, 2018, pp. 415:1–415:12.

[95] C. Huang, L. Yao, X. Wang, B. Benatallah, and Q. Z. Sheng, Expert as a service: Software expert recommendation via knowledge domain embeddings in stack overflow, 2017 IEEE International Conference on Web Services (ICWS), June 2017, pp. 317–324.

[96] Chaoran Huang, Lina Yao, Xianzhi Wang, Boualem Benatallah, and Xiang Zhang, Software expert discovery via knowledge domain embeddings in a collaborative network, Pattern Recognition Letters (2018).

[97] Jizhou Huang, Ming Zhou, and Dan Yang, Extracting chatbot knowledge from online discussion forums., IJCAI, vol. 7, 2007, pp. 423–428.

[98] Qiao Huang, Xin Xia, Zhenchang Xing, David Lo, and Xinyu Wang, Api method recommendation without worrying about the task-api knowledge gap, ASE 2018, ACM, 2018, pp. 293–304.

Page 174 of 194 [99] Ting-Hao Kenneth Huang, Joseph Chee Chang, and Jeffrey P Bigham, Evorus: A crowd-powered conversational assistant built to automate itself over time, Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, ACM, 2018, p. 295.

[100] Ting-Hao Kenneth Huang, Walter S Lasecki, and Jeffrey P Bigham, Guardian: A crowd-powered spoken dialog system for web apis, Third AAAI conference on human com- putation and crowdsourcing, AAAI Press, 2015.

[101] Zhiheng Huang, Wei Xu, and Kai Yu, Bidirectional lstm-crf models for sequence tagging, arXiv preprint arXiv:1508.01991 (2015).

[102] Richard Hull and Hamid R Motahari Nezhad, Rethinking bpm in a cognitive world: Transforming how we learn and perform business processes, International Conference on Business Process Management, Springer, 2016, pp. 3–19.

[103] Ian Hutchby and Robin Wooffitt, Conversation analysis, Polity, 2008.

[104] Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli, Embeddings for word sense disambiguation: An evaluation study, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2016, pp. 897– 907.

[105] Vladimir Ilievski, Claudiu Musat, Andreea Hossman, and Michael Baeriswyl, Goal-oriented chatbot dialog management bootstrapping with transfer learning, Proceed- ings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization, 7 2018, pp. 4115–4121.

[106] Australiasian Legal Information Insititute, Austlii: Free, comprehensive and independent access to australasian law, Available: www.austlii.edu.au, 2018, [Online; accessed 7-May- 2018].

[107] Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III, Deep unordered composition rivals syntactic methods for text classification, Proceedings of the 53rd An- nual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), vol. 1, 2015, pp. 1681–1691.

Page 175 of 194 [108] Minwoo Jeong and G. Geunbae Lee, Triangular-chain conditional random fields, Trans. Audio, Speech and Lang. Proc. 16 (2008), no. 7, 1287–1302.

[109] Rogers Jeffrey Leo John, Navneet Potti, and Jignesh M Patel, Ava: From data to insights through conversations.

[110] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov, Bag of tricks for efficient text classification, arXiv preprint arXiv:1607.01759 (2016).

[111] Dan Jurafsky and James H Martin, Speech and language processing, vol. 3, Pearson Lon- don, 2017.

[112] Daniel Jurafsky and James H. Martin, Speech and language processing (2nd edition), Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2009.

[113] Daniel Jurafsky, Elizabeth Shriberg, and Debra Biasca, Switchboard SWBD-DAMSL shallow-discourse-function annotation coders manual, draft 13, Tech. Report 97-02, Uni- versity of Colorado, Boulder Institute of Cognitive Science, Boulder, CO, 1997.

[114] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore, Reinforcement learning: A survey, Journal of artificial intelligence research 4 (1996).

[115] Panos Kampanakis, Security automation and threat information-sharing options, IEEE Se- curity & Privacy 12 (2014), no. 5, 42–51.

[116] Tom Kaneshige and Daniel Hong, Predictions 2019: This is the year to invest in humans, as backlash against chatbots and ai begins, Available: https://go.forrester.com/blogs/predictions-2019-chatbots-and-ai-backlash/, 2018, [On- line; accessed 10-September-2019].

[117] Esther Kaufmann, Abraham Bernstein, and Renato Zumstein, Querix: A natural language interface to query ontologies based on clarification dialogs, ISWC, 2006, pp. 980–981.

[118] Tom Kenter, Alexey Borisov, and Maarten de Rijke, Siamese cbow: Optimizing word embeddings for sentence representations, arXiv preprint arXiv:1606.04640 (2016).

[119] Mohammad Reza Keyvanpour, Mostafa Javideh, and Mohammad Reza Ebrahimi, Detecting and investigating crime by means of data mining: a general crime matching framework, Procedia Computer Science 3 (2011), 872–880.

Page 176 of 194 [120] Joo-Kyung Kim, Gokhan Tur, Asli Celikyilmaz, Bin Cao, and Ye-Yi Wang, Intent detection using semantically enriched word embeddings, Spoken Language Technology Workshop (SLT), 2016 IEEE, IEEE, 2016, pp. 414–419.

[121] Yoon Kim, Convolutional neural networks for sentence classification, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (2014).

[122] Wilhelm Kirch (ed.), Pearson’s correlation coefficient, pp. 1090–1091, Springer Nether- lands, Dordrecht, 2008.

[123] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler, Skip-thought vectors, Advances in neural information process- ing systems, 2015, pp. 3294–3302.

[124] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al., Moses: Open source toolkit for statistical machine translation, Proceedings of the 45th annual meet- ing of the association for computational linguistics companion volume proceedings of the demo and poster sessions, 2007, pp. 177–180.

[125] Teuvo Kohonen, The self-organizing map, Proceedings of the IEEE 78 (1990), no. 9.

[126] Ravi Kondadadi, Blake Howald, and Frank Schilder, A statistical nlg framework for aggregated planning and realization, Proceedings of the 51st Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), 2013, pp. 1406–1415.

[127] Chih Hao Ku, Alicia Iriberri, and Gondy Leroy, Natural language processing and e-government: crime information extraction from heterogeneous data sources, Proceedings of the 2008 international conference on Digital government research, Digital Government Society of North America, 2008, pp. 162–170.

[128] Harshit Kumar, Arvind Agarwal, Riddhiman Dasgupta, and Sachindra Joshi, Dialogue act sequence labeling using hierarchical encoder with crf, AAAI ’18, 2018.

[129] Ravi Kumar and K Raghuveer, Legal document summarization using latent dirichlet allocation, International Journal of Computer Science and Telecommunications 3 (2012), 114–117.

Page 177 of 194 [130] Hong-Kwang Kuo and Vaibhava Goel, Active learning with minimum expected error for spoken language understanding., 01 2005, pp. 437–440.

[131] Matt Kusner, Y. Sun, N.I. Kolkin, and Kilian Weinberger, From word embeddings to document distances, Proceedings of the 32nd International Conference on Machine Learn- ing (ICML 2015) (2015), 957–966.

[132] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the Eighteenth International Conference on Machine Learning (San Francisco, CA, USA), ICML ’01, Morgan Kaufmann Publishers Inc., 2001, pp. 282–289.

[133] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao, Recurrent convolutional neural networks for text classification, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, AAAI Press, 2015, pp. 2267–2273.

[134] An Ngoc Lam, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen, Combining deep learning with information retrieval to localize buggy files for bug reports (n), Auto- mated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, IEEE, 2015, pp. 476–481.

[135] Amy N. Langville and Carl D. Meyer, Google’s pagerank and beyond: The science of search engine rankings, Princeton University Press, Princeton, NJ, USA, 2012.

[136] Quoc Le and Tomas Mikolov, Distributed representations of sentences and documents, Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, JMLR.org, 2014, pp. II–1188–II–1196.

[137] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, Deep learning, nature 521 (2015), no. 7553, 436.

[138] Angel Lagares Lemos, Florian Daniel, and Boualem Benatallah, Web service composition: A survey of techniques and tools, ACM Comput. Surv. 48 (2015), no. 3, 33:1–33:41.

[139] Alessandro Lenci, Distributional semantics in linguistic and cognitive research, Italian jour- nal of linguistics 20 (2008), no. 1, 1–31.

[140] Rogers Jeffrey Leo John, Jignesh M. Patel, Andrew L. Alexander, Vikas Singh, and Nagesh Adluru, A natural language interface for dissemination of reproducible biomedical data

Page 178 of 194 science, Medical Image Computing and Computer Assisted Intervention – MICCAI 2018 (Cham) (Alejandro F. Frangi, Julia A. Schnabel, Christos Davatzikos, Carlos Alberola- López, and Gabor Fichtinger, eds.), Springer International Publishing, 2018, pp. 197–205.

[141] Esther Levin, Roberto Pieraccini, and Wieland Eckert, Using markov decision process for learning dialogue strategies, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), vol. 1, IEEE, 1998, pp. 201–204.

[142] Omer Levy and Yoav Goldberg, Dependency-based word embeddings, vol. 2, 06 2014, pp. 302–308.

[143] Fei Li and Hosagrahar V Jagadish, Nalir: an interactive natural language interface for querying relational databases, ACM SIGMOD, ACM, 2014, pp. 709–712.

[144] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky, Deep reinforcement learning for dialogue generation, arXiv preprint arXiv:1606.01541 (2016).

[145] Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz, End-to-end task-completion neural dialogue systems, arXiv preprint arXiv:1703.01008 (2017).

[146] Q. Vera Liao, Muhammed Mas-ud Hussain, Praveen Chandar, Matthew Davis, Yasaman Khazaeni, Marco Patricio Crasso, Dakuo Wang, Michael Muller, N. Sadat Shami, and Werner Geyer, All work and no play?, CHI ’18, 2018.

[147] Chao-Lin Liu and Ting-Ming Liao, Classifying criminal charges in chinese for web-based legal services, Asia-Pacific Web Conference, Springer, 2005, pp. 64–75.

[148] Honghai Liu, Shengyong Chen, and Naoyuki Kubota, Intelligent video systems and analytics: A survey, IEEE Transactions on Industrial Informatics 9 (2013), no. 3, 1222– 1233.

[149] Xutong Liu, Changshu Jian, and Chang-Tien Lu, A spatio-temporal-textual crime search engine, Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2010, pp. 528–529.

[150] Stuart Lloyd, Least squares quantization in pcm, IEEE transactions on information theory 28 (1982), no. 2, 129–137.

Page 179 of 194 [151] Robert B Loatman, Stephen D Post, Chih-King Yang, and John C Hermansen, Natural language understanding system, April 3 1990, US Patent 4,914,590.

[152] Richard K Lomotey, Joseph Pry, Sumanth Sriramoju, Emmanuel Kaku, and Ralph Deters, Middleware framework for iot services integration, 2017 IEEE International Conference on AI & Mobile Services (AIMS), IEEE, 2017, pp. 89–92.

[153] Anselmo López, Josep Sànchez-Ferreres, Josep Carmona, and Lluís Padró, From process models to chatbots, Advanced Information Systems Engineering (Cham) (Paolo Giorgini and Barbara Weber, eds.), Springer International Publishing, 2019, pp. 383–398.

[154] Vanessa Lopez, Miriam Fernández, Enrico Motta, and Nico Stieler, Poweraqua: Supporting users in querying and exploring the semantic web, Semantic Web 3 (2012), no. 3, 249–265.

[155] Qiang Lu, Jack G Conrad, Khalid Al-Kofahi, and William Keenan, Legal document clustering with built-in topic segmentation, Proceedings of the 20th ACM international con- ference on Information and knowledge management, ACM, 2011.

[156] Giuseppe Lugano, Virtual assistants and self-driving cars, 2017 15th International Confer- ence on ITS Telecommunications (ITST), IEEE, 2017, pp. 1–5.

[157] Bingfeng Luo, Yansong Feng, Jianbo Xu, Xiang Zhang, and Dongyan Zhao, Learning to predict charges for criminal cases with legal basis, arXiv preprint arXiv:1707.09168 (2017).

[158] Paweł Łupkowski and Jonathan Ginz, A corpus-based taxonomy of question responses, IWCS ’13, 2013.

[159] F. Mairesse, M. Gasic, F. Jurcicek, S. Keizer, B. Thomson, K. Yu, and S. Young, Spoken language understanding from unaligned data using discriminative classification models, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, April 2009, pp. 4749–4752.

[160] François Mairesse and Steve Young, Stochastic language generation in dialogue using factored language models, Computational Linguistics 40 (2014), no. 4, 763–799.

[161] Giandomenico Majone, Evidence, argument, and persuasion in the policy process, Yale University Press, 1989.

Page 180 of 194 [162] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky, The stanford corenlp natural language processing toolkit, ACL, 2014, pp. 55–60.

[163] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, Introduction to information retrieval, Cambridge University Press, New York, NY, USA, 2008.

[164] Christopher D Manning and Hinrich Schütze, Foundations of statistical natural language processing, MIT press, 1999.

[165] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky, The Stanford CoreNLP natural language processing toolkit, Association for Computational Linguistics (ACL) System Demonstrations, 2014, pp. 55–60.

[166] John McCrank and Jim Finkle, Equifax breach could be most costly in corporate history, Available: https://www.reuters.com/article/us-equifax-cyber/equifax-breach- could-be-most-costly-in-corporate-history-idUSKCN1GE257, 2018, [Online; accessed 10-September-2019].

[167] Ryan McDonald, Slav , and Keith Hall, Multi-source transfer of delexicalized dependency parsers, Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing (Stroudsburg, PA, USA), EMNLP ’11, Association for Computational Linguistics, 2011, pp. 62–72.

[168] Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu, Portfolio: Finding relevant functions and their usage, Proceedings of the 33rd International Confer- ence on Software Engineering (New York, NY, USA), ICSE ’11, ACM, 2011, pp. 111–120.

[169] Michael F McTear, Spoken dialogue technology: enabling the conversational user interface, ACM Computing Surveys (CSUR) 34 (2002), no. 1, 90–169.

[170] Marcelo Mendoza and Juan Zamora, Identifying the intent of a user query using support vector machines, 08 2009, pp. 131–142.

[171] Martino Mensio, Giuseppe Rizzo, and Maurizio Morisio, Multi-turn qa: A rnn contextual approach to intent classification for goal-oriented systems, WWW ’18, 2018.

Page 181 of 194 [172] F. A. Mikic, J. C. Burguillo, M. Llamas, D. A. Rodriguez, and E. Rodriguez, Charlie: An aiml-based chatterbot which works as an interface among ines and humans, 2009 EAEEIE Annual Conference, June 2009, pp. 1–6.

[173] Fernando Mikic Fonte, Juan Carlos, Juan Burguillo, and MartÃn Llamas Nistal, Tq-bot: an aiml-based tutor and evaluator bot, Journal of Universal Computer Science 15 (2009), 1486–1495.

[174] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013).

[175] , Efficient estimation of word representations in vector space, arXiv:1301.3781 (2013).

[176] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, Distributed representations of words and phrases and their compositionality, Advances in Neural Infor- mation Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), Curran Associates, Inc., 2013, pp. 3111–3119.

[177] , Distributed representations of words and phrases and their compositionality, Ad- vances in neural information processing systems, 2013, pp. 3111–3119.

[178] George A Miller, Wordnet: a lexical database for english, Communications of the ACM 38 (1995), no. 11, 39–41.

[179] Aniello Minutolo, Massimo Esposito, and Giuseppe De Pietro, A conversational chatbot based on kowledge-graphs for factoid medical questions., SoMeT, 2017, pp. 139–152.

[180] Margaret Mitchell, Dan Bohus, and Ece Kamar, Crowdsourcing language generation templates for dialogue systems, Proceedings of the INLG and SIGDIAL 2014 Joint Ses- sion (2014), 172–180.

[181] Andriy Mnih and Koray Kavukcuoglu, Learning word embeddings efficiently with noise-contrastive estimation, Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (USA), NIPS’13, Curran Associates Inc., 2013, pp. 2265–2273.

Page 182 of 194 [182] Nikola Mrkšic,´ Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young, Multi-domain dialog state tracking using recurrent neural networks, ACL ’15.

[183] Nikola Mrkšic,´ Diarmuid O Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve Young, Neural belief tracker: Data-driven dialogue state tracking, arXiv preprint arXiv:1606.03777 (2016).

[184] David Nadeau and Satoshi Sekine, A survey of named entity recognition and classification, Lingvisticae Investigationes 30 (2007), no. 1, 3–26.

[185] Roberto Navigli and Simone Paolo Ponzetto, Babelnet: Building a very large multilingual semantic network, Proceedings of the 48th Annual Meeting of the Association for Com- putational Linguistics (Stroudsburg, PA, USA), ACL ’10, Association for Computational Linguistics, 2010, pp. 216–225.

[186] , Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network, Artificial Intelligence 193 (2012), 217–250.

[187] Roberto Navigli and Simone Paolo Ponzetto, BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network, Artificial Intelligence 193 (2012), 217–250.

[188] Iulian Neamtiu, Jeffrey S Foster, and Michael Hicks, Understanding source code evolution using abstract syntax tree matching, ACM SIGSOFT Software Engineering Notes 30 (2005), no. 4, 1–5.

[189] José Luis Neves and Rajesh Bordawekar, Demonstrating ai-enabled sql queries over relational data using a cognitive database, (2018).

[190] Hamid Reza Motahari Nezhad, Boualem Benatallah, Fabio Casati, and Regis Saint-Paul, From business processes to process spaces, IEEE Internet Computing 15 (2010), no. 1, 22–30.

[191] Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N. Nguyen, Exploring api embedding for api usages and applications, Proceedings of the 39th International Con- ference on Software Engineering (Piscataway, NJ, USA), ICSE ’17, IEEE Press, 2017, pp. 438–449.

Page 183 of 194 [192] Thomas Fritz Nick C. Bradley and Reid Holmes, Context-aware conversational developer assistants, Proceedings of 40th International Conference on Software Engineering (New York, NY, USA), ICSE ’18, Association for Computational Linguistics, 2018, pp. 37–42.

[193] Feiping Nie, Xiaoqian Wang, Michael I. Jordan, and Heng Huang, The constrained laplacian rank algorithm for graph-based clustering, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, AAAI Press, 2016, pp. 1969–1976.

[194] Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi, Unsupervised learning of sentence embeddings using compositional n-gram features, arXiv preprint arXiv:1703.02507 (2017).

[195] Sinno Jialin Pan and Qiang Yang, A survey on transfer learning, IEEE Transactions on knowledge and data engineering 22 (2009), no. 10, 1345–1359.

[196] Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and Amit Paradkar, Inferring method specifications from natural language api descriptions, Software Engineer- ing (ICSE), 2012 34th International Conference on, IEEE, 2012, pp. 815–825.

[197] Michael P Papazoglou and Dimitrios Georgakopoulos, Service-oriented computing, Com- munications of the ACM 46 (2003), no. 10, 25–28.

[198] Jeffrey Pennington, Richard Socher, and Christopher Manning, Glove: Global vectors for word representation, Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

[199] Karen Scarfone Peter Mell and Sasha Romanosky, A complete guide to the common vulnerability scoring system, Available: https://www.first.org/cvss/v2/guide, 2007, [Online; accessed 21-July-2019].

[200] Gayane , Martin P Robillard, and Renato De Mori, Discovering information explaining api types using text classification, Proceedings of the 37th International Con- ference on Software Engineering-Volume 1, IEEE Press, 2015, pp. 869–879.

[201] Bojan , Ignacio Aguado, Andreea Hossmann, Michael Baeriswyl, and Claudiu Musat, Embedding individual table columns for resilient sql chatbots, arXiv preprint arXiv:1811.00633 (2018).

Page 184 of 194 [202] Seth Polsley, Pooja Jhunjhunwala, and Ruihong Huang, Casesummarizer: A system for automated summarization of legal texts, Proceedings of COLING 2016, the 26th Interna- tional Conference on Computational Linguistics: System Demonstrations, 2016, pp. 258– 262.

[203] Ana-Maria Popescu, Oren Etzioni, and Henry Kautz, Towards a theory of natural language interfaces to databases, IUI ’03, ACM, 2003, pp. 149–157.

[204] Piotr Pruski, Sugandha Lohar, William Goss, Alexander Rasin, and Jane Cleland-Huang, Tiqi: answering unstructured natural language trace queries, Req. Eng. 20 (2015), no. 3, 215–232.

[205] Chen Qu, Liu Yang, W. Bruce Croft, Yongfeng Zhang, Johanne R. Trippas, and Minghui Qiu, User intent prediction in information-seeking conversations, Proceedings of the 2019 Conference on Human Information Interaction and Retrieval (New York, NY, USA), CHIIR ’19, ACM, 2019, pp. 25–33.

[206] Mukund Raghothaman, Yi Wei, and Youssef Hamadi, Swim: Synthesizing what i mean-code search and idiomatic snippet synthesis, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), IEEE, 2016, pp. 357–367.

[207] K Raghuveer, Legal documents clustering using latent dirichlet allocation, IAES Int. J. Artif. Intell 2 (2012), no. 1, 34–37.

[208] Mohammad Masudur Rahman, Chanchal K Roy, and David Lo, Rack: Automatic api recommendation using crowdsourced knowledge, Software Analysis, Evolution, and Reengineering (SANER), 2016 IEEE 23rd International Conference on, vol. 1, IEEE, 2016, pp. 349–359.

[209] Kiran Ramesh, Surya Ravishankaran, Abhishek Joshi, and K. Chandrasekaran, A survey of design techniques for conversational agents, Information, Communication and Computing Technology (Singapore) (Saroj Kaushik, Daya Gupta, Latika Kharb, and Deepak Chahal, eds.), Springer Singapore, 2017, pp. 336–350.

[210] Abhinav Rastogi, Raghav Gupta, and Dilek Hakkani-Tur, Multi-task learning for joint language understanding and dialogue state tracking, arXiv preprint arXiv:1811.05408 (2018).

Page 185 of 194 [211] Antoine Raux and Maxine Eskenazi, A finite-state turn-taking model for spoken dialog systems, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2009, pp. 629–637.

[212] Veselin Raychev, Martin Vechev, and Eran Yahav, Code completion with statistical language models, Acm Sigplan Notices, vol. 49, ACM, 2014, pp. 419–428.

[213] RBS, The 2017 year end vulnerability quick view report, Available: https://pages.riskbasedsecurity.com/2017-ye-vulnerability-quickview-report, 2017, [On- line; accessed 11-September-2019].

[214] , More than 10,000 vulnerabilities disclosed so far in 2018 â over 3,000 you may not know about, Available: https://www.riskbasedsecurity.com/2018/08/13/more-than-10000- vulnerabilities-disclosed-so-far-in-2018-over-3000-you-may-not-know-about/, 2018, [On- line; accessed 20-August-2019].

[215] Tewksbury Richard, Qualitative versus quantitative methods: Understanding why qualitative methods are superior for criminology and criminal justice.

[216] Peter C Rigby and Martin P Robillard, Discovering essential code elements in informal documentation, 2013 35th International Conference on Software Engineering (ICSE), IEEE, 2013, pp. 832–841.

[217] Martin P. Robillard and Yam B. Chhetri, Recommending reference api documentation, Em- pirical Software Engineering 20 (2015), no. 6, 1558–1586.

[218] Martin P Robillard and Yam B Chhetri, Recommending reference api documentation, Em- pirical Software Engineering 20 (2015), no. 6, 1558–1586.

[219] Carlos Rodriguez, Shayan Zamanirad, Reza Nouri, Kirtana Darabal, Boualem Benatallah, and Mortada Al-Banna, Security vulnerability information service with natural language query support, Advanced Information Systems Engineering (Cham) (Paolo Giorgini and Barbara Weber, eds.), Springer International Publishing, 2019, pp. 497–512.

[220] Barry S Rowlingson and Peter J Diggle, Splancs: spatial point pattern analysis code in s-plus, Computers & Geosciences 19 (1993), no. 5, 627–655.

Page 186 of 194 [221] Andreas Rücklé, Steffen Eger, Maxime Peyrard, and Iryna Gurevych, Concatenated power mean embeddings as universal cross-lingual sentence representations, arXiv (2018).

[222] Caitlin Sadowski, Kathryn T. Stolee, and Sebastian Elbaum, How developers search for code: A case study, Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (New York, NY, USA), ESEC/FSE 2015, ACM, 2015, pp. 191–201.

[223] R. Sarikaya, G. E. Hinton, and B. Ramabhadran, Deep belief nets for natural language call-routing, 2011 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), May 2011, pp. 5680–5683.

[224] Md Shahriare Satu, Md Hasnat Parvez, et al., Review of integrated applications with aiml based chatbot, 2015 International Conference on Computer and Information Engineering (ICCIE), IEEE, 2015, pp. 87–90.

[225] Ari Schlesinger, Kenton P. O’Hara, and Alex S. Taylor, Let’s talk about race: Identity, chatbots, and ai, Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (New York, NY, USA), CHI ’18, ACM, 2018, pp. 315:1–315:14.

[226] Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan, Introduction to information retrieval, vol. 39, Cambridge University Press, 2008.

[227] Iulian V Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, et al., A deep reinforcement learning chatbot, arXiv preprint arXiv:1709.02349 (2017).

[228] Pararth Shah, Dilek Hakkani-Tür, Gokhan Tür, Abhinav Rastogi, Ankur Bapna, Neha Nayak, and Larry Heck, Building a conversational agent overnight with dialogue self-play, arXiv:1801.04871 (2018).

[229] Bayan Abu Shawar and Eric Atwell, Chatbots: are they really useful?, Ldv forum, vol. 22, 2007, pp. 29–49.

[230] Scharolta Katharina Siencnik,ˇ Adapting word2vec to named entity recognition, Proceedings of the 20th nordic conference of computational linguistics, nodalida 2015, may 11-13, 2015, vilnius, lithuania, no. 109, Linköping University Electronic Press, 2015, pp. 239–243.

[231] Jagdish Singh, Minnu Joesph, and Khurshid Jabbar, Rule-based chabot for student enquiries, Journal of Physics: Conference Series 1228 (2019), 012060.

Page 187 of 194 [232] Smerity, Peeking into the neural network architecture used for google’s neural machine translation, Available:

https://smerity.com/articles/2016/googlenmtarch.html, 2016, [Online; accessed02 − September − 2019].

[233] Justin Smith, Brittany Johnson, Emerson Murphy-Hill, Bill Chu, and Heather Richter Lipford, Questions developers ask while diagnosing potential security vulnerabilities with static analysis, ESEC/SIGSOFT FSE 2015, ACM, 2015, pp. 248–259.

[234] Claudia Soria, Roberto Bartolini, Alessandro Lenci, Simonetta Montemagni, and Vito Pirrelli, Automatic extraction of semantics in law documents, Proceedings of the V Legislative XML Workshop, 2007, pp. 253–266.

[235] Robert Speer, Joshua Chin, and Catherine Havasi, Conceptnet 5.5: An open multilingual graph of general knowledge., AAAI, 2017, pp. 4444–4451.

[236] Robert Speer and Catherine Havasi, Representing general relational knowledge in conceptnet 5., LREC, 2012, pp. 3679–3686.

[237] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014), 1929–1958.

[238] Amanda Stent, Rashmi Prasad, and Marilyn Walker, Trainable sentence planning for complex information presentation in spoken dialog systems, Proceedings of the 42nd annual meeting on association for computational linguistics, Association for Computational Linguistics, 2004, p. 79.

[239] Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer, Dialogue act modeling for automatic tagging and recognition of conversational speech, Computational linguistics 26 (2000), no. 3, 339–373.

[240] Ezra Stotland and Michael Pendleton, Workload, stress, and strain among police officers, Behav- ioral Medicine 15 (1989), no. 1, 5–17.

[241] Yu Su, Ahmed Hassan Awadallah, Madian Khabsa, Patrick Pantel, Michael Gamon, and Mark Encarnacion, Building natural language interfaces to web apis, (2017).

Page 188 of 194 [242] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al., End-to-end memory networks, Advances in neural information processing systems, 2015, pp. 2440–2448.

[243] Octavia-Maria Sulea, Marcos Zampieri, Mihaela Vela, and Josef van Genabith, Predicting the law area and decisions of french supreme court cases, arXiv preprint arXiv:1708.01681 (2017).

[244] Zhen Sun, Ee-Peng Lim, Kuiyu Chang, Teng-Kwee Ong, and Rohan Kumar Gunaratna, Event-driven document selection for terrorism information extraction, International Conference on Intelligence and Security Informatics, Springer, 2005, pp. 37–48.

[245] Supervise.ly, Big challenge in deep learning: Training data.

[246] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, Sequence to sequence learning with neural networks, Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Cambridge, MA, USA), NIPS’14, MIT Press, 2014, pp. 3104–3112.

[247] Valentin Tablan, Danica Damljanovic, and Kalina Bontcheva, A natural language query interface to structured information, ESWC, Springer, 2008, pp. 361–375.

[248] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin, Learning sentiment-specific word embedding for twitter sentiment classification, vol. 1, 06 2014, pp. 1555–1565.

[249] Yufei Tao and Dimitris Papadias, Efficient historical r-trees, Scientific and Statistical Database Management, 2001. SSDBM 2001. Proceedings. Thirteenth International Conference on, IEEE, 2001, pp. 223–232.

[250] Ferdian Thung, Shaowei Wang, David Lo, and Julia Lawall, Automatic recommendation of api methods from feature requests, Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering (Piscataway, NJ, USA), ASE’13, IEEE Press, 2013, pp. 290– 300.

[251] , Automatic recommendation of api methods from feature requests, Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, IEEE Press, 2013, pp. 290–300.

[252] A. Torralba and A. A. Efros, Unbiased look at dataset bias, CVPR 2011, June 2011, pp. 1521– 1528.

Page 189 of 194 [253] Carlos Toxtli, Andrés Monroy-Hernández, and Justin Cranshaw, Understanding chatbot-mediated task management, arXiv preprint arXiv:1802.03109 (2018).

[254] Carlos Toxtli, Andrés Monroy-Hernández, and Justin Cranshaw, Understanding chatbot-mediated task management, Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (New York, NY, USA), CHI ’18, ACM, 2018, pp. 58:1–58:6.

[255] Christoph Treude and Martin P Robillard, Augmenting api documentation with insights from stack overflow, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), IEEE, 2016, pp. 392–403.

[256] Gokhan Tur, Li Deng, Dilek Hakkani-TÃÅr, and Xiaodong He, Towards deeper understanding deep convex networks for semantic utterance classification, IEEE International Confrence on Acoustics, Speech, and Signal Processing (ICASSP), March 2012.

[257] Wil Van Der Aalst, Kees Max Van Hee, and Kees van Hee, Workflow management: models, methods, and systems, MIT press, 2004.

[258] Thanh Van Nguyen, Anh Tuan Nguyen, and Tien N. Nguyen, Characterizing api elements in software documentation with vector representation, Proceedings of the 38th International Con- ference on Software Engineering Companion (New York, NY, USA), ICSE ’16, ACM, 2016, pp. 749–751.

[259] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, Attention is all you need, Advances in neural information processing systems, 2017, pp. 5998–6008.

[260] Mandana Vaziri, Louis Mandel, Avraham Shinnar, Jérôme Siméon, and Martin Hirzel, Generating chat bots from web api specifications, Proceedings of the 2017 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, ACM, 2017, pp. 44–57.

[261] Jesse Vig, Shilad Sen, and John Riedl, The tag genome: Encoding community knowledge to support novel interaction, ACM Transactions on Interactive Intelligent Systems (TiiS) 2 (2012), no. 3, 13.

[262] Oriol Vinyals and Quoc Le, A neural conversational model, arXiv preprint arXiv:1506.05869 (2015).

Page 190 of 194 [263] Mark von Rosing, Stephen White, Fred Cummins, and Henk de Man, Business process model and notation-bpmn., 2015.

[264] Alexander Wachtel, Sebastian Weigelt, and Walter F Tichy, Initial implementation of natural language turn-based dialog system, Procedia Computer Science 84 (2016), 49–56.

[265] Ferdinand Wagner, Ruedi Schmuki, Thomas Wagner, and Peter Wolstenholme, Modeling software with finite state machines: a practical approach, Auerbach Publications, 2006.

[266] Marilyn A Walker, Owen C Rambow, and Monica Rogati, Training a sentence planner for spoken dialogue using boosting, Computer Speech & Language 16 (2002), no. 3-4, 409–433.

[267] Marilyn A Walker, Amanda Stent, François Mairesse, and Rashmi Prasad, Individual and domain adaptation in sentence planning for dialogue, Journal of Artificial Intelligence Research 30 (2007), 413–456.

[268] Richard Wallace, The elements of aiml style, (2003).

[269] Hanna M Wallach, Topic modeling: beyond bag-of-words, Proceedings of the 23rd international conference on Machine learning, ACM, 2006, pp. 977–984.

[270] William Yang Wang, Dan Bohus, Ece Kamar, and Eric Horvitz, Crowdsourcing the acquisition of natural language corpora: Methods and observations, SLT, IEEE, 2012, pp. 73–78.

[271] Chen Wei, Zhichen Yu, and Simon Fong, How to build a chatbot: Chatbot framework and its capabilities, Proceedings of the 2018 10th International Conference on Machine Learning and Computing, ACM, 2018, pp. 369–373.

[272] Joseph Weizenbaum, Eliza—a computer program for the study of natural language communication between man and machine, Commun. ACM 9 (1966), no. 1, 36–45.

[273] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young, Semantically conditioned lstm-based natural language generation for spoken dialogue systems, arXiv preprint arXiv:1508.01745 (2015).

[274] Sanjaya Wijeratne, Lakshika Balasuriya, Amit P Sheth, and Derek Doran, Emojinet: An open service and api for emoji sense discovery., ICWSM, 2017, pp. 437–447.

[275] Tadeusz Wilusz, Neural networks ÃÂÂ a comprehensive foundation, Neurocomputing 8 (1995), 359ÃÂÂ360.

Page 191 of 194 [276] Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal, Data mining: Practical machine learning tools and techniques, Morgan Kaufmann, 2016.

[277] Naihao Wu, Daqing Hou, and Qingkun Liu, Linking usage tutorials into api client code, Proceed- ings of the 3rd International Workshop on CrowdSourcing in Software Engineering (New York, NY, USA), CSI-SE ’16, ACM, 2016, pp. 22–28.

[278] Bowen Xu, Zhenchang Xing, Xin Xia, and David Lo, Answerbot: Automated generation of answer summary to developers´z technical questions, Proceedings of the 32Nd IEEE/ACM In- ternational Conference on Automated Software Engineering (Piscataway, NJ, USA), ASE 2017, IEEE Press, 2017, pp. 706–716.

[279] Rui Yan and Dongyan Zhao, Coupled context modeling for deep chit-chat: Towards conversations between human and computer, Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (New York, NY, USA), KDD ’18, ACM, 2018, pp. 2574–2583.

[280] Zhao Yan, Nan Duan, Junwei Bao, Peng Chen, Ming Zhou, Zhoujun Li, and Jianshe Zhou, Docchat: An information retrieval approach for chatbot engines using unstructured documents, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), 2016, pp. 516–525.

[281] Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, Jianshe Zhou, and Zhoujun Li, Building task-oriented dialogue systems for online shopping, AAAI ’17, 2017.

[282] Jinqiu Yang, Erik Wittern, Annie T. T. Ying, Julian Dolby, and Lin Tan, Towards extracting web api specifications from documentation, Proceedings of the 15th International Conference on Mining Software Repositories (New York, NY, USA), MSR ’18, ACM, 2018, pp. 454–464.

[283] Mihalis Yannakakis, Hierarchical state machines, TCS 2000, Springer, 2000, pp. 315–330.

[284] Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, and Chang Liu, From word embeddings to document similarities for improved information retrieval in software engineering, Proceedings of the 38th international conference on software engineering, ACM, 2016, pp. 404–415.

[285] Koichiro Yoshino, Takuya Hiraoka, Graham Neubig, and Satoshi Nakamura, Dialogue state tracking using long short term memory neural networks.

Page 192 of 194 [286] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao, Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop, arXiv preprint arXiv:1506.03365 (2015).

[287] Haibo Yu, Wenhao Song, and Tsunenori Mine, Apibook: An effective approach for finding apis, Proceedings of the 8th Asia-Pacific Symposium on Internetware (New York, NY, USA), Internet- ware ’16, ACM, 2016, pp. 45–53.

[288] Qian Yu, Wai Lam, and Zihao Wang, Responding e-commerce product questions via exploiting qa collections and reviews, Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 2192–2203.

[289] Shayan Zamanirad, Boualem Benatallah, Moshe Chai Barukh, Fabio Casati, and Carlos Ro- driguez, Programming bots by synthesizing natural language expressions into api invocations, Proceedings of the 32Nd IEEE/ACM International Conference on Automated Software Engineer- ing (Piscataway, NJ, USA), ASE 2017, IEEE Press, 2017, pp. 832–837.

[290] Jennifer Zamora, I’m sorry, dave, i’m afraid i can’t do that: Chatbot perception and expectations, HAI ’17, 2017.

[291] Hongyu Zhang, Anuj Jain, Gaurav Khandelwal, Chandrashekhar Kaushik, Scott Ge, and Wenx- iang Hu, Bing developer assistant: Improving developer productivity by recommending sample code, Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (New York, NY, USA), FSE 2016, ACM, 2016, pp. 956–961.

[292] Shuo Zhang and Krisztian Balog, Entitables: Smart assistance for entity-focused tables, Proceed- ings of the 40th International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval, ACM, 2017, pp. 255–264.

[293] Zheng Zhang, Minlie Huang, Zhongzhou Zhao, Feng Ji, Haiqing Chen, and Xiaoyan Zhu, Memory-augmented dialogue management for task-oriented dialogue systems, ACM Transactions on Information Systems (TOIS) 37 (2019), no. 3, 34.

[294] Mingyi Zhao, Jens Grossklags, and Peng Liu, An empirical study of web vulnerability discovery ecosystems, ACM SIGSAC, ACM, 2015, pp. 1105–1117.

Page 193 of 194 [295] Rong Zheng, Yi Qin, Zan Huang, and Hsinchun Chen, Authorship analysis in cybercrime investigation, International Conference on Intelligence and Security Informatics, Springer, 2003, pp. 59–73.

[296] Victor Zue, Stephanie Seneff, James R Glass, Joseph Polifroni, Christine Pao, Timothy J Hazen, and Lee Hetherington, Jupiter: a telephone-based conversational interface for weather information, IEEE Transactions on speech and audio processing 8 (2000), no. 1, 85–96.

Page 194 of 194