Bot2Vec: Learning Representations of Chatbots

Jonathan Herzig1 Tommy Sandbank∗,∗2 Michal Shmueli-Scheuer1 David Konopnicki1 John Richards1

1IBM Research 2Similari {hjon,shmueli,davidko}.il.ibm.com, [email protected], [email protected]

Abstract complex objects that execute logic in order to drive conversations with users. How should they best be Chatbots (i.e., bots) are becoming widely used represented? in multiple domains, along with supporting bot programming platforms. These platforms are Many commercial companies provide bot pro- equipped with novel testing tools aimed at im- gramming platforms. These platforms provide proving the quality of individual chatbots. Do- tools and services to develop bots, monitor and ing so requires an understanding of what sort improve their quality. Due to the increasing pop- of bots are being built (captured by their under- ularity of bots, thousands or tens of thousands of lying conversation graphs) and how well they bots could be deployed by different companies on perform (derived through analysis of conver- each platform1. Although bots might have differ- sation logs). In this paper, we propose a new ent purposes and different underlying structures, model, BOT2VEC, that embeds bots to a com- pact representation based on their structure the ability to understand bot behavior at a high and usage logs. Then, we utilize BOT2VEC level could inspire new tools and services bene- representations to improve the quality of two fitting all bots on the platform. In this work, we bot analysis tasks. Using conversation data explore a commercial platform, and study differ- and graphs of over than 90 bots, we show that ent bot representations. BOT2VEC representations improve detection Inspired by the success of recently proposed performance by more than 16% for both tasks. learned embeddings for objects such as graphs 1 Introduction (Narayanan et al., 2017), nodes (Grover and Leskovec, 2016), documents (Le and Mikolov, As conversational systems (i.e., chatbots) become 2014) and words (Mikolov et al., 2013a), we pro- more pervasive, careful analysis of their capabil- pose a new model, BOT2VEC, that learns bot ities becomes important. Conversational systems embeddings, and propose both content and graph are being used for a variety of support, service, based representations. While previous graph em- and sales applications that were formerly handled bedding representations consider static local struc- by human agents. Thus, organizations deploying tures in the graph (Narayanan et al., 2017; Grover such systems must be able to understand bots be- and Leskovec, 2016), our graph representation is havior to improve their performance. In many based on dynamic conversation paths. As bots are cases, such an analysis can be viewed as a classi- usually represented on bot platforms as some form fication task whose goal is to check whether a bot of directed graph, with conversations represented or a particular instance of a conversation satisfies as traversals on the graph, this approach seems some property (e.g., is the conversation success- reasonable. It captures the way the bot is actually ful?). Models for these downstream classification used, in addition to how it is structured. tasks should benefit from conditioning on repre- In this paper, our goal is to consider various bot sentations that capture global bot behavior. embeddings and two different but realistic classi- For a conversation itself, there exists a natural fication tasks, and test whether some bot represen- way to represent it as the concatenation of the hu- tations are more appropriate for these tasks. The man and bot utterances. As for a bot, the question first task, at the level of entire bots, aims to detect of its representation is more complicated: bots are 1https://www.techemergence.com/chatbot-comparison- ∗ Work was done while working at IBM. -microsoft-amazon-google/

75 Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM), pages 75–84 Minneapolis, June 6–7, 2019. c 2019 Association for Computational Linguistics (1) (1,1) (1,1,1) (1,1,1,1) (1,1,1,1,1) Welcome Make_a Payment_ Payment_ Paypal Paypal_error _payment time methods the context of a conversation. (1,2) (1,1,1,2) (1,1,1,1,2)

Payment_type Credit_card Insert_detail At every step (i.e., every turn) of a conversation (2) (2,1) (2,1,1) with a bot, the human user expresses an utterance Account_ New_account Create_user operation (2,2) (2,2,1) and the bot analyzes it, determines how to respond

Delete_ Delete_user account and updates its internal state. This determination (2,3) Update_ is executed by traversing the graph, starting from account (3) (3,1) a special node called the root node, and moving Store_ Location information

(4) (4,1) (4,1,1) (4,1,1,1) along the nodes of the graph according to a given Technical_ Headset_ Wired_ Echo_test problem problem model set of rules as described below. Note that this de- (4,2) (4,1,2) (4,1,2,1) HD_problem Wireless_ Charge_ model headset scription aims to present and explain key abstrac- (4,3) Battery_ Goto: tions rather than the implementation details of an problem Echo_test (5) actual bot platform. Device_ jump node information negative edge (6) positive edge Anything_Else 2.1 Graph Components

Figure 1: Example of a customer support bot graph. Every node in the graph has two internal parts: a user intent, and an optional reply of the bot. Given a user utterance, an intent classifier is used to de- termine whether the user utterance matches the in- whether the bot is in production (i.e., in use with tent associated with the node. For example, the real human users) or not. The second task, at the Technical problem node has been defined to cap- conversation level within each bot, aims to detect ture cases where users encounter a technical prob- problematic conversations with a deployed bot in lem with a product, and this is what is being ex- support of focusing improvement efforts. pressed in the utterance at hand (e.g., “I’m hav- The main contributions of this paper are three- ing some issues with my headset”). In this case, fold: (1) this is the first research that leverages in- the classifier should be able to classify this utter- formation from multiple bots in order to improve ance as relevant to this intent. In practice, the in- bot quality, (2) this is the first research to propose tent classifier is trained from examples of utter- an embedding approach for bots based on their ances and their corresponding intents, written by structure and how this structure is exploited dur- bot programmers. ing conversations, and (3) we empirically evaluate Every node has two optional outgoing edges: a these representations for two classification tasks positive edge and a negative edge. If a user ut- using data from more than 90 conversational bots. terance has been classified positively, the optional We find that our proposed representations lead node answer is presented to the user, and the exe- to more than 16% improvement in classification cution moves to the node which is the target of the for our two tasks, with our structure-based repre- positive edge. When the execution moves along sentation performing better than our content-based a positive edge, it is possible to obtain additional representation. This suggests that the representa- input from the user, and continue the evaluation tions explored in this paper are valuable across a using this input. If a user utterance has been clas- range of possible tasks. sified negatively, the execution moves to the node which is the target of the negative edge. As shown 2 Bot Overview in Figure1, we represent nodes connected by neg- Although different programming models can be ative edges along the vertical axis, and nodes con- used to create bots, in practice, most commercial nected by positive edges along the horizontal axis. conversational system platforms represent the con- When there is no positive edge and the classi- versation control flow for bots as graphs. In this fication is positive, the execution stops, and the paper, we use a bot paradigm based on one of the system waits for the next user utterance. When publicly available commercial platforms, but it is there is no negative edge and the classification is quite general and can be adapted to fit to other bot negative, execution jumps back to the root node to programming platforms. Our example in Figure1 resume evaluation from the start. shows a part of a customer support bot graph, and In addition to the nodes that we just described, we use it to explain how such a graph is used in there are special sink nodes (in our example, the

76 Agent: Hi, I’m Eva a , how may I help you? using?”), the system then waits for the next user Customer: I’m having some issues with my headset utterance. The next user utterance (“A wireless Agent: Which model are you using? Customer: A wireless one. one.”) leads to the Wireless model node, hence the Agent: Please make sure that they are charged. resulting path for this utterance is a continuation of Customer: Yes they are, and they work well with other de- the previous path. vices... Agent: Got you. In this case go and try our echo test. Note that nodes connected vertically by neg- Customer: Well, I’ll stop by one of your stores and see a tech- ative edges represent alternative understand- nician. Agent: I’m not trained on that yet, but I’m still learning. You ings of an utterance. That is, in our ex- may want to rephrase your question and try again. ample, an utterance can be identified as Ac- Customer: What is your closest location? count operation, Store information or Techni- Agent: Check here to see our locations in your city. cal problem, etc. Nodes connected by horizon- Figure 2: Example of a possible conversation. tal positive edges represent specializations of the analysis. That is, after the utterance is classified as Technical problem, moving along the positive Anything Else node), which are not the source of edge will check whether the utterance expresses any edges, and which typically trigger a special a Headset problem, or (moving again vertically default message like “I’m still in a learning phase, along negative edges) a HD problem or, alterna- is there anything else I can help you with?’’. As tively, a Battery problem. bots support only a limited set of intents, this In addition, special jump nodes are nodes mechanism is used to let the user know that some that allow the conversation to jump to a des- intent is beyond the knowledge of the bot, and to ignated node. In our example the node below initiate a recovery process. Charge headset, that refers to the Echo test node, 2.2 Graph Execution is a jump. Such jump nodes are not essential, but simplify the graph by preventing duplication of A conversation starts by traversing the graph from subgraphs. the root node. The root node is special in that it does not expect a user utterance, and it only has 2.3 Notations a positive edge. Its optional response, which can be a greeting message for example, is only output We define the depth of the bot graph as the max- once at the beginning of the conversation. imum number of nodes from left to right (ignor- Consequently, a user utterance defines a path in ing the root node), i.e. nodes connected by pos- the graph, and each conversation between a human itive edges. The depth of a node v is defined as and the bot can be represented as a sequence of the number of positive edges used to traverse the paths in the graph. Figure2 shows an example of graph from the root node to v. In our example such a conversation. from Figure1 the depth of the graph is 5, while The nodes that are evaluated for the first user the depth of Headset problem is 2. We define the utterance in Figure2( “I’m having some issues level l as the set of all the nodes whose depth is l. with my headset”) are marked in bold in Figure1. We define the width of the graph at level l as the Thus, the path that is created by the analysis of maximum number of nodes connected by negative this utterance starts with the root node Welcome, edges at this level. In our example, the width of then moves to the Make a payment node, check- level 1 is 6, while the width of level 2 is 3. ing whether this utterance expresses the user in- To further simplify notations, we consider a grid tention to make a payment. Since it is not, con- layout that defines coordinates for the nodes from trol moves to the Account operation node, and left to right and from top to bottom. For example, then, in turn, to the Store information node, along node Technical problem is mapped to (4), which the negative edges, until it reaches the Techni- means that it is the 4th node from top to bottom cal problem node. Here, the internal classifier de- at level 1. The node Headset problem is mapped termines that the utterance indeed expresses that to (4,1), meaning that it is the 1st node at level the user encountered a technical problem. As a 2 of the 4th node at level 1. Similarly, the node result, the control moves along the positive edge HD problem is mapped to coordinate (4,2) and to the Headset problem node. Once the node’s re- Wireless model is (4,1,2). Note that nodes located ply is presented to the user (“Which model are you deeper in the graph are mapped to a longer list of

77 coordinates. The maximal possible length of a co- lar representations, given that they handle simi- ordinate for a node is the depth of the graph. lar conversations. In the context of learning word representations using the skip-gram 2.4 Bot Behavior model (Mikolov et al., 2013b), the output embed- The graph of a bot determines its behavior, and dings were found to be of good quality (Press and thus, the structure of the graph captures interest- Wolf, 2017). Hereafter, we denote the content- ing properties of the bot. For example, there are based model as BOT2VEC-C, and the structure- bots designed to handle simple Q&A conversa- based model as BOT2VEC-S. tions, as opposed to bots that handle filling in the We now describe how a conversation is rep- details of complex transactions. For Q&A bots, resented using its textual content and using its the graph is likely to be of depth 1, with many bot graph structure which is used as input for the nodes at this level, representing various alternative model. questions and answers. For bots handling complex conversations, the graphs are likely to be deeper in 3.2 Content-based Representation order to handle more complicated cases. In gen- Conversations between users and bots occur in eral, bots handling narrow use-cases and which natural language, and as shown in Figure2, are are very specific in their dialog capabilities, are composed of user utterances and bot responses. likely to have fewer nodes and more jumps to sink The first step for creating our textual representa- nodes. Thus, in order to capture the bot behavior tion of a single conversation is to build a vocab- we should consider the different characteristics of ulary, which is the union of all terms across all its graph, and this is what we would like to capture conversations of all the bots in the dataset. We in our representation. first mask tokens that might reveal the bot’s iden- tity, such as bot names, URLs, HTML tags etc. To 3 Bot2Vec Framework neglect additional bot specific words (which are 3.1 Representation Learning probably infrequent), we take the k most popular terms to be the vocabulary. In this work, we employ a neural network model Now, for a given conversation, we create two to learn the BOT2VEC representation. The train- vectors with term frequency (TF) entries accord- ing input to this model is either a content-based ing to the template just defined, one for the user representation or a structure-based representation utterances and one for the bot responses. The con- of conversations between a human and a bot. Both versation is then represented as their concatena- are described in the following sections. The result tion. of the training is a vector representation for each bot in the dataset. 3.3 Structure-based Representation To learn this BOT2VEC representation, a fully Our goal in this representation is to characterize connected network with N hidden layers is used. bot behavior by analyzing its conversations with During training, the input to this network is the respect to the structure of the bot graph. This representation of a conversation (either content- learned representation should capture the charac- based or structure-based), and the ground truth is a teristics of how bots are being utilized. We would one-hot vector of the bot that handled this conver- like to capture, for example, which nodes are be- sation. In other words, given a conversation c, the ing visited during a conversation with a user, at network predicts which bot handled c using soft- which nodes the conversation turns end, etc. max, that is a distribution over the bots. Thus, the Recall that each conversation can be repre- output layer vector of the model has the size of the sented as a sequence of paths on the bot graph (a number of bots in the dataset. Once the model is path for each turn). We now describe how to rep- b trained, the representation of a bot is the weights resent a path as a vector (bin vector below), and V V vector b·, where is the output embedding ma- how to aggregate paths of a single conversation. trix (the weights connecting the last hidden layer to the output layer). Bin vector To be able to compare bots with dif- The motivation for choosing this representa- ferent bot graph structures, we define a common tion is that the training procedure (using cross- fixed size bin vector to represent paths of different entropy loss) should drive similar bots to simi- bots.

78 section 1 section 2 section 3 section 1 section 2 section 3

Turn 1: s f r u s f r u s f r u s f r u s f r u s f r u (1),(1,1),(2), (2,3),(4,3), (1,1,1),(1,1,1,1), (1,1,1,1,1), 10 2 3 00 2 3 00 0 4 00 0 7 00 0 2 00 0 2 (1,2),(2,2),(3), (2,1,1),(2,2,1), (1,1,1,2),(4,1,2) (2,1),(3,1), (4),(4,2) (5),(6) (4,1,1),(4,1,1,1), (1,1,1,1,2) (4,1) (4,1,2,1) bin 1 bin 2 bin 3 bin 1 bin 2 bin 1

bin 1 bin 2 bin 3 bin 1 bin 2 bin 1 section 1 section 2 section 3

s f r u s f r u s f r u s f r u s f r u s f r u Figure 3: A mapping of a bot graph to the bin vector. Turn 5: 10 2 3 00 1 4 00 0 4 00 0 7 00 0 2 00 0 2 bin 1 bin 2 bin 3 bin 1 bin 2 bin 1

section 1 section 2 section 3

s f r u s f r u s f r u s f r u s f r u s f r u 1 Mapping a node to section and bin aggregated 20 1216 00 718 01 118 10 232 10 1 8 00 010

nl: node depth bin 1 bin 2 bin 3 bin 1 bin 2 bin 1 nw: node width input D: graph depth Figure 4: Bin vector representation of turns in a con- w = w1, w2, . . . , wD: graph width at each level versation. S: number of sections b = b1, b2, . . . , bS : number of bins per section output ns: section index nb: bin index 2.A failure (f) node is the last node of the path, 1: n = b (nl−1) × S + 1c s d if it is a sink node. (nw−1) 2: nb = b × bn + 1c wl s 3: return ns, nb 3. All the other nodes that belong to the path are regular (r) nodes.

We create a bin vector such that each node in the 4. Nodes that do not belong to the path are un- bot graph is mapped to a single bin based on its co- involved (u) nodes. ordinates in the graph. Each bin vector is divided When representing a path, we consider the type into S sections, and each section s is divided into of the node it is mapped to, in the corresponding bs bins (Figure3). Since the idea is to represent bins in the bin vector: each bin maintains 4 coun- a conversation path in a standardized and com- ters, one counter for each type of node mentioned pact way across different bots, each level in the above (success, failure, regular and uninvolved). bot graph is mapped to a section in the bin vector, That is, the mapping of the first user utterance and each node in the bot graph is mapped to a bin “I’m having some issues with my headset” to the in the appropriate section (see Algorithm 1). Sev- bin vector is as follows: eral levels might be mapped to the same section, and several nodes can be mapped to the same bin • The first node in the bot graph that is visited (sections do not necessarily have the same number is Make a payment (1). This node is mapped of bins). The number of sections and bins in the to the first bin in section 1 of the bin vector. bin vector are set based on the depths and widths Thus the regular counter is set to 1 for this of all the bot graphs (e.g., the average depth of bin. the graphs and the average width of each level in • The second node traversed in the bot graph the graphs). For example, the bin vector in Fig- is Account operation (2), which is mapped ure3 represents the mapping of all nodes for the to the same first bin of section 1 of the bin bot graph in Figure1. This bin vector has 3 sec- vector. Hence, the regular counter of this bin tions. There are 3 bins in section 1, 2 bins in sec- is set to 2 in the bin vector. tion 2, and 1 bin in section 3. • Similarly, nodes Store information (3) and Utterance modeling We now explain how each Technical problem (4) are visited, and that utterance is represented using the bin vector. As sets the regular counter of bin 2 in section mentioned above, each user utterance in a conver- 1 to 2. sation is represented by a path in the bot graph, whose nodes can be mapped to sections and bins • Finally, the Headset problem (4, 1) node is in the bin vector. In order to capture how every visited, and that sets the success counter of utterance is being analyzed by the bot, we distin- bin 1 in section 1 to 1, as this is the last node guish between different types of nodes in the path: that is being visited for this utterance. Now we can update the uninvolved counters of the 1.A success (s) node is the last node of the path, bins according to the nodes that were not vis- if it is not a sink node. ited during the traversal.

79 When this path is mapped to the bin vector, we obtain the vector as shown in “Turn 1” of Figure4. Conversation modeling As the input to the model is a conversation, we now describe how it is represented. We aggregate the bin vectors of the user utterances paths by summing each counter based on the node types (s, f, r or u) across all the bins in the matching sections. This aggrega- tion captures different patterns of the conversation, such as how many times nodes which are mapped Figure 5: Bots summary: #conversations, #nodes and to a bin are visited, how many turns ended suc- depth. cessfully in the mapped nodes vs. how many turns failed in these nodes, etc. Figure4 depicts the de- tailed vectors for the first and the fifth customer would be needed to salvage them. Finding these utterance from Figure2, as well as the aggregated egregious conversations can help identify where vector obtained for the whole conversation. improvement efforts should be focused. In this task, a conversation c should be classified as egre- 4 Classification Tasks gious or not, with the BOT2VEC representation BOT2VEC representations could be used for a va- potentially improving performance. riety of bot analytics tasks. In this research, we have examined two such tasks. 5 Experiments Detecting production bots Several companies 5.1 Data provide bot development platforms that are used to We collected two months of data from 92 bots, in- create and manage conversational bots. Based on cluding their graphs and conversations logs. The analyses of the logs of one commercial platform, bot domains included health, finance, banking, we have found that a large percentage of bots are travel, HR, IT support, and more. Figure5 sum- not being used with real users. From the platform marizes the information about number of conver- provider perspective, understanding bots interac- sations, number of nodes and graph depth for the tion with actual users could inspire the develop- bots. In total, we collected 1.3 million conversa- ment of new tools and services that could assist all tions, with a minimum of 110 conversations and a of bots that use the platform. Thus, it is impor- maximum of 161, 000 conversations per bot. For tant to first determine which bots are used in pro- 62% of the bots, the number of conversations var- duction, rather than in debugging or testing. This ied between 1000 to 10, 000. Bot graph depth is made difficult by the fact that bot testing often ranged from 2 to 52 levels with an average depth involves somewhat realistic simulations of conver- of 7; the total number of nodes ranged from 11 to sations. Thus, in this binary classification task, a 1088 with an average of 160 nodes per bot. bot should be classified as either a production bot or not, given all of its conversations. 5.2 Experimental Setting Common bin vector As explained, to capture Detecting egregious conversations Once in comparable behavior across bots, we create one production, bot log analysis forms the basis for common bin vector using the average depth and continuous improvement. Finding the areas most average width for each level of the bots graphs. in need of improvement is complicated by the fact Specifically, first, based on the average depth, we that bots may have thousands of conversations per define the number of sections to be 7. Then we set day, making it hard to find conversations failing the number of bins for each section (based on the from causes such as faulty classification of user average width per level per bot) to 108, 10, 6, 17, intent, bugs in dialog descriptions, and inadequate 8, 4, and 1, respectively. use of conversational context. Recently, a new analysis (Sandbank et al., 2018), aims at detect- Bot2Vec implementation details ing egregious conversations, those in which the bot Content-based: The content-based model input is behaves so badly that a human agent, if available, comprised of two vectors of size k = 5000 each ,

80 Model F1-score % improvement Model F1-score % improvement BOT-STAT 0.519 - EGR 0.537 - BOT2VEC-C 0.545 5.0 bot-STAT 0.597 11.0 BOT2VEC-S 0.616 18.6 BOT2VEC-C 0.617 14.8 BOT2VEC-S 0.626 16.4 Table 1: Production bots detection - classification re- sults on the test set. Table 2: Egregious conversations - classification results on the test set. one vector representing the user utterances and the second vector representing the bot responses. The each bot we calculated features, such as the num- two vectors are concatenated and passed through ber of unique customer sentences, number of con- a fully connected layer with 5000 units. We also versations, number of unique agent responses, and calculate the squared difference of the two vectors statistical measures (mean, median, percentile) of and an -wise multiplication of the vectors the following metrics: number of turns of a con- to capture the interaction between the user and the versation, number of tokens in each turn in a con- bot. The three vectors are then concatenated and versation, and the time of a turn in a conversation. passed through another two fully connected layers In total we implemented 17 features. with 1000 and 100 units. In our implementation we used an SVM clas- Structure-based: The input is a single vector with sifier (as we only have 92 samples), measured the the size of 616 (the total number of bins (154) F1-score of the production bot class, and evaluated times the 4 counters per bin). This input vector the models using 10-fold cross-validation. is passed through two fully connected layers with Results Table1 depicts the classification results 100 and 20 units. for the three models we explored. The BOT2VEC- For both models, the last hidden layer is con- S model outperformed the other models with a nected to the output layer (with a size equal to the relative improvement of 18.6% over the baseline. total number of bots). All hidden layers consists of The performance of the BOT2VEC-C is slightly ReLU activation units, and are regularized using better than the baseline which indicates that the in- dropout rate of 0.5. The models were optimized formation that was captured by the content of the using an Adam optimizer with a 0.001 learning conversations was helpful to detect the usage of rate. the bot. The structure-based representation, how- ever, seems to capture bot variability more effec- 5.3 Task 1 - Production Bots Detection tively, i.e. the coverage of visited nodes, different Ground truth To annotate bots, we randomly conversations patterns, etc. sampled 100 conversations from our dataset. The sampled conversations were tagged by two dif- 5.4 Task 2 - Egregious Conversations ferent expert judges. Given a full conversation, Detection each judge tagged the conversation as production Ground truth To collect ground truth data, we or test/debugging. If more than 50% of the con- randomly sampled 12 bots (from the production versations were tagged as production, then the bot bots), and for each bot 100 conversations were an- was tagged as production. In addition, if the bot notated following the methodology in (Sandbank was annotated as not-production, the experts had et al., 2018), namely, given the full conversation, to provide a list of reasons for their choice (e.g., each judge tagged whether the conversation was repeating users ids, repeating bot response, etc.). egregious or not. Judges had a Cohen’s Kappa co- We generated true binary labels by considering a efficient of 0.93 which indicates a high level of bot to be a production bot if both judges agreed. agreement. The size of the egregious class varied Judges had a Cohen’s Kappa coefficient of 0.95 between the bots, ranging from 8% to 48% of the which indicates a high level of agreement. This conversations. All the conversations were aggre- process generated the production bot class size of gated to one dataset. 40 (44% of the 92 bots). Baseline model We implemented the state-of- Baseline model Inspired by (McIntire et al., the-art EGR model, presented in (Sandbank et al., 2010; Zhang et al., 2018) we implemented a base- 2018). In addition, our models are an extension of model denoted BOT-STAT as follows: for the EGR model, such that for each conversation,

81 its bot representation vector was concatenated to log act detection (Kumar et al., 2018), and improv- the EGR model’s original feature vector. ing fluency and coherency (Gangadharaiah et al., We measured the F1-score of the egregious 2018). Yuwono et al.(2018) learn the quality of class, and evaluated the models using 10-fold chatbot responses by combining word representa- cross-validation. tions of human and chatbot responses using neural approaches. The main difference between these Results Table2 summarizes the classification works and ours is that we analyze multiple bots results for all models. Specifically, the BOT2VEC- within a service to generate representations use- S outperforms all other models with a relative ful to each whereas others analyze a single bot at improvement of 16.4%. This suggests that the a time. In addition, we show that our representa- structure-based representation of the bot encapsu- tions are beneficial across different tasks. Finally, lates information which helps the model to distin- none of these works consider the structure of the guish between egregious and non-egregious con- bot as part of the representation. versations. Moreover, although the EGR model Learning embeddings for different objects is receives the text of the conversation as input, the one of the most explored tasks2. As mentioned content-based representation of the bot also helped above, graphs and nodes representations were pro- to improve the performance of the task. posed in (Narayanan et al., 2017; Grover and 5.5 Structure-based Analysis Leskovec, 2016). Both works considered static lo- cal structures in graphs, whereas our graph repre- In practice, bots belong to various application do- sentation is based on dynamic conversation paths. mains like banking, IT and HR. Motivated by The work in (Mikolov et al., 2013b) suggested a the semantic similarities between word embed- Word2Vec skip-gram model such that a word is dings (Mikolov et al., 2013b), we further analyzed predicted given its context. In our work, we take a the structure representation BOT2VEC-S w.r.t bots similar approach fitted to the more complex struc- that belong to the same domain. For the set of ture of bots, and predict a bot id given a represen- production bots, IT, HR, and the banking domains tation of a conversation. were prominent with 10, 7, and 6 bots respectively, Recently, Guo et al.(2018); Pereira and D ´ıaz while the other bots belonged to a long tail of do- (2018) presented a chatbot usage analysis over mains like travel, medical, etc. For the prominent several bots. Guo et al.(2018) compared the per- domains (that had more than 5 bots), we calculated formance of various chatbots that participated in the average cosine distance between vector repre- the Alexa prize challenge and implemented the sentations for pairs of bots that belong to the do- same scenario. To do so, the authors used dif- main vs. pairs of bots from different domains. We ferent measures such as conversational depth and find that the average distance between bots within breadth, users engagement, coherency, and more. their domain is 0.614, while the distance between Pereira and D´ıaz(2018) suggested a list of quality bots from different domains is 0.694. Thus, the attributes, and analyzed 100 popular chatbots in representations of bots that belong to the same do- . Our work focuses on learn- main appear to have, as one would expect, a higher ing bot representations targeted towards classifica- level of similarity. tion tasks on the level of bots and their conversa- 6 Related Work tions. Despite the popularity of chatbots, research on 7 Conclusions bot representations and usage analysis is still un- In this paper, we suggest two BOT2VEC models der explored. Works on chatbot representations that capture a bot representation based either on are mostly concentrated on neural response gen- the structure of the bot or the content of its con- eration (Xu et al., 2017; Li et al., 2016; Herzig versations. We showed that utilizing these repre- et al., 2017) and slot filling (Ma and Hovy, 2016; sentations improves two platform analysis tasks, Kurata et al., 2016). In these works, conversa- both for bot level and conversation level tasks. Fu- tion history is used to generate the next bot re- ture work includes extension of the model to en- sponse. Other works use conversation represen- capsulate both the content and the structure based tations for improving specific tasks useful in di- alog, like intent detection (Kato et al., 2017), dia- 2https://github.com/MaxwellRebo/awesome-2vec

82 representations combined together using sequen- Papers), pages 994–1003. Association for Computa- tial neural networks (such as RNN). tional Linguistics. Xuezhe Ma and Eduard Hovy. 2016. End-to-end se- quence labeling via bi-directional lstm-cnns-crf. In References Proceedings of the 54th Annual Meeting of the As- Rashmi Gangadharaiah, Balakrishnan sociation for Computational Linguistics (Volume 1: Narayanaswamy, and Charles Elkan. 2018. What Long Papers), pages 1064–1074. Association for we need to learn if we want to do and not just Computational Linguistics. talk. In Proceedings of the 2018 Conference of J. P. McIntire, L. K. McIntire, and P. R. Havig. 2010. the North American Chapter of the Association Methods for chatbot detection in distributed text- for Computational Linguistics: Human Language based communications. In 2010 International Sym- Technologies, Volume 3 (Industry Papers), pages posium on Collaborative Technologies and Systems. 25–32. Association for Computational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- Aditya Grover and Jure Leskovec. 2016. node2vec: frey Dean. 2013a. Efficient estimation of word Scalable feature learning for networks. In Proceed- representations in vector space. arXiv preprint ings of the 22nd ACM SIGKDD international con- arXiv:1301.3781. ference on Knowledge discovery and data mining, pages 855–864. ACM. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- rado, and Jeffrey Dean. 2013b. Distributed repre- Fenfei Guo, Angeliki Metallinou, Chandra Khatri, sentations of words and phrases and their composi- Anirudh Raju, Anu Venkatesh, and Ashwin Ram. tionality. CoRR, abs/1310.4546. 2018. Topic-based evaluation for conversational bots. CoRR, abs/1801.03622. Annamalai Narayanan, Mahinthan Chandramohan, Ra- jasekar Venkatesan, Lihui Chen, Yang Liu, and Jonathan Herzig, Michal Shmueli-Scheuer, Tommy Shantanu Jaiswal. 2017. graph2vec: Learn- Sandbank, and David Konopnicki. 2017. Neural ing distributed representations of graphs. CoRR, response generation for customer service based on abs/1707.05005. personality traits. In Proceedings of the 10th Inter- national Conference on Natural Language Genera- Juanan Pereira and Oscar D´ıaz. 2018. A quality analy- tion, pages 252–256. Association for Computational sis of facebook messenger’s most popular chatbots. Linguistics. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, SAC ’18. Tsuneo Kato, Atsushi Nagai, Naoki Noda, Ryosuke Sumitomo, Jianming Wu, and Seiichi Yamamoto. Ofir Press and Lior Wolf. 2017. Using the output em- 2017. Utterance intent classification of a spoken di- bedding to improve language models. In Proceed- alogue system with efficiently untied recursive au- ings of the 15th Conference of the European Chap- toencoders. In Proceedings of the 18th Annual SIG- ter of the Association for Computational Linguistics: dial Meeting on Discourse and Dialogue, pages 60– Volume 2, Short Papers, pages 157–163. Association 64. Association for Computational Linguistics. for Computational Linguistics.

Harshit Kumar, Arvind Agarwal, Riddhiman Dasgupta, Tommy Sandbank, Michal Shmueli-Scheuer, Jonathan and Sachindra Joshi. 2018. Dialogue act sequence Herzig, David Konopnicki, John Richards, and labeling using hierarchical encoder with CRF. In David Piorkowski. 2018. Detecting egregious con- AAAI. AAAI Press. versations between customers and virtual agents. In Proceedings of the 2018 Conference of the North Gakuto Kurata, Bing Xiang, Bowen Zhou, and Mo Yu. American Chapter of the Association for Computa- 2016. Leveraging sentence-level information with tional Linguistics: Human Language Technologies, encoder lstm for semantic slot filling. In Proceed- Volume 1 (Long Papers), pages 1802–1811. Associ- ings of the 2016 Conference on Empirical Methods ation for Computational Linguistics. in Natural Language Processing, pages 2077–2083. Association for Computational Linguistics. Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie SUN, Xiaolong Wang, Zhuoran Wang, and Chao Qi. Quoc Le and Tomas Mikolov. 2014. Distributed repre- 2017. Neural response generation via gan with an sentations of sentences and documents. In Proceed- approximate embedding layer. In Proceedings of the ings of the 31st International Conference on Inter- 2017 Conference on Empirical Methods in Natural national Conference on - Volume Language Processing, pages 617–626. Association 32, ICML’14, pages II–1188–II–1196. JMLR.org. for Computational Linguistics. Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp- Steven Kester Yuwono, Wu Biao, and Luis Fernando ithourakis, Jianfeng Gao, and Bill Dolan. 2016.A D’Haro. 2018. Automated scoring of chatbot re- persona-based neural conversation model. In Pro- sponses in conversational dialogue. In Proceedings ceedings of the 54th Annual Meeting of the Associa- of International Workshop on Spoken Dialog System tion for Computational Linguistics (Volume 1: Long Technology.

83 Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Per- sonalizing dialogue agents: I have a dog, do you have pets too? CoRR, abs/1801.07243.

84