Learning Personas from Dialogue with Attentive Memory Networks
Eric Chu ∗ Prashanth Vijayaraghavan ∗ Deb Roy MIT Media Lab MIT Media Lab MIT Media Lab [email protected] [email protected] [email protected]
Abstract sense personality in order to generate more inter- esting and varied conversations. The ability to infer persona from dialogue can have applications in areas ranging from We define persona as a person’s social role, computational narrative analysis to personal- which can be categorized according to their con- ized dialogue generation. We introduce neu- versations, beliefs, and actions. To learn personas, ral models to learn persona embeddings in a we start with the character tropes data provided in supervised character trope classification task. the CMU Movie Summary Corpus by (Bamman The models encode dialogue snippets from et al., 2014). It consists of 72 manually identified IMDB into representations that can capture the commonly occurring character archetypes and ex- various categories of film characters. The best- performing models use a multi-level attention amples of each. In the character trope classifica- mechanism over a set of utterances. We also tion task, we predict the character trope based on utilize prior knowledge in the form of tex- a batch of dialogue snippets. tual descriptions of the different tropes. We In their original work, the authors use apply the learned embeddings to find similar Wikipedia plot summaries to learn latent variable characters across different movies, and clus- models that provide a clustering from words to ter movies according to the distribution of the embeddings. The use of short conversational topics and topics to personas – their persona clus- text as input, and the ability to learn from terings were then evaluated by measuring similar- prior knowledge using memory, suggests these ity to the ground-truth character trope clusters. We methods could be applied to other domains. asked the question – could personas also be in- ferred through dialogue? Because we use quotes 1 Introduction as a primary input and not plot summaries, we be- Individual personality plays a deep and pervasive lieve our model is extensible to areas such as dia- role in shaping social life. Research indicates that logue generation and conversational analysis. it can relate to the professional and personal rela- Our contributions are: tionships we develop (Barrick and Mount, 1993), 1. Data collection of IMDB quotes and charac- (Shaver and Brennan, 1992), the technological in- ter trope descriptions for characters from the terfaces we prefer (Nass and Lee, 2000), the be- CMU Movie Summary Corpus. arXiv:1810.08717v1 [cs.CL] 19 Oct 2018 havior we exhibit on social media networks (Self- 2. Models that greatly outperform the baseline hout et al., 2010), and the political stances we take model in the character trope classification (Jost et al., 2009). task. Our experiments show the importance With increasing advances in human-machine di- of multi-level attention over words in dia- alogue systems, and widespread use of social me- logue, and over a set of dialogue snippets. dia in which people express themselves via short 3. We also examine how prior knowledge in text messages, there is growing interest in systems the form of textual descriptions of the per- that have an ability to understand different person- sona categories may be used. We find that a ality types. Automated personality analysis based ‘Knowledge-Store’ memory initialized with on short text analysis could open up a range of po- descriptions of the tropes is particularly use- tential applications, such as dialogue agents that ful. This ability may allow these models to be The∗ first two authors contributed equally to this work. used more flexibly in new domains and with Character Trope Character Movie Corrupt corporate executive Les Grossman Tropic Thunder Retired outlaw Butch Cassidy Butch Cassidy and the Sundance Kid Lovable rogue Wolverine X-Men
Table 1: Example tropes and characters
different persona categories. we utilized three different datasets for our mod- els: (a) the Movie Character Trope dataset, (b) 2 Related Work the IMDB Dialogue Dataset, and (c) the Charac- ter Trope Description Dataset. We collected the Prior to data-driven approaches, personalities were IMDB Dialogue and Trope Description datasets, largely measured by asking people questions and and these datasets are made publicly available 1. assigning traits according to some fixed set of di- mensions, such as the Big Five traits of openness, 3.1 Character Tropes Dataset conscientiousness, extraversion, agreeability, and The CMU Movie Summary dataset provides neuroticism (Tupes and Christal, 1992). Compu- tropes commonly occurring in stories and media tational approaches have since advanced to infer (Bamman et al., 2014). There are a total of 72 these personalities based on observable behaviors tropes, which span 433 characters and 384 movies. such as the actions people take and the language Each trope contains between 1 and 25 characters, they use (Golbeck et al., 2011). with a median of 6 characters per trope. Tropes Our work builds on recent advances in neural and canonical examples are shown in Table1. networks that have been used for natural language processing tasks such as reading comprehension 3.2 IMDB Dialogue Snippet Dataset (Sukhbaatar et al., 2015) and dialogue modeling To obtain the utterances spoken by the charac- and generation (Vinyals and Le, 2015; Li et al., ters, we crawled the IMDB Quotes page for each 2016; Shang et al., 2015). This includes the grow- movie. Though not every single utterance spoken ing literature in attention mechanisms and mem- by the character may be available, as the quotes ory networks (Bahdanau et al., 2014; Sukhbaatar are submitted by IMDB users, many quotes from et al., 2015; Kumar et al., 2016). most of the characters are typically found, espe- The ability to infer and model personality has cially for the famous characters found in the Char- applications in storytelling agents, dialogue sys- acter Tropes dataset. The distribution of quotes tems, and psychometric analysis. In particular, per trope is displayed in Figure1. Our models personality-infused agents can help “chit-chat” were trained on 13,874 quotes and validated and bots avoid repetitive and uninteresting utterances tested on a set of 1,734 quotes each. (Walker et al., 1997; Mairesse and Walker, 2007; We refer to each IMDB quote as a (contextu- Li et al., 2016; Zhang et al., 2018). The more alized) dialogue snippet, as each quote can con- recent neural models do so by conditioning on a tain several lines between multiple characters, as ‘persona’ embedding – our model could help pro- well as italicized text giving context to what might duce those embeddings. be happening when the quote took place. Fig- Finally, in the field of literary analysis, graphi- ure2 show a typical dialogue snippet. 70.3% of cal models have been proposed for learning char- the quotes are multi-turn exchanges, with a mean acter personas in novels (Flekova and Gurevych, of 3.34 turns per multi-turn exchange. While the 2015; Srivastava et al., 2016), folktales (Valls- character’s own lines alone can be highly indica- Vargas et al., 2014), and movies (Bamman et al., tive of the trope, our models show that account- 2014). However, these models often use more ing for context and the other characters’ lines and structured inputs than dialogue to learn personas. context improves performance. The context, for instance, can give clues to typical scenes and ac- 3 Datasets tions that are associated with certain tropes, while Characters in movies can often be categorized into the other characters’ lines give further detail into archetypal roles and personalities. To understand 1https://pralav.github.io/emnlp_ the relationship between dialogue and personas, personas/ eest h hrce’ w lines, own character’s the to refers character a category ( trope ground-truth Each ing dataset. tropes character character the in tropes acter and dataset batch trope. a IMDB character take the the predict can from that snippets model dialogue a of train to is goal Our Formulation Problem 4 Five memory. Big Knowledge-Store the our in on de- traits based store could personalities we of example, scriptions one char- as movie – the tropes beyond acter to flexibly model more our applied allow be could This classifica- improves performance. descriptions tion these experiments, of our use in the demonstrate we about As trope. etc. the personalities, actions, characteristics, typical describing paragraphs several contains de- tion corresponding TVTropes the from the scraped using scriptions of by each tropes of character descriptions incorporate also Dataset We Description Trope Character 3.3 or his and environment. character her the between relationship the context. and characters tiple 2 Figure 1000 1200 1400 1600 1800 ,E O E, D, 200 400 600 800 iue1 Figure omly let Formally, 0 2 http://tvtropes.org xml MBdaou npe otiigmul- containing snippet dialogue IMDB Example :
classy_cat_burglar )
ubro MBdaou npesprtrope per snippets dialogue IMDB of Number : the_chief pupil_turned_to_evil bully C self_made_man ophelia eadao npe soitdwith associated snippet dialog a be absent_minded_professor
C dean_bitterman the_editor junkie_prophet romantic_runnerup sascae ihacorrespond- a with associated is where , dumb_muscle broken_bird crazy_survivalist klutz
N young_gun evil_prince storyteller chanteuse
P dirty_cop granola_person egomaniac_hunter henpecked_husband officer_and_a_gentleman
etettlnme fchar- of number total the be drill_sargeant_nasty stupid_crooks final_girl doormat morally_bankrupt_banker
D eccentric_mentor playful_hacker jerk_jock consummate_professional big_man_on_campus grumpy_old_man brainless_beauty
[ = loser_protagonist child_prodigy warrior_poet retired_outlaw ditz hitman_with_a_heart psycho_for_hire fastest_gun_in_the_west arrogant_kungfu_guy w prima_donna valley_girl gentleman_thief
D bounty_hunter father_to_his_men
2 revenge P
1 tranquil_fury
ahdescrip- Each . corrupt_corporate_executive surfer_dude w , bruiser_with_a_soft_center
Let . dumb_blonde master_swordsman hardboiled_detective casanova
D gadgeteer_genius bromantic_foil coward heartbroken_badass 2 adventurer_archaeologist cultured_badass .,w ..., crazy_jealous_guy loveable_rogue trickster
E stoner charmer slacker
S byronic_hero D T = = ] crst uhsipt.Tersligattended resulting The snippets. such attention to low assign scores to learns the model the about and informative trope, be di- not the may of snippets Some alogue importance. their determines and over recurrence the using captures relationship encoder inter-snippet This encoder. inter-snippet the our 3, Figure beddings in shown As Encoder Inter-Snippet Attentive snippet encoded an is encoder embedding this of 5.1.2 computed. output Section are The weights relevance. attention the its how for explains text given word the each in scores which model layer our self-attention augment a we with text, trope-reflective input the the Sec- capture from to in words order detail In in 5.1.1. explained tion encoder, net- our neural an recurrent as into a work them use We encode and space. embedding each inputs from textual features extract these to of encoder the snippet to A fed is snippet. the in words sin- snippet a dialogue from features gle extracts encoder snippet The Encoder Snippet Attentive snippets. multiple snippet across individual (b) the and level, (a) attentive at multi-level operates a ignore that use encoder to we trope learn informa- the not to about are tive order that reflective snippets In be dialogue and may words persona. latent dialogue a of of piece every Not Encoders Attentive 5.1 sections. com- following the the describe in We ponents model. overall the 3 outlines Figure and Module. Memory Encoders, Knowledge-Store Attentive a (b) (a) two of components: consists major Network Memory Attentive The Network Memory Attentive 5 model. our to inputs of as trope set the to a related sample We (where trope. a for snippets eesr.Let necessary. hrces ie.W en l he components three of all define We lines. characters’ and [ w E S 1 N O ohv xdsqec length sequence fixed have to w , diag N E [ = 2 diag S .,w ..., npe mednsfragvntrope given a for embeddings snippet e S w e rmtesiptecdraefdto fed are encoder snippet the from O N ( = E 1 T w , S N ] D etettlnme fdialogue of number total the be O stecneta information contextual the is S e 2 npesfrom snippets ) E , .,w ..., S ihatninoe the over attention with , e O , O T e ) ] N . eoe h other the denotes diag S T n a when pad and ( = npe em- snippet N S ,E O E, D, snippets N diag ) Figure 3: Illustration of the Attentive Memory Network. The network takes dialogue snippets as input and predicts its asso- ciated character trope. In this example, dialogue snippets associated with the character trope “Bruiser with a Soft Corner” is given as input to the model.
summary vector from this phase is the persona rep- where fattn is a two layer fully connected network d resentation z, defined as: in which the first layer projects ht ∈ IR h to an da s s s s s s attention hidden space gt ∈ IR , and the second z = γDD + γE E + γOO s s s (1) γD + γE + γO = 1 layer produces a relevance score for every hidden s s s state at timestep t. where γD, γE, γO are learnable weight parame- s s s ters. D ,E ,O refers to summary vectors of 5.2 Memory Modules the Ndiag character’s lines, contextual informa- tion, and other characters’ lines, respectively. In Our model consists of a read-only ‘Knowledge- Section7, we experiment with models that have Store’ memory, and we also test a recent read- s s write memory. External memories have been γE and γO set to 0 to understand how the contex- tual information and other characters’ lines con- shown to help on natural language processing tribute to the overall performance. tasks (Sukhbaatar et al., 2015; Kumar et al., 2016; Kaiser and Nachum, 2017), and we find similar 5.1.1 Encoder improvements in learning capability. Given an input sequence (x , x , ..., x ), we use 1 2 T 5.2.1 Knowledge-Store Memory a recurrent neural network to encode the sequence The main motivation behind the Knowledge-Store into hidden states (h1, h2, ..., hT ). In our exper- iments, we use a gated recurrent network (GRU) memory module is to incorporate prior domain (Chung et al., 2014) over LSTMs (Hochreiter and knowledge. In our work, this knowledge refers to Schmidhuber, 1997) because the latter is more the trope descriptions described in Section 3.3. computationally expensive. We use bidirectional Related works have initialized their memory GRUs and concatenate our forward and backwards networks with positional encoding using word em- ←→ beddings (Sukhbaatar et al., 2015; Kumar et al., hidden states to get ht for t = 1, ..., T . 2016; Miller et al., 2016). To incorporate the de- 5.1.2 Attention scriptions, we represent them with skip thought We define an attention mechanism Attn that com- vectors (Kiros et al., 2015) and use them to ini- ←→ NP ×dK putes s from the resultant hidden states ht of a tialize the memory keys KM ∈ IR , where GRU by learning to generate weights αt. This can NP is the number of tropes, and dK is set to be interpreted as the relative importance given to a the size of embedded trope description RD, i.e. D hidden state ht to form an overall summary vector dK = ||R ||. for the sequence. Formally, we define it as: The values in the memory represent learnable embeddings of corresponding trope categories a = f (h ) (2) t attn t V ∈ IRNP ×dV , where d is the size of the α = softmax(a ) (3) M V t t trope category embeddings. The network learns to T X s = αtht (4) use the persona representation z from the encoder t=1 phase to find relevant matches in the memory. This corresponds to calculating similarities between z Memory Write We update the memory in a and the keys KM . Formally, this is calculated as: similar fashion to the original work by (Kaiser and Nachum, 2017), which takes into account the max- z = f (z) (5) M z imum age of items as stored in A . (i) NM pM = softmax(zM · KM [i]) (6) ∀i ∈ {1, .., N } P 6 Objective Losses
d dK where fz : IR h 7→ IR is a fully-connected layer that projects the persona representation in the To train our model, we utilize the different objec- tive losses described below. space of memory keys KM . Based on the match (i) probabilities pM , the values VM are weighted and cumulatively added to the original persona repre- 6.1 Classification Loss sentation as: We calculate the probability of a character belong-
NP ing to a particular trope category P through Equa- X (i) dh NP rout = pM · VM [i] (7) tion 11, where fP : IR 7→ IR is a fully- i=1 connected layer, and z is the persona representa- We iteratively combine our mapped persona tion produced by the multi-level attentive encoders representation zM with information from the described in Equation1. We then optimize the cat- memory rout. The above process is repeated nhop egorical cross-entropy loss between the predicted times. The memory mapped persona representa- and true tropes as in Equation 12, where NP is the tion zM is updated as follows: total number of tropes, qj is the predicted distribu- tion that the input character fulls under trope j, and hop hop−1 z = fr(z ) + rout (8) M M pj ∈ {0, 1} denotes the ground-truth of whether 0 dV dK the input snippets come from characters from the where z = zM , and fr : IR 7→ IR is a fully- M th connected layer. Finally, we transform the re- j trope. n sulting z hop using another fully-connected layer, M q = softmax(fP (z)) (11) dK dh fout ∈ IR 7→ IR , via: N XP nhop JCE = −pj log(qj ) (12) zˆM = fout(z ) (9) M j=1 5.2.2 Read-Write Memory 6.2 Trope Description Triplet Loss We also tested a Read-Write Memory following Kaiser et. al (Kaiser and Nachum, 2017), which In addition to using trope descriptions to initialize was originally designed to remember rare events. the Knowledge-Store Memory, we also test learn- In our case, these ‘rare’ events might be key di- ing from the trope descriptions through a triplet alogue snippets that are particularly indicative of loss (Hoffer and Ailon, 2015). We again use the latent persona. It consists of keys, which are ac- skip thought vectors to represent the descriptions. tivations of a specific layer of model, i.e. the Specifically, we want to maximize the similarity persona representation z, and values, which are of representations obtained from dialogue snippets the ground-truth labels, i.e. the trope categories. with their corresponding description, and mini- Over time, it is able to facilitate predictions based mize their similarity with negative examples. We on past data with similar activations stored in the implement this as: memory. For every new example, the network P writes to memory for future look up. A memory R = fD(z) (13) P D P D with memory size NM is defined as: JT = max(0, s(R ,Rn ) − s(R ,Rp ) + αT ) (14)
M = (K ,V ,A ) (10) NM ×dH NM NM d ||RD|| where fD : IR h 7→ IR is a fully-connected Memory Read We use the persona embedding layer. The triplet ranking loss is then Equation z as a query to the memory. We calculate the co- 14, where αT is a learnable margin parameter and sine similarities between z and the keys in M, take s(·, ·) denotes the similarity between trope em- P D D the softmax on the top-k neighbors, and compute beddings (R ), positive (Rp ) and negative (Rn ) a weighted embedding zˆM using those scores. trope descriptions. Trope Description Triplet Loss with Memory 7 Experiments Module We experimented with combinations of our var- If a memory module is used, we compute a new ious modules and losses. The experimental re- triplet loss in place of the one described in Equa- sults and ablation studies are described in the fol- tion 14. Models that use a memory module lowing sections, and the experimental details are should learn a representation zˆ , based on ei- M described in Supplementary SectionB. The dif- ther the prior knowledge stored in the memory ferent model permutation names in Table2, e.g. (as in Knowledge-Store memory) or the top-k key “attn 3 tropetrip ks-mem ndialog16”, are defined matches (as in Read-Write memory), that is simi- as follows: lar to the representation of the trope descriptions. This is achieved by replacing the persona em- • baseline vs attn: The ‘baseline’ model uses bedding z in Equation 13 with the memory out- only one dialogue snippet S to predict the put zˆM as shown in Equation 15, where fDM : trope, i.e. N = 1. Hence, the inter- D diag IRdh 7→ IR||R || is a fully-connected layer. To snippet encoder is not used. The ‘attn’ model compute the new loss, we combine the representa- operates on Ndiag dialogue snippets using the tions obtained from Equations 13 and 15 through inter-snippet encoder to assign an attention a learnable parameter γ that determines the impor- score for each snippet Si. tance of each representation. Finally, we utilize • char vs. 3: To measure the importance of P this combined representation Rˆ to calculate the context and other characters’ lines, we have loss as shown in Equation 17. two variants – ‘char’ uses only the char- P acter’s lines, while ‘3’ uses the character’s RM = fDM (ˆzM ) (15) lines, other character’s lines, and all context RˆP = γRP + (1 − γ)RP (16) M lines. Formally, in ‘char’ mode, we set γs ˆP D ˆP D E JMT = max(0, s(R ,Rn ) − s(R ,Rp ) + αMT ) (17) s and γO to 0 in Equation1. In ‘attn’ mode, s s s 6.3 Read-Write Memory Losses (γE, γO, γD) are learned by the model. When the Read-Write memory is used, we use two • tropetrip: The presence of ’tropetrip’ indi- extra loss functions. The first is a Memory Rank- cates that the triplet loss on the trope de- scriptions was used. If ‘-500’ is appended ing Loss JMR as done in (Kaiser and Nachum, 2017), which learns based on whether a query with to ‘tropetrip’, then the 4800-dimensional skip the persona embedding z returns nearest neighbors embeddings representing the descriptions in with the correct trope. The second is a Memory Equations 15 and 17 are projected to 500 di- mensions using a fully connected layer. Classification Loss JMCE that uses the values re- turned by the memory to predict the trope. The • ks-mem vs. rw-mem: ‘ks-mem’ refers to the full details for both are found in Supplementary Knowledge-Store memory, and ‘rw-mem’ SectionA. refers to the Read-Write memory. • ndialog: The number of dialogue snippets 6.4 Overall Loss Ndiag used as input for the attention mod- We combine the above losses through: els. Any attention model without the explicit J = βCE · JCE Ndiag listed uses Ndiag = 8. + βT · JˆT + βMR · JMR + βMCE · JMCE (18) 7.1 Ablation Results ( J if memory module is used Jˆ = MT Baseline vs. Attention Model. The attention T J otherwise. T model shows a large improvement over the base- where β = [βCE, βMCE, , βT , βMR] are learn- line models. This matches our intuition that not P able weights such that i βi = 1. Depending on every quote is strongly indicative of character which variant of the model is being used, the list trope. Some may be largely expository or ‘chit- β is modified to contain only relevant losses. For chat’ pieces of dialogue. Example attention scores example, when the Knowledge-Store memory is are shown in Section 7.2. used, we set βMR = βMCE = 0 and β is modified Though our experiments showed marginal im- to β = [βCE, βT ]. We discuss different variants of provement between using the ‘char’ data and the our model in the next section. ‘3’ data, we found that using all 3 inputs had Model Accuracy Precision Recall F1 baseline char 0.286 0.538 0.286 0.339 baseline 3 0.287 0.552 0.288 0.349 attn char 0.630 0.586 0.628 0.600 attn 3 0.630 0.590 0.630 0.603 attn 3 tropetrip 0.615 0.566 0.615 0.583 attn 3 tropetrip-500 0.644 0.601 0.642 0.615 attn 3 ks-mem 0.678 0.648 0.676 0.657 attn 3 rw-mem 0.635 0.598 0.635 0.611 attn 3 tropetrip ks-mem 0.663 0.628 0.662 0.639 attn 3 tropetrip-500 ks-mem 0.654 0.618 0.652 0.629 attn 3 tropetrip rw-mem 0.649 0.602 0.648 0.617 attn 3 tropetrip-500 rw-mem 0.644 0.608 0.644 0.619 attn 3 tropetrip-500 ks-mem ndialog16 0.740 0.707 0.741 0.718 attn 3 tropetrip-500 ks-mem ndialog32 0.750 0.750 0.750 0.750 attn 3 tropetrip ks-mem ndialog16 0.740 0.722 0.741 0.728 attn 3 tropetrip ks-mem ndialog32 0.731 0.712 0.731 0.718
Table 2: Experimental results. Details and analysis are given in Section 7.1. The best performing results in each block are bolded. The first block examines the baseline model vs. the attention model, as well as use of different inputs. The second block uses the triplet loss, and the third block uses our memory modules. The fourth block combines the triplet loss and memory module, which the fifth block extends to larger Ndiag.
0.330 Rumors of my death have been greatly exaggerated. dfasdf Memory Modules. The Knowledge-Store 0.181 What does it matter? Nothing's real down there. Our life is here. memory in particular was helpful. Initialized with I don't give a shit who he's connected to. … I want you vacate this 0.155 guy off the premises, and I want you to exit him off his feet and use the trope descriptions, this memory can ‘sharpen’ his head to open the fucking door. queries toward one of the tropes. The Read-Write 0.082 I'm the man. memory had smaller gains in performance. It may I had to see you. Stacy, I swear... I- I don't blame you for hating me. Or for wanting to break up. I just- Let me explain. About my 0.081 family. I ju- I didn't want you to know. See, my dad's a drunk. be that more data is required to take advantage of Alright? … Cause when you love somebody, you never give up on them. the write capabilities. 0.061 Not so bad... Oh! 'Ello, beastie. Combined Trope Description Triplet Loss 0.055 I heard things. and Memory Modules. Using the triplet loss 0.054 Fuck! Fuck! That's it, I'm screwed. It's over. with memory modules led to greater performance Figure 4: Attention scores for a batch of dialogues for the when compared to the attn 3 model, but the per- “byronic hero” trope 3. formance sits around the use of either triplet only or memory only. However, when we increase the Ndiag to 16 or 32, we find a jump in performance. greater performance for models with the triplet This is likely the case because the model has both loss and read-only memory. This is likely because increased learning capacity and a larger sample of the others’ lines and context capture more of the data at every batch, which means at least some of social dynamics and situations that are described the Ndiag quotes should be informative about the in the trope descriptions. Subsequent results are trope. shown only for the ‘attn 3’ models. Trope Description Triplet Loss. Adding the 7.2 Attention Scores trope description loss alone provided relatively small gains in performance, though we see greater Because the inter-snippet encoder provides such a gains when combined with memory. While both large gain in performance compared to the base- use the descriptions, perhaps the Knowledge Store line model, we provide an example illustrating the memory matches an embedding against all the weights placed on a batch of Ndiag snippets. Fig- tropes, whereas the trope triplet loss is only pro- ure4 shows the attention scores for the charac- vided information from one positive and one neg- ter’s lines in the “byronic hero” trope. Matching ative example. what we might expect for an antihero personality, we find the top weighted line to be full of confi- 3TVtropes.org defines a byronic hero as “Sometimes an dence and heroic bluster, while the middle lines Anti-Hero, others an Anti-Villain, or even Just a Villain, By- hint at the characters’ personal turmoil. We also ronic heroes are charismatic characters with strong passions and ideals, but who are nonetheless deeply flawed individu- find the lowly weighted sixth and seventh lines to als....” be largely uninformative (e.g. “I heard things.”), and the last line to be perhaps too pessimistic and character or movie embeddings, we could poten- negative for a hero, even a byronic one. tially unearth notable similarities. We note some of the interesting clusters below. 7.3 Purity scores of character clusters 8.1 Clustering Characters Finally, we measure our ability to recover the trope ‘clusters’ (with one trope being a cluster of its • Grumpy old men: Carl Fredricksen (Up); characters) with our embeddings through the pu- Walk Kowalski (Gran Torino) rity score used in (Bamman et al., 2014). Equa- • Shady tricksters, crooks, well versed in de- tion 19 measures the amount of overlap between ceit: Ugarte (Casablanca); Eames (Inception) two clusterings, where N is the total number of • Intrepid heroes, adventurers: Indiana Jones characters, gi is the i-ith ground truth cluster, and (Indiana Jones and the Last Crusade); Nemo cj is the j-th predicted cluster. (Finding Nemo); Murph (Interstellar)
1 X Purity = max |g ∩ c | (19) N j i j 8.2 Clustering Movies i • Epics, historical tales: Amadeus, Ben-Hur We use a simple agglomerative clustering • Tortured individuals, dark, violent: Donnie method on our embeddings with a parameter k for Darko, Taxi Driver, Inception, The Prestige the number of clusters. The methods in (Bamman • Gangsters, excess: Scarface, Goodfellas, et al., 2014) contain a similar hyper-parameter for Reservoir Dogs, The Departed, Wolf of Wall the number of persona clusters. We note that the Street metrics are not completely comparable because not every character in the original dataset was 9 Conclusion found on IMDB. The results are shown in Table 3. It might be expected that our model perform We used the character trope classification task as a better because we use the character tropes them- test bed for learning personas from dialogue. Our selves as training data. However, dialogue may be experiments demonstrate that the use of a multi- noisier than the movie summary data; their better level attention mechanism greatly outperforms a performing Persona Regression (PR) model also baseline GRU model. We were also able to lever- uses useful metadata features such as the movie age prior knowledge in the form of textual descrip- genre and character gender. We simply note that tions of the trope. In particular, using these de- our scores are comparable or higher. scriptions to initialize our Knowledge-Store mem- ory helped improved performance. Because we k PR DP AMN use short text and can leverage domain knowl- 25 42.9 39.63 48.4 50 36.5 31.0 48.1 edge, we believe future work could use our models 100 30.3 24.4 45.2 for applications such as personalized dialogue sys- tems. Table 3: Cluster purity scores. k is the number of clusters, PR and DP are the Persona Regression and Dirichlet Persona models from (Bamman et al., 2014), and AMN is our atten- tion memory network. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2014. Neural machine translation by jointly 8 Application: Narrative Analysis learning to align and translate. arXiv preprint arXiv:1409.0473. We collected IMDB quotes for the top 250 movies on IMDB. For every character, we calculated a David Bamman, Brendan O’Connor, and Noah A Smith. 2014. Learning latent personas of film char- character embedding by taking the average em- acters. In Proceedings of the Annual Meeting of the bedding produced by passing all the dialogues Association for Computational Linguistics (ACL), through our model. We then calculated movie page 352. embeddings by taking the weighted sum of all Murray R Barrick and Michael K Mount. 1993. Au- the character embeddings in the movie, with the tonomy as a moderator of the relationships between weight as the percentage of quotes they had in the the big five personality dimensions and job perfor- movie. By computing distances between pairs of mance. Journal of applied Psychology, 78(1):111. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Alexander Miller, Adam Fisch, Jesse Dodge, Amir- and Yoshua Bengio. 2014. Empirical evaluation of Hossein Karimi, Antoine Bordes, and Jason We- gated recurrent neural networks on sequence model- ston. 2016. Key-value memory networks for ing. arXiv preprint arXiv:1412.3555. directly reading documents. arXiv preprint arXiv:1606.03126. Lucie Flekova and Iryna Gurevych. 2015. Personal- ity profiling of fictional characters using sense-level Clifford Nass and Kwan Min Lee. 2000. Does links between lexical resources. In Proceedings of computer-generated speech manifest personality? the 2015 Conference on Empirical Methods in Nat- an experimental test of similarity-attraction. In Pro- ural Language Processing, pages 1805–1816. ceedings of the SIGCHI conference on Human Fac- tors in Computing Systems, pages 329–336. ACM. Jennifer Golbeck, Cristina Robles, Michon Edmond- son, and Karen Turner. 2011. Predicting person- Jeffrey Pennington, Richard Socher, and Christopher ality from twitter. In Privacy, Security, Risk and Manning. 2014. Glove: Global vectors for word Trust (PASSAT) and 2011 IEEE Third Inernational representation. In Proceedings of the 2014 confer- Conference on Social Computing (SocialCom), 2011 ence on empirical methods in natural language pro- IEEE Third International Conference on, pages cessing (EMNLP), pages 1532–1543. 149–156. IEEE. Maarten Selfhout, William Burk, Susan Branje, Jaap Denissen, Marcel Van Aken, and Wim Meeus. 2010. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Emerging late adolescent friendship networks and Long short-term memory. Neural computation, big five personality traits: A social network ap- 9(8):1735–1780. proach. Journal of personality, 78(2):509–538. Elad Hoffer and Nir Ailon. 2015. Deep metric learning Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. using triplet network. In International Workshop on Neural responding machine for short-text conversa- Similarity-Based Pattern Recognition, pages 84–92. tion. arXiv preprint arXiv:1503.02364. Springer. Phillip R Shaver and Kelly A Brennan. 1992. At- John T Jost, Tessa V West, and Samuel D Gosling. tachment styles and the” big five” personality traits: 2009. Personality and ideology as determinants of Their connections with each other and with romantic candidate preferences and obama conversion in the relationship outcomes. Personality and Social Psy- 2008 us presidential election. Du Bois Review: So- chology Bulletin, 18(5):536–545. cial Science Research on Race, 6(1):103–124. Shashank Srivastava, Snigdha Chaturvedi, and Tom M Łukasz Kaiser and Ofir Nachum. 2017. Aurko roy, and Mitchell. 2016. Inferring interpersonal relations in samy bengio. Learning to remember rare events. narrative summaries. In AAAI, pages 2807–2813. arXiv preprint. Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. Diederik P Kingma and Jimmy Ba. 2014. Adam: A 2015. End-to-end memory networks. In Advances method for stochastic optimization. arXiv preprint in neural information processing systems, pages arXiv:1412.6980. 2440–2448.
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Ernest C Tupes and Raymond E Christal. 1992. Recur- Richard Zemel, Raquel Urtasun, Antonio Torralba, rent personality factors based on trait ratings. Jour- and Sanja Fidler. 2015. Skip-thought vectors. In nal of personality, 60(2):225–251. Advances in neural information processing systems, Josep Valls-Vargas, Jichen Zhu, and Santiago Ontanon.´ pages 3294–3302. 2014. Toward automatic role identification in unan- notated folk tales. In Tenth Artificial Intelligence Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit and Interactive Digital Entertainment Conference. Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Oriol Vinyals and Quoc Le. 2015. A neural conversa- Ask me anything: Dynamic memory networks for tional model. arXiv preprint arXiv:1506.05869. natural language processing. In International Con- ference on Machine Learning, pages 1378–1387. Marilyn A Walker, Janet E Cahn, and Stephen J Whit- taker. 1997. Improvising linguistic style: Social and Jiwei Li, Michel Galley, Chris Brockett, Georgios P affective bases for agent personality. In Proceedings Spithourakis, Jianfeng Gao, and Bill Dolan. 2016. of the first international conference on Autonomous A persona-based neural conversation model. arXiv agents, pages 96–105. ACM. preprint arXiv:1603.06155. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Franc¸ois Mairesse and Marilyn Walker. 2007. Person- Szlam, Douwe Kiela, and Jason Weston. 2018. Per- age: Personality generation for dialogue. In Pro- sonalizing dialogue agents: I have a dog, do you ceedings of the 45th Annual Meeting of the Associa- have pets too? arXiv preprint arXiv:1801.07243. tion of Computational Linguistics, pages 496–503. A Read-Write Memory Losses
A.1 Memory Ranking Loss JMR We want the model to learn to make efficient matches with the memory keys in order to facil- itate look up on past data. To do this, we find a positive and negative neighbor after computing the k nearest neighbors (n1, ..., nk) by finding the smallest index p such that V [p] = P and n such that V [n] 6= P respectively. We define the mem- ory ranking loss as:
JMR = max(0, s(z, K[n]) − s(z, K[p]) + αMR) (20) where αMR is a learnable margin parameter and s(·, ·) denotes the similarity between persona embeddings (z), key representations of positive (K[p]) and negative (K[n]) neighbors. The above equation is consistent with the memory loss de- fined in the original work (Kaiser and Nachum, 2017).
A.2 Memory Classification Loss JMCE
The Read-Write memory returns zˆM and values vM = V [ni] ∀i ∈ {n1, .., nk}. The probability of the given input dialogues belonging to a particular persona category P is computed using values vM returned from the memory via:
M q = softmax(fPM (vM )) (21)
k NP where fPM : IR 7→ IR is a fully-connected M layer. We replace the qj with qj in Equation 12 and calculate the categorical cross entropy to get JMCE. B Experimental details The vocabulary size was set to 20000. We used a GRU hidden size of 200, a word embedding size of 300, and the word embedding lookup was initial- ized with GLoVe (Pennington et al., 2014). For the Read-Write memory module, we used k=8 when calculating the nearest neighbors and a memory size of 150. Our models were trained using Adam (Kingma and Ba, 2014).