Recollection Versus Imagination: Exploring Human Memory and Cognition Via Neural Language Models

Recollection versus Imagination: Exploring Human Memory and Cognition via Neural Language Models Maarten Sapzy∗ Eric Horvitzy Yejin Choiz} Noah A. Smithz} James W. Pennebaker| yMicrosoft Research zPaul G. Allen School for Computer Science & Engineering, University of Washington }Allen Institute for Artificial Intelligence |Department of Psychology, University of Texas at Austin [email protected], [email protected] Abstract My daughter gave birth to her first child. She and her husband were overwhelmed by emotions. We investigate the use of NLP as a measure ….her husband called me and then drove of the cognitive processes involved in story- her to the hospital. I joined her at the telling, contrasting imagination and recollec- hospital. When we got the hospital things tion of events. To facilitate this, we collect and got complicated. Her husband tried his release HIPPOCORPUS, a dataset of 7,000 sto- best to be with her and to keep her ries about imagined and recalled events. strong. She eventually delivered perfectly. RECALLED We introduce a measure of narrative flow and # concrete events: 7 use this to examine the narratives for imagined We recently attended a family wedding. It was and recalled events. Additionally, we measure the first time in a decade we all got together. the differential recruitment of knowledge at- …My older brother is getting PersonX tributed to semantic memory versus episodic gets married married to a rich tycoon lady. memory imagined re- (Tulving, 1972) for and He will be very happy. I hope causes called storytelling by comparing the frequency he doesn’t get too greedy. PersonX to of descriptions of general commonsense events IMAGINED # concrete events: 1 be happy with more specific realis events. Our analyses show that imagined stories have a Figure 1: Snippets from two stories from HIPPOCOR- substantially more linear narrative flow, com- PUS (top: recalled, bottom: imagined). Concrete pared to recalled stories in which adjacent sen- or realis events (in gray) are more frequent in re- tences are more disconnected. In addition, called stories, whereas general or commonsense events while recalled stories rely more on autobio- (underlined) are associated with imagined stories. graphical events based on episodic memory, imagined stories express more commonsense knowledge based on semantic memory. Fi- other hand, an imagined story about a wedding nally, our measures reveal the effect of narrativization of memories in stories (e.g., stories (Figure1, bottom) will largely draw from the au- about frequently recalled memories flow more thor’s commonsense knowledge about the world linearly; Bartlett, 1932). Our findings high- (Kintsch, 1988; Graesser et al., 1981). light the potential of using NLP tools to study We harness neural language and commonsense the traces of human cognition in language. models to study how cognitive processes of rec- 1 Introduction ollection and imagination are engaged in storytelling. We rely on two key aspects of stories: When telling stories, people draw from their own narrative flow (how the story reads) and semantic experiences (episodic knowledge; Conway et al., vs. episodic knowledge (the types of events in the 1996, 2003) and from their general world knowl- story). We propose as a measure of narrative flow edge (semantic knowledge; Bartlett, 1932; Oatley, the likelihood of sentences under generative lan- 1999). For example, in Figure1 (top), a recalled guage models conditioned on varying amounts of story about a birth will likely recount concrete history. Then, we quantify semantic knowledge by events from that day, relying heavily on the au- measuring the frequency of commonsense events thor’s episodic memory (Tulving, 1972). On the (from the ATOMIC knowledge graph; Sap et al., ∗ Research conducted during an internship at Microsoft 2019), and episodic knowledge by counting realis Research. events (Sims et al., 2019), both shown in Figure1. We introduce HIPPOCORPUS,1 a dataset of # stories # sents # words 6,854 diary-like short stories about salient life recalled 2,779 17.8 308.9 events, to examine the cognitive processes of re- imagined 2,756 17.5∗∗ 274.2∗∗ membering and imagining. Using a crowdsourc- retold 1,319 17.3∗ 296.8∗∗ ing pipeline, we collect pairs of recalled and imagined stories written about the same topic. By total 6,854 design, authors of recalled stories rely on their Table 1: HIPPOCORPUS data statistics. ∗∗ and ∗ indi- episodic memory to tell their story. cate significant difference from recalled at p < 0:001 We demonstrate that our measures can uncover and p < 0:05, respectively. differences in imagined and recalled stories in HIPPOCORPUS. Imagined stories contain more commonsense events and elaborations, whereas Stage 2: imagined. A new set of workers write recalled stories are more dense in concrete events. imagined stories, using a randomly assigned sum- Additionally, imagined stories flow substantially mary from stage 1 as a prompt. Pairing imagined more linearly than recalled stories. Our findings stories with recalled stories allows us to control for provide evidence that surface language reflects the variation in the main topic of stories. differences in cognitive processes used in imagin- Stage 3: retold past. After 2–3 months, we con- ing and remembering. tact workers from stage 1 and ask them to re-tell Additionally, we find that our measures can un- their stories, providing them with the summary of cover narrativization effects, i.e., the transform- their story as prompt. ing of a memory into a narrative with repeated re- call or passing of time (Bartlett, 1932; Reyna and Post-writing questionnaire (all stages). Imme- Brainerd, 1995; Christianson, 2014). We find that diately after writing, workers describe the main with increased temporal distance or increased fre- topic of the story in a short phrase. We then ask a quency of recollection, recalled stories flow more series of questions regarding personal significance linearly, express more commonsense knowledge, of their story (including frequency of recalling the and are less concrete. event: FREQUENCYOFRECALL; see A.1 for questionnaire details). Optionally, workers could re- 2 2 HIPPOCORPUS Creation port their demographics. We construct HIPPOCORPUS, containing 6,854 3 Measures stories (Table1), to enable the study of imagined and recalled stories, as most prior corpora are ei- To quantify the traces of imagination and recollec- ther limited in size or topic (e.g., Greenberg et al., tion recruited during storytelling, we devise a mea- 1996; Ott et al., 2011). See AppendixA for addi- sure of a story’s narrative flow, and of the types of tional details (e.g., worker demographics; xA.2). events it contains (concrete vs. general). 3.1 Narrative Flow 2.1 Data Collection Inspired by recent work on discourse modeling We collect first-person perspective stories in three (Kang et al., 2019; Nadeem et al., 2019), we use stages on Amazon Mechanical Turk (MTurk), us- language models to assess the narrative linearity of ing a pairing mechanism to account for topical a story by measuring how sentences relate to their variation between imagined and recalled stories. context in the story. Stage 1: recalled. We ask workers to write a We compare the likelihoods of sentences un- 15–25 sentence story about a memorable or salient der two generative models (Figure2). The bag event that they experienced in the past 6 months. model makes the assumption that every sentence Workers also write a 2–3 sentence summary to be is drawn independently from the main theme of used in subsequent stages, and indicate how long the story (represented by E). On the other hand, ago the events took place (in weeks or months; the chain model assumes that a story begins with a TIMESINCEEVENT). 2 With IRB approval from the Ethics Advisory Board at Microsoft Research, we restrict workers to the U.S., and en- 1Available at http://aka.ms/hippocorpus. sure they are fairly paid ($7.5–9.5/h). edge included explicitly in stories, as a proxy ࣟ ݏ ݏ ǥ ݏ for semantic memory, a form of memory that is (i) bag ଴ ଵ ௜ thought to encode general knowledge about the ࣟ world (Tulving, 1972). While this includes facts ݏ ݏ ǥ ݏ about how events unfold (i.e., scripts or schemas; (ii) chain ଴ ଵ ௜ Schank and Abelson, 1977; van Kesteren et al., Figure 2: Two probabilistic graphical models repre- 2012), here we focus on commonsense knowl- senting (i) bag-like and (ii) chain-like (linear) story rep- edge, which is also encoded in semantic memory resentations. E represents the theme of the story. (McRae and Jones, 2013). Given the social focus of our stories, we use the theme, and sentences linearly follow each other.3. social commonsense knowledge graph ATOMIC (Sap et al., 2019).4 For each story, we first match ∆l is computed as the difference in negative log- likelihoods between the bag and chain models: possible ATOMIC events to sentences by selecting 1 events that share noun chunks and verb phrases ∆l(si) = − [log p(si j E)− with sentences (e.g., “getting married” “Per- jsij sonX gets married”; Figure1). We then search log p(s j E; s )] i 1:i−1 (1) the matched sentences’ surrounding sentences for where the log-probability of a sentence s in a con- commonsense inferences (e.g., “be very happy” text C (e.g., topic E and history s1:i−1) is the sum “happy”; Figure1). We describe this algorithm in of the log-probabilities of its tokens wt in context: further detail in Appendix B.2. In our analyses, the P log p(s j C) = t log p(wt j C; w0:t−1). measure quantifies the number of story sentences We compute the likelihood of sentences using with commonsense tuple matches in the two pre- OpenAI’s GPT language model (Radford et al., ceding and following sentences.

Load more