<<

Fabula Entropy Indexing: Objective Measures of Story Coherence

Louis Castricato, Spencer Frazier, Jonathan Balloch, and Mark O. Riedl Georgia Tech {lcastric, sfrazier7, balloch}@gatech.edu, {riedl}@cc.gatech.edu

Abstract perplexity and n-gram based methods such as BLEU (Papineni et al., 2002)—are insufficient in Automated story generation remains a difficult creative generation domains such as story genera- area of research because it lacks strong ob- tion. These metrics assume that generated language jective measures. Generated stories may be linguistically sound, but in many cases suffer can only be good if is resembles testing data or a poor coherence required for a com- given target story. This precludes the possibility pelling, logically-sound story. To address this, that stories may be good yet be completely novel. we present Fabula Entropy Indexing (FEI), an Indeed, the goal of story generation is usually the evaluation method to assess story coherence construction of novel stories. by measuring the degree to which human par- ticipants agree with each other when answer- In the absence of automated evaluation metrics, ing true/false questions about stories. We the alternative is to use human participant stud- devise two theoretically grounded measures ies. Human participants, typically recruited via of reader question-answering entropy, the en- crowdsourcing platforms (e.g Mechanical Turk or tropy of world coherence (EWC), and the en- Prolific), are asked to read the stories generated tropy of transitional coherence (ETC), focus- by various systems and provide subjective rating ing on global and local coherence, respectively. or rankings. Questionnaires may ask participants We evaluate these metrics by testing them on to rate or rank the overall quality of stories, but human-written stories and comparing against the same stories that have been corrupted to in- may also ask specific questions about features of troduce incoherencies. We show that in these stories such as fluency or coherence. Coherence is controlled studies, our entropy indices provide particularly difficult feature of stories to measure a reliable objective measure of story coher- because the term “coherence” can mean different ence. things to different participants. 1 Introduction In this paper, we introduce a technique for objec- tive human participant evaluation, called Fabula Automated story generation is one of the grand Entropy Indexing (FEI). FEI provides a structure challenges of generative artificial intelligence. AI for metrics that more objectively measure story is a crucial component of the human coherence based on human question-answering. experience. Humans have always used storytelling A fabula is a narratological term referring to the to entertain, share experiences, educate, and to fa- reader’s inferred story world that a story takes place cilitate social bonding. For an intelligent system in, whether it be similar to the real world or a to be unable to generate a coherent story limits its fantasy or science fiction world. The reader may ability to interact with humans in naturalistic ways. of course be surprised by certain events but other There have been a number of techniques ex- events may seem implausible or contradictory, thus plored for story generation; these include symbolic disrupting coherence. As they read, humans form planning, case-based reasoning, neural language cognitive structures to make sense of a story, which models and others. Despite extensive research, au- in turn can be used to answer simple true/false ques- tomated story generation remains a difficult task. tions about the story. As such, an incoherent story One of the reasons why automated story gener- results in readers making random guesses about the ation is such a difficult area of research is due to answers to these questions. FEI metrics thus mea- weak objective validation measures. Traditional sure the entropy of the answers—how much the automated measures of natural language quality— answers disagree with each other—which directly 84

Proceedings of the 3rd Workshop on Narrative Understanding, pages 84–94 June 11, 2021. ©2021 Association for Computational Linguistics correlates with the coherence of the story. cohere with each other. Studies of human read- We introduce two such FEI metrics: Entropy ing comprehension (Trabasso and Van Den Broek, of Transitional Coherence (ETC) and Entropy of 1985; Graesser et al., 1991, 1994) show that hu- World Coherence (EWC), measuring (respectively) mans comprehend stories by tracking the relations sequential coherence between events in a story, and between events. Reader comprehension studies the internal coherence of the story world: the facts suggest that readers rely on the tracking of at least about characters, objects, and locations that dis- four types of relations between events: (1) causal tinguish a story. The correlation between human consequence, (2) goal hierarchies, (3) goal initia- question-answering and these metrics are grounded tion, and (4) character intentions. The perceived in narratological1 theories. coherence of a story is a function of the reader To validate the measure, we test our metrics on being able to comprehend how events correlate to human-written stories as well as corrupted versions each other causally or how they follow characters’ of those stories. For the corrupted stories, we arti- pursuits of implicit goals. ficially reduce the coherence by altering elements To control the generation and achieve greater of the story. We show that FEI metrics evaluate coherence, a high-level plot outline can either non-corrupted human-written stories as having low be generated or given as an input to a language entropy and corrupted stories as having higher en- model. (Fan et al., 2018; Peng et al., 2018; Rashkin tropy. et al., 2020; Brahman and Chaturvedi, 2020). These techniques can produce more coherent sto- 2 Background and Related Work ries when their guidance forces different parts of 2.1 Automated Story Generation the story to appear related or to follow a pattern acceptable to humans. Early story and plot generation systems relied on symbolic planning (Meehan, 1976; Lebowitz, 1987; Tambwekar et al.(2018) attempt to train a neu- Cavazza et al., 2003; Porteous and Cavazza, 2009; ral language model to perform goal-based genera- Riedl and Young, 2010; Ware and Young, 2011) or tion. They fine-tune a neural language model with case-based reasoning (Pérez y Pérez and Sharples, a policy-gradient reinforcement learning technique 2001; Peinado and Gervás, 2005; Turner, 2014). that rewards the language model for generating An increasingly common machine learning ap- events progressively closer to the goal event. proach to story generation is to use neural language models (Roemmele, 2016; Khalifa et al., 2017; 2.2 Story Generator Evaluation Clark et al., 2018; Martin et al., 2018). These Traditional automated measures of natural lan- techniques have improved with the adoption of guage quality such as perplexity or n-gram com- Transformer-based models, such as GPT-2 (Rad- parisons (e.g., BLEU) are generally considered in- ford et al., 2019). While GPT-2 and similar neural sufficient for evaluating story generation systems. language models are considered highly fluent from Perplexity is the measure of how well a model cap- a grammatical standpoint. tures the patterns in an underlying dataset. Implicit In these systems, a neural language model learns in the notion of perplexity is the belief that the qual- to approximate the distribution Pθ(tokn|tok

We have used the Fabula Entropy Indexing Mieke Bal and Christine Van Boheemen. 2009. Narra- method described in this paper to evaluate an au- tology: Introduction to the theory of narrative. Uni- tomated story generation system in (under review, versity of Toronto Press. 91 Faeze Brahman and Snigdha Chaturvedi. 2020. Mod- Lara Martin, Prithviraj Ammanabrolu, Xinyu Wang, eling protagonist emotions for emotion-aware story- William Hancock, Shruti Singh, Brent Harrison, and telling. arXiv preprint arXiv:2010.06822. Mark Riedl. 2018. Event representations for auto- mated story generation with deep neural nets. In Marc Cavazza, Olivier Martin, Fred Charles, Steven J Proceedings of the AAAI Conference on Artificial In- Mead, and Xavier Marichal. 2003. Interacting telligence, volume 32. with virtual agents in mixed reality interactive sto- rytelling. In International Workshop on Intelligent James Richard Meehan. 1976. The metanovel: writing Virtual Agents, pages 231–235. Springer. stories by computer. Technical report, YALE UNIV NEW HAVEN CONN DEPT OF COMPUTER SCI- Elizabeth Clark, Asli Celikyilmaz, and Noah A Smith. ENCE. 2019. Sentence mover’s similarity: Automatic eval- uation for multi-sentence texts. In Proceedings of Kishore Papineni, Salim Roukos, Todd Ward, and Wei- the 57th Annual Meeting of the Association for Com- Jing Zhu. 2002. Bleu: a method for automatic eval- putational Linguistics, pages 2748–2760. uation of machine translation. In Proceedings of the 40th annual meeting of the Association for Compu- Elizabeth Clark, Yangfeng Ji, and Noah A Smith. 2018. tational Linguistics, pages 311–318. Neural text generation in stories using entity repre- sentations as context. In Proceedings of the 2018 Federico Peinado and Pablo Gervás. 2005. Creativity Conference of the North American Chapter of the issues in plot generation. In Workshop on Computa- Association for Computational Linguistics: Human tional Creativity, Working Notes, 19th International Language Technologies, Volume 1 (Long Papers), Joint Conference on AI, pages 45–52. pages 2250–2260. Nanyun Peng, Marjan Ghazvininejad, Jonathan May, William Cook. 2011. PLOTTO: the master book of all and Kevin Knight. 2018. Towards controllable story plots. Tin House Books. generation. In Proceedings of the First Workshop on Storytelling, pages 43–49. Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi- erarchical neural story generation. arXiv preprint Rafael Pérez y Pérez and Mike Sharples. 2001. Mexica: arXiv:1805.04833. A computer model of a cognitive account of creative writing. Journal of Experimental & Theoretical Ar- Art Graesser, Kathy L. Lang, and Richard M. Roberts. tificial Intelligence, 13(2):119–139. 1991. Question answering in the context of sto- ries. Journal of Experimental Psychology: General, Julie Porteous and Marc Cavazza. 2009. Controlling 120(3):254–277. narrative generation with planning trajectories: the role of constraints. In Joint International Confer- Art Graesser, Murray Singer, and Tom Trabasso. 1994. ence on Interactive Digital Storytelling, pages 234– Constructing inferences during narrative text com- 245. Springer. prehension. Psychological Review, 101(3):371– 395. Gerald Prince. 2003. A dictionary of . U of Nebraska Press. Arthur C Graesser, Danielle S McNamara, and Max M Louwerse. 2003. What do readers need to learn in Christopher Purdy, X. Wang, Larry He, and Mark O. order to process coherence relations in narrative and Riedl. 2018. Predicting generated story quality with expository text. Rethinking reading comprehension, quantitative measures. In AIIDE. 82:98. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Ahmed Khalifa, Gabriella AB Barros, and Julian Dario Amodei, and Ilya Sutskever. 2019. Language Togelius. 2017. Deeptingle. arXiv preprint models are unsupervised multitask learners. OpenAI arXiv:1705.03557. blog, 1(8):9. Kevin H. Knuth. 2004. Measuring questions: Rele- Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, and vance and its relation to entropy. AIP Conference Jianfeng Gao. 2020. Plotmachines: Outline- Proceedings. conditioned generation with dynamic plot state tracking. arXiv preprint arXiv:2004.14967. Michael Lebowitz. 1987. Planning stories. In Proceed- ings of the 9th annual conference of the cognitive Mark O Riedl and Robert Michael Young. 2010. Narra- science society, pages 234–242. tive planning: Balancing plot and character. Journal of Artificial Intelligence Research, 39:217–268. Boyang Li, Stephen Lee-Urban, George Johnston, and Mark Riedl. 2013. Story generation with crowd- Melissa Roemmele. 2016. Writing stories with help sourced plot graphs. In Proceedings of the AAAI from recurrent neural networks. In Proceedings of Conference on Artificial Intelligence, volume 27. the AAAI Conference on Artificial Intelligence. Jean M Mandler and Nancy S Johnson. 1977. Remem- Marie-Laure Ryan. 1991. Possible worlds, artificial in- brance of things parsed: Story structure and recall. telligence, and narrative theory. Indiana University Cognitive psychology, 9(1):111–151. Press. 92 Pradyumna Tambwekar, Murtaza Dhuliawala, Lara J where you might otherwise describe a “stray” dog. Martin, Animesh Mehta, Brent Harrison, and Note: This may not be a constraint for all readers - Mark O Riedl. 2018. Controllable neural story those answering questions will only assess based plot generation via reinforcement learning. arXiv preprint arXiv:1809.10736. on their belief about the world.)

Pradyumna Tambwekar, Murtaza Dhuliawala, Lara J Prior to this line did you imagine [Ad- Martin, Animesh Mehta, Brent Harrison, and Mark O Riedl. 2019. Controllable neural story plot verb/Adjective] was a possible descriptor generation via reward shaping. In IJCAI, pages for Object/Entity/Event? 5982–5988. After this line containing [Adverb/Adjective] do Tom Trabasso and Paul Van Den Broek. 1985. Causal thinking and the representation of narrative events. you hold the belief this is a possible descriptor or Journal of memory and language, 24(5):612–630. do you reject it? Tom Trabasso et al. 1982. Causal cohesion and story coherence. Because of [Adverb/Adjective] does Line N contradict information in another line? Scott R Turner. 2014. The creative process: A com- puter model of storytelling and creativity. Psychol- ogy Press. Because of [Adverb/Adjective] does this in- dicate emotional valence (extreme sentiment) Stephen Ware and R Young. 2011. Cpocl: A narra- toward an Object/Entity/Event? tive planner supporting conflict. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 6. In the line with [Adverb/Adjective] does this alter Author or Entity sentiment toward A Appendices Object/Event? Because of [Adverb/Adjective] does this change A.1 Alteration Templates4 your sentiment toward some Entity/Object/Event? The [Adjective1] Object/Entity/Event -> The [Adjective2] Object/Entity/Event Does [Adverb/Adjective] contradict an as- sertion on Line N? The [Adjective1] Object/Entity/Event -> The not [Adjective1] Object/Entity/Event Could [Adverb/Adjective] be removed and the story world would remain unchanged? Object/Entity/Event is [Adverb1] [Adjective1] -> Object/Entity/Event is [Adverb1] [Adjective2] Without [Adverb/Adjective] on Line N, Line N+1 would not have happened. Object/Entity/Event is [Adverb1] [Adjective1] -> Object/Entity/Event is [Adverb2] [Adjective1] A.3 Question Templates: ETC Object/Entity/Event [Adverb1][Verb] -> Ob- ject/Entity/Event [Adverb2][Verb] Does Entity A’s perception of Entity B change?

These are just a small sample of templates given Do all Entities in Line N observe or gain the complex nature of certain sentences. You can awareness of Events in Line N+1? make alterations beyond this but adhere to the rules above. Do the Events in Line N+1 contradict Events in Line N? A.2 Question Templates: EWC Does Entity A’s sentiment/emotion change In the context of this narrative setting, is [Ad- between line N-1 and N? verb/Adjective] plausible? (e.g. an “otherworldly” dog showing up in a short story about World War 2 Does Object A still retain State S? 4Additional clarifying examples were given to participants when they requested them during task completion. Does Object A change possession in Line 93 N+1?

Is Object A in Line N+1 necessary for Events in line N to occur?

Is there a change in context or location be- tween these lines?

Is knowledge of Object A necessary for un- derstanding the following line?

Does Line N have causal dependencies es- tablished in Line N-1?

Could Line N-1 occur before Line N?

A.4 Selected Questions Does "awful" contradict an assertion on line 1?

Could "shaped" in line 4 be removed and the story world would remain unchanged?

Because of "tall" does line 9 contradict in- formation in another line?

Could line 1 and 5 both be removed and have no effect on the story?

Is there a change in context or location be- tween line 2 and 5?

Do the events in line 3 contradict events in line 2?

94