Fabula Entropy Indexing: Objective Measures of Story Coherence
Total Page:16
File Type:pdf, Size:1020Kb
Fabula Entropy Indexing: Objective Measures of Story Coherence Louis Castricato, Spencer Frazier, Jonathan Balloch, and Mark O. Riedl Georgia Tech {lcastric, sfrazier7, balloch}@gatech.edu, {riedl}@cc.gatech.edu Abstract perplexity and n-gram based methods such as BLEU (Papineni et al., 2002)—are insufficient in Automated story generation remains a difficult creative generation domains such as story genera- area of research because it lacks strong ob- tion. These metrics assume that generated language jective measures. Generated stories may be linguistically sound, but in many cases suffer can only be good if is resembles testing data or a poor narrative coherence required for a com- given target story. This precludes the possibility pelling, logically-sound story. To address this, that stories may be good yet be completely novel. we present Fabula Entropy Indexing (FEI), an Indeed, the goal of story generation is usually the evaluation method to assess story coherence construction of novel stories. by measuring the degree to which human par- ticipants agree with each other when answer- In the absence of automated evaluation metrics, ing true/false questions about stories. We the alternative is to use human participant stud- devise two theoretically grounded measures ies. Human participants, typically recruited via of reader question-answering entropy, the en- crowdsourcing platforms (e.g Mechanical Turk or tropy of world coherence (EWC), and the en- Prolific), are asked to read the stories generated tropy of transitional coherence (ETC), focus- by various systems and provide subjective rating ing on global and local coherence, respectively. or rankings. Questionnaires may ask participants We evaluate these metrics by testing them on to rate or rank the overall quality of stories, but human-written stories and comparing against the same stories that have been corrupted to in- may also ask specific questions about features of troduce incoherencies. We show that in these stories such as fluency or coherence. Coherence is controlled studies, our entropy indices provide particularly difficult feature of stories to measure a reliable objective measure of story coher- because the term “coherence” can mean different ence. things to different participants. 1 Introduction In this paper, we introduce a technique for objec- tive human participant evaluation, called Fabula Automated story generation is one of the grand Entropy Indexing (FEI). FEI provides a structure challenges of generative artificial intelligence. AI for metrics that more objectively measure story storytelling is a crucial component of the human coherence based on human question-answering. experience. Humans have always used storytelling A fabula is a narratological term referring to the arXiv:2104.07472v1 [cs.CL] 23 Mar 2021 to entertain, share experiences, educate, and to fa- reader’s inferred story world that a story takes place cilitate social bonding. For an intelligent system in, whether it be similar to the real world or a to be unable to generate a coherent story limits its fantasy or science fiction world. The reader may ability to interact with humans in naturalistic ways. of course be surprised by certain events but other There have been a number of techniques ex- events may seem implausible or contradictory, thus plored for story generation; these include symbolic disrupting coherence. As they read, humans form planning, case-based reasoning, neural language cognitive structures to make sense of a story, which models and others. Despite extensive research, au- in turn can be used to answer simple true/false ques- tomated story generation remains a difficult task. tions about the story. As such, an incoherent story One of the reasons why automated story gener- results in readers making random guesses about the ation is such a difficult area of research is due to answers to these questions. FEI metrics thus mea- weak objective validation measures. Traditional sure the entropy of the answers—how much the automated measures of natural language quality— answers disagree with each other—which directly correlates with the coherence of the story. plot cohere with each other. Studies of human read- We introduce two such FEI metrics: Entropy ing comprehension (Trabasso and Van Den Broek, of Transitional Coherence (ETC) and Entropy of 1985; Graesser et al., 1991, 1994) show that hu- World Coherence (EWC), measuring (respectively) mans comprehend stories by tracking the relations sequential coherence between events in a story, and between events. Reader comprehension studies the internal coherence of the story world: the facts suggest that readers rely on the tracking of at least about characters, objects, and locations that dis- four types of relations between events: (1) causal tinguish a story. The correlation between human consequence, (2) goal hierarchies, (3) goal initia- question-answering and these metrics are grounded tion, and (4) character intentions. The perceived in narratological1 theories. coherence of a story is a function of the reader To validate the measure, we test our metrics on being able to comprehend how events correlate to human-written stories as well as corrupted versions each other causally or how they follow characters’ of those stories. For the corrupted stories, we arti- pursuits of implicit goals. ficially reduce the coherence by altering elements To control the generation and achieve greater of the story. We show that FEI metrics evaluate coherence, a high-level plot outline can either non-corrupted human-written stories as having low be generated or given as an input to a language entropy and corrupted stories as having higher en- model. (Fan et al., 2018; Peng et al., 2018; Rashkin tropy. et al., 2020; Brahman and Chaturvedi, 2020). These techniques can produce more coherent sto- 2 Background and Related Work ries when their guidance forces different parts of 2.1 Automated Story Generation the story to appear related or to follow a pattern acceptable to humans. Early story and plot generation systems relied on symbolic planning (Meehan, 1976; Lebowitz, 1987; Tambwekar et al.(2018) attempt to train a neu- Cavazza et al., 2003; Porteous and Cavazza, 2009; ral language model to perform goal-based genera- Riedl and Young, 2010; Ware and Young, 2011) or tion. They fine-tune a neural language model with case-based reasoning (Pérez y Pérez and Sharples, a policy-gradient reinforcement learning technique 2001; Peinado and Gervás, 2005; Turner, 2014). that rewards the language model for generating An increasingly common machine learning ap- events progressively closer to the goal event. proach to story generation is to use neural language models (Roemmele, 2016; Khalifa et al., 2017; 2.2 Story Generator Evaluation Clark et al., 2018; Martin et al., 2018). These Traditional automated measures of natural lan- techniques have improved with the adoption of guage quality such as perplexity or n-gram com- Transformer-based models, such as GPT-2 (Rad- parisons (e.g., BLEU) are generally considered in- ford et al., 2019). While GPT-2 and similar neural sufficient for evaluating story generation systems. language models are considered highly fluent from Perplexity is the measure of how well a model cap- a grammatical standpoint. tures the patterns in an underlying dataset. Implicit In these systems, a neural language model learns in the notion of perplexity is the belief that the qual- to approximate the distribution Pθ(toknjtok<n) ity of a model is tied to its ability to reconstruct its where θ is the parameters that approximate the pat- own data. However, in automated story generation, tern of an underlying dataset. Stories are produced stories that are very dissimilar to training and test- by providing an initial context sequence, then iter- ing data can also be “good”. Likewise, BLEU (and atively generating additional tokens by sampling related techniques such as ROGUE and sentence from the distribution. When the language model mover techniques (Clark et al., 2019)) measure a is trained on a corpus of stories, subsets of the language model’s ability to produce n-grams in a generated text tend to also be a story. specific target sentence, whereas a good story may One of the reasons why story generation is chal- not resemble a given target story and yet still be lenging is because of the strong requirement that coherent. stories be coherent. Coherence can refer to read- The gold standard for evaluation of automated ability/fluency. However, stories also require plot story generation systems is to use human partic- coherence, which is how well the elements of a ipant studies. Many systems are evaluated with 1Narratology is the study of stories and storytelling. subjective questionnaires in which human partic- ipants either rate generated stories on a scale, or to our proposed measure of coherence; our tech- rank pairs of stories. Often a single question is nique is mathematically grounded and not tied to asked about overall quality. Other subjective ques- any particular way of generating stories. tions focusing on different story attributes, such as coherence, may be asked as well. Asking questions 3 Preliminaries about coherence is tricky as participants may have In this section we review narratological definitions different notions of what coherence might mean, that will be relevant to understanding how to mea- from grammatical notions of coherence to logical sure the Fabula Entropy Indices. story structure. Definition 3.1. A narrative is the recounting Purdy et al.(2018) introduced a set of subjec- of a sequence of events that have a continuant tive questions for human participant studies about subject and constitute a whole (Prince, 2003). global coherence, local consistency, grammatical- An event describes some change in the state of ity, and overall story quality. Algorithms to pre- the world. A “continuant subject” means there is dict how humans would answer these questions some relationship between the events—it is about were also introduced. The goal of this work was something and not a random list of unrelated events.