Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 341

Frontiers of Information Technology & Electronic Engineering www.jzus.zju.edu.cn; engineering.cae.cn; www.springerlink.com ISSN 2095-9184 (print); ISSN 2095-9230 (online) E-mail: [email protected] Review: A survey of script learning∗

Yi HAN1,LinboQIAO‡1, Jianming ZHENG2, Hefeng WU3, Dongsheng LI1,XiangkeLIAO1 1Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410000, China 2Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410000, China 3School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China E-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected] Received July 13, 2020; Revision accepted Nov. 11, 2020; Crosschecked Jan. 8, 2021

Abstract: Script is the structured knowledge representation of prototypical real-life event sequences. Learning the commonsense knowledge inside the script can be helpful for machines in understanding natural language and drawing commonsensible inferences. Script learning is an interesting and promising research direction, in which a trained script learning system can process narrative texts to capture script knowledge and draw inferences. However, there are currently no survey articles on script learning, so we are providing this comprehensive survey to deeply investigate the standard framework and the major research topics on script learning. This research field contains three main topics: event representations, script learning models, and evaluation approaches. For each topic, we systematically summarize and categorize the existing script learning systems, and carefully analyze and compare the advantages and disadvantages of the representative systems. We also discuss the current state of the research and possible future directions.

Key words: Script learning; Natural language processing; Commonsense knowledge modeling; Event reasoning https://doi.org/10.1631/FITEE.2000347 CLC number: TP391

1 Introduction ment of events, which can be implicitly learned and comprehended by humans in long-term daily-life ac- Understanding is one of the most important tivities. After learning such knowledge, they can use components of a human-level artificial intelligence it to help understand the implicit information and (AI) system. However, developing such a system address the ambiguities of natural language. How- is far from trivial, because natural language comes ever, machines cannot perform such daily-life activ- with its own complexity and inherent ambiguities. ities, and therefore lack commonsense knowledge, To understand natural language, machines need to which is one of the major obstacles to understanding comprehend not only literal meaning but also com- natural language and drawing human-like inferences. monsense knowledge about the real world. Commonsense knowledge includes certain inter- Some researchers attempt to alleviate such bot- nal logic and objective laws based on the develop- tlenecks from the perspective of how humans remem- ber and understand various real-life events. Ac- ‡ Corresponding author * Project supported by the National Natural Science Foundation cording to many psychological experiments, script of China (No. 61806216) knowledge is an essential component of human cog- ORCID: Yi HAN, https://orcid.org/0000-0001-9176-8178; Linbo QIAO, https://orcid.org/0000-0002-8285-2738 nition and memory systems (Bower et al., 1979; c Zhejiang University Press 2021 Schank, 1983; Tulving, 1983; Terry, 2006). From the 342 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 perspective of cognitive psychology, humans use schemas to organize and understand the world. A schema is a mental framework that consists of a gen- eral template for the knowledge shared by all in- stances, and a number of slots that can take on dif- ferent values for a specific instance. A script is a special case of a schema that describes stereotypical activities with a sequence of events organized in tem- poral order (Rumelhart, 1980). When a person first meets a prototypical activity, his/her episodic mem- ory will store the events and the scenarios in which it occurred in the form of a script (Tulving, 1983). The psychological research has indicated that the activity stored in episodic memory can be activated by some particular external event (Terry, 2006). So, the next time a similar or relevant event happens, the person’s Fig. 1 Scripts for visiting a restaurant and making coffee. Each script includes a sequence of events for a episodic memory will activate the old corresponding prototypical scenario script knowledge. Based on the script knowledge, the person will soon understand the actions and the par- ticipants of the scenario, and subsequently form the 1.2 Objective of script learning expectations for possibleupcomingevents.Theex- pectations can deeply affect the ability of humans to The objective of script learning is encoding the comprehend specific scenarios and draw reasonable commonsense knowledge contained in prototypical inferences. events (also called “script knowledge”) in a machine- readable format. Then the machine can use the for- Since script knowledge is critical for humans to mat to draw event-related inferences, such as infer- remember and comprehend different scenarios, ma- ring missing events, predicting subsequent events, chines may also gain benefits by learning such knowl- determining the order of the event chain, and distin- edge. It is this intention that motivates researchers guishing similar events. to introduce script learning to encode script knowl- The commonsense knowledge provided by the edge for machines. A trained script learning system script can help machines understand the practical can process texts that describe daily-life events and meanings of natural language and draw human-like capture commonsense knowledge involving the ac- inferences. It will also play a crucial role in a wide tions and the entities that participate in it. Based range of AI tasks, especially many natural language on this, it can also draw event-related inferences that processing (NLP) tasks, such as event prediction, are reasonable but not explicitly stated in the texts. event extraction, understanding ambiguity resolu- tion discourse, intention recognition, question an- 1.1 Definition of script swering, and coreference resolution.

A script is a structural knowledge representa- 1.3 Framework of script learning tion that captures the relationships between pro- totypical event sequences and their participants in Note that this survey concentrates mainly on a given scenario (Schank and Abelson, 1977). In the broad-domain script knowledge learned from nar- other words, the script is a sequence of prototypi- rative event chains, which can be used to compare cal events organized in temporal order. These events and infer possible upcoming events in a specific con- contain stereotypical human activities, such as eating text, rather than those works attempting to con- in a restaurant, cooking dinner, and making coffee. struct scenario-specific event schemas. Therefore, For example, Fig. 1 shows two toy examples of the the framework of script learning systems with which “visiting restaurant” script and the “making coffee” we are concerned can be illustrated by the basic pro- script. cess and a simple example in Fig. 2. Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 343

Draw inference Subsequent event

stage ??? Leave(John, restaurant) Evaluation

Script learning models for event encoding and matching stage Model Candidate events

Leave(John, restaurant) Event 1 Event 2 Event 3 Event 4 Pay(John, waitress) Walk(John, restaurant) Seat(John, chair) Read(John, menu) Order(John, meal)

stage Event 5 Event 6 Event 7 Subsequent event Take(John, dog, walk) Get(John, food) Eat(John, food) Make(John, payment) ??? Representation Swim(John, pool)

Sleep(John, bed)

NLP tools for event detection and extraction Ă

action r

stage Raw narrative texts Ext “John walked into his favorite restaurant and sat on a chair. Later, he read the menu and ordered a meal. After 10 minutes, he got the food and ate it. After a while, he made payment and ...”

Text corpus

Fig. 2 Framework of script learning which contains four stages. In the extraction stage, events are extracted from raw texts. In the representation stage, events are represented by a structured form. In the model stage, the structured representations are encoded and matched by script learning models, and the matching result is outputted. In the evaluation stage, some evaluation approaches with specific tasks and corresponding metrics are employed to test the performance of the models

According to the standard script learning set- adopted as the matching metrics. tings, our framework can be roughly divided into Finally, some evaluation approaches with spe- four stages: extraction stage, representation stage, cific tasks and corresponding metrics are employed model stage, and evaluation stage. to test the quality of event representations and the Script knowledge is often mentioned implicitly performance of script learning models (i.e., the eval- in the unstructured narrative texts that describe uation stage in Fig. 2). For this stage, the input daily-life activities. The first task of script learn- is event matching results, and the output is perfor- ing is employing NLP tools to detect and extract mance, based on comparisons between prediction re- events from the raw text (i.e., the extraction stage in sults and golden results. Fig. 2). Specifically, in Fig. 2, the extracted events are The output from the previous stage is then rep- represented by the structure predicate(subject, ob- resented in a specific structure or neural networks to ject), while the evaluation is to judge whether the form representations of events (i.e., the representa- model can correctly predict the subsequent event tion stage in Fig. 2). In this stage, events extracted from a candidate set, given the event chain that by NLP tools are regarded as the input, and struc- occurred. tured representations of events as the output. After that, the script learning models will en- 1.4 Organization of the survey code the collected event representations, and mea- sure the relationship between them using event We can conclude from the above framework that matching algorithms (i.e., the model stage in Fig. 2). script learning primarily focuses on five aspects: the For this stage, the input is the structured event rep- used as a training dataset, the NLP tools resentations, and the output is the matching results. for detecting and extracting events, the representa- Notably, similarity or coherence scores are usually tion form of script events, the models for encoding 344 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 and matching events, and the evaluation approaches Although script learning is quite a potential and for estimating the performance of the models. significant research direction, there are currently no There is no standard corpus for script learn- survey articles on the subject, to the best of our ing, and a majority of related works extract events knowledge. The contributions of this survey can be from news articles by themselves. Therefore, we do summarized as follows: not discuss a corpus in detail in the main part of 1. We are the first to provide a comprehensive the survey, but introduce such literature in future survey that deeply investigates the standard frame- research directions described in Section 6.2. In addi- work and major research topics in script learning. tion, event detection and extraction belong to other 2. We thoroughly summarize the development independent NLP research directions, and there are process and the technique taxonomy of the script many off-the-shelf open-source NLP tools that can learning system. be used conveniently. Consequently, in this survey, 3. We carefully analyze and compare the most we do not consider event detection and extraction ei- representative works on script learning and discuss ther. In summary, we focus mainly on three aspects, some promising research directions. i.e., event representations, script learning models, and evaluation approaches. The rest of the survey is also organized primarily 2 Historical overview based on these research topics. Specifically, Section In this section, we will provide a chronolog- 2 briefly reviews the development history of script ical research script learning timeline. Based on learning; Section 3 introduces several typical event modeling methods, the script learning development representation structures; Section 4 introduces some timeline can be roughly divided into three stages: representative script learning models from seven cat- incipient rule-based approaches (1975–2008), early egories; Section 5 introduces some common evalua- count-based approaches (2008–2014), and recent tion approaches and corresponding metrics; Section deep learning based approaches (2014–). Although 6 concludes the survey and some possible future re- there are different ways of dividing this timeline, we search directions are discussed. The organization of will follow the above division to briefly review the these sections and some related specific items are script learning development process, because it can shown in Fig. 3. better fit the chronological timeline.

ļ Conclusions and future directions 2.1 Rule-based approaches (1975–2008) NC ANC Although script learning has attracted wide at- Ļ Evaluation MCNC approaches MCNS tention in recent years, the use of script-related con- MCNE SCT cepts in AI research dates back to the 1970s, in Rule-based models particular, the seminal works on schema by Rumel- Count-based models hart (1980), on frame by Minsky (1975), on semantic Deep learning based models ĺ Script Pair-based models frame by Fillmore (1976), and on scripts by Schank learning models Chain-based models Graph-based models and Abelson (1977). These works established the External knowledge models core concepts of script learning. Protagonist representation

Research topics in script learning learning script Research topics in The incipient script theory was primarily intro- Ĺ Event Relational representation representations Multi-argument representation duced to construct story understanding and genera- Fine-grained property representation tion systems. The works in this stage often employed ĸ Historical overview quite complex rules (mostly handwritten) to process events and used a set of separate models to treat dif- ķ Introduction ferent scenarios. Although these works showed some Fig. 3 Organization of the survey. We discuss abilities in script understanding and inferring, their mainly three research topics, i.e., event represen- tations, script learning models, and evaluation ap- rule-based methods do not scale to complex domains proaches. These topics and their related specific items because they are characteristically domain-specific are introduced in Sections 3–5 and time-consuming to construct manually. Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 345

2.2 Count-based approaches (2008–2014) 2.3 Deep learning based approaches (2014–)

The trend of script learning was revived by The emergence of deep learning technology has count-based methods, which can automatically learn brought NLP to a new era; modern script learning broad-domain script knowledge using statistical ap- systems widely introduce neural networks and em- proaches to obtain co-occurrence counts from a large beddings to counter the above shortcomings. The text corpus. embedding mitigates the sparsity issue and provides Count-based script learning stems from the un- a useful approach to compose predicates with their supervised framework proposed by Chambers and arguments. Deep learning based models use word Jurafsky (2008). They ran a co-reference system and embeddings to represent event components (i.e., a a dependency parser to automatically extract event predicate and its arguments) and obtain the embed- sequences from raw text. Pairwise mutual informa- ding of the entire event by composing the embed- tion (PMI) scores were employed to measure the re- dings of its components, via a composition neural lationship between event pairs. They also pioneered network. The relationship between events can also the use of pairs to represent be measured based on vector distances in the embed- events and proposed the narrative cloze (NC) test to ding space. evaluate script learning models. As for specific initial experimental works, Modi Several researchers followed their model frame- and Titov (2014a, 2014b) first introduced embed- work and enhanced it in different ways, such as ding methods to represent events; Rudinger et al. using a skip bi-gram probability model to re- (2015) tentatively trained a log-bilinear neural lan- place PMI (Jans et al., 2012), or proposing multi- guage model (Mnih and Hinton, 2007) and attained argument event representation v(es,eo,ep) to re- a remarkable improvement in NC tests; Granroth- place the pair (Pichotta and Wilding and Clark (2016) employed , the Mooney, 2014). vector-learning system of Mikolov et al. (2013), to In this period, there is still a set of works that learn embeddings. attempted to encode script knowledge by construct- The aforementioned works showed that deep ing an event graph, which can express denser and learning based models were certainly more effective broader connections among events. As a general rule, for script learning tasks. Then, how to design a the nodes of the graph refer to the events, and the capable neural network structure became the main edges refer to the relationships between events. Dif- challenge at this stage. More advanced mechanisms ferent models build script graphs differently: Regneri and more complex structures were introduced by the et al. (2010) constructed a temporal script graph for follow-up work to make further improvements. a specific scenario by learning from crowd-sourced Particularly, as for more advanced mechanisms, data; Orr et al. (2014) employed the hidden Markov Pichotta and Mooney (2016a) first used a recurrent model (HMM) to construct transition script graphs; neural network (RNN) based method, long short- Glavaš and Šnajder (2015) constructed event graphs term memory (LSTM), to capture order information using a three-stage pipeline to extract anchor, argu- of an event chain and express interactions between ment, and the relationship between event pairs from long-term sequences. Hu et al. (2017) applied a the text. contextual hierarchical LSTM (CH-LSTM) that can However, count-based methods determine the capture both the word sequence of a single event and probability of an event by calculating co-occurrence the temporal order of the event chain. counts, and learn only shallow statistical informa- As for more complex structures, Wang ZQ et al. tion rather than deep semantic properties. There (2017) incorporated the advantage of both LSTM are two main shortcomings: impoverished represen- temporal order learning and traditional event pair tations and sparsity issues. They treat events as coherence learning, using LSTM with a dynamic one fixed unit, consequently failing to consider the memory network (Weston et al., 2015). Lv et al. compositional nature of an event and suffering from (2019) employed self-attention mechanism (Lin ZH sparsity issues, which seriously restricts the ability et al., 2017) to extract event segments which have of script learning systems to infer. richer semantic information than individual events 346 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 and introduce less noise than event chains. of the event, such as bring(Tom, book, Mary). Following the recent trend of using the pre- The main aim of event representation is to repre- trained language model BERT (Devlin et al., 2019), sent the event with an appropriate structure, which a few novel works have introduced BERT as the ba- can include essential components of the event and sic word embeddings in script learning and obtained capture the implicit commonsense knowledge hiding significant improvements. For instance, Zheng et al. behind the text description. Event representation is (2020) replaced the traditional GloVe (Pennington very important for script learning, because it spec- et al., 2014) with BERT, and proposed a unified ifies what we mean by an “event,” and because it framework to integrate raw-text training, intra-event is the basic operational element of the subsequent training, and inter-event training. Li ZY et al. script modeling process. (2019) introduced the transferable BERT (Trans- In addition to the differences in representation BERT) training framework, which can transfer not structure, different models have great differences in only general knowledge from large-scale unlabeled representation forms. Some models use discrete rep- data but also specific knowledge from various seman- resentations (i.e., treating the whole structured event tically related supervised tasks. representation as an unbreakable atomic unit), while Graph-based works also introduced neural net- others use distributed representations (i.e., treating works and embeddings into their models during this the event as a dense vector of real values). In this period. For example, Zhao et al. (2017) embed- section, however, we will focus mainly on the vari- ded an abstract causality network into the vector ous representation structures; the relevant features space and applied it into a neural network to make of different representation forms will be discussed in stock market movement predictions; Li ZY et al. detail in Section 4.2 onward. (2018) constructed the scaled graph neural network 3.1 Protagonist representation (SGNN) to model event interactions and learn event representations. Chambers and Jurafsky (2008) first represented Recently, some notable works have attempted to a free-form event description as a structured pair, which we refer to as a protago- nal information, such as injecting fine-grained event nist representation. They assumed that although a properties (Lee IT and Goldwasser, 2018), inject- script has several participants, there is a central ac- ing nuanced multi-relations (Lee IT and Goldwasser, tor, called the protagonist, who characterizes mainly 2019), and injecting additional commonsense knowl- the whole event chain. They extracted the narrative edge (Ding et al., 2019b). chain from texts, which involves a sequence of events that share a common protagonist. 3 Event representations They represented events as pairs, where “verb” refers to the predicate At the linguistic level, an event and its compo- verb describing the event, and “dependency” refers nents can be verbalized by free-form narrative texts. to the typed grammatical dependency relationship We refer to the lexical realization of a script event between the verb and the protagonist, such as “sub- as an event description, which consists of verb pred- ject,” “object,” or “prepositional.” A narrative chain icates and some noun phrases. The verb predicates contains a series of events, so it can be represented describe the actions, while the noun phrases describe as a chain of pairs, all related the related entities. We refer to the verb predicate to a single protagonist. and its associated noun phrases as event components, For example, the narrative text “Jessie killed a which can be extracted from event descriptions by man, she ran away and got arrested by the police NLP tools. For example, the text “Tom brought in the street” produces a chain for the protagonist the book to Mary” is a description of the bringing Jessie: (kill, subject), (run, subject), (arrest, object). book event, containing four components: a predicate Chambers and Jurafsky (2008) made a prelim- “brought” and three noun phrases “Tom,” “book,” inary attempt to use a structured event representa- and “Mary.” These components can be combined tion, and their protagonist representation has been with a specific structure to form the representation adopted by various subsequent works (Chambers and Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 347

Jurafsky, 2009; Jans et al., 2012; Rudinger et al., (1) (He, cited, a new study); 2015). However, it still has some flaws. (2) (A new study, was released by, UCLA); First, it solely concentrates on organizing event (3) (A new study, was released in, 2008). chains around a central participator, the protago- nist. Thus, a script that contains multiple partic- A relational triple provides a more specific rep- ipants will redundantly produce many chains. For resentation method, which aids in maintaining coher- example, the narrative text “Mary emailed Jim and ence between subject and object. Balasubramanian he responded to her immediately” yields two chains, et al. (2013) used it to induce open-domain event a chain for Mary, (email, subject) (respond, ob- schemas, which effectively overcame the coherence- ject), and a chain for Jim, (email, object) (respond, lacking issue of the event schemas proposed by subject). Chambers and Jurafsky (2009). Second, it treats subject-verb and verb-object separately. Thus, it lacks coherence between subject 3.3 Multi-argument representation and object; in other words, it does not express inter- Pichotta and Mooney (2014) proposed a simi- actions between different pairs lar approach, called multi-argument representation. produced from the same event. They represented an event as a relational atom Take the above narrative as an example. There v(es,eo,ep),wherev is the verb, and es, eo,andep is no connection between (email, subject) of Mary are arguments of v, standing for subject relation, ob- and (email, object) of Jim, even though they are both ject relation, and prepositional relation to the verb, yielded from the same event “Mary emailed Jim.” respectively. Any argument except v may be null Third, it considers only the pairs rather than other arguments; however, in relation to the verb. For example, the text “Tom sometimes the meaning of the verb changes drasti- brought the book to Mary” yields the representation cally with different arguments. bring(Tom, book, Mary). For example, consider the following events: “Jim performed a music” and “Jim performed a surgery.” Multi-argument representation produces only a The same verb, “perform,” has quite different mean- single sequence for a document. Given the same text ings in these two events, but if using the protagonist as above, “Mary emailed Jim and he responded to her representation, they will obtain the same represen- immediately,” it yields only one chain: mail(Mary, tation, namely . Furthermore, Jim, -), respond(Jim, Mary, -). even if the meanings of the verbs are the same, the A multi-argument representation can involve meanings of the events will be changed with differ- multiple entities and express a more specific mean- ent arguments. For example, if using the protagonist ing of an event. It can also capture pair-wise entity representation, the event “go to the church” will be relations between events. In the example above, it the same as the event “go to sleep,” because they can directly model the fact that Tom responded to both yield (go, subject). Mary after Mary emailed him into a sequence chain, while the solely captures two 3.2 Relational representation discrete events. Balasubramanian et al. (2013) addressed these Many subsequent works, such as Pichotta flaws by representing an event as a relational triple, inspired by the information (2016), Modi (2016), Wang ZQ et al. (2017), and extraction system OLLIE (Mausam et al., 2012). We Lv et al. (2019), adopted this representation form. refer to as a relational rep- Pichotta and Mooney (2016a) made slight improve- resentation, where Arg1 and Arg2 are the subject ments in their multi-argument representation by and object of the event description, respectively, and adding a preposition dimension, representing an Relation is the relation phrase between Arg1 and event as a 5-tuple (v, es,eo,ep,p), where p is the Arg2. For instance, the event “He cited a new study preposition. So, the above text “Tom brought the that was released by UCLA in 2008” produces three book to Mary” generates the representation (bring, tuples: Tom, book, Mary, to). 348 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373

3.4 Fine-grained property representation play an important role in predicting the next event. Consider two events: “Jerry was angry” and “Jerry Lee IT and Goldwasser (2018) held the view punched her son.” The predicate adjective “angry” that fine-grained event properties, such as argument, is crucial information to predict the follow-up event sentiment, animacy, time, and location, can poten- correctly. These works will represent the predicate tially enhance the event representation, so they sug- adjective as an argument to the verb be or become, gested a fine-grained property representation. They e.g., “Jerry was angry” → be(Jenny, angry). represented an event as a 6-tuple , which contains four cates, such as negation labels, particles, and clausal basic components and two fine-grained properties. complements (xcomp) (Lee IT and Goldwasser, The basic components refer to the former 4- 2019). With negation labels, the event “She does not e e e e tuple , where like the dinner” will result in the predicate “not_like” e tok( ) stands for the pair, rather than “like” alone. Also, because some verbs, which is the same as the protagonist representation. such as “go” and “have,” are meaningless, it is nec- e e The other three elements (i.e., subj( ), obj( ), and essary to include xcomp. In that case, the event e v e ,e ,e prep( )) are the same as the ( s o p) of a multi- “She went shopping yesterday” yields the predicate argument representation. “go_shop” rather than “go” alone. The two additional properties f1(e) and f2(e) refer to the sentence-level sentiment and animacy information, respectively. The sentence-level senti- 4 Script learning models ment captures the overall tone of the event, consist- The script learning model is the core component ing of three polarity labels (i.e., negative, neutral, of the script learning system, and undertakes the and positive). The animacy information also con- main tasks of encoding and matching events. Gen- sists of three types (i.e., animate, inanimate, and erally, models take the representations of the struc- unknown), depending on whether the protagonist is tured events as the input and take event-matching re- a living entity. sults as the output. A trained script learning model For instance, given the narrative text “Jenny can encode script knowledge involving the actions went to a restaurant and ordered a salad,” the fine- and the entities that participate in the actions, and grained property representation for the protagonist calculate the matching results based on the encoded Jenny will be <(go, subj), jenny, -, restaurant, neu- script knowledge and event-matching algorithm. In ral, animate>, <(order, subj), jenny, salad, -, neural, the next step, the matching results can be applied animate>. to specific tasks to draw inferences that are com- 3.5 More detailed information monsensible but not explicitly stated. In most cases, the matching results refer to similarity or coherence Although similar event representation struc- scores, which can somewhat measure the relationship tures are used, different works may process details between events. in different ways. There are various ways to classify the exist- Some works extract only the headword of entity ing script learning models; recalling from Section 2, mentions, while other works extract the entire men- based on modeling methods, they can be roughly tion span. For example, consider the following event: divided into three categories: rule-based models, “Jenny went into her favorite restaurant.” The for- count-based models, and deep learning based mod- mer works may represent it as go (Jenny, restaurant), els. Rule-based models use complicated hand- while the latter works may see it as go (Jenny, her_ crafted rules to model script knowledge of specific favorite_restaurant). The latter works can capture domains; count-based models use statistical count- more nuanced information, but have a greater risk of ing approaches to automatically learn broad-domain sparsity. script knowledge from a large text corpus; deep When it comes to predicates, some works con- learning based models introduce the neural network sider only verbs, while other works also consider and embledding to capture richer script knowledge predicate adjectives, because some adjectives may and overcome the sparsity issues. Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 349

From the perspective of different structures used the restaurant. When we go to a restaurant, we can to handle scripts, they can also be divided into three activate the restaurant schema that is stored in our other categories: event pair based models, event memory, and anticipate the upcoming events. chain based models, and event graph based models. A frame (Minsky, 1975) is a variation of the Pair-based models focus on calculating associations schema, and is a data structure used in AI to rep- between pairs of events by count- or vector-based resent knowledge about stereotyped situations. The methods; chain-based models capture the order in- frame has a hierarchical structure containing four formation and long-term context information of the different levels: syntactic surface frames, seman- full event chain by RNN-based approaches; graph- tic surface frames, thematic frames, and narrative based models extract script knowledge by construct- frames. Similar to the schema, the top levels (the- ing graph structures that can express denser and matic frames and narrative frames) are fixed general broader connections among events. templates with slots for different situations. In con- In this section, we discuss some representative trast, the bottom levels (surface syntactic frames and models in detail, according to all of these six cate- surface semantic frames) are specific instances that gories. Finally, we also introduce some other notable fill in slots with different values. works outside the above categories that attempt to A semantic frame (Fillmore, 1976) can be enhance traditional script learning models by inject- roughly seen as a lexical-level Minsky frame, which ing external knowledge. was proposed to capture the abstract description of Notably, because these categories are divided an individual activity and all of its possible roles. from various perspectives, there is some overlap be- A semantic frame includes a set of predicates that tween them inevitably. In addition, each script learn- can describe the main concept of the activity and ing model may involve multiple categories, so when the corresponding semantic roles, which can describe discussing specific categories below, we may empha- different entities involved in the activity. The same size some particular aspects. activity can be expressed via different surface re- alizations, with different predicates or sentence con- 4.1 Rule-based models structions. For example, consider the following three The handcrafted rule-based script learning mod- sentences: els are the earliest use of script-related concepts in (1) Microsoft purchased the Canadian company AI research. The seminal works include schemas by Maluuba. Rumelhart (1980), frames by Minsky (1975), seman- (2) AI company Maluuba was bought by Mi- tic frames by Fillmore (1976), and scripts by Schank crosoft on January 9, 2017. and Abelson (1977). Although these works estab- (3) Microsoft purchased Maluuba for 30 million lished the core concept of script learning and made dollars. many beneficial attempts, they still have many obvi- Although different in surface forms, the core se- ous demerits and limitations. Below, we will briefly mantic meaning of them is similar, namely a com- introduce and discuss these incipient works. merce purchase activity. The purpose of the seman- Schema (Rumelhart, 1980) is a broad concept tic frame is to capture the core semantic meaning of cognitive psychology and an important theoretical across all the surface forms, by abstracting out the source of the script. A schema is a mental framework information about the main activity. The semantic employed to organize and understand the world, and frame also assigns proper semantic roles to the en- consists of a general template and some slots. The tities involved in the activity (e.g., Microsoft is the general template represents an abstract class con- buyer role, and Maluuba is the role of the good). The taining the knowledge shared by all instances, while semantic roles can help understand the meaning of the slots can be filled in with different values to the text by answering questions like “who did what to form a specific instance of this class. For exam- whom and by what means?” They are also useful for ple, a restaurant scenario has its own schema, which a lot of NLP tasks, such as (Shen includes a sequence of events that take place in a and Lapata, 2007), document summarization (Khan restaurant, such as entering the restaurant, order- et al., 2015), and plagiarism detection (Osman et al., ing food, eating food, paying the bill, and leaving 2012). 350 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373

The concept of script was first firmly proposed input of DISCERN is a short narrative text about by Schank and Abelson (1977) to explain how com- a stereotypical event, and a “lexico” module is used monsense knowledge about daily human activities to map the input words into distributed represen- can be used in language processing. They defined tations. Then a “sentence parser” module processes scripts as structured knowledge representations cap- each input sentence word by word to form the repre- turing the relationships between prototypical event sentation of the sentence. After that, a “story parser” sequences and their participants in a given scenario. module composes all the sentence representations to In short, it is a sequence of prototypical events orga- produce the representation of the story. Finally, nized in temporal order. These events contain stereo- the story representation is inputted into a “story typical human activities, such as eating in restau- generator” module and a “sentence generator” mod- rants, cooking dinner, and making coffee. ule to generate corresponding sentences about the To compare the relationship between the script story. The corresponding sentences can be used to and the concepts mentioned before, we can see the solve story-related problems such as retrieving infor- script as a specific schema or frame that focuses on mation, answering questions, and generating para- stereotypical events in a particular scenario. We can phrases. All the modules are trained separately, also treat the script as multiple semantic frames, be- and subsequently trained jointly to fine-tune the cause it models a sequence of events rather than an modules. individual activity. The idea of employing distributed representa- The incipient script theory was primarily intro- tions and combining the component representations duced to construct story understanding and genera- to produce representations of the whole is simi- tion systems (Schank and Abelson, 1977; Culling- lar to many modern deep learning based models. ford, 1978; DeJong, 1979; Schank, 1990). These These deep learning based models widely employ script systems employ mainly complex handcrafted distributed representations, and use embeddings to rules for modeling relations between events and use represent words, sentences, and events. They also a set of separate models to treat different scenarios. create representations of a sentence by composing Because rules are designed by humans, rule-based the words in the sentence, and create representa- systems are limited to the bounded designer experi- tions of an event by combining its components. In ence, and are extremely time-consuming. Although addition, the joint training and fine-tuning methods they could be applied in certain fields, their rule- used by DISCERN are generally used by recent deep based methods were characteristically limited to a learning models. With improvements in computing few specific domains and did not scale to complex power, model structure sophistication, and the surge ones. of data volume, the ability to process and generalize Some potential end-to-end connectionist meth- these deep learning based modelsisfargreaterthan ods were subsequently proposed to model script that of DISCERN. We will discuss these models in knowledge, such as DYNASTY (DYNAmic STory detail in Sections 4.3–4.5. understanding sYstem) (Lee G et al., 1992) and DIS- 4.2 Count-based models CERN (DIstributed SCript processing and Episodic memoRy Network) (Miikkulainen, 1992, 1993). Count-based script learning models greatly re- These works have some abilities in script understand- duce the defects of rule-based models by automati- ing and inferring, but more rules require more com- cally learning broad-domain script knowledge from a puting power and more application scenarios require large text corpus. The key idea of count-based mod- more training data. Due to the weak computational els is to measure the relationship of event pairs using power and insufficient data at that time, the abilities statistical counting approaches. were limited and these works did not achieve good Chambers and Jurafsky (2008) were the first to generalization (Mueller, 1998; Gordon, 2001). introduce a count-based approach for learning statis- Nevertheless, these works are of great value to tical script knowledge. They adopted PMI (Church the subsequent works. From the structure and op- and Hanks, 1990) to calculate the relationship scores eration process of DISCERN, we can even see the over all the event pairs that occurred in the narrative rudiments of modern scriptlearningsystems.The chains. Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 351

As reviewed in Section 3.1, Chambers and Ju- the highest PMI score as the prediction. The re- rafsky (2008) used protagonists to represent events. sult can be obtained by maximizing the following They extracted the narrative chain from the corpus, expression: which is a script-like event sequence that shares a common protagonist, and represented it as a series n of pairs, all related to a com- max PMI (ci,ej ) , (3) 0 pairs that shared the same training corpus, and cj is the j candidate event. entity as a subject, object, or preposition, to form a narrative event chain. The work of Chambers and Jurafsky (2008) re- Chambers and Jurafsky (2008) followed the as- vived the trend of script learning, but their unsuper- sumption that verbs sharing coreferring arguments vised framework still has some shortcomings. The are semantically connected. Thus, argument-sharing framework ignores the temporal order, which may verbs are more likely to participate in the same nar- affect the performance, and has impoverished event rative chain than verbs that are not sharing. There- representation, which cannot deal with multiple- fore, they measured the relationships between event argument events. Several researchers followed their pairs based on how often they shared grammatical framework and enhanced it in different aspects. arguments, and how often two predicate verbs occur The first essential extension work was done by in the same narrative chain. Jans et al. (2012). They introduced novel skip n- PMI was adopted to calculate the specific scores gram counting methods, an ordered PMI model and of the relations. Given an occurring narrative chain a probability model, to replace the original C =(c1,c2, ..., cn), which contains n events, and a PMI. candidate event e as the possible new event, every Following the insight that semantically related event is represented as a pair. events do not have to be strictly adjacent, Jans et al. The PMI of C and e is the sum of the PMIs between (2012) used skip-grams rather than regular n-grams the event e and each of the occurring events ci in C, to collect statistical data for the training model. The which can be formulated as skip-grams method can reduce the sparsity and in- n crease the size of the training data. PMI(C, e)= PMI (ci,e) i=1 Particularly, the N-skip bigram strategy will n (1) P (ci,e) find all event pairs that occur with 0 to N events in- = log . P c P e tervening between them, from an event chain. Fig. 4 i=1 ( i) ( ) illustrates an example that contains event pairs col-

The numerator is defined by (taking c1 as an lected by 0-, 1-, and 2-skip . example) After obtaining the statistical data, the au- C c ,e thors introduced two novel score functions to explic-  ( 1 ) P (c1,e)= , (2) itly capture the temporal order information of event c e C (ci,ej ) i j pairs. The first one is the ordered PMI (OP) score, a where C(c1,e) can be obtained by counting the co- variation of the PMI, which explicitly takes into ac- occurrence times of event pair (c1, e) in the training count the temporal order of event pairs in the chain. data, regardless of the order. OP assumes that, in addition to the events occur- The authors hypothesized that the most likely ring before the candidate events, the events occur- new event had the highest PMI score, so they did ring after the candidate events are still considered. the above calculations for every candidate event in So, given an insertion point, m, where the new event the training corpus and then chose the one with should be added, we can calculate the OP score as 352 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373

(Saw, subject), (Smiled, subject); (Smiled, subject), (Kissed, object); (Kissed, object), (Blushed, subject) Six pairs Two-skip bigrams: (Saw, subject), (Kissed, subject); (Smiled, subject), (Blushed, subject); (Saw, subject), (Blushed, subject)

(Saw, subject), (Smiled, subject); (Smiled, subject), (Kissed, object); (Kissed, object), (Blushed, subject) Five pairs One-skip bigrams: (Saw, subject), (Kissed, subject); (Smiled, subject), (Blushed, subject)

Three pairs Zero-skip bigrams: (Saw, subject), (Smiled, subject); (Smiled, subject), (Kissed, object); (Kissed, object), (Blushed, subject)

Event pair occurs adjacently Event pair occurs Event pair occurs (Saw, subject), (Smiled, subject) with one event intervening with two events intervening (Smiled, subject), (Kissed, object) (Saw, subject), (Kissed, subject) (Kissed, object), (Blushed, subject) (Smiled, subject), (Blushed, subject) (Saw, subject), (Blushed, subject)

Event chain: (Saw, subject), (Smiled, subject), (Kissed, object), (Blushed, subject)

Fig. 4 Event pairs collected by N -skip bigrams. 0-skip bigrams collect three event pairs that occur adjacently. 1-skip bigrams collect five event pairs, including three 0-skip bigrams plus two 1-skip bigrams whose event pair occurs with one event intervening. 2-skip bigrams collect six event pairs, including three 0-skip bigrams and two 1-skip bigrams mentioned above, plus a 2-skip bigram whose event pair occurs with two events intervening follows: plus the bigram probability function achieved the OP(C, e) best empirical performance. m n Another enhancement was made by Pichotta = OP (ci,e)+ OP (e, ci) and Mooney (2014), who proposed multi-argument i=1 i=m+1 (4) event representation v(es,eo,ep) to replace the im- m P c ,e n P e, c poverished representation. As ( i ) ( i) , = log P c P e + log P e P c explained in detail in Section 3.3, the multi-argument i=1 ( i) ( ) i=m+1 ( ) ( i) representation directly models the interactions be- C c ,e P c ,e  ( 1 ) . tween entities and consequently captures richer se- ( 1 )= C c ,e (5) c1 ej ( i j ) mantic meanings. OP treats the co-occurrence of two events, The authors used the same 2-skip-bigram prob- C(ci,e), as an asymmetric count (i.e., considering ability function as that in Jans et al. (2012), and its order), whereas the PMI of Chambers and Juraf- made some further changes to make it applicable sky (2008) treats it as symmetric (i.e., regardless of to their multi-argument representation. The previ- the order). ous works measured the relationships of event pairs The second score is the bigram probability by simply counting the co-occurrence times of each (BP) score, which employs conditional probabilities pair, but that is not sufficient for multi-argument P (e|ci) instead of P (ci,e) to compute the relational representation. scores of the candidate event. The BP score is for- For example, if there are two co-occurring event mulated as follows: pairs, and , we want to count two co-occurrence times of X Y Z Y = BP (ci,e)+ BP (e, ci) event pair ,forall i=1 i=m+1 (6) distinct entities X, Y ,andZ. However, if we keep m n the entities as they are and calculate the raw co- P e|c P c |e , = log ( i)+ log ( i ) occurrence counts, we will count one time for pair i=1 i=m+1 , and C (ci,e) C (e, ci) another one time for pair . The authors compared these two novel scores Pichotta and Mooney (2014) reformed the with the PMI and observed that the 2-skip strategy counting algorithm. They were motivated by the Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 353 observation that the essential factor in capturing the rely on co-occurrence counts of events, but the num- relationship between two events was their overlap- ber of instances required increases exponentially. ping entities. In short, the central idea of their algo- rithm is that when counting the co-occurrence times 4.3 Deep learning based models of an event pair, the algorithm adds the number of Recent deep learning based script learning mod- co-occurrence times plus their overlapping entities. els widely introduce neural networks and embedding This algorithm has simple calculations and can cap- to counter the shortcomings of count-based models. ture pair-wise entity relationships between events. Embedding is a dense continuous vector of real val- They employed two evaluation tasks, predicting ues; it mitigates the sparsity issue and provides a the verb with all arguments and predicting simpler useful approach for composing a predicate with its pairs. The empirical results arguments. showed that modeling multi-argument events, in- In summary, as illustrated in Fig. 5, deep learn- stead of modeling pairs, could ing based models use embeddings to represent an provide more power for both tasks. event and its components (i.e., predicate and its Although both the representation and the per- arguments). The embedding of the entire event is formance were improved, the essential computing computed by composing the word embeddings of its methods of the enhanced works were still based on components, via a composition neural network. The the count, which had common sparsity issues. event-matching works are based mainly on calculat- Count-based models use discrete event represen- ing conference scores or similarity scores of event em- tations, treat the whole structured representation of beddings, which can somewhat measure the seman- the event as an unbreakable atomic unit, and re- tic correlation between events. Both the parameters gard the components of event as fixed parts of the of the compositional process for computing the event unit. The matching probability of an event pair is embeddings and the parameters of the matching pro- estimated by calculating the co-occurrence counts cess for computing the similarities are automatically of two fixed event tuples. However, the underly- learned from the texts. ing symbolic nature of tuple matching greatly lim- its the flexibility and generalization ability. Those Similarity or coherence score works fail to express the between Event matching algorithm individual components of an event. For example, Event given two events cook(John, spaghetti, dinner) and embeddings prepare(Mary, pasta, dinner), since “cook” and “pre- Compositional neural network Compositional neural network pare” and “spaghetti” and “pasta” are semantically very similar, these two events should be semanti- Word embeddings cally similar. However, count-based models would not take the similarity between these components Text Cook(John, spaghetti, dinner) Prepare(Mary, pasta, dinner) into account, unless both events often occur in simi- words Event 1 Event 2 lar context. Fig. 5 Process of deep learning based models. An In addition, count-based models measure the re- event and its components are represented by embed- lationship among events merely by calculating co- dings. The embedding of an event can be obtained by occurrence counts of the entire event tuples. Conse- composing the word embeddings of its components, via a composition neural network. The event match- quently, the event tuple pairs that are never simul- ing work is based on calculating conference scores or taneously observed in the training corpus are given similarity scores of event embeddings zero probability (or a very small probability, if using a smoothing method), even if some of them are se- Learning and exploiting embeddings to repre- mantically probable. Therefore, count-based meth- sent words is beneficial for a range of NLP tasks, ods can make a respective good judgment about fre- such as (Laender et al., 2002), quent events, but are less successful with rare ones. (Erk and Padó, 2008), word Moreover, count-based methods suffer from the curse similarity analysis (Radinsky et al., 2011), word of dimensionality (Bengio et al., 2003), because they sense disambiguation (Navigli, 2009), and word and 354 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 spelling correction (Jones and Martin, 1997). The evaluation method at that time. distributional hypothesis (Harris, 1954), namely, In fact, the NC task is essentially a language words that occur in the same context tend to have modeling task. For language models, the goal is to similar meanings, is the foundation of word embed- predict the center word by referring to the context dings. The word embeddings are learned so they can words, while for script learning models, the goal is to predict context words. Consequently, they encode predict the missing event by referring to the context some degree of semantic and syntactic properties of events, and all of the events are made up of words. words: words that share similar contexts should be This observation strongly suggests the essential cor- close to each other in the embedding vector space. relation between neural language models and script There are some works that focus on learning em- learning models. Following this, Rudinger et al. beddings of phrases using a compositional model to (2015) did exploratory work training a log-bilinear compose the embeddings of individual words in the language model (LBL) (Mnih and Hinton, 2007) to phrase (Baroni and Zamparelli, 2010; Socher et al., resolve script learning tasks. Their model achieved 2012). Motivated by them, Modi and Titov (2014a, a significantly better result than count-based models 2014b) suggested that the embedding of an event in the NC task, which led them to conclude that for can also be learned using a compositional neural script learning, either neural language models were network to compose the embeddings of individual a more effective approach or the NC task was not a components of the event. suitable evaluation task. The event embeddings were In their works, the event was no longer treated also learned during the training process but merely as an unbreakable unit. Instead, it consisted of some regarded as a byproduct. The authors still consid- separable components that contain a predicate and ered the event as an unbreakable unit and directly its arguments. All of the components were repre- learned representation vectors of them, so their event sented as embeddings, which were learned from pre- representations are distributed in external form but dicting the prototypical event orderings. Because the discrete in essence. embeddings are in the same and genetic dimensional The alternative conclusions of Rudinger et al. vector space, we can easily obtain the embedding of (2015) were determined by subsequent works of Modi the entire event by composing the embeddings of its (2016) and Granroth-Wilding and Clark (2016). The components via a compositional neural network. performance of their neural network models over- Similar to , event embedding shadowed the previous count-based models, which can encode the semantic and syntactic properties of indicated that neural network based methods are an event: the similar events should be close to each certainly more effective for script learning tasks. In other in the embedding vector space, while dissimilar addition, two novel and more appropriate evaluation events should be far away from each other. With tasks, the adversarial narrative cloze (ANC) task and these properties, event embedding can be used to the multiple choice narrative cloze (MCNC) task, measure the relationship between almost any pair were proposed to refine the NC test. We will discuss of events by calculating the vector distance between the details of NC, ANC, and MCNC in Section 5. them, no matter whether they occurred together. Modi (2016) proposed a probabilistic composi- In addition, due to the continuous nature of tional model that had the same compositional neural real values, the embedding representation pushes the network as that in Modi and Titov (2014b) to pro- script learning system to learn the development pat- duce event embeddings, and a neural network based terns instead of the exact sequences of events. This in probabilistic model to encode the event sequence and turn results in smooth modeling allowing the script predict the missing event. Word2vec, the vector- learning system to predict unobserved events. learning system of Mikolov et al. (2013), was also Although the above improvements effectively introduced to learn word embeddings. mitigated the sparsity issue of count-based models, Specifically, given the event sequence with a Modi and Titov (2014a, 2014b) chose the event or- missing event (e1 → e2 → ... → ek−1 → ? → dering task to learn embeddings and evaluate mod- ek+1 → ... → en), the model is trained by predict- els, rather than the NC task (Chambers and Juraf- ing the missing event ek. During training, instead sky, 2008), which was regarded as the predominant of predicting the whole event at one time, the model Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 355 predicts the next events by incrementally using the NC test. Their experimental results showed that context events and the previous prediction. As both the neural network and word2vec could yield shown in Fig. 6, the model first predicts the ver- further empirical improvements. bal predicate of ek using the embedding of context The above works not only prove the significant events. Then it predicts the position of the protag- advantages of neural networks and embeddings in onist using the embedding of context events and the learning script knowledge, but also construct the ba- embeddings of the verbal predicate, predicted in the sic framework of deep learning based script learning last step. In this way, the other arguments are pre- models. However, they focus only on learning associ- dicted one by one. The authors considered this to be ations between event pairs and ignore the temporal a more natural way of predicting events, because the order of the event chain, so they are incapable of verbal predicate information predicted before will in- capturing order information and long-term context fluence the next possible arguments. information. The following works introduce more Notably, the model uses the embedding of the advanced mechanisms, such as a recurrent neural previous prediction to make the next prediction, in- network and attentional neural network, to model stead of using gold embedding. In other words, if the the full event chain. We will describe the details of previous prediction is wrong, the model will use the these models in Section 4.5. wrong embedding rather than the correct one stored 4.4 Pair-based models in the training dataset to make the next prediction. The purpose of this design is to make the model Event pair-based models refer to models that more robust to noise and partially recover from mis- focus on calculating associations between pairs of predictions during the testing period. The authors events. Because these models concern only the re- also proposed a new testing task, the ANC task, to lations between event pairs, they are also known as overcome the demerits of the standard NC task. weak-order models. Pair-based models are the domi- Similar to Modi (2016), Granroth-Wilding and nant method in the literature before 2017 (Chambers Clark (2016) employed a compositional neural net- and Jurafsky, 2008; Jans et al., 2012; Pichotta and work and word2vec. In addition, they introduced Mooney, 2014; Granroth-Wilding and Clark, 2016). a Siamese neural network to estimate the coherence The core idea of pair-based models is calculat- scores of two event pairs. They evaluated their model ing the pair-wise event scores by various approaches. using the MCNC task, a novel development of the These scores can somewhat measure the semantic

Verbal predicate Position of protagonist Argument 1 Argument 2 Prediction embeddings

Neural network Neural network Neural network Neural network

Context embedding

...... Event embeddings e1 ekí1 ek+1

Fig. 6 Process of incrementally predicting the next event. First, the verbal predicate is predicted using context embedding. Then, the protagonist position is predicted using context embedding and the predicted verbal predicate embedding. In this way, the other arguments are predicted one by one using the context events and the previous prediction 356 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 correlation and compatibility between two events, will be effective for multiple event-related tasks. namely, how strongly they are expected to appear When resolving standard event prediction tasks in the same chain. These approaches include tra- (i.e., predicting which candidate event is most likely ditional count-based methods, such as PMI (Cham- to occur given a context event sequence), pair-based bers and Jurafsky, 2008) or skip bigram probabili- models need to compute the correlation scores be- ties (Jans et al., 2012), and vector-based methods, tween a candidate event and a sequence of context such as the Siamese network (Granroth-Wilding and events. Generally, they will compute the scores be- Clark, 2016) or vector distance. tween the candidate event and each individual event In a harder situation, when measuring the re- in the chain, and then treat the average value of these lations of event pairs that have subtle differences in scores as the final correlation score between the can- surface realizations, the event embeddings, produced didate event and the whole event chain. Finally, the from widely used compositional neural networks, do candidate with the highest score is chosen as the final not work well. They usually combine event compo- prediction. We can see this process in Fig. 7. nents by concatenating or adding their word embed- Although the relationships between the candi- dings, and then applying a parameterized function date event and each event in the chain have been con- to map the summed vector into the event embed- sidered, pair-based models are still limited to event ding space. However, due to the overlap of words pairs. They ignore the temporal order of events, and and the additive nature of parameters, it is hard thus lack the ability to capture order information to encode subtle differences in an event’s surface and long-term context information. As a result, im- realizations. probable predictions may be produced by them. For For example, the event pair “John threw bomb” example, given event , pair-based models and “John threw football” is semantically far apart, may predict as the next event, simply because they indicate entirely different scenarios because these two events often co-occur or have a (i.e., sports and terrorism). However, because their high coherence score (Granroth-Wilding and Clark, subject and predicate are the same, the embeddings 2016). This flaw will be countered by the event chain of them, produced by compositional neural networks, based models proposed by subsequent works. are likely to be close in the vector space, which in- Regardless of the above flaw, pair-based mod- correctly indicates that they are closely related. On els still have their respective strengths compared to the contrary, the event pair “John threw bomb” and chain-based models. On one hand, they obviously “John attacked embassy” is semantically close. Even inject less noise when modeling event chains with though these two events have very different surface flexible order. On the other hand, because they are forms, they do not share similar word vectors, so simpler, they are less likely to suffer from over-fitting their embeddings may be distinct, which incorrectly issues and are less costly to employ. For these rea- indicates that they have a distant relation. sons, many recent script learning models have chosen To distinguish the relations of such event pairs, to combine the advantages of pair- and chain-based we need to understand deeper scenario-level seman- approaches (Wang ZQ et al., 2017; Lv et al., 2019). tics, of which the interactions between the predi- 4.5 Chain-based models cate and its arguments are the key points. As the above example shows, the interactions of “football” Event chain based models are introduced to or “bomb” with the predicate “threw” are what de- overcome the above-mentioned flaws. They use termine the precise semantics of the scenario. Weber RNN-based approaches to model the full event chain et al. (2018) proposed a tensor-based composition and consider the temporal order information. The model to combine the subject, predicate, and ob- basis of chain-based models is the RNN, which is ject, and then to produce the final event embedding. widely used for language models to express inter- Tensor-based composition models can capture multi- actions between long-term sequences via intermedi- plicative interactions between event components and ate hidden states. The basic process of chain-based therefore represent deeper scenario-level semantics. models is shown in Fig. 8, in which an RNN provides With this ability, the event embeddings are sensitive embeddings with order information, and the event to even a small change of a single component, which matching works are based on these embeddings. Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 357

Candidate with the Final prediction highest final score

Scores between event Final correlation score of candidate 1 ... Final correlation score of candidate m chain and candidates

Scores between Score 1 Score 2 ... Score n Score 1 Score 2 ... Score n event pairs

Event matching algorithm

Event 1, event 2, Ă, event n Candidate 1 ... Event 1, event 2, Ă, event n Candidate m Event chain Candidate event Event chain Candidate event

Fig. 7 Process of resolving the event prediction task by pair-based methods. First, calculate the scores between the candidate event and each individual event in the chain. Then, calculate the average value of these scores as the final score between the candidate event and the whole event chain. Finally, choose the candidate with the highest score as the final prediction

Nevertheless, basic RNNs suffer from a van- Two notable improved studies were conducted ishing and exploding gradient problem. During to overcome these shortcomings. The first one is the learning period, the gradient signal tends to by Pichotta and Mooney (2016b), in which the raw either approach zero or diverge as it propagates texts describing the previous event chain were di- through long time steps. So, LSTM (Hochreiter and rectly used to predict the texts describing the miss- Schmidhuber, 1997) and gated recurrent unit (GRU) ing event. The sentence-level RNN encoder-decoder (Chung et al., 2014) were devised to overcome this model of Kiros et al. (2015) was employed to produce problem. They employ more complicated hidden text predictions. Their word-level model trained by units, which can provide encoded long-range prop- raw text was compared with an identical event-level agation of events without losing long-term historical model trained by structured event representations. information. The experimental results of the evaluation showed that, on the task of event prediction, there was only A characteristic of scriptlearningisthatsome a marginal difference between the word-level model events will be highly predictive of events far ahead and the event-level model. in the chain, while some events are only locally pre- dictive. Following this idea, Pichotta and Mooney The second improved work was done by Hu et al. (2016a) first adopted the LSTM model for the task (2017). They developed an end-to-end LSTM model of script learning. Their model represents an event called CH-LSTM. Similar to Pichotta and Mooney with five components (i.e., 5-tuple (v, es,eo,ep, p), (2016b), they took the raw texts describing the pre- involving a verb predicate, subject, direct object, vious event chain as the input, and directly generated prepositional relation, and preposition). At each texts describing the possible next event. time step, one event component is inputted into the CH-LSTM captures two-level structures of the LSTM model. After inputting the entire event chain, event chain. At the first level, a single event is en- the model will output the prediction of an additional coded using word embeddings as the input, and they event, which is also a component at each time step. are combined to produce event embedding as the Their work produced more accurate results, but it re- output. At the second level, the event chain is en- quired handcrafted features to represent events and coded using event embeddings as the input, and the linguistic preprocessing to extract these features. In last hidden vector is outputted as the embedding addition, it cannot predict additional events other of the entire previous event chain. CH-LSTM uses than a given candidate set. the event chain embedding to make a non-targeted 358 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373

Matching result

Event matching algorithm

Embeddings with ... order information

Chain-based Hidden RNN RNN ... RNN RNN RNN methods State

Event ...... embeddings

Compositional neural network

Event Candidate Event 1 Event 2 ... Event n Candidate 1 ... chain events Candidate n

Fig. 8 Process of chain-based models. Inputting each event embedding inside an event chain into a recurrent neural network (RNN) can provide embeddings with order information via an intermediate hidden state. The event-matching process based on these embeddings can capture long-term context information

(unknown) next event prediction, which can gener- ods. Therefore, a dynamic memory network (Weston ate new events that do not even exist in the training et al., 2015) was used to automatically induce differ- set. ent weights for each event in the inferring task. Although the aforementioned works showed The work of Wang ZQ et al. (2017) inspires us to that LSTM captured significantly more event chain resolve the script-related task by combining the ad- order information and made substantial experimen- vantages of pair- and chain-based models. Following tal improvements compared to the previous methods, this inspiration, we can exploit the event segment, it may still suffer from the over-fitting problem, be- which is a concept between individual events and cause an event chain in a script has a flexible order. event chains, to make further improvements. To solve this issue, Wang ZQ et al. (2017) proposed An event segment refers to a part of script that a novel dynamic memory network model to integrate involves several individual events in the same event the advantages of chain-based temporal order learn- chain. As Fig. 10 shows, given a script, for example, ing and pair-based coherence learning. Fig. 9 shows the script of restaurant visiting, there are various an overview of their model. event segments within the chain, such as continuous As we mentioned at the end of Section 4.4, event segment <“X read menu” “X order food” “X pair-based models are more practical when model- eat food”> or discontinuous event segment <“X read ing event chains with flexible order, so Wang ZQ menu” “X order food” “X make payment”>. et al. (2017) used pair-based methods to measure We can observe that both the individual events the relationship between pairs of an occurring con- and the event segments are conducive to making text event and a subsequent candidate event. To more accurate subsequent event predictions. In par- inject the order information, they represented events ticular, the individual event “X make payment” has a as the hidden states of LSTM, and used them to significant effect on predicting the subsequent event calculate the relation scores of event pairs. “X leave restaurant.” Meanwhile, the event seg- In addition, considering the fact that the sig- ment <“X read menu” “X order food” “X eat food”> nificance of different events varies for inferring a has a strong semantic relation with the subsequent subsequent event, it is not appropriate to give an event “X leave restaurant.” The event segment has equal weight to each event as in the previous meth- a unique advantage; namely, it has richer semantic Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 359

Probability of candidate 1

Dynamic memory network for relation measuring

Pair-based Score 1 Score 2 ... Score n methods

Embeddings with ...... order information

Chain-based Hidden LSTM LSTM ... LSTM LSTM methods State

Event ... embeddings

Compositional neural network

Event chain Event 1 Event 2 ... Event n Candidate events Candidate 1

Fig. 9 Integrating the advantages of chain-based temporal order learning and pair-based coherence learning. For chain-based methods, the event embeddings of the event chain are first inputted into an LSTM to encode order information. For pair-based methods, the hidden states of LSTM are used to measure the relation between a pair of context event and subsequent candidate event

X walk to restaurant X walk to restaurant the order information of the event chain, and puts its hidden states into a self-attention mechanism (Lin event Individual X seat X seat ZH et al., 2017) to extract diverse event segments. SAM-Net can combine the information of individual

X read menu X read menu events and the information of event segments. The former refers to the relations between a subsequent event and an individual event, while the latter refers X order food X order food Continuous

event segment to the relations between a subsequent event and an event segment. X eat food X eat food Discontinuous event segment Similar to Wang ZQ et al. (2017), SAM-Net con-

X make payment X make payment siders that different events or event segments may have different semantic relations with the subsequent

X leave restaurant X leave restaurant event. So, two attention mechanisms (Luong et al., 2015) were used to assign different corresponding Fig. 10 Event segments of restaurant visiting. An weights for each individual event and event segment. event segment refers to several individual events in The prediction of the subsequent event is based on the same event chain. There are various event seg- ments within an event chain, including a continuous the combination of these two attention mechanisms. event segment and a discontinuous event segment The standard MCNC task showed that SAM- Net apparently achieves better results compared to information than individual events and introduces the previous models, and the idea of exploiting an less noise than event chains, whereas previous ap- event segment to draw more accurate inferences is proaches did not fully use them. a potential direction. Nevertheless, there is still Lv et al. (2019) designed the SAM-Net model to no effective method for extracting event segments. fill in this gap. SAM-Net uses an LSTM to capture The self-attention mechanism used by SAM-Net can 360 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 somewhat achieve this aim, but it is not accurate language text, and link them to construct the event enough. Therefore, follow-up improvement research graph. is anticipated. Here we discuss two more recent papers in detail. Zhao et al. (2017) modeled cause-effect relations be- 4.6 Graph-based models tween events into a graph structure. They proposed There is also a line of research that tries to ex- an abstract causality network on top of the specific press the relation between events and extract script events to reveal high-level general causality rules. knowledge by constructing a graph-based organiza- General causality patterns can be helpful in pre- tion of events. As a general rule, the nodes of the dicting future events correctly and reducing the spar- graph refer to the events, and the edges refer to the sity issue caused by the causalities between specific relations between events; some edges may also have events. The abstract causality network can discover weights to measure the degree of the relationship or general causalities on top of the specific causalities. the transition probability between two events. For instance, the specific causality “a massive 8.9 The graph structure is clearer and more intuitive magnitude earthquake hit northeast Japan on Fri- from the human interpretability perspective to some day → A large number of houses collapsed” can be extent. It is also closer to the organizational form of abstracted to a more concise and general one: Earth- event development in reality. As in the real world, quake → Houses collapsed. given any events, there are always many possible sub- The authors extracted event pairs with a causal- sequent events, so there are also many different event ity relation from text snippets (in this paper, news chains occurring in the same scenario. When link- headlines), and represented them by a set of verbs ing all of these events and event chains together ac- and nouns (e.g., “Hudson killed people”). The spe- cording to their associations, a graph structure will cific causality network is constructed by regarding be formed. Compared with pair- and chain-based each specific event as a node, and the causality be- models, graph-based models can express denser and tween specific event pairs as an edge. The abstract broader connections among events, which contain causality network is constructed on top of the spe- richer script knowledge. In addition, once scripts cific causality network. Particularly, the nouns of are represented in a graph structure, a variety of ex- specific events are generalized to their hypernyms in isting graph-based algorithms can be used to solve WordNet (Miller, 1995) (e.g., the hypernym of the script-related tasks. noun “chips” is “dishes”), and the verbs of specific The early work of Regneri et al. (2010) con- events are generalized to their high-level classes in structed the temporal script graph for specific sce- VerbNet (Schuler, 2005) (e.g., the verb “kill” belongs narios from crowdsourced data. The data contains to the class “murder-42.1”). The nodes of the ab- many different descriptions focusing on a single sce- stract causality network are created by replacing the nario. The scenario-specific paraphrase and tem- original one with its generalization result, and the poral ordering information were extracted using a edges are also generated according to the specific one. multiple sequence alignment (MSA) algorithm and For example, if there is an edge between the specific connected to form a script graph. event “Hudson killed people” and the specific event Orr et al. (2014) employed the hidden Markov “Hudson was sent to prison,” then there will also be model (HMM) to construct a transition event graph. an edge between their corresponding abstract events, Their model clustered event descriptions into dif- namely, the abstract event “murder-42.1, people” and ferent types and learned an HMM over the se- the abstract event “send, prison”. After constructing quence of event types. The structure and pa- the abstract causality network, the event prediction rameters were learned based on the expectation- task can be formulated as a link prediction task on maximization (EM) algorithm, and the script infer- the net. Given the cause (effect) event, the model ences were drawn according to the state transition will produce a list of corresponding effect (cause) probability. events ranked by possibility. Glavaš and Šnajder (2015) designed an end-to- Another notable graph-based model was intro- end system with a three-stage pipeline, which can duced by Li ZY et al. (2018). They constructed extract anchor, argument, and relation from natural a narrative event evolutionary graph (NEEG) to Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 361 express connections among events, and a scaled 4.7 External knowledge models graph neural network (SGNN) to model event in- teractions and learn event representations. At the end of this section, we will discuss some recent works that attempt to enhance traditional As shown on the left of Fig. 11, NEEG uses script learning models by injecting external knowl- the pair to stand for an event edge, such as additional commonsense knowledge (i.e., a node of event graph), and regards each bigram as the relation between events grained properties of entities, or nuanced relations (i.e., an edge of event graph). Their edges also have a between events. Although these models do not weight that measures the transition probability be- strictly belong to one of the above-mentioned spe- tween events and can be calculated from training cific categories, they are enlightening and potentially data. useful, and some of them also achieve quite good per- The task of inferring subsequent events can be formance. solved by SGNN. Traditionally, a gated graph neu- Some fine-grained commonsense knowledge, ral network (GGNN) was used to model the interac- such as the purpose or mental state of event par- tions among events. GGNN needs to input the whole ticipants, can be practical for understanding deep graph, so it cannot effectively operate on a large-scale semantic meanings and generating better event rep- graph. To remedy the issue, SGNN borrows the idea resentations. For instance, event “John threw bomb” of divide and conquer, and inputs a subgraph instead and event “Jerry attacked embassy” have the same of the whole graph. As shown on the right of Fig. 11, intent (i.e., bloodshed). This intent information can the subgraph comprises only nodes of context events help map the embeddings of these two events into (red nodes) and a subsequent candidate event (green the neighbor vector space, even though they have node), rather than all of the possible corresponding very different surface forms. As for emotion infor- events. A graph network with a series of GGNNs is mation, consider the following events: “John threw used to learn the representation of events. A Siamese bomb” and “John threw football.” The former event network (Granroth-Wilding and Clark, 2016) is em- may elicit feelings of anger or dread, while the latter ployed to compute the relatedness scores of the con- may elicit feeling of joy. Based on this information, text and candidate events. The experimental re- their embeddings should be far apart in the vector sults showed that their graph-based script learning space, even though they are quite similar in the sur- model achieves high performance in standard MCNC face form. tasks. However, Ding et al. (2019b) found that the event representation extracted from raw text lacks NEEG SGNN this commonsense knowledge,whichisanimportant 1 2 (Enter, subj) reason why it is difficult for script learning models 0.0 (Eat, subj) 1 6 (Order, subj) 4 3 to discriminate between the event pairs with sub- 4 Subgraph 05 2 0 . .0 0.02 0 3 2 tle differences in their surface form. Hence, they 3 .0 .0 0 0 3 Graph used two commonsense knowledge corpora, namely

0.09 9 network 0 0.06 0 (Walk, subj) . . 0 0 3 Event2Mind (Rashkin et al., 2018) and ATOMIC 0 .

0.02 4 0 0.10 0.0 Siamese (Sap et al., 2019), as the training dataset. In ad- 12 .03 0 .50 network (Wait, subj) 0 dition to an event description, there are multiple (Read, subj) (Leave, subj) (Seat, subj) (Pay, subj) Relatedness additional inference dimensions, produced by crowd- scores sourcing, which are appended to each event in these Fig. 11 Brief framework of NEEG and SGNN. NEEG two corpora. These inference dimensions contain regards pairs as nodes, and re- gards bigrams as edges. SGNN richer commonsense knowledge, such as the cause uses a subgraph instead of the whole graph as the and effect of an event, intents, and mental states of input, and uses a series of GGNNs and a Siamese participants. network to encode and match events. References to color refer to the online version of this figure The authors leveraged intent and sentiment knowledge to help models produce more accurate em- beddings and inferences of events. Specifically, they 362 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 used a neural tensor network to learn baseline event In their later work, Lee IT and Goldwasser embeddings and defined a corresponding loss func- (2019) made further improvement by introducing tion to incorporate intent and sentiment information. fine-grained multi-relations between events, such as Subsequently, the model jointly trained the combi- Reason, Cause, and Contrast. nation of the loss functions on the event, intent, and Previous models usually drew inferences based sentiment, using the training data with intent and on vector similarity scores between event embed- emotion labels. Experimental results showed that dings. However, when making commonsensible infer- external commonsense knowledge could enhance the ences, there are always many reasonable choices, so quality of event representations and improve the per- using an event similarity alone is too coarse to sup- formance of downstream applications. port some relevant inferences. For instance, given Lee IT and Goldwasser (2018) proposed a the event “Jenny went to her favorite restaurant,” feature-enriched script learning model, featured the high related possible events include “She was event embedding learning (FEEL), and injected fine- very hungry,” “She ordered a meal,” and many other grained event properties to enhance event embed- choices. However, if we have more clues, for exam- dings. Following the idea that fine-grained event ple, given that the type of relation between these properties and relevant contextual information, such two events is Reason, analogous to asking “Why did as argument, sentiment, animacy, time, or location, Jenny go there,” then the event “She was very hun- can potentially enhance the event representations, gry” will obviously be a more reasonable choice. In the authors injected two additional event properties, the same way, if the type of relation is Temporal, sentence-level sentiment and animacy information of analogous to asking “What did she do after that,” the protagonist, into the script learning model. clearly, the event “She ordered a meal” will be a bet- The sentence-level sentiment captures the over- ter choice. all tone of the event, which can impact the prob- Considering that the nuanced types of relation- ability of specific events. For example, the event ships between events can make commonsense infer- “Jack likes the food” indicates a positive sentiment, ences more accurate, this paper presents a model increasing the possibility of “He tips the waiter” to to encode multiple nuanced relations, rather than be the next event. Animacy information of the pro- only typical temporal relation or co-occurrence rela- tagonist is also an important factor that affects fu- tions. As Fig. 12 shows, the authors extracted rela- ture events, because some events can be done only tion triplets (eh,r,et) from the corpus, which consists by living entities, and some events’ meaning changes of two related events (eh,et)andtherelationtype greatly with different types of animacy (e.g., “Jack is between them (r), such as (eh=“Jenny went into her blue” and “the sky is blue”). favorite restaurant,” r=Reason, et=“She was very FEEL represents an event by a 6-tuple , where f1(e) by embeddings, which can be jointly learned by the and f2(e) refer to the sentence-level sentiment and model. animacy information, respectively. A hierarchical The relation embeddings contain 11 types of re- multitask model was used to jointly learn intra- lations, including a commonly used temporal relation and inter-event objectives. The intra-event objec- and a co-occurrence relation, plus nine additional tive (or local objective) expresses the connections discourse relations. These discourse relations are in- between different components inside an individual troduced by an annotated corpus, Penn Discourse event, namely, allowing the tok(e), sub(e), obj(e), Tree Bank-2.0 (PDTB) (Prasad et al., 2008). The re- prep(e), f1(e),andf2(e) to share information with lation embeddings are learned by translation-based each other. The inter-event objective (or contextual embedding models, which regard relations as trans- objective) expresses the relationship between two dif- lations in the embedding space. This paper adopted ferent events. The model was evaluated over three two of translation-based embeddings, TransE (Bor- narrative tasks, including the traditional MCNC task des et al., 2013) and TransR (Lin YK et al., 2015), and its two variants (MCNS and MCNE). The ex- to jointly learn embeddings for events and relations. perimental results showed that these two properties As for TransE, it embeds events and their rela- could contribute improvements. tions in the same vector space, so that the distance Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 363

Joint learning 5 Evaluation approaches

Essentially, the aim of the script learning system Embedding(eh) Embedding(r) Embedding(et) is to encode script knowledge and use it to draw reasonable inferences, so in the final process of script “Jenny went “She was very into her favorite Reason hungry” learning, some evaluation approaches are needed to restaurant” test whether the model can effectively encode the

(eh, r, et) script knowledge and exploit it. In this section, we will introduce some represen- Corpus tative evaluation approaches with specific tasks and corresponding metrics. The evaluation approaches Fig. 12 Multi-relational script learning. Relation can be employed to measure the capabilities of the triplet (eh, r, et) are extracted from the corpus. It consists of two related events (eh, et)andtherela- script learning model by experimentally testing its tion type between them (r). All of the eh, et,andr performance on the task. are represented by embeddings, which can be jointly learned by the model 5.1 Narrative cloze test |e − e | between event embeddings ( h t ) can directly The cloze test (Taylor, 1953) is a classic way to r reflect their relations ( ). The relatedness score be- evaluate the language understanding ability of hu- tween events is formulated as mans, by deleting a random word from a sentence p and having a human attempt to fill in the blank. f (t)=f (eh,et,r)=eh − et + r , (8) trass transe p Chambers and Jurafsky (2008) extended this dr approach to script learning and proposed the nar- where eh,et,r ∈ R are the embeddings from the event composition network. This formulation is a rative cloze (NC) test. The NC test evalu- dissimilarity measure, with a lower score meaning a ates the inference ability of script learning mod- stronger relatedness. els by holding out a single event from the event Considering that relation and event are two chain and letting the model predict the miss- completely different categories, it may not be appro- ing event given the remaining events. For ex- priate to embed them into the same space. TansE has ample, given an event chain with a missing , → , → → limitations in dealing with reflexive, 1-to-N, N-to-1, event: (walk subject) (sit subject) (?) , → , → , or N-to-N relations (Wang Z et al., 2014), so the (serve object) (eat subject) (leave subject), authors also adapted TransR to encoder relations. the model is asked to predict the missing event (?) using the information of the remaining events. Par- TransR separates the embedding space of rela- dr ticularly, it needs to compute the probability of each tion (r ∈ R ) and the embedding space of events de event in its entire event vocabulary and return a (eh,et ∈ R ). When performing operations, event prediction list of likely events ranked by probability. embeddings can be mapped into the relation embed- The smaller the rank number of the correct event ding space by multiplying a relation-specific param- de∗dr is, the more accurate the prediction will be. Cham- eter matrix (M ∈ R ). The relatedness score bers and Jurafsky (2008) used the average rank of all between events is formulated as correct events as an evaluation metric. Obviously, a

ftrass(t)=ftranse (eh,et,r) better performing model will have a lower average p (9) rank. = ehMr − etMr + rp . However, this metric is partially irrational; that The model of Lee IT and Goldwasser (2019) is, it tends to punish the model with a long event was evaluated using several tasks, including three vocabulary. If a large model with a long vocabu- cloze tasks (MCNC, MCNS, MCNE), three relation- lary and a small model with a short vocabulary both specific tasks, and a related downstream task. The predict the correct event at the bottom of the list, results showed that the learned embedding could then the large model will be more penalized than the capture relation-specific information and improve small one, because the guess list of the large model performance for the downstream task. is much longer. 364 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373

Therefore, Jans et al. (2012) suggested well. Recall@N as a more reliable metric. It evaluates In the ANC task, there are two event chains: the percentage of missing events that fall under top- one is the correct chain, and the other is the same N predictions of the model. Specifically, if the rank but with one event replaced by random events. The number of the correct event is smaller than N,mean- models need to discriminate between real and cor- ing that the first N events in the prediction list con- rupted event chains. Itsformisasfollows: tain the correct one, then one score will be given. Correct chain: Recall@N is the average score of all predictions, (walk, subject) → (sit, subject)→ (eat, subject) namely, the percentage of prediction lists with cor- → (leave, subject). rect events in their first N events. Contrary to the Incorrect chain: average rank metric, a better performing model will (walk, subject) → (sit, subject)→ (eat, subject) N have a higher Recall@ . → (swim, subject). Despite some relief, Recall@N still penalizes se- mantically reasonable predictions and solely awards 5.3 Multiple choice narrative cloze task events with exactly the same surface form. There- Granroth-Wilding and Clark (2016) also refined fore, Pichotta and Mooney (2014) introduced accu- the NC test and proposed the multiple choice nar- racy as a more robust metric, because it does not rative cloze (MCNC) task, in which a multi-choice regard the event as an unbreakable atomic unit, and prediction rather than a ranking list is used. Partic- takes into account all of its components. Specifically, ularly, a series of contextual events are given, and the one point will be given if the model’s top guess in- model should choose the most likely next event from cludes each component of the correct event. Then a set of optional candidates. There is only one cor- the score will be divided by the total number of pos- rect subsequent event in the candidate set, while the siblepointstoyieldavaluebetween0and1.More other events are randomly sampled from the corpus simply put, accuracy refers to the percentage of event with their protagonist replaced by the protagonist of components that are correctly predicted. the current chain. Its form is as follows: The NC test constructed the basic form of main- X → X → stream script learning evaluation methods, and it has Event chain: walk( , restaurant) sit( ) X → been adopted by various subsequent works (Jans et order( , food) __. al., 2012; Pichotta and Mooney, 2014, 2016a, 2016b; Optional choices: X Rudinger et al., 2015). Despite the popularity, it still C1: play( , tennis). has two issues: C2: climb(X, mountain). First, for any given event chains, there may be C3: ride(X, bicycle). multiple subsequent choices that are possible and C4: eat(X, food). equally correct, but only one specific right answer. C5: have(X,shower). The NC test will penalize wrong answers equally, MCNC is a better fit for evaluating multi- even if some of them are semantically plausible, argument events, because it can evaluate models’ which is somewhat arbitrary. inference abilities without searching the entire event Second, it requires searching the entire event vo- vocabulary. The way to calculate the accuracy of the cabulary; however, when considering complex event MCNC task is also simpler and more intuitive, i.e., structures, such as multi-argument events, the vo- directly calculating the ratio of the correct predic- cabulary becomes extremely large, which leads to a tion to the total test. In addition, in principle, the burden of large computation. MCNC task can be completed by humans. Hence, task results from humans would be a valuable base- 5.2 Adversarial narrative cloze task line for comparison with the model results.

Modi (2016) attempted to address these issues 5.4 Multiple choice narrative sequences task by proposing the adversarial narrative cloze (ANC) task. This research pointed out that it is more realis- The MCNC task is substituted for the NC task tic to evaluate script learning models that give credit by most of subsequent related works (Wang ZQ et al., for predicting semantically plausible alternatives as 2017; Li ZY et al., 2018; Weber et al., 2018; Lee IT Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 365 and Goldwasser, 2019; Lv et al., 2019). However, four sentences are given as context, and models need it does not capture the flow of the entire narrative to choose one correct ending of the story from two chain. Therefore, it is hard to draw an inference over alternative sentences. For example: long event sequences. Lee IT and Goldwasser (2018) “Karen was assigned a roommate in her first year proposed two novel multiple-choice variants gener- of college. Her roommate asked her to go to a nearby alized from the MCNC task, to evaluate a model’s city for a concert. Karen agreed happily. The show ability to draw long sequence inferences instead of was absolutely exhilarating.” only predicting one event. Given the above context, the model needs to The first one is multiple choice narrative se- choose between two alternative sentences, “Karen be- quences (MCNS), which samples options for all the came good friends with her roommate” (right ending) events on the chain, except for the first event used and “Karen hated her roommate” (wrong ending). as the starting point. Its form is as follows: walk(X, Although the purpose of SCT and NC is rel- restaurant) → __ → __ → __ → __. evant, there are some slight differences between The model needs to make a multiple choice nar- the story on which SCT focuses and the script rative for every blank. on which NC focuses. To some extent, the story 5.5 Multiple choice narrative explanation is a kind of complex script that consists of more task objections and multiple relations. SCT also calls for stronger inferential capability and understand- Another variant is multiple choice narrative ex- ing of the deeper-level story semantics, rather than planation (MCNE), which gives both the first event shallow literal or statistical information, because and the final event and predicts all the intermediate it is asked to predict the entire sentence instead events. Its form is as follows: walk(X, restaurant) of an individual event. The experiment results → __ → __ → __ → leave(X, restaurant). also proved this. Mostafazadeh et al. (2016) chose Note that regardless of the specific evaluation some of the state-of-the-art script learning meth- approach, the basic goal is to evaluate the ability of ods at that time for evaluation, but the experi- the script learning model to encode script knowl- mental results showed that they all struggled to edge and draw reasonable inferences. Therefore, achieve a high score on SCT, and most of them were the models with excellent evaluation performance only slightly better than random or constant selec- should have learned rich commonsense knowledge tion. This indicated that they have only learned and understood deep semantic meaning. Neverthe- shallow language knowledge instead of deeper less, some researchers (Pichotta and Mooney, 2014; understanding. Mostafazadeh et al., 2016; Chambers, 2017) found Recently, there have been a variety of works (Li an issue: some models are achieving high scores by Q et al., 2018; Zhou et al., 2019; Li ZY et al., 2019) at- optimizing performance on the task itself, instead tempting to resolve SCT, and some of them achieved of the learning process. For example, in the NC high scores. For example, the three-stage transfer- task, directly giving the prediction of the ranked able BERT training framework proposed by Li ZY list in accordance with the event’s corpus frequency et al. (2019) achieved rather good results with accu- (e.g., always predicting common events “X said” or racy greater than 90%. “X went”) was shown to be an extremely strong baseline. 6 Conclusions and future directions 5.6 Story cloze test In this section, we conclude the main contents To alleviate the above issues, Mostafazadeh of this survey and briefly analyze the current script et al. (2016) introduced the story cloze test (SCT), learning situation. In addition, we discuss some pos- an open task, to help models make a more com- sible directions for future research, because we be- prehensive and deeper evaluation. They collected a lieve script learning research still has great potential, corpus full of commonsense short stories written in and there are many novel directions that need to be five sentences. When performing an evaluation task, explored in the future. 366 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373

6.1 Conclusions jecting external knowledge. Some new evaluation frameworks, such as SCT, MCNS, and MCNG, and This survey tried to provide a comprehensive re- several novel corpora, such as InScript, Event2Mind, view and introduce some representative script learn- and ATOMIC, were also proposed. These innovative ing works. We first briefly introduced some basic works are very enlightening and have potential, and script learning knowledge, including its definition, some of them have achieved quite good performance. aim, significance, process, focus, and development We are looking forward to more revolutionary im- timeline. Then we discussed in detail three main re- provements shortly. search topics: event representations, script learning Finally, we list some representative script learn- models, and evaluation approaches. For each one, we ing works, as well as their dataset, event representa- introduced some typical and representative works, tions, models, and evaluation approaches, in Table tried to compare their advantages and disadvantages, 1. Because these works have quite different experi- and summarized their development process. In the mental setups, it is not likely that a fair comparison end, we discussed some possible directions for future can be conducted among them. Consequently, we do research. not list the specific experimental results and detailed In summary, script learning is a relatively small processes, but only the main methods they used. but growing research direction with potential. It aims at encoding the commonsense knowledge of 6.2 Future directions real life to help machines understand natural lan- 6.2.1 Establishing a standard corpus and evaluation guage and draw commonsensible inferences. Script system learning systems have a long history and profound psychological foundation, and are also crucial com- There is no standard corpus or evaluation sys- ponents in constructing AI systems that are designed tem for script learning. The majority of related to achieve human-level performance. works use NLP tools to automatically extract events Despite its significance, it still has some draw- from the news articles in the New York Times (NYK) backs and limitations, such as lacking a standard portion of the Gigaword corpus, and the NC or and high-quality corpus, lacking systematic evalua- MCNC test is used to evaluate the extraction results. tion frameworks, and lacking effective approaches for However, this method has the following defects: the extracting and encoding the most important infor- types of chosen raw texts can vary widely, the qual- mation of events. Or, more broadly, it lacks a deeper ity of extraction results relies heavily on NLP tools, understanding of the mechanisms of human cogni- and the content of texts may lack detailed informa- tion and comprehension of commonsense knowledge. tion about quite common and mundane normal life This knowledge is considered self-evident. On one events. In addition, Mostafazadeh et al. (2016) and hand, it seems that everyone naturally knows it, and Chambers (2017) found that automatically generat- therefore there is no need to explain it. On the other ing event chains for evaluation may not test rele- hand, it seems that we are unable to present appro- vant script knowledge. Some script learning models priate and effective proof at present. Hence, there is that achieve high test scores tend to capture fre- still much work that needs to be done in the future. quent event patterns instead of script knowledge. If For now, event chain based plus deep learning there is no effective evaluation system, we cannot based models are the most popular script learning accurately judge whether a model can really work. approaches. The trend in this direction is to further Therefore, the establishment of a standardized cor- improve them by introducing more advanced mech- pus and evaluation system is particularly important anisms and more complex structures. Most of these for the development of script learning. works extracted events from the New York Times There are a few other works (Regneri et al., (NYK) portion of the Gigaword corpus as a dataset 2010; Modi et al., 2017; Ding et al., 2019b; Lee and evaluated their models by the MCNC test. IT and Goldwasser, 2019) that attempt to solve this issue using a crowdsourced corpus, such as Furthermore, very recently, some other notable OMICS (Gupta and Kochenderfer, 2004), PDTB works have attempted to enhance traditional script (Prasad et al., 2008), InScript (Modi et al., 2016), learning models by constructing event graphs or in- Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 367 ed representation; ze test; MCNS: multiple choice DLB: deep learning based; GB: ice narrative cloze; SCT: story clo RTR + DTR DLB + EK SCT fine-grained property representation. CB: count-based; PB: pair-based; al narrative cloze; MCNC: multiple cho +SWAG+ROCStories Table 1 A comparison of representative script learning works SNLI + MNLI + MC_NLI + IMDB + Twitter Reference Dataset Representation Model Evaluation graph-based; EK: external knowledge; NC: narrative cloze; ANC: adversari Chambers and Jurafsky (2008)Jans et al. (2012)Balasubramanian et al. (2013)Pichotta and Mooney (2014)Modi and Titov (2014a)Modi and Titov (2014b)Rudinger et al. (2015)Modi (2016) NYKGGranroth-Wilding Reuters and + Clark Andrew (2016) Lang FairyPichotta Tale and Mooney NYKG (2016a)Pichotta and Mooney (2016b) NYKGModi et al. (2017) OMICS +Hu NYKG et PR al. + (2017) DCRWang ZQ et al. (2017) NYKGZhao PR English et + NYKG language al. DCR Wikipedia (2017)Li English Q language et NYKG CB+ Wikipedia al. PB (2018)Li PR ZY + et DCR al. (2018)Lee MAR CB IT Movies + + and summary DTR PR PB Goldwasser + (2018) DCRLee IT and MAR Goldwasser + (2019) DTR Chinese InScriptDing news + et CB event Movies al. dataset + RTR summary (2019b) + PB of DLB DTR SinaLv + News PB et MAR al. + (2019) MAR CB DTR + + DTR PB DLB +Li CB ZY NC Event et ordering al. PR + DLB (2019) + Event + RTR DCR NYKG paraphrasing MAR + CB + DTR DTR DLBNYK: + DLB MAR PB NC New + + PB York DTR Human Times evaluation corpus; NYKG: PDTB NYK NYK NYKG + portion NYKG DLB DLB Human DLB ROCStories of + + evaluation + PB the CB PB Gigaword corpus. DLB + PB PR: NYKG protagonist representation; Event2Mind DCR: NC + discrete ATOMIC representation; Event ordering DTR: distribut NC MAR MCNC + DTR MAR NC + Other + DTR test ANC FGPR RTR NC NYKG NC + + + DTR RTR DTR ANC + DLB MAR DTR + + CB DLB DTR PR+ + MAR PB + + DTR DLB EK + PB + DLB EK DLB + + DLB CB PB + DLB + + EK + GB EK GB MCNC + MCNS + MCNC MCNE + MCNS + MCNE MCNC MAR + DTR MCNC MCNC Other test SCT DLB + CB MCNC MAR: multi-argument representation; RTR: raw text representation; FGPR: narrative sequences; MCNE: multiple choice narrative explanation 368 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373

Event2Mind (Rashkin et al., 2018), or ATOMIC The experimental results also show that ELG is ef- (Sap et al., 2019). These crowdsourced corpora have fective for script-related tasks. advantages such as focusing on specific prototypi- cal scenarios, containing rich script knowledge, hav- 6.2.3 Using pre-trained language models ing structure representations, and having clear anno- Recent works have shown that the two-stage tated information. Specifically, the InScript Corpus framework that includes a pre-trained language (Modi et al., 2016) was designed for script learning, model and fine-tuning operation can bring great im- all of the relevant verbs and noun phrases were an- provements to many NLP tasks. Specifically, a lan- notated with event types and participant types, and guage model can be pre-trained on large-scale un- it contained many subtle actions describing a par- supervised corpora, such as CoVe (McCann et al., ticular scenario. These advantages can help models 2017), ELMo (Peters et al., 2018), BERT (Devlin effectively process data and learn script knowledge, et al., 2019), or GPT (Radford et al., 2019), and as well as avoid many of the problems caused by fine-tuned on target tasks, such as natural language unstructured texts (Chambers, 2017). inference (NLI) or reading comprehension. The pre- However, this is far from a trivial issue. On one trained language models can learn universal language hand, the crowdsourced corpora require domain ex- representations, which can lead to better general- pertise, professional knowledge, and plenty of man- ization performance and speed up convergence for ual annotation and quality control. Thus, they are downstream NLP tasks. The pre-trained language small in size but prohibitive in construction cost. models can also avoid heavy work and the complex On the other hand, automatically understanding un- process of training a model from scratch (Qiu et al., structured natural texts is a vital goal and a neces- 2020). sary capability of advanced AI systems. Therefore, It is natural to think that script learning tasks collecting a standard corpus that has both quantity may also benefit from applying this framework. In and quality and establishing an effective evaluation addition, these novel pre-trained models can re- system are important directions for future research. place the embedding learning systems that have been 6.2.2 Constructing the event graph widely used by script learning models for a long time, such as word2vec (Mikolov et al., 2013), (Pen- As we discussed in Section 4.6, the graph struc- nington et al., 2014), or deepwalk (Perozzi et al., ture is clearer and more intuitive from the human 2014). This direction is of great research value but interpretability perspective, and to some extent, it is there are only a few existing works. Here we briefly also closer to the organizational form of event devel- introduce two representative works. opment in reality. It can express denser and broader Li ZY et al. (2019) did relevant work that de- connections among events, compared with the event signed a transferable BERT (TransBERT) training pair or event chain. In addition, once scripts are framework to solve an SCT task. In addition to a represented in a graph structure, a variety of off-the- pre-trained BERT model, they introduced three se- shelf graph-based algorithms can be used to solve mantically related supervised tasks (including NLI, event-related tasks. However, there are only a few sentiment classification, and next action prediction) models that represent a script as graph structure and to further pre-train the BERT model. Zheng et al. use graph-based approaches to solve event-related (2020) proposed a unified fine-tuning architecture problems, so it is a research direction that deserves (UniFA), which is a scalable and stackable frame- moreattentioninthefuture. work consisting of four key components, i.e., BERT- For example, the event logic graph (ELG) pro- base-uncased, raw-text, intra-event, and inter-event posed by Ding et al. (2019a) is a potential event- components, connected by a multi-step fine-tuning related knowledge base that aims to discover the method. With the merits of cascaded training, multi- evolutionary patterns and logic of events. ELG in- step fine-tuning can contribute to minimizing loss tegrates event-related knowledge into conventional and also avoid being trapped in local optima. They knowledge graph techniques, which have been pro- also employed a scenario-level variational auto en- gressing fast recently, and organizes event evolution- coder (S-VAE) to implicitly represent scenario-level ary patterns into a commonsense knowledge base. knowledge. In particular, S-VAE regards the event Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 369 chain as a dialog-generation process and further in- 6.2.5 Injecting external fine-grained knowledge troduces a stochastic latent variable to guide the event representation. As we discussed in Section 4.7, some recent works attempt to enhance traditional script learning models by injecting external knowledge. They chose 6.2.4 Learning script by reinforcement learning fine-grained event properties, such as intent, senti- ment, and animacy of event participants, or nuanced The script learning system is somewhat natu- relations between events, such as Reason, Temporal, rally similar to the reinforcement learning frame- and Contrast. work. The process of generating correct events one These works indicate that fine-grained event by one through an event chain is similar to the situ- properties and relevant contextual information such ation wherein an agent traverses through the space as argument, sentiment, animacy, time, or location, of events (Kaelbling et al., 1996; Li JW et al., 2016; can potentially enhance script learning models, so Sutton and Barto, 2018). other script learning models may also be improved by injecting external fine-grained knowledge. In the reinforcement learning setup, an agent will search for all the possible states of the environ- This knowledge, however, is not necessarily the ment to achieve the highest rewards. The action of more the better. The experimental results of Lee IT the agent refers to changing the state, and a corre- and Goldwasser (2018) showed that although each sponding reward will be given to the agent according property individually improves the performance, in- to whether the action is good or bad. The policy of cluding too many properties might hurt it. There- the agent refers to a sequence of actions taken by it, fore, the questions of which and how many properties and the optimal policy is the policy that achieves the to choose are appropriate and need further study. highest rewards. 6.2.6 Building an interpretable system In the script learning setup, each discrete state Although deep learning technology has devel- can represent an event. The action of the agent may oped rapidly and reached impressive performance in refer to transferring to another event, based on cur- many fields recently, its deep and complex nonlin- rent and previous states. If the agent transfers to a ear architecture makes the decision-making process correct state (i.e., predicting the correct next event), highly non-transparent. then it will obtain a reward. The optimal policy of Deep learning models are known to be black the agent may represent the prediction of the most boxes, so it is hard to explain the reason behind likely event chain. their decisions. As for deep learning based script Reinforcement learning can help with many learning systems, the lack of interpretability exists script learning tasks that are hard to solve now, for mainly in terms of two aspects: first, the output example, the event segment extraction task. As we of the script learning system (i.e., the event predic- mentioned in Section 4.5, Lv et al. (2019) used a tion results) is hardly explainable to system users; self-attention mechanism to solve this task, but this second, the operation mechanism of how machines approach is not accurate enough. We may design a make a decision (i.e., the event encoding and match- reinforcement learning framework in which an agent ing algorithm) can hardly be investigated by system traverses the event chain to extract the most valu- designers. Without letting users know why specific able event segments. The state refers to whether the results are provided, the system may be less effec- corresponding event is part of the extracted event tive in persuading users to accept the results, which segment, and the alternative action refers to chang- may further decrease the system’s trustworthiness. ing the state and moving to the next event or keep- Without letting designers know the mechanism be- ing the state and moving to the next event. After hind the algorithm, it will be difficult for them to traversing the whole event chain, the final extracted promote the system effectively. event segment will be used to predict the next event, In a broader sense, explainable artificial in- and a corresponding reward will be given based on telligence (XAI) (Arrieta et al., 2020) has re- the accuracy of the prediction. cently sparked wider attention from the general AI 370 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373

community. Many of the recent advances in ma- Chambers N, Jurafsky D, 2008. Unsupervised learning of th chine learning have been trying to open the black narrative event chains. Proc 46 Annual Meeting of box. For instance, Koh and Liang (2017) introduced the Association for Computational Linguistics, p.789- 797. a framework to analyze deep learning models based Chambers N, Jurafsky D, 2009. Unsupervised learning of on influence analyses; Pei et al. (2017) proposed a narrative schemas and their participants. Proc Joint th white-box testing mechanism to explore the nature Conf of the 47 Annual Meeting of the ACL and the 4th Int Joint Conf on Natural Language Processing of of deep learning models. The interpretability-related the AFNLP, p.602-610. works will help us understand the exact meaning of https://doi.org/10.5555/1690219.1690231 each feature in a neural network and how they in- Chung J, Gulcehre C, Cho K, et al., 2014. Empirical eval- teract to produce the final outputs. However, the uation of gated recurrent neural networks on sequence modeling. https://arxiv.org/abs/1412.3555 research in interpretable script learning systems is Church KW, Hanks P, 1990. Word association norms, mutual still in the initial stage, and there is much to be fur- information, and lexicography. Comput Ling, 16(1):22- ther studied. 29. https://doi.org/10.5555/89086.89095 Cullingford RE, 1978. Script Application: Computer Un- derstanding of Newspaper Stories. PhD Thesis, Yale Contributors University, New Haven, CT, USA. Yi HAN drafted the manuscript. Linbo QIAO, DeJong GF, 1979. Skimming Stories in Real Time: an Experiment in Integrated Understanding. PhD Thesis, Jianming ZHENG, and Hefeng WU helped organize the Yale University, New Haven, CT, USA. manuscript. Dongsheng LI and Xiangke LIAO led the prepa- Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training ration of the manuscript. Yi HAN and Linbo QIAO revised of deep bidirectional transformers for language under- and finalized the paper. standing. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186. Compliance with ethics guidelines https://doi.org/10.18653/v1/n19-1423 Yi HAN, Linbo QIAO, Jianming ZHENG, Hefeng WU, Ding X, Li ZY, Liu T, et al., 2019a. ELG: an event logic graph. https://arxiv.org/abs/1907.08015 Dongsheng LI, and Xiangke LIAO declare that they have no Ding X, Liao K, Liu T, et al., 2019b. Event representation conflict of interest. learning enhanced with external commonsense knowl- edge. Proc Conf on Empirical Methods in Natural References Language Processing and the 9th Int Joint Conf on Arrieta AB, Díaz-Rodríguez N, Del Ser J, et al., 2020. Ex- Natural Language Processing, p.4896-4905. plainable artificial intelligence (XAI): concepts, tax- https://doi.org/10.18653/v1/D19-1495 onomies, opportunities and challenges toward respon- Erk K, Padó S, 2008. A structured vector space model sible AI. Inform Fus, 58:82-115. for word meaning in context. Proc Conf on Empirical https://doi.org/10.1016/j.inffus.2019.12.012 Methods in Natural Language Processing, p.897-906. Balasubramanian N, Soderland S, Mausam, et al., 2013. Gen- https://doi.org/10.5555/1613715.1613831 erating coherent event schemas at scale. Proc Conf Fillmore CJ, 1976. Frame semantics and the nature of on Empirical Methods in Natural Language Processing, language. AnnNYAcadSci, 280(1):20-32. p.1721-1731. https://doi.org/10.1111/j.1749-6632.1976.tb25467.x Baroni M, Zamparelli R, 2010. Nouns are vectors, adjectives Glavaš G, Šnajder J, 2015. Construction and evaluation of are matrices: representing adjective-noun constructions event graphs. Nat Lang Eng, 21(4):607-652. in semantic space. Proc Conf on Empirical Methods in https://doi.org/10.1017/S1351324914000060 Natural Language Processing, p.1183-1193. Gordon AS, 2001. Browsing image collections with represen- https://doi.org/10.5555/1870658.1870773 tations of common-sense activities. JAmSocInform Bengio Y, Ducharme R, Vincent P, et al., 2003. A neu- Sci Technol, 52(11):925-929. ral probabilistic language model. JMachLearnRes, https://doi.org/10.1002/asi.1143 3:1137-1155. https://doi.org/10.5555/944919.944966 Granroth-Wilding M, Clark S, 2016. What happens next? Bordes A, Usunier N, Garcia-Durán A, et al., 2013. Trans- Event prediction using a compositional neural network lating embeddings for modeling multi-relational data. model. Proc 30th AAAI Conf on Artificial Intelligence, th Proc 26 Int Conf on Neural Information Processing p.2727-2733. https://doi.org/10.5555/3016100.3016283 Systems, p.2787-2795. Gupta R, Kochenderfer MJ, 2004. Common sense data https://doi.org/10.5555/2999792.2999923 acquisition for indoor mobile robots. Proc 19th National Bower GH, Black JB, Turner TJ, 1979. Scripts in memory Conf on Artifical Intelligence, p.605-610. for text. Cogn Psychol, 11(2):177-220. https://doi.org/10.5555/1597148.1597246 https://doi.org/10.1016/0010-0285(79)90009-4 Harris ZS, 1954. Distributional structure. Word, 10(2- Chambers N, 2017. Behind the scenes of an evolving event 3):146-162. cloze test. Proc 2nd Workshop on Linking Models Hochreiter S, Schmidhuber J, 1997. Long short-term mem- of Lexical, Sentential and Discourse-Level Semantics, ory. Neur Comput, 9(8):1735-1780. p.41-45. https://doi.org/10.18653/v1/w17-0905 https://doi.org/10.1162/neco.1997.9.8.1735 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 371

Hu LM, Li JZ, Nie LQ, et al., 2017. What happens next? Lin ZH, Feng MW, Dos Santos CN, et al., 2017. A structured Future subevent prediction using contextual hierarchical self-attentive sentence embedding. Proc 5th Int Conf LSTM. Proc 31st AAAI Conf on Artificial Intelligence, on Learning Representations. p.3450-3456. https://doi.org/10.5555/3298023.3298070 Luong T, Pham H, Manning CD, 2015. Effective approaches Jans B, Bethard S, Vulic, et al., 2012. Skip N-grams and to attention-based neural . Proc ranking functions for predicting script events. Proc Conf on Empirical Methods in Natural Language Pro- th 13 Conf of the European Chapter of the Association cessing, p.1412-1421. for Computational Linguistics, p.336-344. https://doi.org/10.18653/v1/d15-1166 Jones MP, Martin JH, 1997. Contextual spelling correction Lv SW, Qian WH, Huang LT, et al., 2019. SAM-Net: inte- 5th using . Proc Conf on Ap- grating event-level and chain-level attentions to predict plied Natural Language Processing, p.166-173. what happens next. Proc AAAI Conf on Artificial In- https://doi.org/10.3115/974557.974582 telligence, p.6802-6809. Kaelbling LP, Littman ML, Moore AW, 1996. Reinforcement https://doi.org/10.1609/aaai.v33i01.33016802 learning: a survey. JArtifIntellRes, 4:237-285. Mausam, Schmitz M, Bart R, et al., 2012. Open language https://doi.org/10.1613/jair.301 learning for information extraction. Proc Joint Conf on Khan A, Salim N, Kumar YJ, 2015. A framework for multi- Empirical Methods in Natural Language Processing and document abstractive summarization based on semantic Computational Natural Language Learning, p.523-534. role labelling. Appl Soft Comput, 30:737-747. https://doi.org/10.5555/2390948.2391009 https://doi.org/10.1016/j.asoc.2015.01.070 McCann B, Bradbury J, Xiong CM, et al., 2017. Learned Kiros R, Zhu YK, Salakhutdinov R, et al., 2015. Skip-thought in translation: contextualized word vectors. Proc 31st vectors. Proc 28th Int Conf on Neural Information Int Conf on Neural Information Processing Systems, Processing Systems, p.3294-3302. p.6297-6308. https://doi.org/10.5555/3295222.3295377 https://doi.org/10.5555/2969442.2969607 Koh PW, Liang P, 2017. Understanding black-box predic- Miikkulainen R, 1992. Discern: a distributed neural network tions via influence functions. Proc 34th Int Conf on model of script processing and memory. University Machine Learning, p.1885-1894. Twente, Connectionism and Natural Language Process- Laender AHF, Ribeiro-Neto BA, Da Silva AS, et al., 2002. ing, p.115-124. A brief survey of web data extraction tools. ACM Miikkulainen R, 1993. Subsymbolic Natural Language Pro- SIGMOD Rec, 31(2):84-93. cessing: an Integrated Model of Scripts, Lexicon, and https://doi.org/10.1145/565117.565137 Memory. MIT Press, Cambridge, USA. Lee G, Flowers M, Dyer MG, 1992. Learning distributed Mikolov T, Chen K, Corrado G, et al., 2013. Efficient representations of conceptual knowledge and their ap- estimation of word representations in vector space. plication to script-based story processing. In: Sharkey https://arxiv.org/abs/1301.3781 N (Ed.), Connectionist Natural Language Processing. Miller GA, 1995. WordNet: a lexical database for English. Springer, Dordrecht, p.215-247. Commun ACM, 38(11):39-41. https://doi.org/10.1007/978-94-011-2624-3_11 https://doi.org/10.1145/219717.219748 Lee IT, Goldwasser D, 2018. FEEL: featured event em- Minsky M, 1975. A framework for representing knowledge. nd bedding learning. Proc 32 AAAI Conf on Artificial In: Winston PH (Ed.), The Psychology of Computer Intelligence. Vision. McGraw-Hill Book, New York, USA. Lee IT, Goldwasser D, 2019. Multi-relational script learning Mnih A, Hinton G, 2007. Three new graphical models for 57th for discourse relations. Proc Annual Meeting of statistical language modelling. Proc 24th Int Conf on the Association for Computational Linguistics, p.4214- Machine Learning, p.641-648. 4226. https://doi.org/10.18653/v1/p19-1413 https://doi.org/10.1145/1273496.1273577 Li JW, Monroe W, Ritter A, et al., 2016. Deep rein- Modi A, 2016. Event embeddings for semantic script model- forcement learning for dialogue generation. Proc Conf ing. Proc 20th SIGNLL Conf on Computational Natural on Empirical Methods in Natural Language Processing, Language Learning, p.75-83. p.1192-1202. https://doi.org/10.18653/v1/D16-1127 https://doi.org/10.18653/v1/k16-1008 Li Q, Li ZW, Wei JM, et al., 2018. A multi-attention based Modi A, Titov I, 2014a. Inducing neural models of script neural network with external knowledge for story ending th th knowledge. Proc 18 Conf on Computational Natural predicting task. Proc 27 Int Conf on Computational Language Learning, p.49-57. Linguistics, p.1754-1762. https://doi.org/10.3115/v1/w14-1606 Li ZY, Ding X, Liu T, 2018. Constructing narrative event evolutionary graph for script event prediction. Proc Modi A, Titov I, 2014b. Learning semantic script knowledge 2nd 27th Int Joint Conf on Artificial Intelligence, p.4201- with event embeddings. Proc Int Conf on Learning 4207. https://doi.org/10.5555/3304222.3304354 Representations. Li ZY, Ding X, Liu T, 2019. Story ending prediction by Modi A, Anikina T, Ostermann S, et al., 2016. InScript: transferable BERT. Proc 28th Int Joint Conf on Artifi- narrative texts annotated with script information. Proc th cial Intelligence, p.1800-1806. 10 Int Conf on Language Resources and Evaluation. https://doi.org/10.24963/ijcai.2019/249 Modi A, Titov I, Demberg V, et al., 2017. Modeling se- Lin YK, Liu ZY, Sun MS, et al., 2015. Learning entity and mantic expectation: using script knowledge for referent relation embeddings for knowledge graph completion. prediction. Trans Assoc Comput Ling, 5(2):31-44. Pro 29th AAAI Conf on Artificial Intelligence. https://doi.org/10.1162/tacl_a_00044 372 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373

Mostafazadeh N, Chambers N, He XD, et al., 2016. A corpus Radinsky K, Agichtein E, Gabrilovich E, et al., 2011. A word and cloze evaluation for deeper understanding of com- at a time: computing word relatedness using temporal monsense stories. Proc Conf of the North American semantic analysis. Proc 20th Int Conf on World Wide Chapter of the Association for Computational Linguis- Web, p.337-346. tics: Human Language Technologies, p.839-849. https://doi.org/10.1145/1963405.1963455 https://doi.org/10.18653/v1/n16-1098 Rashkin H, Sap M, Allaway E, et al., 2018. Event2Mind: Mueller ET, 1998. Natural Language Processing with commonsense inference on events, intents, and reac- ThoughtTreasure. Signiform, New York, USA. tions. Proc 56th Annual Meeting of the Association for Navigli R, 2009. Word sense disambiguation: a survey. ACM Computational Linguistics, p.463-473. Comput Surv, 41(2):10. https://doi.org/10.18653/v1/P18-1043 https://doi.org/10.1145/1459352.1459355 Regneri M, Koller A, Pinkal M, 2010. Learning script knowl- Orr JW, Tadepalli P, Doppa JR, et al., 2014. Learning edge with web experiments. Proc 48th Annual Meeting 28th scripts as hidden Markov models. Proc AAAI of the Association for Computational Linguistics, p.979- Conf on Artificial Intelligence, p.1565-1571. 988. https://doi.org/10.5555/1858681.1858781 https://doi.org/10.5555/2892753.2892770 Rudinger R, Rastogi P, Ferraro F, et al., 2015. Script in- Osman AH, Salim N, Binwahlan MS, et al., 2012. Plagia- duction as language modeling. Proc Conf on Empirical rism detection scheme based on semantic role labeling. Methods in Natural Language Processing, p.1681-1686. & Proc Int Conf on Information Retrieval Knowledge https://doi.org/10.18653/v1/d15-1195 Management, p.30-33. Rumelhart DE, 1980. Schemata: the building blocks of https://doi.org/10.1109/InfRKM.2012.6204978 cognition. In: Spiro RJ (Ed.), Theoretical Issues in Pei KX, Cao YZ, Yang JF, et al., 2017. DeepXplore: auto- Reading Comprehension. Erlbaum, Hillsdale, p.33-58. mated whitebox testing of deep learning systems. Proc 26th Symp on Operating Systems Principles, p.1-18. Sap M, Le Bras R, Allaway E, et al., 2019. ATOMIC: https://doi.org/10.1145/3132747.3132785 an atlas of machine commonsense for if-then reasoning. Proc AAAI Conf on Artificial Intelligence, p.3027-3035. Pennington J, Socher R, Manning C, 2014. GloVe: global vectors for word representation. Proc Conf on Empirical https://doi.org/10.1609/aaai.v33i01.33013027 Methods in Natural Language Processing, p.1532-1543. Schank RC, 1983. Dynamic Memory: a Theory of Reminding https://doi.org/10.3115/v1/d14-1162 and Learning in Computers and People. Cambridge Perozzi B, Al-Rfou R, Skiena S, 2014. DeepWalk: online University Press, New York, USA. learning of social representations. Proc 20th ACM Schank RC, 1990. Tell Me a Story: a New Look at Real and SIGKDD Int Conf on Knowledge Discovery and Data Artificial Memory. Charles Scribner, New York, USA. Mining, p.701-710. Schank RC, Abelson RP, 1977. Scripts, Plans, Goals and Un- https://doi.org/10.1145/2623330.2623732 derstanding: an Inquiry into Human Knowledge Struc- Peters M, Neumann M, Iyyer M, et al., 2018. Deep con- tures. L. Erlbaum, Hillsdale, USA. textualized word representations. Proc Conf of the Schuler KK, 2005. VerbNet: a Broad-Coverage, Comprehen- North American Chapter of the Association for Com- sive Verb Lexicon. PhD Thesis, University of Pennsyl- putational Linguistics: Human Language Technologies, vania, Pennsylvania, USA. p.2227-2237. https://doi.org/10.18653/v1/n18-1202 Shen D, Lapata M, 2007. Using semantic roles to improve Pichotta K, Mooney R, 2014. Statistical script learning question answering. Proc Joint Conf on Empirical 14th with multi-argument events. Proc Conf of the Methods in Natural Language Processing and Compu- European Chapter of the Association for Computational tational Natural Language Learning, p.12-21. Linguistics, p.220-229. Socher R, Huval B, Manning CD, et al., 2012. Semantic com- https://doi.org/10.3115/v1/e14-1024 positionality through recursive matrix-vector spaces. Pichotta K, Mooney RJ, 2016a. Learning statistical scripts Proc Joint Conf on Empirical Methods in Natural Lan- 30th with LSTM recurrent neural networks. Proc AAAI guage Processing and Computational Natural Language Conf on Artificial Intelligence, p.2800-2806. Learning, p.1201-1211. https://doi.org/10.5555/3016100.3016293 https://doi.org/10.5555/2390948.2391084 Pichotta K, Mooney RJ, 2016b. Using sentence-level LSTM Sutton RS, Barto AG, 2018. Reinforcement Learning: an language models for script inference. Proc 54th Annual Introduction (2nd Ed.). MIT Press, Cambridge, USA. Meeting of the Association for Computational Linguis- tics, p.279-289. https://doi.org/10.18653/v1/p16-1027 Taylor WL, 1953. “Cloze procedure”: a new tool for measur- Prasad R, Dinesh N, Lee A, et al., 2008. The Penn discourse ing readability. J Mass Commun Q, 30(4):415-433. 2.0. Proc Int 6th Conf on Language Resources https://doi.org/10.1177/107769905303000401 and Evaluation, p.2961-2968. Terry WS, 2006. Learning and Memory: Basic Principles, Qiu XP, Sun TX, Xu YG, et al., 2020. Pre-trained models Processes, and Procedures. Allyn and Bacon, Boston, for natural language processing: a survey. USA. https://arxiv.org/abs/2003.08271 Tulving E, 1983. Elements of Episodic Memory. Oxford Radford A, Narasimhan K, Salimans T, et al., 2019. Improv- University Press, New York, USA. ing language understanding by generative pre-training. Wang Z, Zhang JW, Feng JL, et al., 2014. Knowledge graph Proc Conf of the North American Chapter of the Associ- embedding by translating on hyperplanes. Proc 28th ation for Computational Linguistics: Human Language AAAI Conf on Artificial Intelligence, p.1112-1119. Technologies, p.4171-4186. https://doi.org/10.5555/2893873.2894046 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 373

Wang ZQ, Zhang Y, Chang CY, 2017. Integrating order in- Jianming ZHENG received the BS and formation and event relation for script event prediction. MS degrees from NUDT, China in 2016 Proc Conf on Empirical Methods in Natural Language and 2018, respectively, and is currently Processing, p.57-67. a PhD candidate at the School of Sys- https://doi.org/10.18653/v1/d17-1006 tem Engineering, NUDT. His research in- Weber N, Balasubramanian N, Chambers N, 2018. Event terests include semantics representation, representations with tensor-based compositions. Proc few-shot learning and its applications in 32nd AAAI Conf on Artificial Intelligence, p.4946-4953. information retrieval. He has several pa- pers published in SIGIR, WWW, COL- Weston J, Chopra S, Bordes A, 2015. Memory networks. ING, IPM, Cognitive Computation, etc. https://arxiv.org/abs/1410.3916 Zhao SD, Wang Q, Massung S, et al., 2017. Constructing Hefeng WU received the BS degree in and embedding abstract event causality networks from Computer Science and Technology and th text snippets. Proc 10 ACM Int Conf on Web Search the PhD degree in Computer Applica- and Data Mining, p.335-344. tion Technology from Sun Yat-sen Uni- https://doi.org/10.1145/3018661.3018707 versity, China in 2008 and 2013, respec- Zheng JM, Cai F, Chen HH, 2020. Incorporating scenario tively. He is currently a full research knowledge into a unified fine-tuning architecture for scientist with the School of Data and event representation. Proc 43rd Int ACM SIGIR Conf Computer Science, Sun Yat-sen Univer- on Research and Development in Information Retrieval, sity. His research interests include com- p.249-258. puter vision, multimedia, and machine learning. He has https://doi.org/10.1145/3397271.3401173 served as a reviewer for many academic journals and confer- Zhou MT, Huang ML, Zhu XY, 2019. Story ending se- ences, including TIP, TCYB, TSMC, TCSVT, PR, CVPR, lection by finding hints from pairwise candidate end- ICCV, NeurIPS, and ICML. ings. IEEE/ACM Trans Audio Speech Lang Process, 27(4):719-729. Dongsheng LI received his BS degree (with https://doi.org/10.1109/TASLP.2019.2893499 honors) and PhD degree (with honors) in computer science from the College of Com- puter Science, NUDT, Changsha, China in 1999 and 2005, respectively. He was Yi HAN, first author of this invited pa- awarded the prize of National Excellent per, received his BS and MS degrees from Doctoral Dissertation by the Ministry of the National University of Defense Tech- Education of China in 2008, and the nology (NUDT), Changsha, China in 2016 National Science Fund for Distinguished and 2018, respectively, and is currently a Young Scholars in 2020. He is now a full professor at the Na- PhD candidate at the College of Computer tional Lab for Parallel and Distributed Processing, NUDT. Science, NUDT. His research interests in- He is a corresponding expert of Frontiers of Information clude natural language processing, event Technology & Electronic Engineering. His research interests extraction, and few-shot learning. include parallel and distributed computing, cloud computing, and large-scale data management. Linbo QIAO, corresponding author of this invited paper, received his BS, MS, and Xiangke LIAO received his BS degree from PhD degrees in computer science and the Department of Computer Science and technology from NUDT, Changsha, China Technology, Tsinghua University, Beijing, in 2010, 2012, and 2017, respectively. China in 1985, and his MS degree from Now, he is an assistant research fellow NUDT, Changsha, China in 1988. He is at the National Lab for Parallel and Dis- currently a full professor of NUDT, and tributed Processing, NUDT. He worked as an academician of the Chinese Academy of a research assistant at the Chinese University of Hong Kong Engineering. His research interests include from May to Oct. 2014. His research interests include struc- parallel and distributed computing, high- tured sparse learning, online and distributed optimization, performance computer systems, operating systems, cloud and deep learning for graph and graphical models. computing, and networked embedded systems.