Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 341
Frontiers of Information Technology & Electronic Engineering www.jzus.zju.edu.cn; engineering.cae.cn; www.springerlink.com ISSN 2095-9184 (print); ISSN 2095-9230 (online) E-mail: [email protected] Review: A survey of script learning∗
Yi HAN1,LinboQIAO‡1, Jianming ZHENG2, Hefeng WU3, Dongsheng LI1,XiangkeLIAO1 1Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410000, China 2Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410000, China 3School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China E-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected] Received July 13, 2020; Revision accepted Nov. 11, 2020; Crosschecked Jan. 8, 2021
Abstract: Script is the structured knowledge representation of prototypical real-life event sequences. Learning the commonsense knowledge inside the script can be helpful for machines in understanding natural language and drawing commonsensible inferences. Script learning is an interesting and promising research direction, in which a trained script learning system can process narrative texts to capture script knowledge and draw inferences. However, there are currently no survey articles on script learning, so we are providing this comprehensive survey to deeply investigate the standard framework and the major research topics on script learning. This research field contains three main topics: event representations, script learning models, and evaluation approaches. For each topic, we systematically summarize and categorize the existing script learning systems, and carefully analyze and compare the advantages and disadvantages of the representative systems. We also discuss the current state of the research and possible future directions.
Key words: Script learning; Natural language processing; Commonsense knowledge modeling; Event reasoning https://doi.org/10.1631/FITEE.2000347 CLC number: TP391
1 Introduction ment of events, which can be implicitly learned and comprehended by humans in long-term daily-life ac- Understanding is one of the most important tivities. After learning such knowledge, they can use components of a human-level artificial intelligence it to help understand the implicit information and (AI) system. However, developing such a system address the ambiguities of natural language. How- is far from trivial, because natural language comes ever, machines cannot perform such daily-life activ- with its own complexity and inherent ambiguities. ities, and therefore lack commonsense knowledge, To understand natural language, machines need to which is one of the major obstacles to understanding comprehend not only literal meaning but also com- natural language and drawing human-like inferences. monsense knowledge about the real world. Commonsense knowledge includes certain inter- Some researchers attempt to alleviate such bot- nal logic and objective laws based on the develop- tlenecks from the perspective of how humans remem- ber and understand various real-life events. Ac- ‡ Corresponding author * Project supported by the National Natural Science Foundation cording to many psychological experiments, script of China (No. 61806216) knowledge is an essential component of human cog- ORCID: Yi HAN, https://orcid.org/0000-0001-9176-8178; Linbo QIAO, https://orcid.org/0000-0002-8285-2738 nition and memory systems (Bower et al., 1979; c Zhejiang University Press 2021 Schank, 1983; Tulving, 1983; Terry, 2006). From the 342 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 perspective of cognitive psychology, humans use schemas to organize and understand the world. A schema is a mental framework that consists of a gen- eral template for the knowledge shared by all in- stances, and a number of slots that can take on dif- ferent values for a specific instance. A script is a special case of a schema that describes stereotypical activities with a sequence of events organized in tem- poral order (Rumelhart, 1980). When a person first meets a prototypical activity, his/her episodic mem- ory will store the events and the scenarios in which it occurred in the form of a script (Tulving, 1983). The psychological research has indicated that the activity stored in episodic memory can be activated by some particular external event (Terry, 2006). So, the next time a similar or relevant event happens, the person’s Fig. 1 Scripts for visiting a restaurant and making coffee. Each script includes a sequence of events for a episodic memory will activate the old corresponding prototypical scenario script knowledge. Based on the script knowledge, the person will soon understand the actions and the par- ticipants of the scenario, and subsequently form the 1.2 Objective of script learning expectations for possibleupcomingevents.Theex- pectations can deeply affect the ability of humans to The objective of script learning is encoding the comprehend specific scenarios and draw reasonable commonsense knowledge contained in prototypical inferences. events (also called “script knowledge”) in a machine- readable format. Then the machine can use the for- Since script knowledge is critical for humans to mat to draw event-related inferences, such as infer- remember and comprehend different scenarios, ma- ring missing events, predicting subsequent events, chines may also gain benefits by learning such knowl- determining the order of the event chain, and distin- edge. It is this intention that motivates researchers guishing similar events. to introduce script learning to encode script knowl- The commonsense knowledge provided by the edge for machines. A trained script learning system script can help machines understand the practical can process texts that describe daily-life events and meanings of natural language and draw human-like capture commonsense knowledge involving the ac- inferences. It will also play a crucial role in a wide tions and the entities that participate in it. Based range of AI tasks, especially many natural language on this, it can also draw event-related inferences that processing (NLP) tasks, such as event prediction, are reasonable but not explicitly stated in the texts. event extraction, understanding ambiguity resolu- tion discourse, intention recognition, question an- 1.1 Definition of script swering, and coreference resolution.
A script is a structural knowledge representa- 1.3 Framework of script learning tion that captures the relationships between pro- totypical event sequences and their participants in Note that this survey concentrates mainly on a given scenario (Schank and Abelson, 1977). In the broad-domain script knowledge learned from nar- other words, the script is a sequence of prototypi- rative event chains, which can be used to compare cal events organized in temporal order. These events and infer possible upcoming events in a specific con- contain stereotypical human activities, such as eating text, rather than those works attempting to con- in a restaurant, cooking dinner, and making coffee. struct scenario-specific event schemas. Therefore, For example, Fig. 1 shows two toy examples of the the framework of script learning systems with which “visiting restaurant” script and the “making coffee” we are concerned can be illustrated by the basic pro- script. cess and a simple example in Fig. 2. Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 343
Draw inference Subsequent event
stage ??? Leave(John, restaurant) Evaluation
Script learning models for event encoding and matching stage Model Candidate events
Leave(John, restaurant) Event 1 Event 2 Event 3 Event 4 Pay(John, waitress) Walk(John, restaurant) Seat(John, chair) Read(John, menu) Order(John, meal)
stage Event 5 Event 6 Event 7 Subsequent event Take(John, dog, walk) Get(John, food) Eat(John, food) Make(John, payment) ??? Representation Swim(John, pool)
Sleep(John, bed)
NLP tools for event detection and extraction Ă
action r
stage Raw narrative texts Ext “John walked into his favorite restaurant and sat on a chair. Later, he read the menu and ordered a meal. After 10 minutes, he got the food and ate it. After a while, he made payment and ...”
Text corpus
Fig. 2 Framework of script learning which contains four stages. In the extraction stage, events are extracted from raw texts. In the representation stage, events are represented by a structured form. In the model stage, the structured representations are encoded and matched by script learning models, and the matching result is outputted. In the evaluation stage, some evaluation approaches with specific tasks and corresponding metrics are employed to test the performance of the models
According to the standard script learning set- adopted as the matching metrics. tings, our framework can be roughly divided into Finally, some evaluation approaches with spe- four stages: extraction stage, representation stage, cific tasks and corresponding metrics are employed model stage, and evaluation stage. to test the quality of event representations and the Script knowledge is often mentioned implicitly performance of script learning models (i.e., the eval- in the unstructured narrative texts that describe uation stage in Fig. 2). For this stage, the input daily-life activities. The first task of script learn- is event matching results, and the output is perfor- ing is employing NLP tools to detect and extract mance, based on comparisons between prediction re- events from the raw text (i.e., the extraction stage in sults and golden results. Fig. 2). Specifically, in Fig. 2, the extracted events are The output from the previous stage is then rep- represented by the structure predicate(subject, ob- resented in a specific structure or neural networks to ject), while the evaluation is to judge whether the form representations of events (i.e., the representa- model can correctly predict the subsequent event tion stage in Fig. 2). In this stage, events extracted from a candidate set, given the event chain that by NLP tools are regarded as the input, and struc- occurred. tured representations of events as the output. After that, the script learning models will en- 1.4 Organization of the survey code the collected event representations, and mea- sure the relationship between them using event We can conclude from the above framework that matching algorithms (i.e., the model stage in Fig. 2). script learning primarily focuses on five aspects: the For this stage, the input is the structured event rep- text corpus used as a training dataset, the NLP tools resentations, and the output is the matching results. for detecting and extracting events, the representa- Notably, similarity or coherence scores are usually tion form of script events, the models for encoding 344 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 and matching events, and the evaluation approaches Although script learning is quite a potential and for estimating the performance of the models. significant research direction, there are currently no There is no standard corpus for script learn- survey articles on the subject, to the best of our ing, and a majority of related works extract events knowledge. The contributions of this survey can be from news articles by themselves. Therefore, we do summarized as follows: not discuss a corpus in detail in the main part of 1. We are the first to provide a comprehensive the survey, but introduce such literature in future survey that deeply investigates the standard frame- research directions described in Section 6.2. In addi- work and major research topics in script learning. tion, event detection and extraction belong to other 2. We thoroughly summarize the development independent NLP research directions, and there are process and the technique taxonomy of the script many off-the-shelf open-source NLP tools that can learning system. be used conveniently. Consequently, in this survey, 3. We carefully analyze and compare the most we do not consider event detection and extraction ei- representative works on script learning and discuss ther. In summary, we focus mainly on three aspects, some promising research directions. i.e., event representations, script learning models, and evaluation approaches. The rest of the survey is also organized primarily 2 Historical overview based on these research topics. Specifically, Section In this section, we will provide a chronolog- 2 briefly reviews the development history of script ical research script learning timeline. Based on learning; Section 3 introduces several typical event modeling methods, the script learning development representation structures; Section 4 introduces some timeline can be roughly divided into three stages: representative script learning models from seven cat- incipient rule-based approaches (1975–2008), early egories; Section 5 introduces some common evalua- count-based approaches (2008–2014), and recent tion approaches and corresponding metrics; Section deep learning based approaches (2014–). Although 6 concludes the survey and some possible future re- there are different ways of dividing this timeline, we search directions are discussed. The organization of will follow the above division to briefly review the these sections and some related specific items are script learning development process, because it can shown in Fig. 3. better fit the chronological timeline.
ļ Conclusions and future directions 2.1 Rule-based approaches (1975–2008) NC ANC Although script learning has attracted wide at- Ļ Evaluation MCNC approaches MCNS tention in recent years, the use of script-related con- MCNE SCT cepts in AI research dates back to the 1970s, in Rule-based models particular, the seminal works on schema by Rumel- Count-based models hart (1980), on frame by Minsky (1975), on semantic Deep learning based models ĺ Script Pair-based models frame by Fillmore (1976), and on scripts by Schank learning models Chain-based models Graph-based models and Abelson (1977). These works established the External knowledge models core concepts of script learning. Protagonist representation
Research topics in script learning learning script Research topics in The incipient script theory was primarily intro- Ĺ Event Relational representation representations Multi-argument representation duced to construct story understanding and genera- Fine-grained property representation tion systems. The works in this stage often employed ĸ Historical overview quite complex rules (mostly handwritten) to process events and used a set of separate models to treat dif- ķ Introduction ferent scenarios. Although these works showed some Fig. 3 Organization of the survey. We discuss abilities in script understanding and inferring, their mainly three research topics, i.e., event represen- tations, script learning models, and evaluation ap- rule-based methods do not scale to complex domains proaches. These topics and their related specific items because they are characteristically domain-specific are introduced in Sections 3–5 and time-consuming to construct manually. Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 345
2.2 Count-based approaches (2008–2014) 2.3 Deep learning based approaches (2014–)
The trend of script learning was revived by The emergence of deep learning technology has count-based methods, which can automatically learn brought NLP to a new era; modern script learning broad-domain script knowledge using statistical ap- systems widely introduce neural networks and em- proaches to obtain co-occurrence counts from a large beddings to counter the above shortcomings. The text corpus. embedding mitigates the sparsity issue and provides Count-based script learning stems from the un- a useful approach to compose predicates with their supervised framework proposed by Chambers and arguments. Deep learning based models use word Jurafsky (2008). They ran a co-reference system and embeddings to represent event components (i.e., a a dependency parser to automatically extract event predicate and its arguments) and obtain the embed- sequences from raw text. Pairwise mutual informa- ding of the entire event by composing the embed- tion (PMI) scores were employed to measure the re- dings of its components, via a composition neural lationship between event pairs. They also pioneered network. The relationship between events can also the use of
Jurafsky, 2009; Jans et al., 2012; Rudinger et al., (1) (He, cited, a new study); 2015). However, it still has some flaws. (2) (A new study, was released by, UCLA); First, it solely concentrates on organizing event (3) (A new study, was released in, 2008). chains around a central participator, the protago- nist. Thus, a script that contains multiple partic- A relational triple provides a more specific rep- ipants will redundantly produce many chains. For resentation method, which aids in maintaining coher- example, the narrative text “Mary emailed Jim and ence between subject and object. Balasubramanian he responded to her immediately” yields two chains, et al. (2013) used it to induce open-domain event a chain for Mary, (email, subject) (respond, ob- schemas, which effectively overcame the coherence- ject), and a chain for Jim, (email, object) (respond, lacking issue of the event schemas proposed by subject). Chambers and Jurafsky (2009). Second, it treats subject-verb and verb-object separately. Thus, it lacks coherence between subject 3.3 Multi-argument representation and object; in other words, it does not express inter- Pichotta and Mooney (2014) proposed a simi- actions between different
3.4 Fine-grained property representation play an important role in predicting the next event. Consider two events: “Jerry was angry” and “Jerry Lee IT and Goldwasser (2018) held the view punched her son.” The predicate adjective “angry” that fine-grained event properties, such as argument, is crucial information to predict the follow-up event sentiment, animacy, time, and location, can poten- correctly. These works will represent the predicate tially enhance the event representation, so they sug- adjective as an argument to the verb be or become, gested a fine-grained property representation. They e.g., “Jerry was angry” → be(Jenny, angry). represented an event as a 6-tuple
From the perspective of different structures used the restaurant. When we go to a restaurant, we can to handle scripts, they can also be divided into three activate the restaurant schema that is stored in our other categories: event pair based models, event memory, and anticipate the upcoming events. chain based models, and event graph based models. A frame (Minsky, 1975) is a variation of the Pair-based models focus on calculating associations schema, and is a data structure used in AI to rep- between pairs of events by count- or vector-based resent knowledge about stereotyped situations. The methods; chain-based models capture the order in- frame has a hierarchical structure containing four formation and long-term context information of the different levels: syntactic surface frames, seman- full event chain by RNN-based approaches; graph- tic surface frames, thematic frames, and narrative based models extract script knowledge by construct- frames. Similar to the schema, the top levels (the- ing graph structures that can express denser and matic frames and narrative frames) are fixed general broader connections among events. templates with slots for different situations. In con- In this section, we discuss some representative trast, the bottom levels (surface syntactic frames and models in detail, according to all of these six cate- surface semantic frames) are specific instances that gories. Finally, we also introduce some other notable fill in slots with different values. works outside the above categories that attempt to A semantic frame (Fillmore, 1976) can be enhance traditional script learning models by inject- roughly seen as a lexical-level Minsky frame, which ing external knowledge. was proposed to capture the abstract description of Notably, because these categories are divided an individual activity and all of its possible roles. from various perspectives, there is some overlap be- A semantic frame includes a set of predicates that tween them inevitably. In addition, each script learn- can describe the main concept of the activity and ing model may involve multiple categories, so when the corresponding semantic roles, which can describe discussing specific categories below, we may empha- different entities involved in the activity. The same size some particular aspects. activity can be expressed via different surface re- alizations, with different predicates or sentence con- 4.1 Rule-based models structions. For example, consider the following three The handcrafted rule-based script learning mod- sentences: els are the earliest use of script-related concepts in (1) Microsoft purchased the Canadian company AI research. The seminal works include schemas by Maluuba. Rumelhart (1980), frames by Minsky (1975), seman- (2) AI company Maluuba was bought by Mi- tic frames by Fillmore (1976), and scripts by Schank crosoft on January 9, 2017. and Abelson (1977). Although these works estab- (3) Microsoft purchased Maluuba for 30 million lished the core concept of script learning and made dollars. many beneficial attempts, they still have many obvi- Although different in surface forms, the core se- ous demerits and limitations. Below, we will briefly mantic meaning of them is similar, namely a com- introduce and discuss these incipient works. merce purchase activity. The purpose of the seman- Schema (Rumelhart, 1980) is a broad concept tic frame is to capture the core semantic meaning of cognitive psychology and an important theoretical across all the surface forms, by abstracting out the source of the script. A schema is a mental framework information about the main activity. The semantic employed to organize and understand the world, and frame also assigns proper semantic roles to the en- consists of a general template and some slots. The tities involved in the activity (e.g., Microsoft is the general template represents an abstract class con- buyer role, and Maluuba is the role of the good). The taining the knowledge shared by all instances, while semantic roles can help understand the meaning of the slots can be filled in with different values to the text by answering questions like “who did what to form a specific instance of this class. For exam- whom and by what means?” They are also useful for ple, a restaurant scenario has its own schema, which a lot of NLP tasks, such as question answering (Shen includes a sequence of events that take place in a and Lapata, 2007), document summarization (Khan restaurant, such as entering the restaurant, order- et al., 2015), and plagiarism detection (Osman et al., ing food, eating food, paying the bill, and leaving 2012). 350 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373
The concept of script was first firmly proposed input of DISCERN is a short narrative text about by Schank and Abelson (1977) to explain how com- a stereotypical event, and a “lexico” module is used monsense knowledge about daily human activities to map the input words into distributed represen- can be used in language processing. They defined tations. Then a “sentence parser” module processes scripts as structured knowledge representations cap- each input sentence word by word to form the repre- turing the relationships between prototypical event sentation of the sentence. After that, a “story parser” sequences and their participants in a given scenario. module composes all the sentence representations to In short, it is a sequence of prototypical events orga- produce the representation of the story. Finally, nized in temporal order. These events contain stereo- the story representation is inputted into a “story typical human activities, such as eating in restau- generator” module and a “sentence generator” mod- rants, cooking dinner, and making coffee. ule to generate corresponding sentences about the To compare the relationship between the script story. The corresponding sentences can be used to and the concepts mentioned before, we can see the solve story-related problems such as retrieving infor- script as a specific schema or frame that focuses on mation, answering questions, and generating para- stereotypical events in a particular scenario. We can phrases. All the modules are trained separately, also treat the script as multiple semantic frames, be- and subsequently trained jointly to fine-tune the cause it models a sequence of events rather than an modules. individual activity. The idea of employing distributed representa- The incipient script theory was primarily intro- tions and combining the component representations duced to construct story understanding and genera- to produce representations of the whole is simi- tion systems (Schank and Abelson, 1977; Culling- lar to many modern deep learning based models. ford, 1978; DeJong, 1979; Schank, 1990). These These deep learning based models widely employ script systems employ mainly complex handcrafted distributed representations, and use embeddings to rules for modeling relations between events and use represent words, sentences, and events. They also a set of separate models to treat different scenarios. create representations of a sentence by composing Because rules are designed by humans, rule-based the words in the sentence, and create representa- systems are limited to the bounded designer experi- tions of an event by combining its components. In ence, and are extremely time-consuming. Although addition, the joint training and fine-tuning methods they could be applied in certain fields, their rule- used by DISCERN are generally used by recent deep based methods were characteristically limited to a learning models. With improvements in computing few specific domains and did not scale to complex power, model structure sophistication, and the surge ones. of data volume, the ability to process and generalize Some potential end-to-end connectionist meth- these deep learning based modelsisfargreaterthan ods were subsequently proposed to model script that of DISCERN. We will discuss these models in knowledge, such as DYNASTY (DYNAmic STory detail in Sections 4.3–4.5. understanding sYstem) (Lee G et al., 1992) and DIS- 4.2 Count-based models CERN (DIstributed SCript processing and Episodic memoRy Network) (Miikkulainen, 1992, 1993). Count-based script learning models greatly re- These works have some abilities in script understand- duce the defects of rule-based models by automati- ing and inferring, but more rules require more com- cally learning broad-domain script knowledge from a puting power and more application scenarios require large text corpus. The key idea of count-based mod- more training data. Due to the weak computational els is to measure the relationship of event pairs using power and insufficient data at that time, the abilities statistical counting approaches. were limited and these works did not achieve good Chambers and Jurafsky (2008) were the first to generalization (Mueller, 1998; Gordon, 2001). introduce a count-based approach for learning statis- Nevertheless, these works are of great value to tical script knowledge. They adopted PMI (Church the subsequent works. From the structure and op- and Hanks, 1990) to calculate the relationship scores eration process of DISCERN, we can even see the over all the event pairs that occurred in the narrative rudiments of modern scriptlearningsystems.The chains. Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 351
As reviewed in Section 3.1, Chambers and Ju- the highest PMI score as the prediction. The re- rafsky (2008) used protagonists to represent events. sult can be obtained by maximizing the following They extracted the narrative chain from the corpus, expression: which is a script-like event sequence that shares a common protagonist, and represented it as a series n of
The numerator is defined by (taking c1 as an lected by 0-, 1-, and 2-skip bigrams. example) After obtaining the statistical data, the au- C c ,e thors introduced two novel score functions to explic- ( 1 ) P (c1,e)= , (2) itly capture the temporal order information of event c e C (ci,ej ) i j pairs. The first one is the ordered PMI (OP) score, a where C(c1,e) can be obtained by counting the co- variation of the PMI, which explicitly takes into ac- occurrence times of event pair (c1, e) in the training count the temporal order of event pairs in the chain. data, regardless of the order. OP assumes that, in addition to the events occur- The authors hypothesized that the most likely ring before the candidate events, the events occur- new event had the highest PMI score, so they did ring after the candidate events are still considered. the above calculations for every candidate event in So, given an insertion point, m, where the new event the training corpus and then chose the one with should be added, we can calculate the OP score as 352 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373
(Saw, subject), (Smiled, subject); (Smiled, subject), (Kissed, object); (Kissed, object), (Blushed, subject) Six pairs Two-skip bigrams: (Saw, subject), (Kissed, subject); (Smiled, subject), (Blushed, subject); (Saw, subject), (Blushed, subject)
(Saw, subject), (Smiled, subject); (Smiled, subject), (Kissed, object); (Kissed, object), (Blushed, subject) Five pairs One-skip bigrams: (Saw, subject), (Kissed, subject); (Smiled, subject), (Blushed, subject)
Three pairs Zero-skip bigrams: (Saw, subject), (Smiled, subject); (Smiled, subject), (Kissed, object); (Kissed, object), (Blushed, subject)
Event pair occurs adjacently Event pair occurs Event pair occurs (Saw, subject), (Smiled, subject) with one event intervening with two events intervening (Smiled, subject), (Kissed, object) (Saw, subject), (Kissed, subject) (Kissed, object), (Blushed, subject) (Smiled, subject), (Blushed, subject) (Saw, subject), (Blushed, subject)
Event chain: (Saw, subject), (Smiled, subject), (Kissed, object), (Blushed, subject)
Fig. 4 Event pairs collected by N -skip bigrams. 0-skip bigrams collect three event pairs that occur adjacently. 1-skip bigrams collect five event pairs, including three 0-skip bigrams plus two 1-skip bigrams whose event pair occurs with one event intervening. 2-skip bigrams collect six event pairs, including three 0-skip bigrams and two 1-skip bigrams mentioned above, plus a 2-skip bigram whose event pair occurs with two events intervening follows: plus the bigram probability function achieved the OP(C, e) best empirical performance. m n Another enhancement was made by Pichotta = OP (ci,e)+ OP (e, ci) and Mooney (2014), who proposed multi-argument i=1 i=m+1 (4) event representation v(es,eo,ep) to replace the im- m P c ,e n P e, c poverished
Verbal predicate Position of protagonist Argument 1 Argument 2 Prediction embeddings
Neural network Neural network Neural network Neural network
Context embedding
...... Event embeddings e1 ekí1 ek+1
Fig. 6 Process of incrementally predicting the next event. First, the verbal predicate is predicted using context embedding. Then, the protagonist position is predicted using context embedding and the predicted verbal predicate embedding. In this way, the other arguments are predicted one by one using the context events and the previous prediction 356 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 correlation and compatibility between two events, will be effective for multiple event-related tasks. namely, how strongly they are expected to appear When resolving standard event prediction tasks in the same chain. These approaches include tra- (i.e., predicting which candidate event is most likely ditional count-based methods, such as PMI (Cham- to occur given a context event sequence), pair-based bers and Jurafsky, 2008) or skip bigram probabili- models need to compute the correlation scores be- ties (Jans et al., 2012), and vector-based methods, tween a candidate event and a sequence of context such as the Siamese network (Granroth-Wilding and events. Generally, they will compute the scores be- Clark, 2016) or vector distance. tween the candidate event and each individual event In a harder situation, when measuring the re- in the chain, and then treat the average value of these lations of event pairs that have subtle differences in scores as the final correlation score between the can- surface realizations, the event embeddings, produced didate event and the whole event chain. Finally, the from widely used compositional neural networks, do candidate with the highest score is chosen as the final not work well. They usually combine event compo- prediction. We can see this process in Fig. 7. nents by concatenating or adding their word embed- Although the relationships between the candi- dings, and then applying a parameterized function date event and each event in the chain have been con- to map the summed vector into the event embed- sidered, pair-based models are still limited to event ding space. However, due to the overlap of words pairs. They ignore the temporal order of events, and and the additive nature of parameters, it is hard thus lack the ability to capture order information to encode subtle differences in an event’s surface and long-term context information. As a result, im- realizations. probable predictions may be produced by them. For For example, the event pair “John threw bomb” example, given event
Candidate with the Final prediction highest final score
Scores between event Final correlation score of candidate 1 ... Final correlation score of candidate m chain and candidates
Scores between Score 1 Score 2 ... Score n Score 1 Score 2 ... Score n event pairs
Event matching algorithm
Event 1, event 2, Ă, event n Candidate 1 ... Event 1, event 2, Ă, event n Candidate m Event chain Candidate event Event chain Candidate event
Fig. 7 Process of resolving the event prediction task by pair-based methods. First, calculate the scores between the candidate event and each individual event in the chain. Then, calculate the average value of these scores as the final score between the candidate event and the whole event chain. Finally, choose the candidate with the highest score as the final prediction
Nevertheless, basic RNNs suffer from a van- Two notable improved studies were conducted ishing and exploding gradient problem. During to overcome these shortcomings. The first one is the learning period, the gradient signal tends to by Pichotta and Mooney (2016b), in which the raw either approach zero or diverge as it propagates texts describing the previous event chain were di- through long time steps. So, LSTM (Hochreiter and rectly used to predict the texts describing the miss- Schmidhuber, 1997) and gated recurrent unit (GRU) ing event. The sentence-level RNN encoder-decoder (Chung et al., 2014) were devised to overcome this model of Kiros et al. (2015) was employed to produce problem. They employ more complicated hidden text predictions. Their word-level model trained by units, which can provide encoded long-range prop- raw text was compared with an identical event-level agation of events without losing long-term historical model trained by structured event representations. information. The experimental results of the evaluation showed that, on the task of event prediction, there was only A characteristic of scriptlearningisthatsome a marginal difference between the word-level model events will be highly predictive of events far ahead and the event-level model. in the chain, while some events are only locally pre- dictive. Following this idea, Pichotta and Mooney The second improved work was done by Hu et al. (2016a) first adopted the LSTM model for the task (2017). They developed an end-to-end LSTM model of script learning. Their model represents an event called CH-LSTM. Similar to Pichotta and Mooney with five components (i.e., 5-tuple (v, es,eo,ep, p), (2016b), they took the raw texts describing the pre- involving a verb predicate, subject, direct object, vious event chain as the input, and directly generated prepositional relation, and preposition). At each texts describing the possible next event. time step, one event component is inputted into the CH-LSTM captures two-level structures of the LSTM model. After inputting the entire event chain, event chain. At the first level, a single event is en- the model will output the prediction of an additional coded using word embeddings as the input, and they event, which is also a component at each time step. are combined to produce event embedding as the Their work produced more accurate results, but it re- output. At the second level, the event chain is en- quired handcrafted features to represent events and coded using event embeddings as the input, and the linguistic preprocessing to extract these features. In last hidden vector is outputted as the embedding addition, it cannot predict additional events other of the entire previous event chain. CH-LSTM uses than a given candidate set. the event chain embedding to make a non-targeted 358 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373
Matching result
Event matching algorithm
Embeddings with ... order information
Chain-based Hidden RNN RNN ... RNN RNN RNN methods State
Event ...... embeddings
Compositional neural network
Event Candidate Event 1 Event 2 ... Event n Candidate 1 ... chain events Candidate n
Fig. 8 Process of chain-based models. Inputting each event embedding inside an event chain into a recurrent neural network (RNN) can provide embeddings with order information via an intermediate hidden state. The event-matching process based on these embeddings can capture long-term context information
(unknown) next event prediction, which can gener- ods. Therefore, a dynamic memory network (Weston ate new events that do not even exist in the training et al., 2015) was used to automatically induce differ- set. ent weights for each event in the inferring task. Although the aforementioned works showed The work of Wang ZQ et al. (2017) inspires us to that LSTM captured significantly more event chain resolve the script-related task by combining the ad- order information and made substantial experimen- vantages of pair- and chain-based models. Following tal improvements compared to the previous methods, this inspiration, we can exploit the event segment, it may still suffer from the over-fitting problem, be- which is a concept between individual events and cause an event chain in a script has a flexible order. event chains, to make further improvements. To solve this issue, Wang ZQ et al. (2017) proposed An event segment refers to a part of script that a novel dynamic memory network model to integrate involves several individual events in the same event the advantages of chain-based temporal order learn- chain. As Fig. 10 shows, given a script, for example, ing and pair-based coherence learning. Fig. 9 shows the script of restaurant visiting, there are various an overview of their model. event segments within the chain, such as continuous As we mentioned at the end of Section 4.4, event segment <“X read menu” “X order food” “X pair-based models are more practical when model- eat food”> or discontinuous event segment <“X read ing event chains with flexible order, so Wang ZQ menu” “X order food” “X make payment”>. et al. (2017) used pair-based methods to measure We can observe that both the individual events the relationship between pairs of an occurring con- and the event segments are conducive to making text event and a subsequent candidate event. To more accurate subsequent event predictions. In par- inject the order information, they represented events ticular, the individual event “X make payment” has a as the hidden states of LSTM, and used them to significant effect on predicting the subsequent event calculate the relation scores of event pairs. “X leave restaurant.” Meanwhile, the event seg- In addition, considering the fact that the sig- ment <“X read menu” “X order food” “X eat food”> nificance of different events varies for inferring a has a strong semantic relation with the subsequent subsequent event, it is not appropriate to give an event “X leave restaurant.” The event segment has equal weight to each event as in the previous meth- a unique advantage; namely, it has richer semantic Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 359
Probability of candidate 1
Dynamic memory network for relation measuring
Pair-based Score 1 Score 2 ... Score n methods
Embeddings with ...... order information
Chain-based Hidden LSTM LSTM ... LSTM LSTM methods State
Event ... embeddings
Compositional neural network
Event chain Event 1 Event 2 ... Event n Candidate events Candidate 1
Fig. 9 Integrating the advantages of chain-based temporal order learning and pair-based coherence learning. For chain-based methods, the event embeddings of the event chain are first inputted into an LSTM to encode order information. For pair-based methods, the hidden states of LSTM are used to measure the relation between a pair of context event and subsequent candidate event
X walk to restaurant X walk to restaurant the order information of the event chain, and puts its hidden states into a self-attention mechanism (Lin event Individual X seat X seat ZH et al., 2017) to extract diverse event segments. SAM-Net can combine the information of individual
X read menu X read menu events and the information of event segments. The former refers to the relations between a subsequent event and an individual event, while the latter refers X order food X order food Continuous
event segment to the relations between a subsequent event and an event segment. X eat food X eat food Discontinuous event segment Similar to Wang ZQ et al. (2017), SAM-Net con-
X make payment X make payment siders that different events or event segments may have different semantic relations with the subsequent
X leave restaurant X leave restaurant event. So, two attention mechanisms (Luong et al., 2015) were used to assign different corresponding Fig. 10 Event segments of restaurant visiting. An weights for each individual event and event segment. event segment refers to several individual events in The prediction of the subsequent event is based on the same event chain. There are various event seg- ments within an event chain, including a continuous the combination of these two attention mechanisms. event segment and a discontinuous event segment The standard MCNC task showed that SAM- Net apparently achieves better results compared to information than individual events and introduces the previous models, and the idea of exploiting an less noise than event chains, whereas previous ap- event segment to draw more accurate inferences is proaches did not fully use them. a potential direction. Nevertheless, there is still Lv et al. (2019) designed the SAM-Net model to no effective method for extracting event segments. fill in this gap. SAM-Net uses an LSTM to capture The self-attention mechanism used by SAM-Net can 360 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 somewhat achieve this aim, but it is not accurate language text, and link them to construct the event enough. Therefore, follow-up improvement research graph. is anticipated. Here we discuss two more recent papers in detail. Zhao et al. (2017) modeled cause-effect relations be- 4.6 Graph-based models tween events into a graph structure. They proposed There is also a line of research that tries to ex- an abstract causality network on top of the specific press the relation between events and extract script events to reveal high-level general causality rules. knowledge by constructing a graph-based organiza- General causality patterns can be helpful in pre- tion of events. As a general rule, the nodes of the dicting future events correctly and reducing the spar- graph refer to the events, and the edges refer to the sity issue caused by the causalities between specific relations between events; some edges may also have events. The abstract causality network can discover weights to measure the degree of the relationship or general causalities on top of the specific causalities. the transition probability between two events. For instance, the specific causality “a massive 8.9 The graph structure is clearer and more intuitive magnitude earthquake hit northeast Japan on Fri- from the human interpretability perspective to some day → A large number of houses collapsed” can be extent. It is also closer to the organizational form of abstracted to a more concise and general one: Earth- event development in reality. As in the real world, quake → Houses collapsed. given any events, there are always many possible sub- The authors extracted event pairs with a causal- sequent events, so there are also many different event ity relation from text snippets (in this paper, news chains occurring in the same scenario. When link- headlines), and represented them by a set of verbs ing all of these events and event chains together ac- and nouns (e.g., “Hudson killed people”). The spe- cording to their associations, a graph structure will cific causality network is constructed by regarding be formed. Compared with pair- and chain-based each specific event as a node, and the causality be- models, graph-based models can express denser and tween specific event pairs as an edge. The abstract broader connections among events, which contain causality network is constructed on top of the spe- richer script knowledge. In addition, once scripts cific causality network. Particularly, the nouns of are represented in a graph structure, a variety of ex- specific events are generalized to their hypernyms in isting graph-based algorithms can be used to solve WordNet (Miller, 1995) (e.g., the hypernym of the script-related tasks. noun “chips” is “dishes”), and the verbs of specific The early work of Regneri et al. (2010) con- events are generalized to their high-level classes in structed the temporal script graph for specific sce- VerbNet (Schuler, 2005) (e.g., the verb “kill” belongs narios from crowdsourced data. The data contains to the class “murder-42.1”). The nodes of the ab- many different descriptions focusing on a single sce- stract causality network are created by replacing the nario. The scenario-specific paraphrase and tem- original one with its generalization result, and the poral ordering information were extracted using a edges are also generated according to the specific one. multiple sequence alignment (MSA) algorithm and For example, if there is an edge between the specific connected to form a script graph. event “Hudson killed people” and the specific event Orr et al. (2014) employed the hidden Markov “Hudson was sent to prison,” then there will also be model (HMM) to construct a transition event graph. an edge between their corresponding abstract events, Their model clustered event descriptions into dif- namely, the abstract event “murder-42.1, people” and ferent types and learned an HMM over the se- the abstract event “send, prison”. After constructing quence of event types. The structure and pa- the abstract causality network, the event prediction rameters were learned based on the expectation- task can be formulated as a link prediction task on maximization (EM) algorithm, and the script infer- the net. Given the cause (effect) event, the model ences were drawn according to the state transition will produce a list of corresponding effect (cause) probability. events ranked by possibility. Glavaš and Šnajder (2015) designed an end-to- Another notable graph-based model was intro- end system with a three-stage pipeline, which can duced by Li ZY et al. (2018). They constructed extract anchor, argument, and relation from natural a narrative event evolutionary graph (NEEG) to Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 361 express connections among events, and a scaled 4.7 External knowledge models graph neural network (SGNN) to model event in- teractions and learn event representations. At the end of this section, we will discuss some recent works that attempt to enhance traditional As shown on the left of Fig. 11, NEEG uses script learning models by injecting external knowl- the
0.09 9 network 0 0.06 0 (Walk, subj) . . 0 0 3 Event2Mind (Rashkin et al., 2018) and ATOMIC 0 .
0.02 4 0 0.10 0.0 Siamese (Sap et al., 2019), as the training dataset. In ad- 12 .03 0 .50 network (Wait, subj) 0 dition to an event description, there are multiple (Read, subj) (Leave, subj) (Seat, subj) (Pay, subj) Relatedness additional inference dimensions, produced by crowd- scores sourcing, which are appended to each event in these Fig. 11 Brief framework of NEEG and SGNN. NEEG two corpora. These inference dimensions contain regards
Joint learning 5 Evaluation approaches
Essentially, the aim of the script learning system Embedding(eh) Embedding(r) Embedding(et) is to encode script knowledge and use it to draw reasonable inferences, so in the final process of script “Jenny went “She was very into her favorite Reason hungry” learning, some evaluation approaches are needed to restaurant” test whether the model can effectively encode the
(eh, r, et) script knowledge and exploit it. In this section, we will introduce some represen- Corpus tative evaluation approaches with specific tasks and corresponding metrics. The evaluation approaches Fig. 12 Multi-relational script learning. Relation can be employed to measure the capabilities of the triplet (eh, r, et) are extracted from the corpus. It consists of two related events (eh, et)andtherela- script learning model by experimentally testing its tion type between them (r). All of the eh, et,andr performance on the task. are represented by embeddings, which can be jointly learned by the model 5.1 Narrative cloze test |e − e | between event embeddings ( h t ) can directly The cloze test (Taylor, 1953) is a classic way to r reflect their relations ( ). The relatedness score be- evaluate the language understanding ability of hu- tween events is formulated as mans, by deleting a random word from a sentence p and having a human attempt to fill in the blank. f (t)=f (eh,et,r)=eh − et + r , (8) trass transe p Chambers and Jurafsky (2008) extended this dr approach to script learning and proposed the nar- where eh,et,r ∈ R are the embeddings from the event composition network. This formulation is a rative cloze (NC) test. The NC test evalu- dissimilarity measure, with a lower score meaning a ates the inference ability of script learning mod- stronger relatedness. els by holding out a single event from the event Considering that relation and event are two chain and letting the model predict the miss- completely different categories, it may not be appro- ing event given the remaining events. For ex- priate to embed them into the same space. TansE has ample, given an event chain with a missing , → , → → limitations in dealing with reflexive, 1-to-N, N-to-1, event: (walk subject) (sit subject) (?) , → , → , or N-to-N relations (Wang Z et al., 2014), so the (serve object) (eat subject) (leave subject), authors also adapted TransR to encoder relations. the model is asked to predict the missing event (?) using the information of the remaining events. Par- TransR separates the embedding space of rela- dr ticularly, it needs to compute the probability of each tion (r ∈ R ) and the embedding space of events de event in its entire event vocabulary and return a (eh,et ∈ R ). When performing operations, event prediction list of likely events ranked by probability. embeddings can be mapped into the relation embed- The smaller the rank number of the correct event ding space by multiplying a relation-specific param- de∗dr is, the more accurate the prediction will be. Cham- eter matrix (M ∈ R ). The relatedness score bers and Jurafsky (2008) used the average rank of all between events is formulated as correct events as an evaluation metric. Obviously, a
ftrass(t)=ftranse (eh,et,r) better performing model will have a lower average p (9) rank. = ehMr − etMr + rp . However, this metric is partially irrational; that The model of Lee IT and Goldwasser (2019) is, it tends to punish the model with a long event was evaluated using several tasks, including three vocabulary. If a large model with a long vocabu- cloze tasks (MCNC, MCNS, MCNE), three relation- lary and a small model with a short vocabulary both specific tasks, and a related downstream task. The predict the correct event at the bottom of the list, results showed that the learned embedding could then the large model will be more penalized than the capture relation-specific information and improve small one, because the guess list of the large model performance for the downstream task. is much longer. 364 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373
Therefore, Jans et al. (2012) suggested well. Recall@N as a more reliable metric. It evaluates In the ANC task, there are two event chains: the percentage of missing events that fall under top- one is the correct chain, and the other is the same N predictions of the model. Specifically, if the rank but with one event replaced by random events. The number of the correct event is smaller than N,mean- models need to discriminate between real and cor- ing that the first N events in the prediction list con- rupted event chains. Itsformisasfollows: tain the correct one, then one score will be given. Correct chain: Recall@N is the average score of all predictions, (walk, subject) → (sit, subject)→ (eat, subject) namely, the percentage of prediction lists with cor- → (leave, subject). rect events in their first N events. Contrary to the Incorrect chain: average rank metric, a better performing model will (walk, subject) → (sit, subject)→ (eat, subject) N have a higher Recall@ . → (swim, subject). Despite some relief, Recall@N still penalizes se- mantically reasonable predictions and solely awards 5.3 Multiple choice narrative cloze task events with exactly the same surface form. There- Granroth-Wilding and Clark (2016) also refined fore, Pichotta and Mooney (2014) introduced accu- the NC test and proposed the multiple choice nar- racy as a more robust metric, because it does not rative cloze (MCNC) task, in which a multi-choice regard the event as an unbreakable atomic unit, and prediction rather than a ranking list is used. Partic- takes into account all of its components. Specifically, ularly, a series of contextual events are given, and the one point will be given if the model’s top guess in- model should choose the most likely next event from cludes each component of the correct event. Then a set of optional candidates. There is only one cor- the score will be divided by the total number of pos- rect subsequent event in the candidate set, while the siblepointstoyieldavaluebetween0and1.More other events are randomly sampled from the corpus simply put, accuracy refers to the percentage of event with their protagonist replaced by the protagonist of components that are correctly predicted. the current chain. Its form is as follows: The NC test constructed the basic form of main- X → X → stream script learning evaluation methods, and it has Event chain: walk( , restaurant) sit( ) X → been adopted by various subsequent works (Jans et order( , food) __. al., 2012; Pichotta and Mooney, 2014, 2016a, 2016b; Optional choices: X Rudinger et al., 2015). Despite the popularity, it still C1: play( , tennis). has two issues: C2: climb(X, mountain). First, for any given event chains, there may be C3: ride(X, bicycle). multiple subsequent choices that are possible and C4: eat(X, food). equally correct, but only one specific right answer. C5: have(X,shower). The NC test will penalize wrong answers equally, MCNC is a better fit for evaluating multi- even if some of them are semantically plausible, argument events, because it can evaluate models’ which is somewhat arbitrary. inference abilities without searching the entire event Second, it requires searching the entire event vo- vocabulary. The way to calculate the accuracy of the cabulary; however, when considering complex event MCNC task is also simpler and more intuitive, i.e., structures, such as multi-argument events, the vo- directly calculating the ratio of the correct predic- cabulary becomes extremely large, which leads to a tion to the total test. In addition, in principle, the burden of large computation. MCNC task can be completed by humans. Hence, task results from humans would be a valuable base- 5.2 Adversarial narrative cloze task line for comparison with the model results.
Modi (2016) attempted to address these issues 5.4 Multiple choice narrative sequences task by proposing the adversarial narrative cloze (ANC) task. This research pointed out that it is more realis- The MCNC task is substituted for the NC task tic to evaluate script learning models that give credit by most of subsequent related works (Wang ZQ et al., for predicting semantically plausible alternatives as 2017; Li ZY et al., 2018; Weber et al., 2018; Lee IT Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 365 and Goldwasser, 2019; Lv et al., 2019). However, four sentences are given as context, and models need it does not capture the flow of the entire narrative to choose one correct ending of the story from two chain. Therefore, it is hard to draw an inference over alternative sentences. For example: long event sequences. Lee IT and Goldwasser (2018) “Karen was assigned a roommate in her first year proposed two novel multiple-choice variants gener- of college. Her roommate asked her to go to a nearby alized from the MCNC task, to evaluate a model’s city for a concert. Karen agreed happily. The show ability to draw long sequence inferences instead of was absolutely exhilarating.” only predicting one event. Given the above context, the model needs to The first one is multiple choice narrative se- choose between two alternative sentences, “Karen be- quences (MCNS), which samples options for all the came good friends with her roommate” (right ending) events on the chain, except for the first event used and “Karen hated her roommate” (wrong ending). as the starting point. Its form is as follows: walk(X, Although the purpose of SCT and NC is rel- restaurant) → __ → __ → __ → __. evant, there are some slight differences between The model needs to make a multiple choice nar- the story on which SCT focuses and the script rative for every blank. on which NC focuses. To some extent, the story 5.5 Multiple choice narrative explanation is a kind of complex script that consists of more task objections and multiple relations. SCT also calls for stronger inferential capability and understand- Another variant is multiple choice narrative ex- ing of the deeper-level story semantics, rather than planation (MCNE), which gives both the first event shallow literal or statistical information, because and the final event and predicts all the intermediate it is asked to predict the entire sentence instead events. Its form is as follows: walk(X, restaurant) of an individual event. The experiment results → __ → __ → __ → leave(X, restaurant). also proved this. Mostafazadeh et al. (2016) chose Note that regardless of the specific evaluation some of the state-of-the-art script learning meth- approach, the basic goal is to evaluate the ability of ods at that time for evaluation, but the experi- the script learning model to encode script knowl- mental results showed that they all struggled to edge and draw reasonable inferences. Therefore, achieve a high score on SCT, and most of them were the models with excellent evaluation performance only slightly better than random or constant selec- should have learned rich commonsense knowledge tion. This indicated that they have only learned and understood deep semantic meaning. Neverthe- shallow language knowledge instead of deeper less, some researchers (Pichotta and Mooney, 2014; understanding. Mostafazadeh et al., 2016; Chambers, 2017) found Recently, there have been a variety of works (Li an issue: some models are achieving high scores by Q et al., 2018; Zhou et al., 2019; Li ZY et al., 2019) at- optimizing performance on the task itself, instead tempting to resolve SCT, and some of them achieved of the learning process. For example, in the NC high scores. For example, the three-stage transfer- task, directly giving the prediction of the ranked able BERT training framework proposed by Li ZY list in accordance with the event’s corpus frequency et al. (2019) achieved rather good results with accu- (e.g., always predicting common events “X said” or racy greater than 90%. “X went”) was shown to be an extremely strong baseline. 6 Conclusions and future directions 5.6 Story cloze test In this section, we conclude the main contents To alleviate the above issues, Mostafazadeh of this survey and briefly analyze the current script et al. (2016) introduced the story cloze test (SCT), learning situation. In addition, we discuss some pos- an open task, to help models make a more com- sible directions for future research, because we be- prehensive and deeper evaluation. They collected a lieve script learning research still has great potential, corpus full of commonsense short stories written in and there are many novel directions that need to be five sentences. When performing an evaluation task, explored in the future. 366 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373
6.1 Conclusions jecting external knowledge. Some new evaluation frameworks, such as SCT, MCNS, and MCNG, and This survey tried to provide a comprehensive re- several novel corpora, such as InScript, Event2Mind, view and introduce some representative script learn- and ATOMIC, were also proposed. These innovative ing works. We first briefly introduced some basic works are very enlightening and have potential, and script learning knowledge, including its definition, some of them have achieved quite good performance. aim, significance, process, focus, and development We are looking forward to more revolutionary im- timeline. Then we discussed in detail three main re- provements shortly. search topics: event representations, script learning Finally, we list some representative script learn- models, and evaluation approaches. For each one, we ing works, as well as their dataset, event representa- introduced some typical and representative works, tions, models, and evaluation approaches, in Table tried to compare their advantages and disadvantages, 1. Because these works have quite different experi- and summarized their development process. In the mental setups, it is not likely that a fair comparison end, we discussed some possible directions for future can be conducted among them. Consequently, we do research. not list the specific experimental results and detailed In summary, script learning is a relatively small processes, but only the main methods they used. but growing research direction with potential. It aims at encoding the commonsense knowledge of 6.2 Future directions real life to help machines understand natural lan- 6.2.1 Establishing a standard corpus and evaluation guage and draw commonsensible inferences. Script system learning systems have a long history and profound psychological foundation, and are also crucial com- There is no standard corpus or evaluation sys- ponents in constructing AI systems that are designed tem for script learning. The majority of related to achieve human-level performance. works use NLP tools to automatically extract events Despite its significance, it still has some draw- from the news articles in the New York Times (NYK) backs and limitations, such as lacking a standard portion of the Gigaword corpus, and the NC or and high-quality corpus, lacking systematic evalua- MCNC test is used to evaluate the extraction results. tion frameworks, and lacking effective approaches for However, this method has the following defects: the extracting and encoding the most important infor- types of chosen raw texts can vary widely, the qual- mation of events. Or, more broadly, it lacks a deeper ity of extraction results relies heavily on NLP tools, understanding of the mechanisms of human cogni- and the content of texts may lack detailed informa- tion and comprehension of commonsense knowledge. tion about quite common and mundane normal life This knowledge is considered self-evident. On one events. In addition, Mostafazadeh et al. (2016) and hand, it seems that everyone naturally knows it, and Chambers (2017) found that automatically generat- therefore there is no need to explain it. On the other ing event chains for evaluation may not test rele- hand, it seems that we are unable to present appro- vant script knowledge. Some script learning models priate and effective proof at present. Hence, there is that achieve high test scores tend to capture fre- still much work that needs to be done in the future. quent event patterns instead of script knowledge. If For now, event chain based plus deep learning there is no effective evaluation system, we cannot based models are the most popular script learning accurately judge whether a model can really work. approaches. The trend in this direction is to further Therefore, the establishment of a standardized cor- improve them by introducing more advanced mech- pus and evaluation system is particularly important anisms and more complex structures. Most of these for the development of script learning. works extracted events from the New York Times There are a few other works (Regneri et al., (NYK) portion of the Gigaword corpus as a dataset 2010; Modi et al., 2017; Ding et al., 2019b; Lee and evaluated their models by the MCNC test. IT and Goldwasser, 2019) that attempt to solve this issue using a crowdsourced corpus, such as Furthermore, very recently, some other notable OMICS (Gupta and Kochenderfer, 2004), PDTB works have attempted to enhance traditional script (Prasad et al., 2008), InScript (Modi et al., 2016), learning models by constructing event graphs or in- Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 367 ed representation; ze test; MCNS: multiple choice DLB: deep learning based; GB: ice narrative cloze; SCT: story clo RTR + DTR DLB + EK SCT fine-grained property representation. CB: count-based; PB: pair-based; al narrative cloze; MCNC: multiple cho +SWAG+ROCStories Table 1 A comparison of representative script learning works SNLI + MNLI + MC_NLI + IMDB + Twitter Reference Dataset Representation Model Evaluation graph-based; EK: external knowledge; NC: narrative cloze; ANC: adversari Chambers and Jurafsky (2008)Jans et al. (2012)Balasubramanian et al. (2013)Pichotta and Mooney (2014)Modi and Titov (2014a)Modi and Titov (2014b)Rudinger et al. (2015)Modi (2016) NYKGGranroth-Wilding Reuters and + Clark Andrew (2016) Lang FairyPichotta Tale and Mooney NYKG (2016a)Pichotta and Mooney (2016b) NYKGModi et al. (2017) OMICS +Hu NYKG et PR al. + (2017) DCRWang ZQ et al. (2017) NYKGZhao PR English et + NYKG language al. DCR Wikipedia (2017)Li English Q language et NYKG CB+ Wikipedia al. PB (2018)Li PR ZY + et DCR al. (2018)Lee MAR CB IT Movies + + and summary DTR PR PB Goldwasser + (2018) DCRLee IT and MAR Goldwasser + (2019) DTR Chinese InScriptDing news + et CB event Movies al. dataset + RTR summary (2019b) + PB of DLB DTR SinaLv + News PB et MAR al. + (2019) MAR CB DTR + + DTR PB DLB +Li CB ZY NC Event et ordering al. PR + DLB (2019) + Event + RTR DCR NYKG paraphrasing MAR + CB + DTR DTR DLBNYK: + DLB MAR PB NC New + + PB York DTR Human Times evaluation corpus; NYKG: PDTB NYK NYK NYKG + portion NYKG DLB DLB Human DLB ROCStories of + + evaluation + PB the CB PB Gigaword corpus. DLB + PB PR: NYKG protagonist representation; Event2Mind DCR: NC + discrete ATOMIC representation; Event ordering DTR: distribut NC MAR MCNC + DTR MAR NC + Other + DTR test ANC FGPR RTR NC NYKG NC + + + DTR RTR DTR ANC + DLB MAR DTR + + CB DLB DTR PR+ + MAR PB + + DTR DLB EK + PB + DLB EK DLB + + DLB CB PB + DLB + + EK + GB EK GB MCNC + MCNS + MCNC MCNE + MCNS + MCNE MCNC MAR + DTR MCNC MCNC Other test SCT DLB + CB MCNC MAR: multi-argument representation; RTR: raw text representation; FGPR: narrative sequences; MCNE: multiple choice narrative explanation 368 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373
Event2Mind (Rashkin et al., 2018), or ATOMIC The experimental results also show that ELG is ef- (Sap et al., 2019). These crowdsourced corpora have fective for script-related tasks. advantages such as focusing on specific prototypi- cal scenarios, containing rich script knowledge, hav- 6.2.3 Using pre-trained language models ing structure representations, and having clear anno- Recent works have shown that the two-stage tated information. Specifically, the InScript Corpus framework that includes a pre-trained language (Modi et al., 2016) was designed for script learning, model and fine-tuning operation can bring great im- all of the relevant verbs and noun phrases were an- provements to many NLP tasks. Specifically, a lan- notated with event types and participant types, and guage model can be pre-trained on large-scale un- it contained many subtle actions describing a par- supervised corpora, such as CoVe (McCann et al., ticular scenario. These advantages can help models 2017), ELMo (Peters et al., 2018), BERT (Devlin effectively process data and learn script knowledge, et al., 2019), or GPT (Radford et al., 2019), and as well as avoid many of the problems caused by fine-tuned on target tasks, such as natural language unstructured texts (Chambers, 2017). inference (NLI) or reading comprehension. The pre- However, this is far from a trivial issue. On one trained language models can learn universal language hand, the crowdsourced corpora require domain ex- representations, which can lead to better general- pertise, professional knowledge, and plenty of man- ization performance and speed up convergence for ual annotation and quality control. Thus, they are downstream NLP tasks. The pre-trained language small in size but prohibitive in construction cost. models can also avoid heavy work and the complex On the other hand, automatically understanding un- process of training a model from scratch (Qiu et al., structured natural texts is a vital goal and a neces- 2020). sary capability of advanced AI systems. Therefore, It is natural to think that script learning tasks collecting a standard corpus that has both quantity may also benefit from applying this framework. In and quality and establishing an effective evaluation addition, these novel pre-trained models can re- system are important directions for future research. place the embedding learning systems that have been 6.2.2 Constructing the event graph widely used by script learning models for a long time, such as word2vec (Mikolov et al., 2013), glove (Pen- As we discussed in Section 4.6, the graph struc- nington et al., 2014), or deepwalk (Perozzi et al., ture is clearer and more intuitive from the human 2014). This direction is of great research value but interpretability perspective, and to some extent, it is there are only a few existing works. Here we briefly also closer to the organizational form of event devel- introduce two representative works. opment in reality. It can express denser and broader Li ZY et al. (2019) did relevant work that de- connections among events, compared with the event signed a transferable BERT (TransBERT) training pair or event chain. In addition, once scripts are framework to solve an SCT task. In addition to a represented in a graph structure, a variety of off-the- pre-trained BERT model, they introduced three se- shelf graph-based algorithms can be used to solve mantically related supervised tasks (including NLI, event-related tasks. However, there are only a few sentiment classification, and next action prediction) models that represent a script as graph structure and to further pre-train the BERT model. Zheng et al. use graph-based approaches to solve event-related (2020) proposed a unified fine-tuning architecture problems, so it is a research direction that deserves (UniFA), which is a scalable and stackable frame- moreattentioninthefuture. work consisting of four key components, i.e., BERT- For example, the event logic graph (ELG) pro- base-uncased, raw-text, intra-event, and inter-event posed by Ding et al. (2019a) is a potential event- components, connected by a multi-step fine-tuning related knowledge base that aims to discover the method. With the merits of cascaded training, multi- evolutionary patterns and logic of events. ELG in- step fine-tuning can contribute to minimizing loss tegrates event-related knowledge into conventional and also avoid being trapped in local optima. They knowledge graph techniques, which have been pro- also employed a scenario-level variational auto en- gressing fast recently, and organizes event evolution- coder (S-VAE) to implicitly represent scenario-level ary patterns into a commonsense knowledge base. knowledge. In particular, S-VAE regards the event Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 369 chain as a dialog-generation process and further in- 6.2.5 Injecting external fine-grained knowledge troduces a stochastic latent variable to guide the event representation. As we discussed in Section 4.7, some recent works attempt to enhance traditional script learning models by injecting external knowledge. They chose 6.2.4 Learning script by reinforcement learning fine-grained event properties, such as intent, senti- ment, and animacy of event participants, or nuanced The script learning system is somewhat natu- relations between events, such as Reason, Temporal, rally similar to the reinforcement learning frame- and Contrast. work. The process of generating correct events one These works indicate that fine-grained event by one through an event chain is similar to the situ- properties and relevant contextual information such ation wherein an agent traverses through the space as argument, sentiment, animacy, time, or location, of events (Kaelbling et al., 1996; Li JW et al., 2016; can potentially enhance script learning models, so Sutton and Barto, 2018). other script learning models may also be improved by injecting external fine-grained knowledge. In the reinforcement learning setup, an agent will search for all the possible states of the environ- This knowledge, however, is not necessarily the ment to achieve the highest rewards. The action of more the better. The experimental results of Lee IT the agent refers to changing the state, and a corre- and Goldwasser (2018) showed that although each sponding reward will be given to the agent according property individually improves the performance, in- to whether the action is good or bad. The policy of cluding too many properties might hurt it. There- the agent refers to a sequence of actions taken by it, fore, the questions of which and how many properties and the optimal policy is the policy that achieves the to choose are appropriate and need further study. highest rewards. 6.2.6 Building an interpretable system In the script learning setup, each discrete state Although deep learning technology has devel- can represent an event. The action of the agent may oped rapidly and reached impressive performance in refer to transferring to another event, based on cur- many fields recently, its deep and complex nonlin- rent and previous states. If the agent transfers to a ear architecture makes the decision-making process correct state (i.e., predicting the correct next event), highly non-transparent. then it will obtain a reward. The optimal policy of Deep learning models are known to be black the agent may represent the prediction of the most boxes, so it is hard to explain the reason behind likely event chain. their decisions. As for deep learning based script Reinforcement learning can help with many learning systems, the lack of interpretability exists script learning tasks that are hard to solve now, for mainly in terms of two aspects: first, the output example, the event segment extraction task. As we of the script learning system (i.e., the event predic- mentioned in Section 4.5, Lv et al. (2019) used a tion results) is hardly explainable to system users; self-attention mechanism to solve this task, but this second, the operation mechanism of how machines approach is not accurate enough. We may design a make a decision (i.e., the event encoding and match- reinforcement learning framework in which an agent ing algorithm) can hardly be investigated by system traverses the event chain to extract the most valu- designers. Without letting users know why specific able event segments. The state refers to whether the results are provided, the system may be less effec- corresponding event is part of the extracted event tive in persuading users to accept the results, which segment, and the alternative action refers to chang- may further decrease the system’s trustworthiness. ing the state and moving to the next event or keep- Without letting designers know the mechanism be- ing the state and moving to the next event. After hind the algorithm, it will be difficult for them to traversing the whole event chain, the final extracted promote the system effectively. event segment will be used to predict the next event, In a broader sense, explainable artificial in- and a corresponding reward will be given based on telligence (XAI) (Arrieta et al., 2020) has re- the accuracy of the prediction. cently sparked wider attention from the general AI 370 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373
community. Many of the recent advances in ma- Chambers N, Jurafsky D, 2008. Unsupervised learning of th chine learning have been trying to open the black narrative event chains. Proc 46 Annual Meeting of box. For instance, Koh and Liang (2017) introduced the Association for Computational Linguistics, p.789- 797. a framework to analyze deep learning models based Chambers N, Jurafsky D, 2009. Unsupervised learning of on influence analyses; Pei et al. (2017) proposed a narrative schemas and their participants. Proc Joint th white-box testing mechanism to explore the nature Conf of the 47 Annual Meeting of the ACL and the 4th Int Joint Conf on Natural Language Processing of of deep learning models. The interpretability-related the AFNLP, p.602-610. works will help us understand the exact meaning of https://doi.org/10.5555/1690219.1690231 each feature in a neural network and how they in- Chung J, Gulcehre C, Cho K, et al., 2014. Empirical eval- teract to produce the final outputs. However, the uation of gated recurrent neural networks on sequence modeling. https://arxiv.org/abs/1412.3555 research in interpretable script learning systems is Church KW, Hanks P, 1990. Word association norms, mutual still in the initial stage, and there is much to be fur- information, and lexicography. Comput Ling, 16(1):22- ther studied. 29. https://doi.org/10.5555/89086.89095 Cullingford RE, 1978. Script Application: Computer Un- derstanding of Newspaper Stories. PhD Thesis, Yale Contributors University, New Haven, CT, USA. Yi HAN drafted the manuscript. Linbo QIAO, DeJong GF, 1979. Skimming Stories in Real Time: an Experiment in Integrated Understanding. PhD Thesis, Jianming ZHENG, and Hefeng WU helped organize the Yale University, New Haven, CT, USA. manuscript. Dongsheng LI and Xiangke LIAO led the prepa- Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training ration of the manuscript. Yi HAN and Linbo QIAO revised of deep bidirectional transformers for language under- and finalized the paper. standing. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186. Compliance with ethics guidelines https://doi.org/10.18653/v1/n19-1423 Yi HAN, Linbo QIAO, Jianming ZHENG, Hefeng WU, Ding X, Li ZY, Liu T, et al., 2019a. ELG: an event logic graph. https://arxiv.org/abs/1907.08015 Dongsheng LI, and Xiangke LIAO declare that they have no Ding X, Liao K, Liu T, et al., 2019b. Event representation conflict of interest. learning enhanced with external commonsense knowl- edge. Proc Conf on Empirical Methods in Natural References Language Processing and the 9th Int Joint Conf on Arrieta AB, Díaz-Rodríguez N, Del Ser J, et al., 2020. Ex- Natural Language Processing, p.4896-4905. plainable artificial intelligence (XAI): concepts, tax- https://doi.org/10.18653/v1/D19-1495 onomies, opportunities and challenges toward respon- Erk K, Padó S, 2008. A structured vector space model sible AI. Inform Fus, 58:82-115. for word meaning in context. Proc Conf on Empirical https://doi.org/10.1016/j.inffus.2019.12.012 Methods in Natural Language Processing, p.897-906. Balasubramanian N, Soderland S, Mausam, et al., 2013. Gen- https://doi.org/10.5555/1613715.1613831 erating coherent event schemas at scale. Proc Conf Fillmore CJ, 1976. Frame semantics and the nature of on Empirical Methods in Natural Language Processing, language. AnnNYAcadSci, 280(1):20-32. p.1721-1731. https://doi.org/10.1111/j.1749-6632.1976.tb25467.x Baroni M, Zamparelli R, 2010. Nouns are vectors, adjectives Glavaš G, Šnajder J, 2015. Construction and evaluation of are matrices: representing adjective-noun constructions event graphs. Nat Lang Eng, 21(4):607-652. in semantic space. Proc Conf on Empirical Methods in https://doi.org/10.1017/S1351324914000060 Natural Language Processing, p.1183-1193. Gordon AS, 2001. Browsing image collections with represen- https://doi.org/10.5555/1870658.1870773 tations of common-sense activities. JAmSocInform Bengio Y, Ducharme R, Vincent P, et al., 2003. A neu- Sci Technol, 52(11):925-929. ral probabilistic language model. JMachLearnRes, https://doi.org/10.1002/asi.1143 3:1137-1155. https://doi.org/10.5555/944919.944966 Granroth-Wilding M, Clark S, 2016. What happens next? Bordes A, Usunier N, Garcia-Durán A, et al., 2013. Trans- Event prediction using a compositional neural network lating embeddings for modeling multi-relational data. model. Proc 30th AAAI Conf on Artificial Intelligence, th Proc 26 Int Conf on Neural Information Processing p.2727-2733. https://doi.org/10.5555/3016100.3016283 Systems, p.2787-2795. Gupta R, Kochenderfer MJ, 2004. Common sense data https://doi.org/10.5555/2999792.2999923 acquisition for indoor mobile robots. Proc 19th National Bower GH, Black JB, Turner TJ, 1979. Scripts in memory Conf on Artifical Intelligence, p.605-610. for text. Cogn Psychol, 11(2):177-220. https://doi.org/10.5555/1597148.1597246 https://doi.org/10.1016/0010-0285(79)90009-4 Harris ZS, 1954. Distributional structure. Word, 10(2- Chambers N, 2017. Behind the scenes of an evolving event 3):146-162. cloze test. Proc 2nd Workshop on Linking Models Hochreiter S, Schmidhuber J, 1997. Long short-term mem- of Lexical, Sentential and Discourse-Level Semantics, ory. Neur Comput, 9(8):1735-1780. p.41-45. https://doi.org/10.18653/v1/w17-0905 https://doi.org/10.1162/neco.1997.9.8.1735 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 371
Hu LM, Li JZ, Nie LQ, et al., 2017. What happens next? Lin ZH, Feng MW, Dos Santos CN, et al., 2017. A structured Future subevent prediction using contextual hierarchical self-attentive sentence embedding. Proc 5th Int Conf LSTM. Proc 31st AAAI Conf on Artificial Intelligence, on Learning Representations. p.3450-3456. https://doi.org/10.5555/3298023.3298070 Luong T, Pham H, Manning CD, 2015. Effective approaches Jans B, Bethard S, Vulic, et al., 2012. Skip N-grams and to attention-based neural machine translation. Proc ranking functions for predicting script events. Proc Conf on Empirical Methods in Natural Language Pro- th 13 Conf of the European Chapter of the Association cessing, p.1412-1421. for Computational Linguistics, p.336-344. https://doi.org/10.18653/v1/d15-1166 Jones MP, Martin JH, 1997. Contextual spelling correction Lv SW, Qian WH, Huang LT, et al., 2019. SAM-Net: inte- 5th using latent semantic analysis. Proc Conf on Ap- grating event-level and chain-level attentions to predict plied Natural Language Processing, p.166-173. what happens next. Proc AAAI Conf on Artificial In- https://doi.org/10.3115/974557.974582 telligence, p.6802-6809. Kaelbling LP, Littman ML, Moore AW, 1996. Reinforcement https://doi.org/10.1609/aaai.v33i01.33016802 learning: a survey. JArtifIntellRes, 4:237-285. Mausam, Schmitz M, Bart R, et al., 2012. Open language https://doi.org/10.1613/jair.301 learning for information extraction. Proc Joint Conf on Khan A, Salim N, Kumar YJ, 2015. A framework for multi- Empirical Methods in Natural Language Processing and document abstractive summarization based on semantic Computational Natural Language Learning, p.523-534. role labelling. Appl Soft Comput, 30:737-747. https://doi.org/10.5555/2390948.2391009 https://doi.org/10.1016/j.asoc.2015.01.070 McCann B, Bradbury J, Xiong CM, et al., 2017. Learned Kiros R, Zhu YK, Salakhutdinov R, et al., 2015. Skip-thought in translation: contextualized word vectors. Proc 31st vectors. Proc 28th Int Conf on Neural Information Int Conf on Neural Information Processing Systems, Processing Systems, p.3294-3302. p.6297-6308. https://doi.org/10.5555/3295222.3295377 https://doi.org/10.5555/2969442.2969607 Koh PW, Liang P, 2017. Understanding black-box predic- Miikkulainen R, 1992. Discern: a distributed neural network tions via influence functions. Proc 34th Int Conf on model of script processing and memory. University Machine Learning, p.1885-1894. Twente, Connectionism and Natural Language Process- Laender AHF, Ribeiro-Neto BA, Da Silva AS, et al., 2002. ing, p.115-124. A brief survey of web data extraction tools. ACM Miikkulainen R, 1993. Subsymbolic Natural Language Pro- SIGMOD Rec, 31(2):84-93. cessing: an Integrated Model of Scripts, Lexicon, and https://doi.org/10.1145/565117.565137 Memory. MIT Press, Cambridge, USA. Lee G, Flowers M, Dyer MG, 1992. Learning distributed Mikolov T, Chen K, Corrado G, et al., 2013. Efficient representations of conceptual knowledge and their ap- estimation of word representations in vector space. plication to script-based story processing. In: Sharkey https://arxiv.org/abs/1301.3781 N (Ed.), Connectionist Natural Language Processing. Miller GA, 1995. WordNet: a lexical database for English. Springer, Dordrecht, p.215-247. Commun ACM, 38(11):39-41. https://doi.org/10.1007/978-94-011-2624-3_11 https://doi.org/10.1145/219717.219748 Lee IT, Goldwasser D, 2018. FEEL: featured event em- Minsky M, 1975. A framework for representing knowledge. nd bedding learning. Proc 32 AAAI Conf on Artificial In: Winston PH (Ed.), The Psychology of Computer Intelligence. Vision. McGraw-Hill Book, New York, USA. Lee IT, Goldwasser D, 2019. Multi-relational script learning Mnih A, Hinton G, 2007. Three new graphical models for 57th for discourse relations. Proc Annual Meeting of statistical language modelling. Proc 24th Int Conf on the Association for Computational Linguistics, p.4214- Machine Learning, p.641-648. 4226. https://doi.org/10.18653/v1/p19-1413 https://doi.org/10.1145/1273496.1273577 Li JW, Monroe W, Ritter A, et al., 2016. Deep rein- Modi A, 2016. Event embeddings for semantic script model- forcement learning for dialogue generation. Proc Conf ing. Proc 20th SIGNLL Conf on Computational Natural on Empirical Methods in Natural Language Processing, Language Learning, p.75-83. p.1192-1202. https://doi.org/10.18653/v1/D16-1127 https://doi.org/10.18653/v1/k16-1008 Li Q, Li ZW, Wei JM, et al., 2018. A multi-attention based Modi A, Titov I, 2014a. Inducing neural models of script neural network with external knowledge for story ending th th knowledge. Proc 18 Conf on Computational Natural predicting task. Proc 27 Int Conf on Computational Language Learning, p.49-57. Linguistics, p.1754-1762. https://doi.org/10.3115/v1/w14-1606 Li ZY, Ding X, Liu T, 2018. Constructing narrative event evolutionary graph for script event prediction. Proc Modi A, Titov I, 2014b. Learning semantic script knowledge 2nd 27th Int Joint Conf on Artificial Intelligence, p.4201- with event embeddings. Proc Int Conf on Learning 4207. https://doi.org/10.5555/3304222.3304354 Representations. Li ZY, Ding X, Liu T, 2019. Story ending prediction by Modi A, Anikina T, Ostermann S, et al., 2016. InScript: transferable BERT. Proc 28th Int Joint Conf on Artifi- narrative texts annotated with script information. Proc th cial Intelligence, p.1800-1806. 10 Int Conf on Language Resources and Evaluation. https://doi.org/10.24963/ijcai.2019/249 Modi A, Titov I, Demberg V, et al., 2017. Modeling se- Lin YK, Liu ZY, Sun MS, et al., 2015. Learning entity and mantic expectation: using script knowledge for referent relation embeddings for knowledge graph completion. prediction. Trans Assoc Comput Ling, 5(2):31-44. Pro 29th AAAI Conf on Artificial Intelligence. https://doi.org/10.1162/tacl_a_00044 372 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373
Mostafazadeh N, Chambers N, He XD, et al., 2016. A corpus Radinsky K, Agichtein E, Gabrilovich E, et al., 2011. A word and cloze evaluation for deeper understanding of com- at a time: computing word relatedness using temporal monsense stories. Proc Conf of the North American semantic analysis. Proc 20th Int Conf on World Wide Chapter of the Association for Computational Linguis- Web, p.337-346. tics: Human Language Technologies, p.839-849. https://doi.org/10.1145/1963405.1963455 https://doi.org/10.18653/v1/n16-1098 Rashkin H, Sap M, Allaway E, et al., 2018. Event2Mind: Mueller ET, 1998. Natural Language Processing with commonsense inference on events, intents, and reac- ThoughtTreasure. Signiform, New York, USA. tions. Proc 56th Annual Meeting of the Association for Navigli R, 2009. Word sense disambiguation: a survey. ACM Computational Linguistics, p.463-473. Comput Surv, 41(2):10. https://doi.org/10.18653/v1/P18-1043 https://doi.org/10.1145/1459352.1459355 Regneri M, Koller A, Pinkal M, 2010. Learning script knowl- Orr JW, Tadepalli P, Doppa JR, et al., 2014. Learning edge with web experiments. Proc 48th Annual Meeting 28th scripts as hidden Markov models. Proc AAAI of the Association for Computational Linguistics, p.979- Conf on Artificial Intelligence, p.1565-1571. 988. https://doi.org/10.5555/1858681.1858781 https://doi.org/10.5555/2892753.2892770 Rudinger R, Rastogi P, Ferraro F, et al., 2015. Script in- Osman AH, Salim N, Binwahlan MS, et al., 2012. Plagia- duction as language modeling. Proc Conf on Empirical rism detection scheme based on semantic role labeling. Methods in Natural Language Processing, p.1681-1686. & Proc Int Conf on Information Retrieval Knowledge https://doi.org/10.18653/v1/d15-1195 Management, p.30-33. Rumelhart DE, 1980. Schemata: the building blocks of https://doi.org/10.1109/InfRKM.2012.6204978 cognition. In: Spiro RJ (Ed.), Theoretical Issues in Pei KX, Cao YZ, Yang JF, et al., 2017. DeepXplore: auto- Reading Comprehension. Erlbaum, Hillsdale, p.33-58. mated whitebox testing of deep learning systems. Proc 26th Symp on Operating Systems Principles, p.1-18. Sap M, Le Bras R, Allaway E, et al., 2019. ATOMIC: https://doi.org/10.1145/3132747.3132785 an atlas of machine commonsense for if-then reasoning. Proc AAAI Conf on Artificial Intelligence, p.3027-3035. Pennington J, Socher R, Manning C, 2014. GloVe: global vectors for word representation. Proc Conf on Empirical https://doi.org/10.1609/aaai.v33i01.33013027 Methods in Natural Language Processing, p.1532-1543. Schank RC, 1983. Dynamic Memory: a Theory of Reminding https://doi.org/10.3115/v1/d14-1162 and Learning in Computers and People. Cambridge Perozzi B, Al-Rfou R, Skiena S, 2014. DeepWalk: online University Press, New York, USA. learning of social representations. Proc 20th ACM Schank RC, 1990. Tell Me a Story: a New Look at Real and SIGKDD Int Conf on Knowledge Discovery and Data Artificial Memory. Charles Scribner, New York, USA. Mining, p.701-710. Schank RC, Abelson RP, 1977. Scripts, Plans, Goals and Un- https://doi.org/10.1145/2623330.2623732 derstanding: an Inquiry into Human Knowledge Struc- Peters M, Neumann M, Iyyer M, et al., 2018. Deep con- tures. L. Erlbaum, Hillsdale, USA. textualized word representations. Proc Conf of the Schuler KK, 2005. VerbNet: a Broad-Coverage, Comprehen- North American Chapter of the Association for Com- sive Verb Lexicon. PhD Thesis, University of Pennsyl- putational Linguistics: Human Language Technologies, vania, Pennsylvania, USA. p.2227-2237. https://doi.org/10.18653/v1/n18-1202 Shen D, Lapata M, 2007. Using semantic roles to improve Pichotta K, Mooney R, 2014. Statistical script learning question answering. Proc Joint Conf on Empirical 14th with multi-argument events. Proc Conf of the Methods in Natural Language Processing and Compu- European Chapter of the Association for Computational tational Natural Language Learning, p.12-21. Linguistics, p.220-229. Socher R, Huval B, Manning CD, et al., 2012. Semantic com- https://doi.org/10.3115/v1/e14-1024 positionality through recursive matrix-vector spaces. Pichotta K, Mooney RJ, 2016a. Learning statistical scripts Proc Joint Conf on Empirical Methods in Natural Lan- 30th with LSTM recurrent neural networks. Proc AAAI guage Processing and Computational Natural Language Conf on Artificial Intelligence, p.2800-2806. Learning, p.1201-1211. https://doi.org/10.5555/3016100.3016293 https://doi.org/10.5555/2390948.2391084 Pichotta K, Mooney RJ, 2016b. Using sentence-level LSTM Sutton RS, Barto AG, 2018. Reinforcement Learning: an language models for script inference. Proc 54th Annual Introduction (2nd Ed.). MIT Press, Cambridge, USA. Meeting of the Association for Computational Linguis- tics, p.279-289. https://doi.org/10.18653/v1/p16-1027 Taylor WL, 1953. “Cloze procedure”: a new tool for measur- Prasad R, Dinesh N, Lee A, et al., 2008. The Penn discourse ing readability. J Mass Commun Q, 30(4):415-433. Treebank 2.0. Proc Int 6th Conf on Language Resources https://doi.org/10.1177/107769905303000401 and Evaluation, p.2961-2968. Terry WS, 2006. Learning and Memory: Basic Principles, Qiu XP, Sun TX, Xu YG, et al., 2020. Pre-trained models Processes, and Procedures. Allyn and Bacon, Boston, for natural language processing: a survey. USA. https://arxiv.org/abs/2003.08271 Tulving E, 1983. Elements of Episodic Memory. Oxford Radford A, Narasimhan K, Salimans T, et al., 2019. Improv- University Press, New York, USA. ing language understanding by generative pre-training. Wang Z, Zhang JW, Feng JL, et al., 2014. Knowledge graph Proc Conf of the North American Chapter of the Associ- embedding by translating on hyperplanes. Proc 28th ation for Computational Linguistics: Human Language AAAI Conf on Artificial Intelligence, p.1112-1119. Technologies, p.4171-4186. https://doi.org/10.5555/2893873.2894046 Han et al. / Front Inform Technol Electron Eng 2021 22(3):341-373 373
Wang ZQ, Zhang Y, Chang CY, 2017. Integrating order in- Jianming ZHENG received the BS and formation and event relation for script event prediction. MS degrees from NUDT, China in 2016 Proc Conf on Empirical Methods in Natural Language and 2018, respectively, and is currently Processing, p.57-67. a PhD candidate at the School of Sys- https://doi.org/10.18653/v1/d17-1006 tem Engineering, NUDT. His research in- Weber N, Balasubramanian N, Chambers N, 2018. Event terests include semantics representation, representations with tensor-based compositions. Proc few-shot learning and its applications in 32nd AAAI Conf on Artificial Intelligence, p.4946-4953. information retrieval. He has several pa- pers published in SIGIR, WWW, COL- Weston J, Chopra S, Bordes A, 2015. Memory networks. ING, IPM, Cognitive Computation, etc. https://arxiv.org/abs/1410.3916 Zhao SD, Wang Q, Massung S, et al., 2017. Constructing Hefeng WU received the BS degree in and embedding abstract event causality networks from Computer Science and Technology and th text snippets. Proc 10 ACM Int Conf on Web Search the PhD degree in Computer Applica- and Data Mining, p.335-344. tion Technology from Sun Yat-sen Uni- https://doi.org/10.1145/3018661.3018707 versity, China in 2008 and 2013, respec- Zheng JM, Cai F, Chen HH, 2020. Incorporating scenario tively. He is currently a full research knowledge into a unified fine-tuning architecture for scientist with the School of Data and event representation. Proc 43rd Int ACM SIGIR Conf Computer Science, Sun Yat-sen Univer- on Research and Development in Information Retrieval, sity. His research interests include com- p.249-258. puter vision, multimedia, and machine learning. He has https://doi.org/10.1145/3397271.3401173 served as a reviewer for many academic journals and confer- Zhou MT, Huang ML, Zhu XY, 2019. Story ending se- ences, including TIP, TCYB, TSMC, TCSVT, PR, CVPR, lection by finding hints from pairwise candidate end- ICCV, NeurIPS, and ICML. ings. IEEE/ACM Trans Audio Speech Lang Process, 27(4):719-729. Dongsheng LI received his BS degree (with https://doi.org/10.1109/TASLP.2019.2893499 honors) and PhD degree (with honors) in computer science from the College of Com- puter Science, NUDT, Changsha, China in 1999 and 2005, respectively. He was Yi HAN, first author of this invited pa- awarded the prize of National Excellent per, received his BS and MS degrees from Doctoral Dissertation by the Ministry of the National University of Defense Tech- Education of China in 2008, and the nology (NUDT), Changsha, China in 2016 National Science Fund for Distinguished and 2018, respectively, and is currently a Young Scholars in 2020. He is now a full professor at the Na- PhD candidate at the College of Computer tional Lab for Parallel and Distributed Processing, NUDT. Science, NUDT. His research interests in- He is a corresponding expert of Frontiers of Information clude natural language processing, event Technology & Electronic Engineering. His research interests extraction, and few-shot learning. include parallel and distributed computing, cloud computing, and large-scale data management. Linbo QIAO, corresponding author of this invited paper, received his BS, MS, and Xiangke LIAO received his BS degree from PhD degrees in computer science and the Department of Computer Science and technology from NUDT, Changsha, China Technology, Tsinghua University, Beijing, in 2010, 2012, and 2017, respectively. China in 1985, and his MS degree from Now, he is an assistant research fellow NUDT, Changsha, China in 1988. He is at the National Lab for Parallel and Dis- currently a full professor of NUDT, and tributed Processing, NUDT. He worked as an academician of the Chinese Academy of a research assistant at the Chinese University of Hong Kong Engineering. His research interests include from May to Oct. 2014. His research interests include struc- parallel and distributed computing, high- tured sparse learning, online and distributed optimization, performance computer systems, operating systems, cloud and deep learning for graph and graphical models. computing, and networked embedded systems.