Constructing a Dictionary Describing Feature Changes of Arguments in Event Sentences

Tetsuaki Nakamura † Daisuke Kawahara †‡ [email protected] [email protected]

Graduate School of Informatics, JST PRESTO † Kyoto University ‡

Abstract large. However, the quality of acquired knowledge might be of low quality. The manual acquisition ap- Common sense knowledge plays an essen- proach may use a fully manual technique (at very tial role for natural language understanding, early stage of artificial intelligence studies) or col- human-machine communication and so forth. lective intelligence (e.g., crowdsourcing and games In this paper, we acquire knowledge of events with a purpose). This approach is useful for gath- as common sense knowledge because there is a possibility that dictionaries of such knowl- ering subjective information, such as emotion infor- edge are useful for recognition of implication mation, which is difficult to acquire automatically. relations in texts, inference of human activi- In this paper, we acquire knowledge of events as ties and their planning, and so on. As for event common sense knowledge because there is a possi- knowledge, we focus on feature changes of ar- bility that dictionaries of such knowledge are use- guments (hereafter, FCAs) in event sentences as knowledge of events. To construct a dic- ful for recognition of implication relations in texts, tionary of FCAs, we propose a framework for inference of human activities and their planning, acquiring such knowledge based on both of and so on. As for event knowledge, we focus on the automatic approach and the collective in- feature changes of arguments (hereafter, FCAs) in telligence approach to exploit merits of both event sentences as knowledge of events because approaches. We acquired FCAs in event sen- there are several studies of infant cognitive devel- tences through crowdsourcing and conducted opment (Massey and Gelman, 1988; Baillargeon et the subjective evaluation to validate whether the FCAs are adequately acquired. As a result al., 1989; Spelke et al., 1995) that report that even in- of the evaluation, it was shown that we were fants use information about the basic features of par- able to reasonably well capture FCAs in event ticipants in events to understand the events. There sentences. are some trials suggesting the possibility that dictio- naries of FCAs are useful resources for deep under- standing of texts (Rahman and Ng, 2012; Goyal et 1 Introduction al., 2013). In (Rahman and Ng, 2012), features of re- Common sense knowledge plays an essential role lationships between event causalities and polarities for natural language understanding, human-machine (positive / negative) of participants in the events are communication and so forth. There are two ap- used as features for anaphora resolution. In (Goyal proaches to acquire such knowledge. One is the au- et al., 2013), a dictionary of various binary effects tomatic acquisition approach, the other is the man- (success(+/-), motivation(+/-), and so on) on char- ual acquisition approach. The automatic acquisi- acters caused by events (that is, verbs) are used for tion approach uses techniques and automatic story generation. pattern matching methods. This approach is useful To construct a dictionary of FCAs, we propose a when the amount of data to be acquired is extremely framework for acquiring such knowledge based on

46

Proceedings of the 4th Workshop on Events: Definition, Detection, Coreference, and Representation, pages 46–50, San Diego, California, June 17, 2016. c 2016 Association for Computational Linguistics both an automatic approach and a manual approach () that exploits merits of both approaches. That is, we acquire seed knowledge by using collective intelligence and expand the knowl- edge automatically. Figure 1: Example entry in the proposed dictionary. Event sen- tences are associated with FCAs. 2 Related Work

Many studies have aimed at automatically acquiring project (Hartshorne et al., 2014), expanding of Verb- relationships between events (Chambers and Juraf- Net (Kipper et al., 2000) through crowdsourcing, is sky, 2009; Chambers and Jurafsky, 2010; Vander- also available. However, these studies do not focus wende, 2005; Shibata et al., 2014). However, these on the granularity of knowledge. studies do not focus on the motivation of events or For the controlled granularity, we focus on the use effects caused by the events. Although we can de- of basic level features of arguments in event sen- termine which events occur after an event by using tences. Since both of animate beings and inanimate the dictionaries constructed in such studies, we can- objects can be arguments in sentences, we assume not understand why the events occur. For example, not only physical features but also mental features. although we may know the event “a girl cries” hap- The decision criteria for these features are described pens after another event “a girl gets injured” by us- in Section 3.2. ing the dictionary, we cannot understand why the former event occurs. To understand the motivation of the former event, we have to know that “the girl 3 Proposal feels pain.” Of course, the problem can be solved by treating the phenomenon “the girl feels pain” as 3.1 Architecture an event and describing it in the dictionary. How- To construct a knowledge that keeps an ever, such a policy requires infinite descriptions. It is controlled granularity and can be used to understand preferable to form and maintain the dictionary with the motivation of events, we focus on the feature an controlled granularity. changes of arguments (FCAs) in event sentences. In the case that participants in events are ani- Our purpose is to construct a dictionary that has the mated beings, events may influence their emotions. architecture shown in Figure 1. As shown in the fig- Many studies developing software to process hu- ure, the dictionary describes relationships between man emotions exploits models proposed by psychol- events and FCAs in the event sentences. ogists (Hasegawa et al., 2013; Tokuhisa et al., 2008; Tokuhisa et al., 2009; Vu et al., 2014). In these studies, Ekman’s Big Six Model (Ekman, 1992) and 3.2 Features Plutchik’s wheel of emotions (Plutchik, 1980) are As described in Section 2, we focus on the use of used to automatically extract emotion knowledge basic level features of arguments in event sentences. from large corpora. However, since these studies We assume 16 features listed in Table 1 to be associ- do not focus on what happens after evoking various ated with arguments in event sentences. As for phys- changes of emotions, they are not able to understand ical basic level features, we regard eight features complex phenomenon involving such changes. listed in the Table as basic level features because Some dictionaries constructed manually that re- we pay attention to the traditional sense categories tain high quality are available. For example, Con- (“sight”, “hearing”, “smell”, “taste”, and “touch”). ceptNet1 built in the Open Mind Common Sense As for mental basic level features, we use basic emo- project (Singh, 2002) and a large-scale knowledge tions (e.g., eight mental features listed in the Ta- database built in the project (Lenat, 1995) ble) of Plutchik’s model (Plutchik, 1980) because are available. The database built in the VerbCorner the model assumes other emotions derived from the 1 http://conceptnet5.media.mit.edu/ basic emotions systematically.

47 Category Features Degrees zero anaphora and coreferences. The KUCF is a physical form, color, touch, smell, changed, sound, taste, location, unchanged, database of case frames automatically constructed temperature irrelevant from 1.6 billion Japanese sentences taken from Web mental joy, fear, trust, surprise, increased, decreased, anger, disgust, sadness, unchanged, irrelevant pages. The KUCF has about 40,000 predicates, with expectation 13 case frames on average for each predicate. Table 1: Features of arguments. The sentence creation procedure is as follows. (STEP 1) The 200 most frequent verbs were ex- tracted from the KWDLC. (STEP 2) Verbs that are very abstract (e.g., “do”) were omitted. As a re- sult, 167 verbs remained. (STEP 3) The top two frequent arguments were used as the representative arguments of each case frame (meaning) of each Figure 2: Expansion of our dictionary. We used automatic pro- verb. (STEP 4) Sentences were created by com- cedures and crowdsourcing to expand the size of the dictionary. bining the verbs with their representative arguments. As a result, 1,935 Japanese sentences were created. 3.3 Construction Method (STEP 5) Sentences that were difficult to understand were pruned based on crowdsourcing4. The crowd- We are planning to construct the dictionary de- sourcing workers were asked to answer whether they scribed in section 3.1 based on both of automatic could understand the presented sentences. Each sen- procedure and collective intelligence as follows: tence was judged by 10 workers. In total, 244 people (PHASE 1) Frequent event sentences are selected participated in the task. For each sentence, we esti- automatically. (PHASE 2) FCAs of the selected mated the probability that the “yes” was selected by sentences are acquired as seed knowledge based on using the method proposed in (Whitehill et al., 2009) crowdsourcing because our dictionary is designed to and omitted sentences with probabilities less than deal with subjective information such as emotions. 0.9. As a result, 857 event sentences remained and (PHASE 3) Knowledge is automatically expanded 148 verbs (types) were included in the sentences. by using the seed knowledge. Since each of these sentences had two or three ar- The sequence of precedures described above is il- guments, 1,768 arguments (token) were obtained as lustrated in Figure 2. We intend to expand the size a result. of our dictionary based on the loop.

4 Experiment 4.2 Method FCAs in event sentences were acquired through a We conducted an experiment to validate whether crowdsourcing task. In the task, workers were asked FCAs in event sentences are adequately acquired by to answer questions about feature changes of argu- using crowdsourcing. ments in the presented event sentences as shown in 4.1 Data Figure 3. Every worker was given a set which in- cludes an event sentence and five questions about We created event sentences presented to crowd- FCAs in the sentence. The argument was randomly sourcing workers based on the “Kyoto Univer- selected from the 1,768 arguments described above. sity Web Document Leads Corpus (KWDLC) Although five features to be asked were selected (Hangyo et al., 2012)”2 and the “Kyoto University from the 16 features in Table 1 for each task, FCAs Case Frames (KUCF) (Kawahara and Kurohashi, of all the features for each argument were acquired 2006).”3 The KWDLC is a Japanese text corpus in total. 2,910 people (token) were participated in that comprises 5,000 documents (15,000 sentences) this experiment. The each worker could be engaged with annotations of morphology, named entities, de- in a maximum of five tasks and each feature was pendencies, predicate-argument structures including judged by 10 workers. 2 http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?DDLC 3 4 http://www.gsk.or.jp/en/catalog/gsk2008-b/ http://crowdsourcing.yahoo.co.jp/

48 Think the situations before and after the presented event. to the corresponding QA (similar = 2 points, a little Then, select the answers below. similar = 1 point, different = 0 point). A person uses an express train As a result, the average point of CAs was 1.24. Is the color of the “person” change? (For QAs with salient physical FCAs, the average yes no (or irrelevant) point was 1.08. For QAs with salient mental FCAs, : the average point was 1.40) This result suggests Figure 3: Layout examples of the task to acquire FCAs (phys- that FCAs were adequately acquired through crowd- ical features). Each task is composed of an event sentence and sourcing. Especially, it was shown that the result five questions about FCAs in the sentence. In the case of men- of QAs with salient mental FCAs was better than tal features, four options (increased, decreased, unchanged, and that of QAs with salient physical FCAs. We spec- irrelevant) are used. ulate that the reason was due to the granularities of values (that is, “whether features are changed (three We used the method proposed by Whitehill et al. options: changed, unchanged, irrelevant)” v.s. “how (2009) to estimate the probability of each option features are changed (four options: increased, de- for each feature. These probabilities are estimated creased, unchanged, irrelevant)”). jointly incorporating the difficulty of each question and the answering skill of each worker according to 5 Conclusion the agreements of the collected judges. Therefore, the estimated probabilities are more reliable than the In this paper, we focused on feature changes of ar- results of a simple counting method such as major- guments (FCAs) in event sentences as knowledge ity voting. The probabilities were used as elements of events and acquire the FCAs as common sense in vectors expressing the FCAs (hereafter, FCA vec- knowledge through crowdsourcing. As a result of tors). Moreover, we assigned ten workers to each the subjective evaluation to validate whether the question to better ensure reliability, since it is re- FCAs are adequately acquired by using crowdsourc- ported that annotation results provided by more than ing, it was shown that we were able to reasonably six non-expert workers are close to those provided well capture the FCAs. by expert labelers (Snow et al., 2008). In the next step, we must consider the automatic method that expand the size of our dictionary by us- 4.3 Results ing seed knowledge as described in Section 3.3. For For example, in the case of “a child” as in “a child the expansion, we are planning to analyze knowl- gets a fever,” the estimated probabilities that the la- edge acquired in this study because we anticipate bels “increased” of “temperature” and “fear” were that there are FCAs to be shared and those not to be selected were 0.99. To validate whether FCAs in shared. That is, we think that there are three types event sentences were adequately acquired by us- of FCAs at least: (1) FCAs mainly depend on verbs ing crowdsourcing, we conducted a subjective eval- themselves (e.g., disgust that “girl” feels in “A boy uation as follows: (STEP 1) Ten arguments with complains to a girl”), (2) FCAs mainly depend on salient elements of FCA vectors were selected (here- combinations of arguments and verbs (e.g., color of after, query arguments: QAs) from 1,768 arguments. “water” in “He pour his orange juice into the wa- (STEP 2) For each QA, a list of arguments with ter”), and (3) FCAs mainly depend on contexts (e.g., impressions similar to the QA was obtained (can- joy that “girl” feels in “A girl see a boy”; the girl’s didate arguments: CAs). The lists were composed emotion depends on whether she likes him). of the five most similar arguments. In this paper, For the future work, we are also planning to con- “arguments with similar impressions” means argu- struct a large scale network in which structures de- ments which have similar FCA vectors. We used scribed in Figure 1 are regarded as units and the units cosine similarities between FCA vectors as similar- associated with related units are linked each other. ities. (STEP 3) For each list, five judges (people Note that the patent for the architecture described engaged in studies of natural language processing) in this paper is pending at the time of writing this were asked to answer whether each CA was similar paper.

49 References Altaf Rahman and Vincent Ng. 2012. Resolving com- plex cases of definite pronouns: The winograd schema Renee Baillargeon,