Semantic Role Labeling for Process Recognition Questions

Samuel Louvan+, Chetan Naik+, Veronica Lynn+, Ankit Arun+, Niranjan Balasubramanian+, Peter Clark∗

+Stony Brook University, ∗Allen Institute for AI, {slouvan, cnaik, velynn, aarun, niranjan}@cs.stonybrook.edu, [email protected]

Abstract mation. This work explores a knowledge-driven approach to answering such questions. We consider a 4th grade level question an- In particular, we investigate a light-weight se- swering task. We focus on a subset involv- mantic role based representation to answer 4th ing recognizing instances of physical, bio- grade science questions that test process recogni- logical, and other natural processes. Many tion. Many processes are similar to each other processes involve similar entities and are and not surprisingly are often described using hard to distinguish using simple bag-of- similar words. For instance, both evaporation words representations alone. and condensation are both phase changes involv- Simple semantic roles such as Input, Re- ing liquids and gases. Distinguishing between sult, and Enabler can often capture the these two requires an understanding of the dif- most critical bits of information about ferent roles that liquids and gases play in these processes. Our QA system scores answers processes. This type of knowledge is naturally by aligning semantic roles in the ques- expressed via semantic roles. tion against the roles in the knowledge. We design a light-weight representation that Empirical evaluation shows that manu- applies to many processes. In general a process ally generated roles provide a 12% rela- has an input – the artifact that undergoes the tive improvement in accuracy over a sim- process, an output – the artifact that results from pler bag-of-words representation. How- the process, and an enabler – an artifact or event ever, automatic role identification is noisy that helps the process. In addition we also in- and doesn’t provide gains even with dis- clude key actions that denote the main events in tant supervision and domain adaptation the process. For example, in evaporation the typ- modifications to account for the limited ical input is a liquid and the output is a gaseous training data. In addition, we conducted form of the input substance. The enabler is some an error analysis of the QA system when form of heat source such as the sun. 1 using the manual roles. We find represen- Given this semantic role based representation, tational gaps i.e., cases where critical in- recognizing a process instance (i.e., answering a formation doesn’t fit into any of the cur- question) becomes a task of assessing how well the rent roles, as well as entailment issues that roles of the instance in the question align with the motivate deeper reasoning beyond simple typical roles of the candidate answer processes. role based alignment for future work. Our preliminary experiments show that manually generated semantic representations can pro- 1 Introduction vide more than a 12% relative improvement in Grade-level science exams are an excellent bench- QA performance over a simple bag-of-words rep- mark for developing knowledge-driven question resentation. A scalable solution however requires answering systems (Clark et al., 2013). These ex- automatic extraction. We investigated an off- ams test students’ understanding of a wide vari- the-shelf semantic role labeling system, MATE ety of concepts including physical, biological, and (Björkelund et al., 2009), to extract these rep- other natural processes. Some questions test the 1This limited representation is incomplete and seman- ability to recognize a process given a description tically inaccurate in some cases. For instance, we ignore of its instance. Here is an example: A puddle is sub-events and any sequence or order information between them. We chose this representation as it covers a majority drying in the sun. This is an example of A) evap- of the questions at the 4th grade level and is also more oration, B) condensation, C) melting, D) subli- amenable to automatic extraction. resentations and evaluated their performance in simple representation that encodes information QA. about each process via the following roles: SRL systems such as MATE are supervised 1. Input – This role captures the main input systems that require large amounts of training to the process or the object undergoing the data. Unfortunately, existing resources such as process. e.g., Water is an input to the evap- FrameNet do not cover all our target scientific oration process. processes. We manually created a small amount 2. Result – The artifact that results from the of training data using sentences that involve pro- process or the change that results from the cesses mentioned in the questions. To account process e.g., Water vapor is a result of evap- for the limited amount of training data, we ex- oration. plored distant supervision, and domain adapta- 3. Trigger – The main action, expressed as a tion. We find that domain adaptation yields a verb or its nominalization, indicating the oc- modest improvement, while distant supervision currence of the process. e.g., converted is a is unreliable. Overall, we find that automatic ex- trigger for evaporation. traction is still quite noisy and as a result doesn’t 4. Enabler – The artifact, condition or action provide improvements over using simple bag-of- that enables the process to happen. e.g., Sun words representations. is a heat source that is enabler for evapora- An error analysis shows the knowledge gaps tion. and the deficiencies in the current representation Our goal is to build aggregate knowledge about and the key linguistic phenomenon that lead to processes from multiple sentences. In effect we errors in automatic SRL, which present interest- wish to create a table, whose columns are the ing avenues for future work. semantic roles and rows are role fillers obtained from sentences mentioning the process. We 2 Representing Processes via gather knowledge about processes from both def- Semantic Roles initional sentences and from sentences that de- A portion of the grade science exams test the abil- scribe instances of the processes. ity to recognize the instances of physical, biolog- Definitional sentences present type information ical, and other natural processes. The questions about the roles. For instance, the input to evap- present a short description of an instance and oration is typically a liquid. During question an- multiple process names as the answer choices. swering, we have to check type compatibility of Table below shows a few examples: the roles in the question and the definitional sentence. Having instance level information in the 1) As water vapor rises in the atmosphere, process knowledge can help situations where type it cools and changes back to liquid. Tiny resolution fails. On the other hand, having defi- drops of liquid form clouds in this process nitional sentences provides better coverage of in- called (A) condensation (B) evaporation formation about roles compared to instance sen- (C) precipitation (D) run-off. tences, which can sometimes omit roles that are 2) The process of change from an egg to obvious. an adult butterfly is an example of: (A) vibrations (B) metamorphosis. 2.1 MATE System for SRL 3) A new aluminum can made from used There have been various SRL approaches , in cans is an example of: (A) recycling (B) terms of resources (PropBank, FrameNet) and reducing (C) reusing (D) repairing also the techniques and features used. Neverthe- The descriptions of the instances are often less, in general the workflow of an SRL consist of short. They mention the main entities involved two parts namely argument identification and ar- and the change brought about by the process. gument classification. In this work, we use MATE At the 4th grade level, the questions do not SRL system (Björkelund et al., 2009). The rea- involve deep knowledge about the sub-events son we use MATE in our setting is because the or their sequential order. Rather the questions code is publicly available, it provides automatic test for shallower knowledge about the entities predicate identification, and it is one of the high undergoing change, the resulting artifacts, and performing system in ConLL’09 SRL shared task the main characteristic action describing the for English dataset. process. This knowledge is naturally expressed MATE accepts sentences in CoNLL’09 for- via semantic roles. Accordingly we design a mat and processes them through several pipelines namely predicate identification, argument identi- take the existing features and create a new feature fication, and argument classification. The fea- vector that contains three versions of the origi- tures used in the pipeline are based on the syn- nal features. A source-specific version, a target- tactic information obtained from POS tagger and specific version, and a general version. The in- dependency parser such as the position of the ar- stances from the source domain will only con- gument with respect to the predicate, the depen- tain the general and source-specific versions, and dency path between the arguments and the predi- the instances for the target domain will contain cate, the set of POS tag of the predicate’s children the general and the target-specific versions. For- etc. Please refer to (Björkelund et al., 2009) for mally, if f is the set of features used, then we details. use the following mappings to create new feature The modification that we make to MATE are vectors: related to predicate identification and domain s adaptation. For the predicate identification, in Φ (f) =< f, f, 0 > case of the classifier fails to identify one, we force Φt(f) =< f, 0, f > the classifier to output one predicate based on the sorted confidence score of each of the predicate where, Φs is applied to source sentences and Φt candidates. For the domain adaptation part, we is applied to target sentences. This mapping en- adopt the simple feature augmentation approach ables the learning algorithm to do domain adap- (DauméIII, 2009). tation by learning two sets of weights that reflect Structured prediction problems such as SRL re- the utility of a feature across all domains as well quire substantial amounts of training data. SRL as within the target domain. This simple trans- systems such as MATE typically train on re- formation has been shown to be quite effective for sources such as FrameNet, which contain more domain adaptation (DauméIII, 2009). than 190,000 sentences. Semantic roles are not 2.3 Distant Supervision reliably identified with syntactic patterns alone. A pattern such as [verb] to [X] could suggest that Distant supervision (Mintz et al., 2009), is an in- X is a result or enabler depending on the process creasingly popular method for relation extraction at hand. For instance, change to [X] indicates and typically used in a setting where we have a a resulting state, whereas adapts to [X] doesn’t. lot of unlabeled data and there exist the source of This suggests that lexical information (e.g., the labeled knowledge base. They key assumption of type of verb) is quite critical. Unless the lexical distant supervision is ” if two entities participate variations had all been observed in the training in a relation, any sentence that contain those two data, generalization is likely to suffer. We explore entities might express that relation” (Mintz et al., two ideas to address this issue. 2009). We make a similar assumption and use distant 2.2 Domain Adaptation supervision to obtain more labeled data. We use A straightforward approach to training the SRL a the Waterloo corpus 2 as our unlabeled data system is to combine all process sentences into source. We search this corpus to find sentences one pool and learn a single model. As an alter- related to each process and automatically anno- nate approach, we can learn a SRL model for ev- tate them for semantic roles. We use the following ery process using only the sentences that describe procedure: the process. This enables learning from a much • Collect most frequent role fillers for each pro- smaller but highly relevant set of sentences. Note cesses. Create a query that finds sentences this is possible in our setting because we know be- containing the role fillers. forehand which process each sentence is describ- • For each retrieved sentence, identify the loca- ing. Rather than picking one approach over the tion of each role filler. Label a candidate role other, we can combine their strengths using do- filler only if there is a up-path ↑ (dependency main adaptation ideas (DauméIII, 2009). path) from the candidate role filler nodes to The sentences that describe the target pro- the trigger node. If there is no such path cess can be viewed as target domain data, and then skip the sentence. the sentences describing all other processes can • We also use simple hand crafted lexical pat- be viewed as the source domain data. Follow- terns to identify Input, Enabler and Result. ing (DauméIII, 2009) we utilize a simple ap- 2This is a large collection of Web documents about 280 proach for domain adaptation. The key idea is to GB collected by Charlie Clark at Univ. of Waterloo. For example, Enablers are often character- 4650 questions from Help Teaching3, a collection ized with the preceding words such as by, of tests and worksheets for parents and educa- through,with, because. Results usually ap- tors, and 195 questions from the 4th-grade New pear after the word causes, into, produce. York Regents Science Exams collection, similar to the one used in (Clark et al., 2013). We selected 3 Question Answering Using multiple-choice questions where at least two of Semantic Roles the answer choices, including the correct answer, In this section, we describe our approach to an- were processes. swering process recognition questions using the For each question, we identified all processes semantic role based representation of processes. given as answer choices and collected definition The questions describe a specific instance of a sentences for each. In total, we collected 948 def- process or a prototypical occurrence of a process. inition sentences covering 183 processes. These The QA task is then to choose the correct sentences were obtained from a variety of sources process that is being described. Intuitively, we such as Barron’s Study Guides and various web can score each process based on how well the resources including Wikipedia and WordNet. roles of the instance described in the question Each question and definition sentence was an- align with the roles of the process we’ve gathered notated by following a set of guidelines to identify from other sentences describing the process. the roles expressed by the sentence. A sentence Higher alignment suggests better evidence. In could contain multiple (or no) values for a single particular, use the following procedure to answer role, but text spans are not allowed to overlap - questions: that is, the same word or phrase cannot be used for multiple roles. Disagreements were resolved 1. Convert the question into k statements by a second annotator. (A , ..., A ), one for each answer option. 1 k 4.2 Semantic Role Labeling 2. Identify the roles in each question statement by processing them through MATE. First, we compare the performance of SRL using 3. For each statement Ai, collect the cor- MATE under the following configurations: responding process rows from the process • Standard – This configuration uses sentences knowledge table (S1, ..., Sl). from all processes combined together for 4. For each row in the table compute an align- training and does not distinguish at test time ment score by checking for the textual entail- whether the sentence is describing a partic- ment of the corresponding roles. ular process. • Per-Process – This uses only the target pro- alignment(A, S) = cess sentences for training. This setting re- X quires that we have some seed set of sen- entails(rolei(A), rolei(S)) tences for every process. rolei∈R • Domain Adaptation – This uses both tar- where, R ={Input, Result, Enabler, Trig- get process sentences as well other sentences ger}. entails(x, y) is computed as a textual using the domain adaptation technique de- entailment score that reflects how well the scribed in Section 2.2. text x entails text y or the other way around. • Distant Supervision – This setting uses the It measures the coverage of the statement sentences obtained via distant supervision role accounting for synonyms and hypernyms for training. However, it trains only on the using WordNet. sentences that describe the target process 5. Compute the mean of the top 5 alignment and does not use sentences describing the scores and use it as the final score for each other processes. This requires that we have process. some seed knowledge for each process but not 6. Return the top scoring process as the answer. necessarily sentence-level annotations. We use 5-fold cross validation to test each con- 4 Evaluation figuration. Table 1 shows the results. The re- 4.1 Process Recognition Collection sults show that MATE is not highly effective with the best F1 of around 0.38. This is sub- Our data set contains 141 process identification questions, which were selected manually out of 3http://www.helpteaching.com Method Precision Recall F1 Standard 0.4323 0.3325 0.3758 Per Process 0.4225 0.2556 0.3185 Distant Supervision 0.5614 0.2642 0.3594 Dom. Adaptation 0.4386 0.3351 0.3799

Table 1: Semantic Role Labeling Performance. Bold face entries indicate the best performance. stantially lower compared to the state-of-the-art tions. As a baseline, we use a simple bag-of-words performance of MATE on standard datasets such system that uses textual entailment but without as CoNLL, where the performance around 0.80 the semantic roles (BOW). Manual SRL, is a sys- in F1. We believe that the limited amount of tem which uses manually assigned semantic roles training data is a key factor in the performance as its representation. We find that using man- differences. Training on target domain sentences ual roles by itself for alignment yields a 9% rel- using the Per-Process model results in lower per- ative improvement in accuracy. BOW and SRL formance, whereas Standard, Distant Supervi- perform well in slightly different subsets of ques- sion, and Domain Adaptation, all with increased tions. As a result combining SRL-based scoring amounts of data result in better performance. with BOW based scoring yields more than 12% We find that Domain Adaptation provides mi- improvement. While access to semantic roles pro- nor improvements over the Standard model. In- vides gains, a variety of other issues also need to terestingly, Distant Supervision doesn’t improve be addressed to further improve QA performance F1 but achieves the best overall precision. We (Section 4.4). hypothesize that this is in part due to the change Automatically obtained semantic roles using in distribution of the roles. Many Distant Super- the Standard formulation is worse than BOW by vision annotated sentences do not contain all the itself. However, in conjunction with BOW, Stan- roles. Thus, the prior probability of observing dard SRL provides minor gains over using BOW any particular role decreases potentially causing alone. For the most part, the QA performance of the learner to be more conservative about pre- the other automatic SRL variants track the SRL dicting roles. extraction accuracy. This suggests that improv- ing the automatic SRL performance is likely to Method Accuracy provide improvements in QA. BOW 63.12 Manual SRL 67.38 4.4 Error Analysis BOW+Manual SRL 70.92 Automatic SRL Failures Standard 55.32 A qualitative analysis of the SRL errors showed Per Process 46.80 two main issues that arise out of data sparsity. Domain Adaptation 55.32 Much of the errors arise from failures to identify Distant Supervision 51.77 the proper predicate. It turns out that the domi- BOW + Standard 65.24 nant pattern is where the predicate is a verb that Table 2: Question Answering Performance. Bold is directly attached to the ROOT. Other syntac- face indicates the best performance in each block. tic patterns are infrequently observed and are not very reliable. Automatic SRL fails in these cases. 4.3 QA Similarly the dependency labels of the children of Our central hypothesis is that using semantic-role a node are also strong features for predicate pre- based representation helps in process recognition diction but this feature again fails except in cases questions. We compare manually generated se- of the most dominant dependencies. mantic roles (as a result of the annotation pro- A closer investigation suggests much of the cess), and the performance of automatically de- variation in these patterns arise because of func- rived semantic roles for question answering. We tional phrases such as “is a process by which”. use the process described in Section 3 for answer- Simplifying the sentences by eliminating these ing questions. constructions can lead to improved performance. Table 2 shows the QA accuracy when using se- QA Failures mantic roles generated by the different configura- Even with manually assigned roles, the QA system failed on nearly 30% of the questions. The standard SRL system trained on our limited la- following are the main error categories. beled data turns out to be quite noisy and doesn’t • Knowledge Representation Issues (37%) yield benefits for QA. Prior work explored semi- Some questions require the ability to recog- supervised and un-supervised approaches for ad- nize the order in which sub-events happen in dressing the data sparsity issues (Fürstenauand a process e.g., What is the third step of the Lapata, 2009; Lang and Lapata, 2011; Lang and water cycle?. In some cases, the question Lapata, 2010). We explored a domain adapta- only specifies one role such as the result e.g., tion technique and a simple distant supervision What process ends with a new and improved approach. While domain adaptation yielded mi- plant?. Success in these cases depends on nor gains, we were unable to benefit from the sim- ability to do effective textual entailment. In pler distant supervision approach. This is in part other cases, certain critical information are due to differences in the percentage of roles in the not covered in our semantic representation. distant supervision and the target sentences. Example: is the spinning of a planet Error analysis on the manual role-based QA on its axis. ‘on its axis’ is the critical infor- shows representational gaps. Also, many ques- mation that differentiates rotation and revo- tions require deeper reasoning that go beyond lution but unfortunately this is not covered simple textual entailment. Our findings suggest in our semantic roles. the following avenues for future work: 1) Address • Entailment Issues (32%) Textual entailment representational gaps by a mix of pre-specified is noisy and provides inconsistent scores even general roles and automatically discovered pro- in some simple cases. For example, the cess specific roles, 2) Introduce additional struc- text-hypothesis pair (‘all plant and animal ture within the roles to facilitate deeper reason- species’, ‘all living things’) has an entailment ing, 3) Rather than relying on automatic interpre- score of 0.2856 while the pair (‘all plant and tation of a handful of sentences, compose knowl- animal species’, ‘plants’) has a score of 1. edge by explicitly searching for sentences that ex- Some textual entailment cases are hard prob- press roles in expected ways. lems which themselves require deeper knowl- 6 Acknowledgements edge and reasoning. For example, one question requires reasoning that ‘energy travel- This work is funded in part by Fulbright PhD ing from the sun to Earth’ is related to ‘en- Fellowship and by the Allen Institute for Arti- ergy transferred through space’. Similarly, ficial Intelligence. The findings and conclusions another question requires that we know ‘rub expressed herein are those of the authors alone. your hands together very quickly’ is related to ‘friction’. References • Scoring Issues (31%) In many cases the scor- Anders Björkelund, Love Hafdell, and Pierre Nugues. ing function we use is unable to distinguish 2009. Multilingual semantic role labeling. In Pro- between a strong evidence that comes from ceedings of the Thirteenth Conference on Computational one role vs. many weak partial evidences Natural Language Learning: Shared Task, pages 43–48. Association for Computational Linguistics. from multiple roles. Favoring cases with many weak partial evidences requires learn- Peter Clark, Philip Harrison, and Niranjan Balasubrama- nian. 2013. A study of the knowledge base requirements ing a threshold. We plan to investigate a for passing an elementary science test. In Proceedings of learning-based approach for combining the the 2013 workshop on Automated knowledge base con- partial evidences rather than an ad-hoc scor- struction, pages 37–42. ACM. ing function. Hal DauméIII. 2009. Frustratingly easy domain adaptation. arXiv preprint arXiv:0907.1815. 5 Conclusions Hagen Fürstenauand Mirella Lapata. 2009. Graph alignment for semi-supervised semantic role labeling. In Pro- In this work we focused on a semantic role based ceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 11–20, Singa- representation of knowledge about processes. pore. We find a small set of semantic roles can be Joel Lang and Mirella Lapata. 2010. Unsupervised induc- used to build an effective representation for pro- tion of semantic roles. In Human Language Technolo- cess recognition questions. Unfortunately, auto- gies: The 2010 Annual Conference of the North Ameri- can Chapter of the Association for Computational Lin- matic SRL systems require significant amounts guistics, pages 939–947, Los Angeles, California, June. of training data. Out-of-the-box application of a Association for Computational Linguistics. Joel Lang and Mirella Lapata. 2011. Unsupervised semantic role induction with graph partitioning. In Pro- ceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1320–1331, Ed- inburgh, Scotland, UK., July. Association for Compu- tational Linguistics.

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th In- ternational Joint Conference on Natural Language Pro- cessing of the AFNLP: Volume 2-Volume 2, pages 1003– 1011. Association for Computational Linguistics.