Quick viewing(Text Mode)

Automatic Focus Annotation: Bringing Formal Pragmatics Alive in Analyzing the Information Structure of Authentic Data

Automatic Focus Annotation: Bringing Formal Pragmatics Alive in Analyzing the Information Structure of Authentic Data

Automatic Focus Annotation: Bringing Formal Alive in Analyzing the Information Structure of Authentic Data

Ramon Ziai Detmar Meurers Collaborative Research Center 833 University of Tubingen¨ rziai,dm @sfs.uni-tuebingen.de { }

Abstract (1) Q: What did John show Mary? A: John showed Mary [[the PICtures]] . Analyzing in , both from a F theoretical and from a computational perspec- tive, is receiving increased interest. Com- In the answer in (1), the NP the pictures is fo- plementing the research in on dis- cussed and hence indicates that there are alterna- course and information structure, in compu- tive things that John could show Mary. It is com- tational linguistics identifying con- monly assumed that focus here typically indicates cepts was also shown to improve the perfor- the presence of alternative ( mance of certain applications, for example, focus, Krifka and Musan 2012, p.8), making it Short Answer Assessment systems (Ziai and a semantic notion. Depending on the language, Meurers, 2014). different devices are used to mark focus, such as Building on the research that established de- prosodic focus marking or different syntactic con- tailed annotation guidelines for manual anno- structions (e.g. clefts). In this paper, we adopt a tation of information structural concepts for written (Dipper et al., 2007; Ziai and Meur- notion of focus based on alternatives, as advanced ers, 2014) and spoken language data (Calhoun by Rooth (1992) and more recently, Krifka and et al., 2010), this paper presents the first ap- Musan (2012), who define focus as indicating “the proach automating the analysis of focus in au- presence of alternatives that are relevant for the in- thentic written data. Our classification ap- terpretation of linguistic expressions” (Krifka and proach combines a range of lexical, syntactic, Musan, 2012, p. 7). Formal has tied the and semantic features to achieve an accuracy notion of alternatives to an explicit relationship of 78.1% for identifying focus. between questions and answers called Question- Answer Congruence (Stechow, 1991), where the 1 Introduction idea is that an answer is congruent to a question if The interpretation of language is well known to both evoke the same set of alternatives. Questions depend on context. Both in theoretical and com- can thus be seen as a way of making alternatives putational linguistics, discourse and information explicit in the discourse, an idea also taken up by structure of sentences are thus receiving increased the Question-Under-Discussion (QUD) approach interest: attention has shifted from the analysis of (Roberts, 2012) to discourse organization. isolated sentences to the question how sentences Complementing the theoretical linguistic ap- are structured in discourse and how information is proaches, in the last decade corpus-based ap- packaged in sentences analyzed in context. proaches started exploring which information As a consequence, a rich landscape of ap- structural notions can reliably be annotated in proaches to discourse and information struc- what kind of language data. While the information ture has been developed (Kruijff-Korbayova´ and status (Given-New) dimension can be annotated Steedman, 2003). Among these perspectives, the successfully (Riester et al., 2010; Nissim et al., Focus-Background dichotomy provides a particu- 2004) and even automated (Hempelmann et al., larly valuable structuring of the information in a 2005; Nissim, 2006; Cahill and Riester, 2012), sentence in relation to the discourse. (1) is an ex- the inter-annotator results for Focus- ample question-answer pair from Krifka and Mu- Background (Ritz et al., 2008; Calhoun et al., san (2012, p. 4) where the focus in the answer is 2010) show that it is difficult to obtain high lev- marked by brackets. els of agreement, especially due to disagreement

117 Proceedings of NAACL-HLT 2018, pages 117–128 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics about the extent or size of the focused unit. proaches here. More recently, Ziai and Meurers (2014) showed The availability of the annotated Switchboard that for data collected in task contexts includ- corpus (Calhoun et al., 2005, 2010) sparked in- ing explicit questions, such as answers to read- terest in information-structural categories and en- ing comprehension questions, reliable focus an- abled several researchers to publish studies on notation is possible. In addition, an option for detecting focus. This is especially true for the externally validating focus annotation was estab- Speech Processing community, and indeed many lished by showing that such focus annotation im- approaches described below are intended to im- proves the performance of Short Answer Assess- prove computational speech applications in some ment (SAA) systems. Focus enables the system way, by detecting prominence through a combina- to zoom in on the part of the answer addressing tion of various linguistic factors. Moreover, with the question instead of considering all parts of the the exception of Badino and Clark (2008), all ap- answer as equal. proaches use prosodic or acoustic features. In this paper, we want to build on this strand All approaches listed below tackle the task of research and develop an approach for automati- of detecting ‘kontrast’ (as focus is called in the cally identifying focus in authentic data including Switchboard annotation) automatically on various explicit question contexts. In to Calhoun subsets of the corpus using different features and (2007) and Sridhar et al. (2008), who make use of classification approaches. For each approach, we prosodic properties to tackle the identification of therefore report the features and classifier used, focus for content words in spoken language data, the data set size as reported by the authors, the (of- we target the analysis of written texts. ten very high) majority baseline for a binary dis- We start in section 2 by discussing relevant re- tinction between ‘kontrast’ and background, and lated work before introducing the gold standard the best accuracy obtained. If available in the orig- focus annotation we are using as foundation of inal description of the approach, we also report the our work in section 3. Section 4 then presents accuracy obtained without acoustic and prosodic the different types of features used for predicting features. which tokens form a part of the focus. In sec- Calhoun (2007) investigated how focus can be tion 5 we employ a supervised machine learning predicted through what she calls “prominence setup to evaluate the perspective and specific fea- structure”. The essential claim is that a “focus tures in terms of the ability to predict the gold stan- is more likely if a word is more prominent than dard focus labeling. Building on these intermedi- expected given its syntactic, semantic and dis- ate results and the analysis thereof in section 6, course properties”. The classification experiment in section 7 we then present two additional fea- is based on 9,289 words with a 60% majority base- ture groups which lead to our final focus detection line for the ‘background’ class. Calhoun (2007) model. Finally, section 8 explores options for ex- reports 77.7% for a combination of prosodic, syn- trinsically showing the value of the automatic fo- tactic and semantic features in a logistic regres- cus annotation for the automatic meaning assess- sion model. Without the prosodic and acoustic ment of short answers. It confirms that focus anal- features, the accuracy obtained is at 74.8%. There ysis pays off when aiming to generalize assess- is no information on a separation between training ment to previously unseen data and contexts. and test set, likely due to the setup of the study 2 Previous Approaches being geared towards determining relevant factors in predicting focus, not building a focus predic- There is only a very small number of approaches tion model for a real application case. Relatedly, dealing with automatically labeling information the approach uses only gold-standard annotation structural concepts.1 Most approaches related to already available in the corpus as the basis for fea- detecting focus automatically almost exclusively tures, not automatic annotation. center on detecting the ‘kontrast’ notion in the En- Sridhar et al. (2008) use lexical, acoustic and glish Switchboard corpus (Calhoun et al., 2010). part-of-speech features in trying to detect pitch ac- We therefore focus on the Switchboard-based ap- cent, and focus. Concerning focus, the 1For a broader perspective of computational approaches work attempts to extend Calhoun (2007)’s analy- in connection with information structure, see Stede (2012). sis to “understand what prosodic and acoustic dif-

118 ferences exist between the focus classes and back- questions, other approaches such as Pinchak and ground items in conversational speech”. 14,555 Lin (2006) avoid assigning pre-determined classes words of the Switchboard corpus are used in to- to questions and instead favor a more data-driven tal, but filtered for evaluation later to balance the label set. In more recent work, Lally et al. (2012) skewed distribution between ‘kontrast’ and ‘back- use a sophisticated combination of deep parsing, ground’. With the thus obtained random baseline lexical clues and broader question labels to ana- of 50%, Sridhar et al. (2008) obtain 73% accu- lyze questions. racy when using all features, which again drops only slightly to 72.95% when using only parts of 3 Data speech. They use a decision tree classifier to com- bine the features in 10-fold cross-validation for The present work is based on the German CREG training and testing. corpus (Ott et al., 2012). CREG contains re- sponses by American learners of German to com- Badino and Clark (2008) aim to model contrast prehension questions on reading texts. Each re- both for its role in analyzing discourse and infor- sponse is rated by two teaching assistants with re- mation structure, and for its potential in speech gard to whether it answers the question or not. applications. They use a combination of lexical, While many responses contain ungrammatical lan- syntactic and semantic features in an SVM clas- guage, the explicit questions in CREG generally sifier. No acoustic or prosodic features are em- make it possible to interpret responses. More im- ployed in the model. In selecting the training and portantly for our work, they can be seen as Ques- testing data, they filter out many ‘kontrast’ in- tions Under Discussion and thus form an ideal stances, such as those triggered across sentence foundation for focus annotation in authentic data. boundaries, those above the word level, and those As a point for the automatic detection not sharing the same broad part of speech with the of focus, we used the CREG-ExpertFocus data set trigger word. The resulting data set has 8,602 in- (De Kuthy et al., 2016) containing 3,187 student stances, of which 96.8% are ‘background’. Badino answers and 990 target answers (26,980 words in and Clark (2008) experiment with different kernel total). It was created using the incremental annota- settings for the SVM and obtain the best result of tion scheme described in Ziai and Meurers (2014), 97.19% using a second-order polynomial kernel, where annotators first look at the surface question and leave-one-out testing. form, then determine the set of alternatives, and In contrast to all approaches above, we target finally mark instances of the alternative set in an- the analysis of written texts, for which prosodic swers. De Kuthy et al. (2016) report substantial and acoustic information is not available, so we agreement in CREG-ExpertFocus (κ .7) and ≥ must rely on lexis, and semantics exclu- provide an adjudicated gold standard, which thus sively. Also, the vast majority of the approaches presents a high-quality basis for training our focus discussed make direct use of the manually anno- detection classifier. tated information in the corpus they use in order to derive their features. While this is a viable ap- 4 Focus Detection Model proach when the aim is to determine the relevant factors for focus detection, it does not represent a As described in section 3 above, focus was marked real-life case where annotated data often unavail- in a span-based way in the data set used: each in- able. In our focus detection model, we only use stance of focus starts at a specific word and ends at automatically determined annotation as the basis another word. Since in principle any part of speech for our features for predicting focus. can be focused, we cannot constrain ourselves to a pre-defined set of markables for automatic clas- Since our approach also makes use of question sification. We therefore conceptualized the task properties, it is also worth mentioning that there of automatic focus detection on a per-word level: are a number of approaches on Answer Typing as for each word in an answer, as identified by the a step in Question Answering (QA) approaches in OpenNLP tokenizer and sentence segmenter2, the order to constrain the search space of possible can- classifier needs to decide whether it is an instance didate answers and improve accuracy. While ear- of focus or background. Besides the choice of lier approaches such as Li and Roth (2002) used a fixed set of answer types for classifying factoid 2http://opennlp.apache.org

119 classification algorithm, the crucial question nat- matized form of the word as determined by Tree- urally is the choice of linguistic features, which Tagger (Schmid, 1994). we turn to next. Positional properties Where a word occurs in 4.1 Features the answer or the question can be relevant for its Various types of linguistic information on differ- information structural status. It has been observed ent linguistic levels can in principle be relevant for since Halliday (1967) that given material tends to focus identification, from morphology to seman- occur earlier in sentences (here: answers), while tics. We start by exploring five groups of features, new or focused content tends to occur later. We which are outlined below. In section 7, we dis- encode this observation in three different features: cuss two more groups designed to address specific the position of the word in the answer (normal- problems observed with the initial model. ized by sentence length), the distance from the fi- nite verb (in words), and the position of the word Syntactic answer properties (SynAns) A in the question (if it is given). word’s part-of-speech and syntactic function are relevant general indicators with respect to focus: Conjunction features To explicitly tie answer since we are dealing with meaning alternatives, properties to question properties, we explored the meaning of e.g. a noun is more likely to different combinations of the features described denote an alternative than a grammatical function above. Specifically, we encoded the current word such as a complementizer or article. word’s POS depending on the question form, Similarly, a word in an dependency and the current word’s POS depending on the relation is potentially a stronger indicator for a fo- wh-word’s POS. To constrain the space cused alternative in a sentence than a word in an and get rid of unnecessary distinctions, we con- relation. We therefore included two fea- verted the answer word’s POS to a coarse-grained tures: the word’s part-of-speech tag in the STTS version before computing these features, which tag set (Schiller et al., 1995) determined using collapses all variants of determiners, pronouns, TreeTagger (Schmid, 1994), and the dependency adjectives/adverbs, prepositions, nouns and verbs 3 relation to the word’s head in the Hamburg de- into one label, respectively. pendency scheme (Foth et al., 2014, p. 2327) de- termined using MaltParser (Nivre et al., 2007) as 5 Intrinsic Evaluation features in our model. 5.1 Setup Question properties The question constitutes the direct context for the answer and dictates its in- To employ the features described above in an formation structure and information requirements actual classifier, we trained a logistic regression to fulfill. In particular, the type of wh- model using the WEKA toolkit (Hall et al., 2009). (if present) of a question is a useful indicator of We also experimented with other classification al- the type of required information: a who-question, gorithms such as SVMs, but found that they did such as ‘Who rang the doorbell?’, will typically be not offer superior performance for this task. The answered with a noun phrase, such as ‘the milk- data set used consists of all expert focus annota- man’. We identified surface question forms such tion available (3,187 student answers, see section as who, what, how etc. using a regular expres- 3), with the exception of the answers occurring in sion approach developed by Rudzewitz (2015) and the extrinsic evaluation test set we use in section included them as features. Related to question 8, which leaves a total of 2,240 student answers forms, we also extracted the question word’s de- with corresponding target answers and questions. pendency relation to its head, analogous to the We used 10-fold cross-validation on this data set to answer feature described above. experiment and select the optimal model for focus detection. Surface givenness As a rough and robust ap- 3 proximation to information status, we add a For a list (in German) of the full tag set, see http://www.ims.uni-stuttgart.de/ boolean feature indicating the presence of the forschung/ressourcen/lexika/TagSets/ current word in the question. We use the lem- stts-table.html

120 5.2 Results clause. The classifier successfully marked ‘weil’ Table 1 lists the accuracies4 obtained for our dif- and the end of the clause as focus, but left out ferent feature groups, as well as three baselines: the pronoun ‘es’ (it) in the middle, presumably be- a POS baseline, following Sridhar et al. (2008), a cause pronouns are given and often not focused in baseline that only includes the simple givenness other answers. We did experiment with using a feature, and the majority baseline. The majority sequence classification approach in order to rem- class is focus, occurring in 58.1% of the 26,980 edy such problems, but it performed worse overall cases (individual words). than the logistic regression model we presented in section 4. We therefore suggest that in such cases, Accuracy for a global constraint stating that why-questions are Feature set focus backgr. both typically answered with a full clause would be a Majority baseline 100% 0% 58.1% more promising approach, combining knowledge Givenness baseline 81.5% 42.5% 65.1% learned bottom-up from data with top-down lin- POS baseline 89.2% 39.6% 68.4% guistic insight. SynAns 82.8% 50.3% 69.2% In Figure 2, we can see two different problems. + Question 83.8% 53.1% 70.9% One is again a faulty gap, namely the omission of + Given 84.8% 62.0% 74.8% the conjunction ‘und’ (and). The other is the focus + Position 84.9% 66.5% 77.2% marking of the word ‘AG’ (corporation) in the be- + Conjunction 85.2% 66.7% 77.4% ginning of the sentence: since the question asks for an enumeration of the institutions that form a cor- Table 1: Initial focus detection model poration, marking ‘AG’ as focused is erroneous. This problem likely occurs often with nouns be- We can see that each feature group incremen- cause the classifier has learned that content words tally adds to the final model’s performance, with are often focused. Moreover, the surface given- particularly noticeable boosts coming from the ness feature does not encode that ‘AG’ is in fact givenness and positional features. Another clear an abbreviation of ‘Aktiengesellschaft’ and there- observation is that the classifier is much better at fore given. It would thus be beneficial to extend detecting focus than background, possibly also due our analysis of givenness beyond surface identity, to the skewedness of the data set. Note that perfor- a direction we explore in the next section. mance on background increases also with the ad- Finally, Figure 3 presents a case where an enu- dition of the ‘Question’ feature set, indicating the meration is marked correctly, including the con- close relation between the set of alternatives intro- junctive punctuation in between, showing that duced by the question and the focus selecting from cases of longer foci are indeed within reach for a that set, even though our approximation to compu- word-by-word focus classifier. tationally determining alternatives in questions is basic. It is also clear that the information intrin- 7 Extending the Model sic in the answers, as encoded in the ‘SynAns’ and ‘Position’ feature sets, already provides significant Based on our analysis of problematic cases out- performance benefits, suggesting that a classifier lined in the previous section, we explored two dif- trained only on these features could be trained and ferent avenues for improving our focus detection applied to settings where no explicit questions are model, which we describe below. available. 7.1 Distributional Givenness 6 Qualitative Analysis We have seen in section 5.2 that surface-based In order to help explain the gap between automatic givenness is helpful in predicting focus. How- and manual focus annotation, let us take a step ever, it clearly has limitations, as for example syn- back from quantitative evaluation and examine a onymy cannot be captured on the surface. We few characteristic examples in more detail. also exemplified one such limitation in Figure 2. Figure 1 shows a case where a why-question In order to overcome these limitations, we im- is answered with an embedded ‘weil’ (because) plemented an approach based on distributional se- 4We show per-class and overall accuracies, the former is mantics. This avenue is motivated by the fact that also known as recall or true positive rate. Ziai et al. (2016) have shown Givenness modeled

121 Warum sollte man Dresden besuchen? ‘Why should one visit Dresden?’

‘One should visit Dresden because it has much to offer.’ Figure 1: Focus with a faulty gap in between

Aus welchen drei Organen besteht eine Aktiengesellschaft? ‘Which three institutions does a corporation consist of?’

‘A corporation consists of the general assembly, the supervisory board and the steering committee.’ Figure 2: Focus with a faulty outlier (and a faulty gap)

Welche Sehenswurdigkeiten¨ gibt es in der Stadt? ‘Which places of interest are in the city?’

‘The city exists the Dresden Zwinger, the Frauenkirche, the Semperoper, the Royal Palace.’ Figure 3: Enumeration with correct focus as distributional similarity to be helpful for SAA we have worked with part-of-speech tags and de- at least in some cases. We used the word vec- pendency relations as far as syntactic representa- tor model they derived from the DeWAC corpus tion is concerned. However, while discontinuous (Baroni et al., 2009) using word2vec’s continuous focus is possible, focus as operationalized in the bag-of-words training algorithm with hierarchical scheme by Ziai and Meurers (2014) most often softmax (Mikolov et al., 2013). The model has a marks an adjacent group of words, a tendency that vocabulary of 1,825,306 words and uses 400 di- our word-based classifier did not always follow, as mensions for each. exemplified by the cases in Figures 1 and 2. Such Having equipped ourselves with a word vector groups very often correspond to a syntactic phrase, model, the question arises how to use it in fo- so constituent membership is likely indicative in cus detection in such a way that it complements predicting the focus status of an individual word. the positive impact that surface-based givenness Similarly, the topological field (Hohle,¨ 1986) iden- already demonstrates. Rather than using an em- tifying the major section of a sentence in relation pirically determined (and hence data-dependent) to the clausal main verb is potentially relevant for empirical threshold for determining givenness as a word’s focus status. done by Ziai et al. (2016), we here use raw cosine Cheung and Penn (2009) present a parsing similarities5 as features and let the classifier assign model that demonstrates good performance in appropriate weights to them during training. Con- determining both topological fields and phrase cretely, we calculate maximum, minimum and structure for German. The model is trained on average cosine between the answer word and the TuBa-D/Z¨ treebank (Telljohann et al., 2004), the question words. As a fourth feature, we cal- whose rich syntactic model encodes topological culate the cosine between the answer word and fields as nodes in the syntax tree itself. Following the additive question word vector, which is the Cheung and Penn (2009), we trained an updated sum of the individual question word vectors. version of their model using the current version of the Berkeley Parser (Petrov and Klein, 2007) and 7.2 Constituency-based Features release 10 of the TuBa-D/Z.¨ 6 Another source of evidence we wanted to exploit Based on the new parsing model, we integrated is constituency-based syntactic annotation. So far, two new features into our focus detection model:

5We normalize cosine similarity as cosine distance to ob- 6http://www.sfs.uni-tuebingen.de/en/ tain positive values between 0 and 2: dist = 1 sim ascl/resources/corpora/tueba-dz.html − 122 the direct parent constituent node of a word and the nearest topological field node of a word.

7.3 Final Results Table 2 shows the impact of the new feature groups discussed above. Accuracy for Feature set focus backgr. both Majority baseline 100% 0% 58.1% Figure 4: Short Answer Assessment example Givenness baseline 81.5% 42.5% 65.1% POS baseline 89.2% 39.6% 68.4% Initial model (sec. 5.2) 85.2% 66.7% 77.4% 1. Annotating linguistic units (words, chunks + dist. Givenness 84.7% 68.0% 77.7% and dependencies) in student and target an- + constituency 84.8% 68.7% 78.1% swer on various levels of abstraction

Table 2: Final focus detection performance 2. Finding alignments of linguistic units be- tween student and target answer based on an- While the improvements may seem modest notation (see Figure 4) quantitatively, they show that the added features are well-motivated and do make an impact. Over- 3. Classifying the student answer based on all, it is especially apparent that the key to better number and type of alignments (see Table 3), performance is reducing the number of false posi- using a supervised machine learning setup tives in this data set: while the accuracy for focus stays roughly the same, the one for background Feature Description improves steadily with each feature set addition. 1. Keyword Overlap Percent of dependency heads aligned (relative to target) 8 Extrinsic Evaluation 2./3. Token Overlap Percent of aligned target/student tokens 4./5. Chunk Overlap Percent of aligned target/student Complementing the intrinsic evaluation above, in chunks (as identified by this section we demonstrate how focus can be suc- OpenNLP3) cessfully used to improve performance in an au- 6./7. Triple Overlap Percent of aligned target/student dependency triples thentic CL task, namely Short Answer Assessment 8. Token Match Percent of token alignments that (SAA). were token-identical 9. Similarity Match Percent of token alignments 8.1 Setup resolved using PMI-IR (Turney, 2001) It has been pointed out that evaluating the anno- 10. Type Match Percent of token alignments resolved using GermaNet tation of a theoretical linguistic notion only in- hierarchy (Hamp and Feldweg, trinsically is problematic because there is no non- 1997) theoretical grounding involved (Riezler, 2014). 11. Lemma Match Percent of token alignments that were lemma-resolved Therefore, besides a to the gold stan- 12. Synonym Match Percent of token alignments dard, we also evaluated the resulting annotation in sharing same GermaNet synset a larger computational task, the automatic mean- 13. Variety of Match Number of kinds of (0-5) token-level alignments (features ing assessment of short answers to reading com- 8–12) prehension questions. Here the goal is to decide, given a question (Q) and a correct target answer Table 3: Standard features in the CoMiC system (TA), whether the student answer (SA) actually In stage 2, CoMiC integrates a simplistic ap- answers the question or not. An example from proach to givenness, excluding all words from Meurers et al. (2011) is shown in Figure 4. alignment that are mentioned in the question. We We used the freely available CoMiC system transferred the underlying method to the notion of (Comparing Meaning in Context, Meurers et al. focus and implemented a component that excludes 2011) as a testbed for our experiment. CoMiC all non-focused words from alignment, resulting is an alignment-based system operating in three stages: 3http://opennlp.apache.org/

123 in alignments between focused parts of answers One can see that in general, the focus classifier only. The hypothesis is that the alignment of fo- seems to introduce too much noise to positively cused elements in answers adds information about impact classification results. The standard CoMiC the quality of the answer with respect to the ques- system outperforms the focus-augmented version tion, leading to a higher answer classification ac- for the cross validation case and the ‘unseen an- curacy. swers’ set. This is in contrast to the experiments We experimented with two different settings in- reported by De Kuthy et al. (2016) using manual volving the standard CoMiC system and a focus- focus information, where the augmented system augmented variant: i) using standard CoMiC with clearly outperforms all other variants. This shows the givenness filter by itself as a baseline, and ii) that while focus information is clearly useful in augmenting standard CoMiC by additionally pro- Short Answer Assessment, it needs to be reliable ducing a focus version of each classification fea- enough to be of actual benefit. Recall also that the ture in Table 3. In each case, we used WEKA’s k- way we use focus information in CoMiC implies nearest-neighbor implementation for CoMiC, fol- a strong commitment: only focused words are lowing positive results by Rudzewitz (2016). aligned and included in feature extraction, which We use two test sets randomly selected from the does not produce the desired result if the focus in- CREG-5K data set (Ziai et al., 2016), one based on formation is not accurate. A possible way of rem- an ‘unseen answers‘ and one based on an ‘unseen edying this situation would be to use focus as an questions‘ test scenario, based on the methodol- extra feature or less strict modifier of existing fea- ogy of (Dzikovska et al., 2013): in ‘unseen an- tures. There is thus room for improvement both swers’, the test set can contain answers to the same in the automatic detection of focus and its use in questions already part of the training set (but not extrinsic tasks. the answers themselves), whereas in ‘unseen ques- However, one result stands out encourag- tions’ both questions and answers are new in the ingly: in the ‘unseen questions’ case, the focus- test set. In order to arrive at a fair and generaliz- augmented version beats standard CoMiC, if only able testing setup, we removed all answers from by a relatively small margin. This shows that the CREG-5K training set that also occur in the even automatically determined information struc- CREG-ExpertFocus set used to train our focus de- tural properties provide benefits when more con- tection classifier. This ensures that neither the fo- crete information, in the form of previously seen cus classifier nor CoMiC have seen any of the test answers to the same questions, is not available. set answers before. Our classifier thus successfully transfers general The resulting smaller training set contains 1606 knowledge about focus to new question material. student answers, while the test sets contain 1002 (unseen answers) and 1121 (unseen questions), re- 9 Conclusion spectively. We presented the first automatic focus detection approach for written data, and the first such ap- 8.2 Results proach for German. The approach uses a rich fea- Table 4 summarizes the results for the different ture set including abstractions to grammatical no- CoMiC variants and test sets in terms of accuracy tions (parts of speech, dependencies), in classifying answers as correct vs. incorrect. aspects captured by a topological field model of ‘Standard CoMiC’ refers to the standard CoMiC German, an approximation of Givenness and the system and ‘+Focus’ refers to the augmented sys- relation between material in the answer and that tem using both feature versions. For reference on of the question word. what is possible with Focus information, we pro- Using a word-by-word classification approach vide the results of the oracle experiment by De that takes into account both syntactic and seman- Kuthy et al. (2016), even though the test setup and tic properties of answer and question words, we data setup are slightly different. In addition to our achieve an accuracy of 78.1% on a data set of two test sets introduced above, we tested the sys- 26,980 words in 10-fold cross validation. The fo- tems on the training set using 10-fold cross valida- cus detection pipeline developed for the experi- tion. We also provide the majority baseline of the ment is freely available to other researchers. respective data set along with the majority class. Complementing the intrinsic evaluation, we

124 Test set Instances Majority baseline CoMiC +Focus Oracle experiment reported by De Kuthy et al. (2016) on CREG-ExpertFocus leave-one-out 3187 51.0% (correct) 83.2% 85.6% 10-fold CV 1606 54.4% (correct) 83.2% 82.3% Unseen answers 1002 51.3% (correct) 80.6% 80.5% Unseen questions 1121 51.1% (incorrect) 77.4% 78.4%

Table 4: CoMiC results on different test sets using standard and focus-augmented features provide an extrinsic evaluation of the approach as is promising in terms of obtaining valuable in- part of a larger CL task, the automatic content sights to be addressed in future work. For ex- assessment of answers to reading comprehension ample, the analysis identified faulty gaps in focus questions. We show that while automatic focus marking. In future work, integrating insights from detection does not yet improve content assessment theoretical linguistic approaches to focus and the for answers similar to the ones previously seen, it notion of focus projection established there (cf., does provide a benefit in test cases where the ques- e.g., De Kuthy and Meurers 2012) could provide tions and answers are completely new, i.e., where more guidance for ensuring contiguity of focus do- the system needs to generalize beyond the specific mains. cases and contexts previously seen. Contextualizing our work, one can see two dif- Acknowledgements ferent strands of research in the automatic anal- We would like to thank Kordula De Kuthy ysis of focus. In comparison to Calhoun (2007) and the anonymous reviewers for detailed and and follow-up approaches, who mainly concen- helpful comments on different versions of this trate on linking prosodic prominence to focus in paper. This work has been funded by the dialogues, we do not limit our analysis to con- Deutsche Forschungsgemeinschaft through Col- tent words, but analyze every word of an utter- laborative Research Center 833. ance. This is made feasible due to the explicit task context we have in the form of answers to reading comprehension questions. We believe this nicely illustrates two avenues for obtaining relevant evi- Leonardo Badino and Robert A. J. Clark. 2008. Auto- dence on information structure: On the one hand, matic labeling of contrastive word pairs from spon- there is evidence obtained bottom-up through the taneous spoken English. In Proceedings of the data such as the rich information on prominence in 2008 IEEE Spoken Language Technology Work- shop. pages 101–104. https://doi.org/10. spoken language data such as the corpus used by 1109/SLT.2008.4777850. Calhoun (2007). On the other hand, there is top- down evidence from the task context, which sets Marco Baroni, Silvia Bernardini, Adriano Ferraresi, up expectations about what is to be addressed for and Eros Zanchetta. 2009. The wacky wide the current question under discussion. Following web: A collection of very large linguistically processed web-crawled corpora. Journal of the QUD research strand, the approach presented Language Resources and Evaluation 3(43):209– in this paper could be scaled up beyond explicit 226. http://wacky.sslmit.unibo.it/ question-answer pairs: De Kuthy et al. (2018) lib/exe/fetch.php?media=papers: spell out an explicit analysis of text in terms of wacky_2008.pdf. QUDs and show that it is possible to annotate ex- Aoife Cahill and Arndt Riester. 2012. Automati- plicit QUDs with high inter-annotator agreement. cally acquiring fine-grained information status dis- Combined with an automated approach to ques- tinctions in german. In Proceedings of the 13th tion generation, it could thus be possible to recover Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Compu- implicit QUDs from text and subsequently apply tational Linguistics, pages 232–236. http:// our current approach to any text, based on an in- aclweb.org/anthology/W12-1632. dependently established, general formal pragmatic analysis. Sasha Calhoun. 2007. Predicting focus through prominence structure. In Proceedings of Finally, the qualitative analysis we exemplified Interspeech. Antwerp, Belgium. http:

125 //www.cstr.inf.ed.ac.uk/downloads/ Myroslava Dzikovska, Rodney Nielsen, Chris Brew, publications/2007/calhounIS07.pdf. Claudia Leacock, Danilo Giampiccolo, Luisa Ben- tivogli, Peter Clark, Ido Dagan, and Hoa Trang Sasha Calhoun, Jean Carletta, Jason Brenier, Neil Dang. 2013. Semeval-2013 task 7: The joint Mayo, Dan Jurafsky, Mark Steedman, and David student response analysis and 8th recognizing Beaver. 2010. The NXT-format switchboard cor- textual entailment challenge. In Proceedings pus: A rich resource for investigating the syntax, of the Seventh International Workshop on Se- semantics, pragmatics and of dialogue. mantic Evaluation (SemEval 2013). Association Language Resources and Evaluation 44:387–419. for Computational Linguistics, Atlanta, Georgia, http://link.springer.com/article/ USA, pages 263–274. http://aclweb.org/ 10.1007\%2Fs10579-010-9120-1. anthology/S13-2045. Sasha Calhoun, Malvina Nissim, Mark Steedman, and Kilian A. Foth, Arne Kohn,¨ Niels Beuck, and Wolf- Jason Brenier. 2005. A framework for annotating gang Menzel. 2014. Because size does matter: information structure in discourse. In Proceedings The hamburg dependency treebank. In Nicoletta of the Workshop on Frontiers in Corpus Annota- Calzolari (Conference Chair), Khalid Choukri, tions II: Pie in the Sky. Association for Compu- Thierry Declerck, Hrafn Loftsson, Bente Maegaard, tational Linguistics, Ann Arbor, Michigan, pages Joseph Mariani, Asuncion Moreno, Jan Odijk, 45–52. http://aclweb.org/anthology/ and Stelios Piperidis, editors, Proceedings of W/W05/W05-0307. the Ninth International Conference on Language Resources and Evaluation (LREC’14). Euro- Jackie Chi Kit Cheung and Gerald Penn. 2009. Topo- pean Language Resources Association (ELRA), logical field parsing of german. In ACL-IJCNLP Reykjavik, Iceland, pages 2326–2333. http: ’09: Proceedings of the Joint Conference of the 47th //www.lrec-conf.org/proceedings/ Annual Meeting of the ACL and the 4th International lrec2014/pdf/860_Paper.pdf. Joint Conference on Natural Language Processing of the AFNLP: Volume 1. Association for Compu- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard tational Linguistics, Morristown, NJ, USA, pages Pfahringer, Peter Reutemann, and Ian H. Witten. 64–72. http://aclweb.org/anthology/ 2009. The WEKA data mining software: An update. P09-1008. In The SIGKDD Explorations. volume 11, pages 10– 18. Kordula De Kuthy and Detmar Meurers. 2012. Fo- cus projection between theory and evidence. In Michael Halliday. 1967. Notes on and Sam Featherston and Britta Stolterfoht, editors, Em- Theme in English. Part 1 and 2. Journal of Linguis- pirical Approaches to Linguistic Theory – Stud- tics 3:37–81, 199–244. ies in Meaning and Structure, De Gruyter, vol- Birgit Hamp and Helmut Feldweg. 1997. GermaNet – ume 111 of Studies in , pages a lexical-semantic net for german. In Proceedings http://purl.org/dm/papers/ 207–240. of ACL workshop Automatic Information Extraction dekuthy-meurers-11.html . and Building of Lexical Semantic Resources for NLP Kordula De Kuthy, Nils Reiter, and Arndt Riester. Applications. Madrid. http://aclweb.org/ 2018. Qud-based annotation of discourse structure anthology/W97-0802. and information structure: Tool and evaluation. In Christian F. Hempelmann, David Dufty, Philip M. Proceedings of the 11th Language Resources and McCarthy, Arthur C. Graesser, Zhiqiang Cai, and Evaluation Conference. Miyazaki, JP. Danielle S. McNamara. 2005. Using LSA to automatically identify Givenness and Newness of Kordula De Kuthy, Ramon Ziai, and Detmar Meurers. noun in written discourse. In B. G. Bara, 2016. Focus annotation of task-based data: a L. Barsalou, and M. Bucciarelli, editors, Proceed- comparison of expert and crowd-sourced anno- ings of the 27th Annual Meeting of the Cognitive Sci- tation in a reading comprehension corpus. In ence Society. Erlbaum, Stresa, Italy, pages 941–949. Proceedings of the 10th Edition of the Language https://doi.org/10.1.1.116.5716. Resources and Evaluation Conference (LREC). Portoroz,ˇ Slovenia, pages 3928–3934. http: Tilman N. Hohle.¨ 1986. Der Begriff ‘Mittelfeld’. //www.lrec-conf.org/proceedings/ Anmerkungen uber¨ die Theorie der topologischen lrec2016/pdf/1083_Paper.pdf. Felder. In A. Schone,¨ editor, Kontroversen alte und neue. Akten des VII. Internationalen Germanis- Stefanie Dipper, Michael Gotze,¨ and Stavros tenkongresses Gottingen¨ 1985, Niemeyer, Tubingen,¨ Skopeteas, editors. 2007. Information Struc- pages 329–340. Bd. 3. ture in Cross-Linguistic Corpora: Annotation Guidelines for , Morphology, Syntax, Manfred Krifka and Renate Musan. 2012. Information Semantics and Information Structure, volume 7 of structure: overview and linguistic issues. In Man- Interdisciplinary Studies on Information Structure. fred Krifka and Renate Musan, editors, The Expres- Universitatsverlag¨ Potsdam, Potsdam, Germany. sion of Information Structure, De Gruyter Mouton, http://www.sfb632.uni-potsdam.de/ Berlin/Boston, volume 5 of The Expression of Cog- publications/isis07.pdf. nitive Categories, pages 1–43.

126 Ivana Kruijff-Korbayova´ and Mark Steedman. 2003. Technologies 2007: The Conference of the North Discourse and information structure. Journal of American Chapter of the Association for Computa- Logic, Language and Information (Introduction to tional Linguistics; Proceedings of the Main Confer- the Special Issue) 12(3):249–259. ence. Rochester, New York, pages 404–411.

Adam Lally, John M. Prager, Michael C. McCord, Bra- Christopher Pinchak and Dekang Lin. 2006. A prob- nimir K. Boguraev, Siddharth Patwardhan, James abilistic answer type model. In Proceedings of the Fan, Paul Fodor, and Jennifer Chu-Carroll. 2012. 11th Conference of the European Chapter of the Question analysis: How Watson reads a clue. IBM Association of Computational Lingustics (EACL). Journal of Research and Development 56(3/4):2:1– pages 393–400. 14. Arndt Riester, David Lorenz, and Nina Seemann. Xin Li and Dan Roth. 2002. Learning question clas- 2010. A recursive annotation scheme for ref- sifiers. In Proceedings of the 19th International erential information status. In Proceedings of Conference on Computational Linguistics (COL- the 7th International Conference on Language ING 2002). Taipei, Taiwan, pages 1–7. http: Resources and Evaluation. Valletta, Malta. http: //aclweb.org/anthology/C02-1150. //www.lrec-conf.org/proceedings/ lrec2010/pdf/764_Paper.pdf. Detmar Meurers, Ramon Ziai, Niels Ott, and Janina Kopp. 2011. Evaluating answers to reading com- Stefan Riezler. 2014. On the problem of theoretical prehension questions in context: Results for Ger- terms in empirical computational linguistics. Com- man and the role of information structure. In Pro- putational Linguistics 40(1):235–245. ceedings of the TextInfer 2011 Workshop on Tex- Julia Ritz, Stefanie Dipper, and Michael Gotze.¨ tual Entailment. Edinburgh, pages 1–9. http:// 2008. Annotation of information structure: An aclweb.org/anthology/W11-2401.pdf. evaluation across different types of texts. In Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Proceedings of the 6th International Conference rado, and Jeff Dean. 2013. Distributed representa- on Language Resources and Evaluation. Mar- tions of words and phrases and their compositional- rakech, Morocco, pages 2137–2142. http: ity. In Advances in neural information processing //www.lrec-conf.org/proceedings/ systems. pages 3111–3119. lrec2008/pdf/543_paper.pdf.

Malvina Nissim. 2006. Learning information status of Craige Roberts. 2012. Information structure in dis- discourse entities. In Proceedings of the 2006 Con- course: Towards an integrated formal theory of ference on Emprical Methods in Natural Language pragmatics. Semantics and Pragmatics 5(6):1–69. https://doi.org/10.3765/sp.5.6 Processing. Sydney, Australia. . Mats Rooth. 1992. A theory of focus interpretation. Malvina Nissim, Shipra Dingare, Jean Carletta, Natural Language Semantics 1(1):75–116. and Mark Steedman. 2004. An annotation scheme for information status in dialogue. In Bjorn¨ Rudzewitz. 2015. Alignment Weighting for Short Proceedings of the 4th Conference on Lan- Answer Assessment. Bachelor’s thesis, University guage Resources and Evaluation. Lisbon, Por- of Tubingen.¨ www.sfs.uni-tuebingen.de/ tugal. http://www.lrec-conf.org/ ˜brzdwtz/resources/BA_Thesis.pdf. proceedings/lrec2004/pdf/638.pdf. Bjorn¨ Rudzewitz. 2016. Exploring the intersec- Joakim Nivre, Jens Nilsson, Johan Hall, Atanas tion of short answer assessment, authorship attri- Chanev, Gulsen¨ Eryigit, Sandra Kubler,¨ Svetoslav bution, and plagiarism detection. In Proceedings Marinov, and Erwin Marsi. 2007. MaltParser: A of the 11th Workshop on Innovative Use of NLP language-independent system for data-driven de- for Building Educational Applications. San Diego, pendency parsing. Natural Language Engineer- CA, pages 235–241. https://aclweb.org/ ing 13(1):1–41. http://w3.msi.vxu.se/ anthology/W16-0527.pdf. ˜nivre/papers/nle07.pdf. Anne Schiller, Simone Teufel, and Chris- Niels Ott, Ramon Ziai, and Detmar Meurers. 2012. tine Thielen. 1995. The Stuttgart-Tubingen¨ Creation and analysis of a reading comprehension Tagset (STTS). Technical report, Univer- exercise corpus: Towards evaluating meaning in sitat¨ Stuttgart, Universitat¨ Tubingen,¨ Germany. context. In Thomas Schmidt and Kai Worner,¨ http://www.sfs.uni-tuebingen.de/ editors, Multilingual Corpora and Multilingual Elwis/stts/stts.html. Corpus Analysis, Benjamins, Amsterdam, Ham- burg Studies in Multilingualism (HSM), pages 47– Helmut Schmid. 1994. Probabilistic part-of-speech 69. https://benjamins.com/\#catalog/ tagging using decision trees. In Proceedings of the books/hsm.14.05ott. International Conference on New Methods in Lan- guage Processing. Manchester, UK, pages 44–49. Slav Petrov and Dan Klein. 2007. Improved infer- http://www.ims.uni-stuttgart.de/ ence for unlexicalized parsing. In Human Language ftp/pub/corpora/tree-tagger1.pdf.

127 Vivek Kumar Rangarajan Sridhar, Ani Nenkova, Shrikanth Narayanan, and Dan Jurafsky. 2008. De- tecting prominence in conversational speech: pitch accent, givenness and focus. In Proceedings of Speech Prosody. Campinas, Brazil, pages 380–388. Arnim von Stechow. 1991. Focusing and background- ing operators. In W. Abraham, editor, Discourse Particles, John Benjamins Publishing Co., Amster- dam/Philadelphia, pages 37–84. Manfred Stede. 2012. Computation and modeling of information structure. In Manfred Krifka and Re- nate Musan, editors, The Expression of Information Structure, De Gruyter Mouton, Berlin/Boston, vol- ume 5 of The Expression of Cognitive Categories, pages 363–408. Heike Telljohann, Erhard Hinrichs, and Sandra Kubler.¨ 2004. The TuBa-D/Z¨ treebank: Annotating German with a context-free backbone. In Proceedings of the Fourth International Conference on Language Re- sources and Evaluation (LREC 2004). Lissabon. Peter Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001). Freiburg, Germany, pages 491–502. Ramon Ziai, Kordula De Kuthy, and Detmar Meurers. 2016. Approximating Givenness in Content Assess- ment through Distributional Semantics. In Proceed- ings of the Fifth Joint Conference on Lexical and (*SEM). ACL, Berlin, Germany, pages 209–218. http://aclweb. org/anthology/S16-2026.pdf.

Ramon Ziai and Detmar Meurers. 2014. Focus an- notation in reading comprehension data. In Pro- ceedings of the 8th Linguistic Annotation Work- shop (LAW VIII, 2014). COLING, ACL, Dublin, Ire- land, pages 159–168. http://aclweb.org/ anthology/W14-4922.pdf.

128