On the Relevance of Syntactic and Discourse Features for Author Profiling and Identification

Juan Soler-Company Leo Wanner UPF UPF and ICREA Carrer de Roc Boronat 138 Carrer de Roc Boronat 138 Barcelona, 08018, Spain Barcelona, 08018, Spain [email protected] [email protected]

Abstract state-of-the-art performance in author and identification on a different literary corpus. The majority of approaches to author In the next section, we briefly review the related profiling and author identification focus work. In Section 3, we describe the experimental mainly on lexical features, i.e., on the con- setup and the features that are used in the exper- tent of a text. We argue that syntactic iments. Section 4 presents the experiments and dependency and discourse features play their discussion. Finally, in Section 5, we draw a significantly more prominent role than some conclusions and sketch the future line of our they were given in the past. We show that research in this area. they achieve state-of-the-art performance in author and gender identification on a lit- 2 Related Work erary corpus while keeping the feature set small: the used feature set is composed of Author identification in the context of the liter- only 188 features and still outperforms the ary genre attracted attention beyond the NLP re- winner of the PAN 2014 shared task on au- search circles, e.g., due to the work by Alju- thor verification in the literary genre. mily (2015), who addressed the allegations that Shakespeare did not write some of his best 1 Introduction plays using clustering techniques with function Author profiling and author identification are two word frequency, word n-grams and character tasks in the context of the automatic derivation of n-grams. Another example of this type of author-related information from textual material. work is (Gamon, 2004), where the author clas- In the case of author profiling, demographic author sifies the writings of the Bronte¨ sisters using as information such as gender or age is to be derived; features the sentence length, number of nom- in the case of author identification, the goal is to inal/adjectival/adverbial phrases, function word predict the author of a text, selected from a pool of frequencies, part-of-speech (PoS) trigrams, con- potential candidates. The basic assumption under- stituency patterns, semantic information and n- lying author profiling is that, as a result of being gram frequencies. In the field of author pro- exposed to similar influences, authors who share filing, several works addressed specifically gen- demographic traits also share linguistic patterns in der identification. Schler et al. (2006), Koppel their writings. The assumption underlying author et al. (2002) extract function words, PoS and identification is that the of an author the 1000 words that have more information gain. is unique enough to be characterized accurately Sarawgi et al. (2011) use long-distance syntactic and to be distinguishable from the style of other patterns based on probabilistic context-free gram- authors. State-of-the-art approaches commonly mars, token-level language models and character- use large amounts of lexical features to address level language models. both tasks. We show that with a small number In what follows, we focus on the identifica- of features, most of them syntactic or discourse- tion of the author profiling trait ‘gender’ and on based, we outperform the best models in the PAN author identification as such. For both, feature 2014 author verification shared task (Stamatatos et engineering is crucial and for both the tendency al., 2014) on a literary genre dataset and achieve is to use word/character n-grams and/or function

681 Proceedings of the 15th Conference of the European Chapter of the Association for Computational : Volume 2, Short Papers, pages 681–687, Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics and stop word frequencies (Mosteller and Wal- ods, commas, parentheses, exclamations, colons, lace, 1963; Aljumily, 2015; Gamon, 2004; Arg- number digits, semicolons, hyphens and quotation amon et al., 2009), PoS tags (Koppel et al., 2002; marks and the total number of characters in a text. Mukherjee and Liu, 2010), or patterns captured by Word-based features are composed of the context-free-grammar-derived linguistic patterns; mean number of characters per word, vocabulary see e.g. (Raghavan et al., 2010; Sarawgi et al., richness, acronyms, stopwords, first person pro- 2011; Gamon, 2004). When syntactic features are nouns, usage of words composed by two or three mentioned, often function words and characters, standard deviation of word length and marks are meant; see e.g. (Amuchi et al., 2012; the difference between the longest and shortest Abbasi and Chen, 2005; Cheng et al., 2009). How- words. ever, it is well-known from linguistics and philol- Sentence-based features are composed of the ogy that deeper syntactic features, such as sen- mean number of words per sentence, standard de- tence structure, the frequency of specific phrasal, viation of words per sentence and the difference and syntactic dependency patterns, and discourse between the maximum and minimum number of structure are relevant characteristics of the writing words per sentence in a text. style of an author (Crystal and Davy, 1969; Di- Dictionary-based features consist of the ratios Marco and Hirst, 1993; Burstein et al., 2003). of discourse markers, interjections, abbreviations, curse words, and polar words (positive and nega- 3 Experimental Setup tive words in the polarity dictionaries described in (Hu and Liu, 2004)) with respect to the total num- State-of-the-art techniques for author profiling / ber of words in a text. identification usually draw upon large quantities of Syntactic features Three types of syntactic fea- features; e.g., Burger et al. (2011) use more than tures are distinguished: 15 million features and Argamon et al. (2009) and 1. Part-of-Speech features are given by the rel- Mukherjee and Liu (2010) more than 1,000. This ative frequency of each PoS tag1 in a text, the rel- limits their application in practice. Our goal is to ative frequency of comparative/superlative adjec- demonstrate that the use of syntactic dependency tives and adverbs and the relative frequency of the and discourse features allows us to minimize the present and past tenses. In addition to the fine- total number of features to less than 200 and still grained Penn Treebank tags, we introduce general achieve competitive performance with a standard grammatical categories (such as ‘’, ‘noun’, classification technique. For this purpose, we use etc.) and calculate their frequencies. Support Vector Machines (SVMs) with a linear 2. Dependency features reflect the occurrence kernel in four different experiments. Let us intro- of syntactic dependency relations in the depen- duce now these features and the data on which the dency trees of the text. The dependency tagset trained models have been tested. used by the parser is described in (Surdeanu et al., 3.1 Feature Set 2008). We extract the frequency of each individual dependency relation per sentence, the percentage We extracted 188 surface-oriented, syntactic de- of modifier relations used per tree, the frequency pendency, and discourse structure features for our of adverbial dependencies (they give information experiments. The surface-oriented features are on manner, direction, purpose, etc.), the ratio of few since syntactic and discourse structure fea- modal with respect to the total number of tures are assumed to reflect better than surface- verbs, and the percentage of verbs that appear in oriented features the unconscious stylistic choices complex tenses referred to as “verb chains” (VCs). of the authors. 3. Tree features measure the tree width, the tree For feature extraction, Python and its natural depth and the ramification factor. Tree depth is de- language toolkit, a dependency parser (Bohnet, fined as the maximum number of nodes between 2010), and a discourse parser (Surdeanu et al., the root and a leaf node; the width is the maxi- 2015) are used. mum number of siblings at any of the levels of the The feature set is composed of six subgroups of tree; and the ramification factor is the mean num- features: 1We use the Penn Treebank tagset http: Character-based features are composed of //www.ling.upenn.edu/courses/Fall_2003/ the ratios between upper case characters, peri- ling001/penn_treebank_pos.html

682 ber of children per level. In other words, the tree Used Features Accuracy Gen Accuracy Auth Complete Set 90.18% 88.34% features characterize the complexity of the depen- Char (C) 67.65% 37.76% dency structure of the sentences. Word (W) 61.79% 38.54% These measures are also applied to subordinate Sent (S) 60.35% 17.12% Dict (Dt) 60.62% 17.90% and coordinate clauses. Discourse (Dc) 69.99% 42.61% Discourse features characterize the discourse Syntactic (Sy) 88.94% 82.82% structure of a text. To obtain the discourse struc- C+W+S+Dt+Dc 80.76% 69.72% C+W+S+Dt+Sy 89.96% 87.17% ture, we use Surdeanu et al. (2015)’s discourse Sy+Dc 89.35% 83.88% parser, which receives as input a raw text, divides C+W+S+Dt 73.89% 42.55% it into Elementary Discourse Units (EDUs) and MajClassBaseline 53.54% 9.93% 2GramBaseline 79.25% 75.24% links them via discourse relations that follow the 3GramBaseline 75.53% 62.63% Rhetorical Structure Theory (Mann and Thomp- 4GramBaseline 72.39% 39.65% son, 1988). 5GramBaseline 65.81% 26.94% We compute the frequency of each discourse Table 1: Results of the Gender and Author Identi- relation per EDU (dividing the number of occur- fication Experiments rences of each discourse relation by the number of EDUs per text) and additionally take into account the shape of the discourse trees by extracting their eraryDataset, and the last one on the PANLiter- depth, width and ramification factor. ary dataset. The LiteraryDataset experiments tar- 3.2 Datasets geted gender identification, author identification, and identification to which of the 54 a given We use two datasets. The first dataset is a corpus chapter belongs, respectively. The PANLiterary of chapters (henceforth, referred to as “Literary- experiment dealt with author verification, anal- Dataset”) extracted from novels downloaded from ogously to the corresponding PAN 2014 shared 2 the “Project Gutenberg” website . Novels from task. 18 different authors were selected. Three novels per author were downloaded and divided by chap- ter, labeled by the gender and name of the author, 4 Experiment Results and Discussion as well as by the they correspond to. All of the authors are British and lived in roughly the 4.1 Gender Identification same time period. Half of the authors are male and 3 half female . The dataset is composed of 1793 in- The gender identification experiment is casted as stances. a supervised binary classification problem. Table 4 The second dataset is publicly available and 1 shows in the column ‘Accuracy Gen’ the perfor- was used in 2014’s PAN author verification task mance of the SVM with each feature group sep- (Stamatatos et al., 2014). It contains groups of arately as well as with the full set and with some literary texts that are written by the same author feature combinations. The performance of the ma- and a text whose author is unknown (henceforth, jority class classifier (MajClassBaseline) and of “PANLiterary”). four different baselines, where the 300 most fre- quent token n-grams (2–5 grams were considered) 3.3 Experiments are used as classification features, are also shown As already mentioned above, we carried out four for comparison. experiments; the first three of them on the Lit- The n-gram baselines outperform the SVM 2https://www.gutenberg.org/ trained on any individual feature group, except 3The 18 selected authors are: Virginia Woolf, Arthur the syntactic features, which means that syntac- Conan Doyle, Anne Bronte,¨ Charlotte Bronte,¨ Lewis Car- tic features are crucial for the characterization of roll, Agatha Christie, William Makepeace Thackeray, Oscar Wilde, Maria Edgeworth, Elisabeth Gaskell, Bram Stoker, the writing style of both . Using only this James Joyce, Jane Austen, Charles Dickens, H.G Wells, group of features, the model obtains an accuracy Robert Louis Stevenson, Mary Anne Evans (known as of 88.94%, which is very close to its performance George Eliot) and Margaret Oliphant. 4http://pan.webis.de/clef14/pan14-web/ with the complete feature set. When discourse fea- author-identification.html tures are added, the accuracy further increases.

683 4.2 Author Identification George Elliot), is one of the books that created the highest confusion; it is often confused with “Mill The second experiment classifies the texts from on the Floss” written by the same author. “Kid- the LiteraryDataset by their authors. It is a 18- napped” by Robert Louis Stevenson, which is very class classification problem, which is consider- different from the other considered books by the ably more challenging. Table 1 (column ‘Accu- same author, is confused with “Treasure Island” racy Auth’) shows the performance of our model also by Stevenson, and “Great Expectations” by with 10-fold cross-validation when using the full Charles Dickens. “Pride and Prejudice” by Jane set of features and different feature combinations. Austen is confused with “Sense and Sensibility” The results of the 10-fold author identification also by her. The majority of confusions are be- experiment show that syntactic dependency fea- tween books by the same author, which proves tures are also the most effective for the charac- our point further: syntactic and discourse struc- terization of the writing style of the authors. The tures constitute very powerful, underused profil- model with the full set of features obtains 88.34% ing features (recall that for this experiment, we accuracy, which outperforms the n-gram base- used only syntactic and discourse features; none lines. The high accuracy of syntactic dependency of the features was content- or surface-oriented). features compared to other sets of features proves When the full set of features was used, the accu- again that dependency is a very powerful racy improved to 91.41%. In that case, the main profiling tool that has not been used to its full po- sources of confusion were between “Agnes Grey” tential in the field. and “The Tenant of Wildfell Hall”, both by Anne Analyzing the confusion matrix of the experi- Bronte¨ and between “Silas Marner” and “Mill on ment, some interesting conclusions can be drawn; the Floss”, both by G. Elliot. due to the lack of space, let us focus on only a few of them. For instance, the novels by Elisa- 4.4 PAN Author Verification beth Gaskell are confused with the novels by Mary Anne Evans, Jane Austen and Margaret Oliphant. The literary dataset in the PAN 2014 shared task This is likely because not only do all of these au- on author verification contains pairs of text in- thors share the gender, but Austen is also consid- stances where one text is written by a specific ered to be one of the main influencers of Gaskell. author and the goal is to determine whether the Even though, Agatha Christie is predicted cor- other instance is also written by the same author. rectly most of the times, when she is confused with Note that the task of author verification is different another author, it is with Arthur Conan Doyle. from the task of author identification. To apply our This may not be surprising since Arthur Conan model in this context, we compute the feature val- Doyle and, more specifically, the books about ues for each pair of known-anonymous instances Sherlock Holmes, greatly influenced her writing, and substract the feature values of the known in- resulting in many detective novels with Detective stance from the features of the anonymous one; the Poirot as protagonist (Christie’s personification of feature values are normalized. As a result, a fea- Sherlock Holmes). Other mispredictions (such ture difference vector for each pair is computed. as the confusion of Bram Stoker with Elisabeth The vector is labeled so as to indicate whether both Gaskell) require a deeper analysis and possibly instances were written by the same author or not. also highlight the need for more training material. The task performance measure is computed by multiplying the area under the ROC curve (AUC) 4.3 Source Book Identification and the “c@1” score, which is a metric that takes To further prove the profiling potential of syntac- into account unpredicted instances. In our case, tic and discourse features, we carried out an addi- the classifier outputs a prediction for each test in- tional experiment. The goal was to identify from stance, such that the c@1 score is equivalent to ac- which of the 54 books a given chapter is, making curacy. In Table 2, the performance of our model, use of syntactic and discourse features only. Us- compared to the winner and second ranked of the ing the same method and 10-fold cross-validation, English literary text section of the shared task (cf. 83.01% of accuracy was achieved. The interesting (Modaresi and Gross, 2014) and (Zamani et al., part of this experiment is the error analysis. “Silas 2014) for details), is shown. Marner”, written by Mary Anne Evans (known as Our model outperforms the task baseline as well

684 Approach Final Score AUC c@1 Author Gender Book PANLiterary Our Model 0.671 0.866 0.775 pronouns AMOD semicolons quotations Modaressi & Gross 0.508 0.711 0.715 VC discourse markers colons charsperword AMOD pronouns VB firstS Zamani et al. 0.476 0.733 0.650 commas firstP PRP commas META-CLASSIFIER 0.472 0.732 0.645 PRD VC MD hyphens BASELINE 0.202 0.453 0.445 discourse width ADV OBJ NNP P MD acronyms subordinate depth TO Elaboration VC DT Table 2: Performance of our model compared to Elaboration TO IM CC other participants on the “PANLiterary” dataset present verbs OPRD sentence STD determiners subordinate width PRT parentheses PRP quotations Contrast commas discourse markers OBJ PRP periods VC as the best performing approach of the shared CC Manner-means stopwords VBZ sentence STD RBR OPRD CONJ task, the META-CLASSIFIER (MC), by a large nouns positive words AMOD firstP margin. The task baseline is the best-performing OPRD OBJ Contrast PUT PRP$ WRB exclamations LOC-OPRD language-independent approach of the PAN-2013 HMOD present verbs PRP$ coordinate width shared task. MC is an ensemble of all systems that periods sentence range quotations adverbs participated in the task in that it uses for its deci- Table 3: 20 features with the highest information sion the averaged probability scores of all of them. gain in all the experiments 4.5 Feature Analysis not in the other experiments. This is likely be- Table 3 displays the 20 features with the high- cause they can serve as indicators of the structural est information gain, ordered top-down (upper be- complexity of a text and thus of the idiosyncrasy ing the highest) for each of the presented experi- of a writing style of an individual – as punctua- ments.5 Syntactic features prove again to be rele- tion marks such as periods and commas, which vant in all the experiments. The table shows that are typical stylistic features. Discourse markers, there are features that work well for the majority of words with positive sentiment, first person plural the experiments. This includes, e.g., the usage of pronouns, Wh-Adverbs and modal verbs are dis- verb chains (VC), syntactic objects (OBJ), com- tinctive features in the gender identification exper- mas, predicative complements of control verbs iment. The fact that the usage of positive words (OPRD), or adjective modifiers (AMOD). It is in- is only relevant in the gender identification ex- teresting to note that the Elaboration discourse re- periment could be caused by the differences in lation is distinctive in the first two experiments, the expressiveness/emotiveness of the writings of while the usage of Contrast relation becomes rele- men and women. Punctuation marks become very vant to gender and book identification. These fea- distinctive in the book identification experiment, tures are not helpful in the PANLiterary experi- where the usage of colons, semicolons, parenthe- ment, where discourse patterns were not found in ses, commas, periods, exclamations and quotation the small dataset. The discourse tree width and marks are among the most relevant features of the the subordinate clause width are distinctive in the experiment. Syntactic shape features are distinc- author identification experiment, while they are tive in the author identification and PANLiterary 5The features starting with a capital are discourse rela- experiments while not as impactful in the rest of tions; ‘sentence range’ is defined as the difference between the experiments. the minimum and maximum value of words per sentence. ‘STD’: standard deviation, ‘firstP’: first person plural pro- nouns, ‘AMOD’: Adjective/adverbial modifier f(requency), 5 Conclusions and Future Work ‘VC’: Verb Chain f, ‘PRD’: Predicative complement f, ‘ADV’: General Adverbial f, ‘P’: Punctuation f, ‘MD’: Modal We have shown that syntactic dependency and dis- Verb f, ‘TO’: Particle to f, ‘OPRD’: Predicative Complement course features, which have been largely neglected of raising/control verb f, ‘PRT’: Particle dependent on the in state-of-the-art proposals so far, play a signifi- verb f, ‘OBJ’: Object f, ‘PRP’: Adverbial of Purpose or Rea- son f, ‘CC’: Coordinating Conjunction f, ‘RBR’: Compara- cant role in the task of gender and author identi- tive Adverb f, ‘PRP$’: Possessive Pronoun f, ‘WRB’: Wh- fication and author verification. With more than Adverb f, ‘HMOD’: Dependent on the Head of a Hyphenated Word f., ‘NNP’: Singular proper noun f, ‘DT’: Determiner f, 88% of accuracy in both gender and author identi- ‘VBZ’: 3rd person singular present verb f, ‘CONJ’: Second fication within the literary genre, our models that conjunct (dependent on conjunction) f, ‘PUT’: Complement uses them beats competitive baselines. In the fu- of the verb put f, ‘LOC-OPRD’: non-atomic dependency that combines a Locative adverbial and a predicative complement ture, we plan to experiment with further features of a control verb f. and other traits of author profiling.

685 References Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. 2002. Automatically categorizing writ- Ahmed Abbasi and Hsinchun Chen. 2005. Applying ten texts by author gender. Literary and Linguistic authorship analysis to extremist-group web forum Computing, 17(4):401–412. messages. IEEE Intelligent Systems, 20(5):67–75. William C. Mann and Sandra A. Thompson. 1988. Refat Aljumily. 2015. Hierarchical and non- Rhetorical structure theory: Toward a functional the- hierarchical linear and non-linear clustering meth- ory of text organization. Text - Interdisciplinary ods to “shakespeare authorship question”. Social Journal for the Study of Discourse, 8(3):243–281. Sciences, 4(3):758–799. Faith Amuchi, Ameer Al-Nemrat, Mamoun Alazab, Pashutan Modaresi and Philipp Gross. 2014. A lan- and Robert Layton. 2012. Identifying cyber preda- guage independent author verifier using fuzzy c- tors through forensic authorship analysis of chat means clustering. In CLEF 2014 Evaluation Labs logs. In and Trustworthy Computing and Workshop – Working Notes Papers, pages 877– Workshop (CTC), 2012 Third, pages 28–37, Ballarat, 897, Sheffield, UK, September. CEUR-WS.org. Australia, October. IEEE Computer Society. Frederick Mosteller and David L. Wallace. 1963. In- Shlomo Argamon, Moshe Koppel, James W. Pen- ference in an authorship problem: A comparative nebaker, and Jonathan Schler. 2009. Automatically study of discrimination methods applied to the au- profiling the author of an anonymous text. Commu- thorship of the disputed federalist papers. Journal of nications of the ACM, 52(2):119–123. the American Statistical Association, 58(302):275– 309. Bernd Bohnet. 2010. Very high accuracy and fast de- pendency parsing is not a contradiction. In Proceed- Arjun Mukherjee and Bing Liu. 2010. Improving ings of the 23rd International Conference on Com- gender classification of authors. In Proceed- putational Linguistics, pages 89–97, Beijing, China, ings of the 2010 Conference on Empirical Methods August. Association for Computational Linguistics. in Natural Language Processing, pages 207–217, Cambridge, MA, October. Association for Compu- John D. Burger, John Henderson, George Kim, and tational Linguistics. Guido Zarrella. 2011. Discriminating gender on . In Proceedings of the 2011 Conference on Sindhu Raghavan, Adriana Kovashka, and Raymond Empirical Methods in Natural Language Process- Mooney. 2010. Authorship attribution using prob- ing, pages 1301–1309, Edinburgh, Scotland, UK., abilistic context-free grammars. In Proceedings of July. Association for Computational Linguistics. the ACL 2010 Conference Short Papers, pages 38– 42, Uppsala, Sweden, July. Association for Compu- Jill Burstein, David Marcu, and Kevin Knight. 2003. tational Linguistics. Finding the write stuff: automatic identification of discourse structure in student essays. IEEE Intelli- Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. gent Systems, 18(1):32–39. 2011. Gender attribution: Tracing stylometric ev- idence beyond topic and genre. In Proceedings of Na Cheng, Xiaoling Chen, R. Chandramouli, and K. P. the Fifteenth Conference on Computational Natural Subbalakshmi. 2009. Gender identification from e- Language Learning, pages 78–86, Portland, Oregon, mails. In Computational Intelligence and Data Min- USA, June. Association for Computational Linguis- ing, 2009. CIDM ’09. IEEE Symposium on, pages tics. 154–158, Nashville, TN, USA, March. IEEE Com- puter Society. Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker. 2006. Effects of age David Crystal and Derek Davy. 1969. Investigating and gender on blogging. In AAAI Spring Sympo- English style. Indiana University Press Blooming- sium: Computational Approaches to Analyzing We- ton. , pages 199–205, Palo Alto, California, March. Chrysanne DiMarco and Graeme Hirst. 1993. A com- AAAI. putational theory of goal-directed style in syntax. Computational Linguistics, 19(3):451–499. Efstathios Stamatatos, Walter Daelemans, Ben Verho- even, Martin Potthast, Benno Stein, Patrick Juola, Michael Gamon. 2004. Linguistic correlates of style: Miguel A. Sanchez-Perez, and Alberto Barron-´ authorship classification with deep linguistic analy- Cedeno.˜ 2014. Overview of the Author Identifica- sis features. In Proceedings of Coling 2004, pages tion Task at PAN 2014. In CLEF 2014 Evaluation 611–617, Geneva, Switzerland, Aug 23–Aug 27. Labs and Workshop – Working Notes Papers, 15-18 COLING. September, Sheffield, UK, pages 877–897, Sheffield, UK, September. CEUR-WS.org. Minqing Hu and Bing Liu. 2004. Mining and summa- rizing customer reviews. In Proceedings of the Tenth Mihai Surdeanu, Richard Johansson, Adam Meyers, ACM SIGKDD International Conference on Knowl- Llu´ıs Marquez,` and Joakim Nivre. 2008. The edge Discovery and Data Mining, pages 168–177, conll-2008 shared task on joint parsing of syntactic Seattle, WA, USA, August. ACM. and semantic dependencies. In Proceedings of the

686 Twelfth Conference on Computational Natural Lan- guage Learning, pages 159–177, Manchester, UK, August. Association for Computational Linguistics. Mihai Surdeanu, Tom Hicks, and Marco Antonio Valenzuela-Escarcega. 2015. Two practical rhetori- cal structure theory parsers. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 1–5, Denver, Colorado, June. Association for Computational Linguistics. Hamed Zamani, Hossein Nasr Esfahani, Pariya Babaie, Samira Abnar, Mostafa Dehghani, and Azadeh Shakery. 2014. Authorship identification using dy- namic selection of features from probabilistic fea- ture set. In CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, pages 877–897, Sheffield, UK, September. CEUR-WS.org.

687