ACL 2016

The 54th Annual Meeting of the Association for Computational Linguistics

Proceedings of the 3rd Workshop on Mining

August 12, 2016 Berlin, Germany c 2016 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

ISBN 978-1-945626-17-3

ii Preface

This third edition of the Workshop on Argument Mining builds on the success of the first and second workshops held at ACL 2014 and NAACL 2015, with an increasing maturity in the work reported. The breadth of papers in the programme this year attests to the range of techniques, the diverse domains and the varied goals that are encompassed in argument (or argumentation) mining.

The focus of argument mining is to tackle the problem of automatic identification of and their internal structure and interconnections.The papers collected here provide a rich exploration of the nature of argumentative structure that can be automatically identified, from identification of the presence of argument, through evidence relationships and types of evidence relationships, argument types and premise types, to highly demanding tasks such as enthymeme reconstruction.

One of the facets that makes argument mining such an exciting and demanding problem is that purely statistical approaches very rapidly reach performance maxima with more knowledge-intensive, linguistically-aware and structurally constrained approaches required as well. Combinations of statistical robustness and structural priors hold particular promise, with early results reported in several of the papers here.

As a very new area, argument mining is also working ab initio on challenges such as data availability, annotation standards, corpus definition and publication, as well as quantification, validation and evaluation of results. Again, several papers here are tackling these community-oriented, practical – but vitally important – problems. We are also very pleased to introduce for the first time a special track focusing on an ‘Unshared Task’ to bootstrap the process of shared data provision for the community. The contributions to this track will lead to a detailed panel discussion with a goal of establishing some initial momentum to what will hopefully become a regular part of the Argument Mining workshop series.

This year also sees a special track on Debating Technologies reflecting the thread of work in the area that focuses on applications of the techniques in solving real problems in man-machine communication, driven in part by commercial R&D and by IBM’s Debating Technology team in particular.

We were delighted with the quantity and quality of submissions, and as a result have developed a packed programme. The workshop attracted 31 submissions in total, of which 13 were accepted as full papers, four as short papers and a further three as contributions to the Unshared Task Panel. As the area continues to grow with an increasing number of groups turning their attention to the problems presented by argument mining, we look forward to seeing further growth in the workshop and the community that it supports.

CAR Dundee, June 2016

iii

Organizers: Chris Reed, University of Dundee (Chair) Kevin Ashley, University of Pittsburgh Claire Cardie, Cornell University Nancy Green, University of N.C. Greensboro Iryna Gurevych, Technische Universitat Darmstadt Diane Litman, University of Pittsburgh Georgios Petasis, N.C.S.R. Demokritos Noam Slonim, IBM Research, Israel Vern Walker, Hofstra University

Program Committee: Stergos Afantenos, IRIT Toulouse Carlos Alzate, IBM Research, Ireland Kevin Ashley, University of Pittsburgh Katarzyna Budzynska, Polish National Academy of Sciences Elena Cabrio, University of Nice Claire Cardie, Cornell University Matthias Grabmair, University of Pittsburgh Nancy Green, University of N.C. Greensboro Iryna Gurevych, Technische Universitat Darmstadt Ivan Habernal, Technische Universitat Darmstadt Graeme Hirst, University of Toronto Ed Hovy, CMU Vangelis Karkaletsis, N.C.S.R. Demokritos Mitesh Khapra, IBM Research, India Valia Kordoni, Humboldt Universitat zu Berlin Jonas Kuhn, Stuttgart University John Lawrence, University of Dundee Joao Leite, FCT-UNL – Universidade Nova de Lisboa Ran Levy, IBM Research, Israel Beishui Liao, Zhejiang University Maria Liakata, University of Warwick Diane Litman, University of Pittsburgh Bernardo Magnini, FBK Trento Robert Mercer, University of Western Ontario Marie-Francine Moens, Katholieke Universiteit Leuven Huy Nguyen, University of Pittsburgh Smaranda Muresan, Columbia University Fabio Paglieri, CNR Italy Alexis Palmer, Saarland University Joonsuk Park, Cornell University Simon Parsons, Kings College London Georgios Petasis, N.C.S.R. Demokritos Craig Pfeifer, MITRE Chris Reed, University of Dundee Ariel Rosenfeld, Bar-Ilan University v Patrick Saint-Dizier, IRIT Toulouse Christian Schunn, University Pittsburgh Jodi Schneider, University Pittsburgh Noam Slonim, IBM Research, Israel Christian Stab, Technische Universitat Darmstadt Manfred Stede, Universitat Potsdam Benno Stein, Universitat Weimar Henning Wachsmuth, Universitat Weimar Marilyn Walker, University of California, Santa Cruz Vern Walker, Hofstra University Serena Villata, INRIA Sophia-Antipolis Mediterranee Lu Wang, Northeastern University Adam Wyner, University Aberdeen

vi Table of Contents

“What Is Your Evidence?” A Study of Controversial Topics on Social Media Aseel Addawood and Masooda Bashir ...... 1

Summarizing Multi-Party Argumentative Conversations in Reader Comment on News Emma Barker and Robert Gaizauskas ...... 12

Argumentative texts and clause types Maria Becker, Alexis Palmer and Anette Frank ...... 21

Contextual stance classification of opinions: A step towards enthymeme reconstruction in online reviews Pavithra Rajendran, Danushka Bollegala and Simon Parsons ...... 31

The CASS Technique for Evaluating the Performance of Argument Mining Rory Duthie, John Lawrence, Katarzyna Budzynska and Chris Reed ...... 40

Extracting Case Law Sentences for Argumentation about the Meaning of Statutory Terms Jaromir Savelka and Kevin D. Ashley ...... 50

Scrutable Feature Sets for Stance Classification Angrosh Mandya, Advaith Siddharthan and Adam Wyner...... 60

Argumentation: Content, Structure, and Relationship with Essay Quality Beata Beigman Klebanov, Christian Stab, Jill Burstein, Yi Song, Binod Gyawali and Iryna Gurevych ...... 70

Neural Attention Model for Classification of Sentences that Support Promoting/Suppressing Relationship Yuta Koreeda, Toshihiko Yanase, Kohsuke Yanai, Misa Sato and Yoshiki Niwa ...... 76

Towards Feasible Guidelines for the Annotation of Argument Schemes Elena Musi, Debanjan Ghosh and Smaranda Muresan ...... 82

Identifying Argument Components through TextRank Georgios Petasis and Vangelis Karkaletsis ...... 94

Rhetorical structure and argumentation structure in monologue text Andreas Peldszus and Manfred Stede ...... 103

Recognizing the Absence of Opposing Arguments in Persuasive Essays Christian Stab and Iryna Gurevych ...... 113

Expert Stance Graphs for Computational Argumentation Orith Toledo-Ronen, Roy Bar-Haim and Noam Slonim ...... 119

Fill the Gap! Analyzing Implicit Premises between Claims from Online Debates Filip Boltuzic and Jan Šnajder ...... 124

Summarising the points made in online political debates Charlie Egan, Advaith Siddharthan and Adam Wyner ...... 134

What to Do with an Airport? Mining Arguments in the German Online Participation Project Tempelhofer Feld Matthias Liebeck, Katharina Esau and Stefan Conrad ...... 144

vii Unshared task: (Dis)agreement in online debates Maria Skeppstedt, Magnus Sahlgren, Carita Paradis and Andreas Kerren ...... 154

Unshared Task: Perspective Based Local Agreement and Disagreement in Online Debate Chantal van Son, Tommaso Caselli, Antske Fokkens, Isa Maks, Roser Morante, Lora Aroyo and PiekVossen...... 160

Unshared Task: A Preliminary Study of Disputation Behavior in Online Debating Forum Zhongyu Wei, Yandi Xia, Chen Li, Yang Liu, Zachary Stallbohm, Yi Li and Yang Jin ...... 166

viii Workshop Program

Friday, August 12, 2016

09:00–09:10 Welcome

09:10–10:30 Session I

09:10–09:30 “What Is Your Evidence?” A Study of Controversial Topics on Social Media Aseel Addawood and Masooda Bashir

09:30–09:50 Summarizing Multi-Party Argumentative Conversations in Reader Comment on News Emma Barker and Robert Gaizauskas

09:50–10:10 Argumentative texts and clause types Maria Becker, Alexis Palmer and Anette Frank

10:10–10:30 Contextual stance classification of opinions: A step towards enthymeme reconstruc- tion in online reviews Pavithra Rajendran, Danushka Bollegala and Simon Parsons

10:30–11:00 Coffee break

11:00–12:30 Session II

11:00–11:20 The CASS Technique for Evaluating the Performance of Argument Mining Rory Duthie, John Lawrence, Katarzyna Budzynska and Chris Reed

11:20–11:40 Extracting Case Law Sentences for Argumentation about the Meaning of Statutory Terms Jaromir Savelka and Kevin D. Ashley

11:40–12:00 Scrutable Feature Sets for Stance Classification Angrosh Mandya, Advaith Siddharthan and Adam Wyner

12:00–12:15 Argumentation: Content, Structure, and Relationship with Essay Quality Beata Beigman Klebanov, Christian Stab, Jill Burstein, Yi Song, Binod Gyawali and Iryna Gurevych

12:15–12:30 Neural Attention Model for Classification of Sentences that Support Promot- ing/Suppressing Relationship Yuta Koreeda, Toshihiko Yanase, Kohsuke Yanai, Misa Sato and Yoshiki Niwa

ix Friday, August 12, 2016 (continued)

12:30–14:00 Lunch

14:00–15:30 Session III

14:00–14:20 Towards Feasible Guidelines for the Annotation of Argument Schemes Elena Musi, Debanjan Ghosh and Smaranda Muresan

14:20–14:40 Identifying Argument Components through TextRank Georgios Petasis and Vangelis Karkaletsis

14:40–15:00 Rhetorical structure and argumentation structure in monologue text Andreas Peldszus and Manfred Stede

15:00–15:15 Recognizing the Absence of Opposing Arguments in Persuasive Essays Christian Stab and Iryna Gurevych

15:15–15:30 Expert Stance Graphs for Computational Argumentation Orith Toledo-Ronen, Roy Bar-Haim and Noam Slonim

15:30–16:00 Coffee break

x Friday, August 12, 2016 (continued)

16:00–17:30 Session IV - Debating Technologies and the Unshared Task Panel

16:00–16:20 Fill the Gap! Analyzing Implicit Premises between Claims from Online Debates Filip Boltuzic and Jan Šnajder

16:20–16:40 Summarising the points made in online political debates Charlie Egan, Advaith Siddharthan and Adam Wyner

16:40–17:00 What to Do with an Airport? Mining Arguments in the German Online Participation Project Tempelhofer Feld Matthias Liebeck, Katharina Esau and Stefan Conrad

17:00–17:25 Panel Discussion: Unshared Task

Unshared task: (Dis)agreement in online debates Maria Skeppstedt, Magnus Sahlgren, Carita Paradis and Andreas Kerren

Unshared Task: Perspective Based Local Agreement and Disagreement in Online Debate Chantal van Son, Tommaso Caselli, Antske Fokkens, Isa Maks, Roser Morante, Lora Aroyo and Piek Vossen

Unshared Task: A Preliminary Study of Disputation Behavior in Online Debating Forum Zhongyu Wei, Yandi Xia, Chen Li, Yang Liu, Zachary Stallbohm, Yi Li and Yang Jin

17:25–17:30 Closing Remarks

17:30 Close

xi

“What is Your Evidence?” A Study of Controversial Topics on Social Media

Aseel A. Addawood Masooda N. Bashir School of Information Sciences University of Illinois at Urbana-Champaign Aaddaw2, [email protected]

usually encounter in professional writing. One way Abstract to handle such faulty arguments is to simply disre- gard them and focus on extracting arguments con- In recent years, social media has revolution- taining proper support (Villalba and Saint-Dizier, ized how people communicate and share in- 2012; Cabrio and Villata, 2012). However, some- formation. One function of social media, be- times what seems like missing evidence is actually sides connecting with friends, is sharing opin- just an unfamiliar or different type of evidence. ions with others. Micro blogging sites, like Thus, recognizing the appropriate type of evidence Twitter, have often provided an online forum can be useful in assessing the viability of users’ for social activism. When users debate about controversial topics on social media, they typ- supporting information, and in turn, the strength of ically share different types of evidence to their whole argument. support their claims. Classifying these types One difficulty of processing social media text is of evidence can provide an estimate for how the fact that it is written in an informal format. It adequately the arguments have been support- does not follow any guidelines or rules for the ex- ed. We first introduce a manually built gold pression of opinions. This has led to many messag- standard dataset of 3000 tweets related to the es containing improper syntax or spelling, which recent FBI and Apple encryption debate. We presents a significant challenge to attempts at ex- develop a framework for automatically classi- tracting meaning from social media content. None- fying six evidence types typically used on Twitter to discuss the debate. Our findings theless, we believe processing such corpora is of show that a Support Vector Machine (SVM) great importance to the argumentation-mining field classifier trained with n-gram and additional of study. Therefore, the motivation for this study is features is capable of capturing the different to facilitate online users’ search for information forms of representing evidence on Twitter, concerning controversial topics. Social media users and exhibits significant improvements over are often faced with information overload about the unigram baseline, achieving a F1 macro- any given topic, and understanding positions and averaged of 82.8%. arguments in online debates can potentially help users formulate stronger opinions on controversial 1 Introduction issues and foster personal and group decision- Social media has grown dramatically over the last making (Freeley and Steinberg, 2013). decade. Researchers have now turned to social me- Continuous growth of online data has led to large amounts of information becoming available dia, via online posts, as a source of information to for others to explore and understand. Several au- explain many aspects of the human experience (Gruzd & Goertzen, 2013). Due to the textual na- tomatic techniques have allowed us to determine ture of online users’ self-disclosure of their opin- different viewpoints expressed in social media text, ions and views, social media platforms present a e.g., sentiment analysis and opinion mining. How- ever, these techniques struggle to identify complex unique opportunity for further analysis of shared relationships between concepts in the text. Analyz- content and how controversial topics are argued. On social media sites, especially on Twitter, user ing argumentation from a computational linguistics text contains arguments with inappropriate or miss- point of view has led very recently to a new field ing justifications—a rhetorical habit we do not called argumentation mining (Green et al., 2014).

1 Proceedings of the 3rd Workshop on Argument Mining, pages 1–11, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

It formulates how humans disagree, debate, and ing the recent Apple/FBI encryption debate on form a consensus. This new field focuses on identi- Twitter between January 1 and March 31, 2016. fying and extracting argumentative structures in We believe that understanding online users’ views documents. This type of approach and the reason- on this topic will help scholars, law enforcement ing it supports is used widely in the fields of logic, officials, technologists, and policy makers gain a AI, and text processing (Mochales and Ieven, better understanding of online users’ views about 2009). The general consensus among researchers is encryption. that an argument is defined as containing a claim, In the remainder of the paper, Section 2 discuss- which is a statement of the position for which the es survey-related work, Section 3 describes the da- claimant is arguing. The claim is supported with ta and corresponding features, Section 4 presents premises that function as evidence to support the the experimental results, and Section 5 concludes claim, which then appears as a conclusion or a the paper and proposes future directions. proposition (Walton, Reed, & Macagno, 2008; Toulmin, 2003). 2 Related Work One of the major obstacles in developing argu- mentation mining techniques is the shortage of 2.1 Argumentation mining high-quality annotated data. An important source Argumentation mining is the study of identify- of data for applying argumentation techniques is ing the argument structure of a given text. Argu- the web, particularly social media. Online newspa- mentation mining has two phases. The first con- pers, blogs, product reviews, etc. provide a hetero- sists of argument annotations and the second con- geneous and growing flow of information where sists of argumentation analysis. Many studies have arguments can be analyzed. To date, much of the focused on the first phase of annotating argumenta- argumentation mining research has been limited tive discourse. Reed and Rowe (2004) presented and has focused on specific domains such as news Araucaria, a tool for argumentation diagramming articles, parliamentary records, journal articles, and that supports both convergent and linked argu- legal documents (Ashley and Walker, 2013; ments, missing premises (enthymemes), and refu- Hachey and Grover, 2005; Reed and Rowe, 2004). tations. They also released the AracuariaDB cor- Only a few studies have explored arguments on so- pus, which has been used for experiments in the cial media, a relatively under-investigated domain. argumentation mining field. Similarly, Schneider Some examples of social media platforms that et al. (2013) annotated Wikipedia talk pages about have been subjected to argumentation mining in- deletion using Walton’s 17 schemes (Walton clude Amazon online product reviews (Wyner, 2008). Rosenthal and McKeown (2012) annotated Schneider, Atkinson, & Bench-Capon, 2012) and opinionated claims, in which the author expresses a tweets related to local riot events (Llewellyn, belief they think should be adopted by others. Two Grover, Oberlander, & Klein, 2014). annotators labeled sentences as claims without any In this study, we describe a novel and unique context. Habernal, Eckle-Kohler & Gurevych benchmark data set achieved through a simple ar- (2014) developed another well-annotated corpus,to gument model, and elaborate on the associated an- model arguments following a variant of the Toul- notation process. Unlike the classical Toulmin min model. This dataset includes 990 instances of model (Toulmin, 2003), we search for a simple and web documents collected from blogs, forums, and robust argument structure comprising only two news outlets, 524 of which are labeled as argumen- components: a claim and associated supporting ev- tative. A final smaller corpus of 345 examples was idence. Previous research has shown that a claim annotated with finer-grained tags. No experimental can be supported using different types of evidence results were reported on this corpus. (Rieke and Sillars, 1984). The annotation that is As far as the second phase, Stab and Gurevych proposed in this paper is based on the type of evi- (2014b) classified argumentative sentences into dence one uses to support a particular position on a four categories (none, major claim, claim, premise) given debate. We identify six types, which are de- using their previously annotated corpus (Stab and tailed in the methods section (Section 3). To Gurevych 2014a) and reached a 0.72 macro-F1 demonstrate these types, we collected data regard-

2

Figure 1: flow chart for annotation score. Park and Cardie (2014) classified proposi- By identifying the most popular types of evi- tions into three classes (unverifiable, verifiable dence used in social media, specifically on Twitter, non-experimental, and verifiable experimental) and our research differs from the previously mentioned ignored non-argumentative text. Using multi-class studies because we are providing a social media SVM and a wide range of features (n-grams, POS, annotated corpus. Moreover, the annotation is sentiment clue words, tense, person) they achieved based on the different types of premises and evi- a 0.69 Macro F1. dence used frequently in social media settings. The IBM Haifa Research Group (Rinott et al., 2015) developed something similar to our research; 3 Data they developed a data set using plain text in Wik- ipedia pages. The purpose of this corpus was to This study uses Twitter as its main source of data. collect context-dependent claims and evidence, Crimson Hexagon (Etlinger & Amand, 2012), a where the latter refers to facts (i.e., premises) that public social media analytics company, was used are relevant to a given topic. They classified evi- to collect every pubic post from January 1, 2016 dence into three types (study, expert, anecdotal). through March 31, 2016. Crimson Hexagon houses Our work is different in that it includes more di- all public Twitter data going back to 2009. The verse types of evidence that reflect social media search criterion for this study was searching for a trends while the IBM Group’s study was limited to tweet that contains the word “encryption” any- looking into plain text in Wikipedia pages. where in its text. The sample only included tweets from accounts that set English as their language; 2.2 Social Media As A Data Source For Ar- this was filtered in when requesting the data. How- gumentation Mining ever, some users set their account language to Eng- lish, but constructed some tweets in a different As stated previously there are only a few studies language. Thus, forty accounts were removed that have used social media data as a source for ar- manually, leaving 531,593 tweets in our dataset. gumentation mining. Llewellyn et al. (2014) ex- Although most Twitter accounts are managed by perimented with classifying tweets into several ar- humans, there are other accounts managed by au- gumentative categories, specifically claims and tomated agents called social bots or Sybil accounts. counter-claims (with and without evidence), and These accounts do not represent real human opin- used verification inquiries previously annotated by ions. In order to ensure that tweets from such ac- Procter, Vis, and Voss (2013). They used uni- counts did not enter our data set, in the annotation grams, punctuations, and POS as features in three procedure, we ran each Twitter user through the classifiers. Schneider and Wyner (2012) focused Truthy BotOrNot algorithm (Davis et al., 2016). on online product reviews and developed a number This cleaned the data further and excluded any user of argumentation schemes—inspired by Walton et with a 50% or greater probability of being a bot. al. (2008)—based on manual inspection of their Overall, 946 (24%) bot accounts were removed. corpus.

3

4 Methodology RT @ItIsAMovement "Without strong encryption, you will be spied on systematically by lots of peo- 4.1 Coding Scheme ple" - Whitfield Diffie In order to perform argument extraction from a so- Blog post (BLOG) refers to the use of a link to a cial media platform, we followed a two-step ap- blog post reacting to the debate. The example be- proach. The first step was to identify sentences low shows a tweet with a link to a blog post. In this containing an argument. The second step was to tweet, the user is sharing sharing a link to her own identify the evidence-type found in the tweets clas- blog post. sified as argumentative. These two steps were per- I care about #encryption and you should too. formed in conjunction with each other. Annotators Learn more about how it works from @Mozilla at were asked to annotate each tweet as either having https://t.co/RTFiuTQXyQ an argument or not having an argument. Then they Picture (PICTURE) refers to a user sharing a were instructed to annotate a tweet based on the picture related to the debate that may or may not type of evidence used in the tweet. Figure 1 shows support his/her point of view. For example, the the flow of annotation. tweet below shows a post containing the picture After considerable observation of the data, a shown in figure 2. draft-coding scheme was developed for the most RT @ErrataRob No, morons, if encryption were being used, you'd find the messages, but you used types of evidence. In order to verify the ap- wouldn't be able to read them plicability and accuracy of the draft-coding scheme, two annotators conducted an initial trial on 50 randomized tweets to test the coding scheme. After some adjustments were made to the scheme, a second trial was conducted consisting of 25 randomized tweets that two different annotators Figure 2: an example of sharing a picture as evi- annotated. The resulting analysis and discussion dence led to a final revision of the coding scheme and Other (OTHER) refers to other types of evi- modification of the associated documentation (an- dence that do not fall under the previous annotation notation guideline). After finalizing the annotation categories. Even though we observed Twitter data scheme, two annotators annotated a new set of in order to categorize different, discrete types of 3000 tweets. The tweets were coded into one of the evidence, we were also expecting to discover new following evidence types. types while annotating. Some new types we found News media account (NEWS) refers to sharing while annotating include audio, books, campaigns, a story from any news media account. Since Twit- petitions, codes, slides, other social media refer- ter does not allow tweets to have more than 140 ences, and text files. characters, users tend to communicate their opin- No evidence (NO EVIDENCE) refers to users ions by sharing links to other resources. Twitter sharing their opinions about the debate without users will post links from official news accounts to having any evidence to support their claim. The share breaking news or stories posted online and example below shows an argumentative tweet from add their own opinions. For example: a user who is in favor of encryption. However, Please who don't understand encryption or tech- he/she does not provide any evidence for his/her nology should not be allow to legislate it. There stance. should be a test... https://t.co/I5zkvK9sZf I hope people ban encryption. Then all their money Expert opinion (EXPERT) refers to sharing and CC's can be stolen and they'll feel better know- someone else’s opinion about the debate, specifi- ing terrorists can't keep secrets. cally someone who has more experience and Non Argument (NONARG) refers to a tweet knowledge of the topic than the user. The example that does not contain an argument. For example, below shows a tweet that shares a quotation from a the following tweet asks a question instead of pre- security expert. senting an argument. RT @cissp_googling what does encryption look like

4

Another NONARG situation is when a user shares and 26% for the two tasks, respectively, yielding a a link to a news article without posting any opin- 70% inter-annotator observed agreement for both ions about it. For example, the following tweet tasks. The unweighted Cohen’s Kappa score was does not present an argument or share an opinion 0.67 and 0.79, respectively, for the two tasks. about the debate; it only shares the title of the news Argumentation Class distribution article, “Tech giants back Apple against FBI's classification 'dangerous' encryption demand,” and a link to the Argument (ARG) 1,271 article. Non argument (NONARG) 1,729 Tech giants back Apple against FBI's 'dangerous' encryption demand #encryption Total 3000 https://t.co/4CUushsVmW Table 1: Argumentation classification distribution Retweets are also considered NONARG because over tweets simply selecting “retweet” does not take enough Evidence type Class distribution effort to be considered an argument. Moreover, just because a user retweets something does not No evidence 630 mean we know exactly how they feel about it; they News media accounts 318 could agree with it, or they could just think it was interesting and want to share it with their follow- Blog post 293 ers. The only exception would be if a user retweet- Picture 12 ed something that was very clearly an opinion or argument. For example, someone retweeting Ed- Expert opinion 11 ward Snowden speaking out against encryption Other 7 backdoors would be marked as an argument. By Total 1,271 contrast, a user retweeting a CNN news story about Apple and the FBI would be marked as NONARG. Table 2: Evidence type distribution over tweets Annotation discussion. While annotating the data, we observed other types of evidence that did 5 Experimental Evaluation not appear in the last section. We assumed users We developed an approach to classify tweets into would use these types of evidence in argumenta- each of the six major types of evidence used in tion. However, we found that users mostly use the- Twitter arguments. se types in a non-argumentative manner, namely as a means forwarding information. The first such ev- 5.1 Preprocessing idence type was “scientific paper,” which refers to sharing a link to scientific research that was pub- Due to the character limit, Twitter users tend to use lished in a conference or a journal. Here is an ex- colloquialisms, slang, and abbreviations in their ample: tweets. They also often make spelling and gram- A Worldwide Survey of Encryption Products. By mar errors in their posts. Before discussing feature Bruce Schneier, Kathleen Seidel & Saranya Vi- selection, we will briefly discuss how we compen- jayakumar #Cryptography sated for these issues in data preprocessing. We https://t.co/wmAuvu6oUb first replaced all abbreviations with their proper The second such evidence type was “video,” which word or phrase counterparts (e.g., 2night => to- refers to a user sharing a link to a video related to night) and replaced repeated characters with a sin- the debate. For example, the tweet below is a post gle character (e.g., haaaapy => happy). In addition, with a link to a video explaining encryption. we lowercased all letters (e.g., ENCRYPTION => An explanation of how a 2048-bit RSA encryp- encryption), and removed all URLs and mentions tion key is created https://t.co/JjBWym3poh to other users after initially recording these fea- tures. 4.2 Annotation results The results of the annotation are shown in Table 1 and Table 2. The inter-coder reliability was 18%

5

5.2 Features “although LIWC includes both positive emotion and negative emotion dimensions, the tone variable We propose a set of features to characterize each puts the two dimensions into a single summary type of evidence in our collection. Some of these variable.” features are specific to the Twitter platform. How- The third type is sentiment features. We first ever, others are more generic and could be applied experimented with the Wilson, Wiebe & Hoffmann to other forums of argumentation. Many features (2005) subjectivity clue lexicon to identify senti- follow previous work (Castillo, Mendoza, & Pob- ment features. However, we decided to use the sen- lete, 2011; Agichtein, Castillo, Donato, Gionis, & timent labels provided by the LIWC sentiment lex- Mishne, 2008). The full list of features appears in icon. We found it provides more accurate results appendix A. Below, we identify four types of fea- than we would have had otherwise. For the final tures based on their scope: Basic, Psychometric, type, subjectivity features, we did use the Wilson Linguistic, and Twitter-specific. et al. (2005) subjectivity clue lexicon to identify Basic Features refer to N-gram features, which the subjectivity type of tweets. rely on the word count (TF) for each given uni- Twitter-Specific Features refer to characteris- gram or bigram that appears in the tweet. tics unique to the Twitter platform, such as the Psychometric Features refer to dictionary- length of a message and whether the text contains based features. They are derived from the linguistic exclamation points or question marks. In addition, enquiry and word count (LIWC). LIWC is a text these features encompass the number of followers, analysis software originally developed within the number of people followed (“friends” on Twitter), context of Pennebaker's work on emotional writing and the number of tweets the user has authored in (Pennebaker & Francis, 1996; Pennebaker, 1997). the past. Also included is the presence or not of LIWC produces statistics on eighty-one different URLs, mentions of other users, hashtags, and offi- text features in five categories. These include psy- cial account verification. We also considered a bi- chological processes such as emotional and social nary feature for tweets that share a URL as well as cognition, and personal concerns such as occupa- the title of the URL shared (i.e., the article title). tional, financial, or medical worries. In addition, they include personal core drives and needs such as 6 Experimental results power and achievement. Linguistic Features encompass four types of Our first goal was to determine whether a tweet features. The first is grammatical features, which contains an argument. We used a binary classifica- refer to percentages of words that are pronouns, ar- tion task in which each tweet was classified as ei- ticles, prepositions, verbs, adverbs, and other parts ther argumentative or not argumentative. Some of speech or punctuation. The second type is previous research skipped this step (Feng and LIWC summary variables. The newest version of Hirst, 2011), while others used different types of LIWC includes four new summary variables (ana- classifiers to achieve a high level of accuracy lytical thinking, clout, authenticity, and emotional (Reed and Moens, 2008; Palau and Moens, 2009). tone), which resemble “person-type” or personality In this study, we chose to classify tweets as ei- measures. ther containing an argument or not. Our results The LIWC webpage (“Interpreting LIWC Out- confirm previous research showing that users do put”, 2016) describes the four summary variables not frequently utilize Twitter as a debating plat- as follows. Analytical thinking “captures the de- form (Smith, Zhu, Lerman & Kozareva, 2013). gree to which people use words that suggest for- Most individuals use Twitter as a venue to spread mal, logical, and hierarchical thinking patterns.” information instead of using it as a platform Clout “refers to the relative social status, confi- through which to have conversations about contro- dence, or leadership that people display through versial issues. People seem to be more interested in their writing or talking.” Authenticity “is when spreading information and links to webpages than people reveal themselves in an authentic or honest in debating issues. way,” usually by becoming “more personal, hum- As a first step, we compared classifiers that have ble, and vulnerable.” Lastly, with emotional tone, frequently been used in related work: Naïve Bayes

6

Decision tree SVM NB Feature Set Pre. Rec. F1 Pre. Rec. F1 Pre. Rec. F1 UNI (Base) 72.5 69.4 66.3 81 78.5 77.3 69.7 67.3 63.9 All features 87.3 87.3 87.2 89.2 89.2 89.2 79.3 79.3 84.7 Table 3: Summary of the argument classification results in %

(NB) approaches as used in Teufel and Moens 89.2% F1 score for all features compared to basic (2002), Support Vector Machines (SVM) as used features, unigram model. We can see there is a sig- in Liakata et al. (2012), and Decision Trees (J48) nificant improvement from just using the baseline as used in Castillo, Mendoza, & Poblete (2011). model. We used the Weka data mining software as used in Evidence type classification our second goal Hall et al. (2009) for all approaches. was for evidence type classification, results across Before training, all features were ranked accord- the training techniques were comparable; the best ing to their information gain observed in the train- results were again achieved by using SVM, which ing set. Features with information gain less than resulted in a 78.6% F1 score. Table 4 shows a zero were excluded. All results were subject to 10- summary of the classification results. The best fold cross-validation. Since, for the most part, our overall performance was achieved by combining data sets were unbalanced, we used the ‘‘Synthetic all features. Minority Oversampling TEchnique’’ (SMOTE) In table 5, we computed Precision, Recall, and approach (Chawla, Bowyer, Hall & Kegelmeyer, F1 scores with respect to the top-used three evi- 2002). SMOTE is one of the most renowned ap- dence types, employing one-vs-all classification proaches to solve the problem of unbalanced data. problems for evaluation purposes. We chose the Its main function is to create new minority class top-used evidence types since other types were too examples by interpolating several minority class small and could have led to biased sample data. instances that lie together. After that, we random- The results show that the SVM classifier achieved ized the data to overcome the problem of over- a F1 macro-averaged score of 82.8%. As the table fitting the training data. shows, the baseline outperformed Linguistic and Argument classification Regarding our first Psychometric features. This was not expected. goal of classifying tweets as argumentative or non- However, Basic features (N-gram) had very com- argumentative, Table 3 shows a summary of the parable results to those from combining all fea- classification results. The best overall performance tures. In other words, the combined features cap- was achieved using SVM, which resulted in a tured the characteristics of each class. This shows

Decision tree SVM NB Feature Set Pre. Rec. F1 Pre. Rec. F1 Pre. Rec. F1 UNI (Base) 59.1 61.1 56.3 63.7 62.1 56.5 27.8 31.6 19.4 All features 76.8 77 76.9 78.5 79.5 78.6 62.4 59.4 52.5 Table 4: Summary of the evidence type classification results in %

NEWS vs. All BLOG vs. All NO EVIDENCE vs. All Macro Feature Set Average F1 Pre. Rec. F1 Pre. Rec. F1 Pre. Rec. F1 UNI (Base) 76.8 74 73.9 67.3 64.4 63.5 78.5 68.7 65.6 67.6 Basic Features 842 81.3 81.3 85.2 83 82.9 80.1 75.5 74.4 79.5 Psychometric Features 62 61.7 57.9 64.6 63.7 63.5 59.2 58.9 58.6 60 Linguistic Features 65 65.3 64.2 69.1 69 69 63.1 62.6 62.4 65.2 Twitter-Specific Features 65.7 65.2 65 63.7 63.6 63.6 68.7 68.1 67.9 65.5 All features 84.4 84 84.1 86 85.2 85.2 79.3 79.3 79.3 82.8 Table 5: Summary of evidence type classification results using one-vs-all in %

7

that we can distinguish between classes using a sufficient argumentative support. One further concise set of features with equal performances. shared feature is that “title” appears frequently in both NEWS and NO EVIDENCE types. One ex- 6.1 Feature Analysis planation for this is that “title” has a high positive The most informative features for the evidence value in NEWS, which often involves highlighting type classification are shown in Table 6. There are the title of an article, while it has a high negative different features that work for each class. For ex- value in NO EVIDENCE since this type does not ample, Twitter-specific features such as title, word contain any titles of articles. count, and WPS are good indicators of the NEWS As Table 5 shows, “all features” outperforms evidence type. One explanation for this is that peo- other stand-alone features and “basic features,” ple often include the title of a news article in the although “basic features” has a better performance tweet with the URL, thereby engaging the afore- than the other features. Table 7 shows the most in- mentioned Twitter-specific features more fully. formative feature for the argumentation classifica- Another example is that linguistic features like tion task using the combined features and unigram grammar and sentiments are essential for using the features. We can see that first person singular is the BLOG evidence type. The word “wrote,” especial- strongest indication of arguments on Twitter, since ly, appears often to refer to someone else’s writing, the easiest way for users to express their opinions as in the case of a blog. The use of the BLOG evi- is by saying “I …”. dence type also seemed to correlate with emotional Feature set Features tone and negative emotions, which is a combina- I’m, surveillance, love, I’ve, I’d, Unigram tion of positive and negative sentiment. This may privacy, I’ll, hope, wait, obama suggest that users have strong negative opinions 1st person singular, RT, personal toward blog posts. pronouns, URL, function words, All user mention, followers, auxiliary Feature Set All Features verbs, verb, analytic Word count, title, personal pronoun, Table 7: Most informative features argumentation NEWS vs. common adverbs, WPS, “iphone”, classification All “nsa director” Emotional Tone, 1st person singular, 7 Conclusions and future work BLOG vs. negation, colon, conjunction, “wrote”, All negative emotions, “blog” In this paper, we have presented a novel task for Title,1st person singular, colon, Im- automatically classifying argumentation on social NO EVI- personal pronouns, discrepancies, in- media for users discussing controversial topics like DENCE vs. sight, differentiation (cognitive pro- the recent FBI and Apple encryption debate. We All cesses), period, adverb, positive emo- classified six types of evidence people use in their tion tweets to support their arguments. This classifica- Table 6: Most informative features for combined tion can help predict how arguments are supported. features for evidence type classification We have built a gold standard data set of 3000 Concerning the NO EVIDENCE type, a combi- tweets from the recent encryption debate. We find nation of linguistic features and psychometric fea- that Support Vector Machines (SVM) classifiers tures best describe the classification type. Further- trained with n-grams and other features capture the more, in contrast with blogs, users not using any different types of evidence used in social media evidence tend to express more positive emotions. and demonstrate significant improvement over the That may imply that they are more confident about unigram baseline, achieving a macro-averaged F1 their opinions. There are, however, mutual features score of 82.8 %. used in both BLOG and NO EVIDENCE types as One consideration for future work is classifying 1st person singular and colon. One explanation for the stance of tweets by using machine learning this is that since blog posts are often written in a techniques to understand a user’s viewpoint and less formal, less evidence-based manner than news opinions about a debate. Another consideration for articles, they are comparable to tweets that lack

8

future work is to explore other evidence types that Green, N., Ashley, K., Litman, D., Reed, C., & Walker, may not be presented in our data. V. (2014). Proceedings of the First Workshop on Ar- gumentation Mining. Baltimore, MD: Association for Acknowledgments Computational Linguistics. We would like to thank Andrew Marturano and Gruzd, A., & Goertzen, M. (2013, January). Wired aca- demia: Why social science scholars are using social Amy Chen for assisting us in the annotation pro- media. In System Sciences (HICSS), 2013 46th Ha- cess. waii International Conference on (pp. 3332-3341). References IEEE. Habernal, I., Eckle-Kohler, J., & Gurevych, I. (2014, Ju- Agichtein, E., Castillo, C., Donato, D., Gionis, A., & ly). Argumentation Mining on the Web from Infor- Mishne, G. (2008, February). Finding high-quality mation Seeking Perspective. In ArgNLP. content in social media. In Proceedings of the 2008 Hachey, B., & Grover, C. (2005, March). Sequence International Conference on Web Search and Data modelling for sentence classification in a legal sum- Mining (pp. 183-194). ACM. marisation system. In Proceedings of the 2005 ACM Ashley, K. D., & Walker, V. R. (2013). From Inform- symposium on Applied computing (pp. 292-296). aNon Retrieval (IR) to Argument Retrieval (AR) for ACM. Legal Cases: Report on a Baseline Study. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reute- Cabrio, E., & Villata, S. (2012, July). Combining textual mann, P., & Witten, I. H. (2009). The WEKA data entailment and for supporting mining software: an update. ACM SIGKDD explora- online debates interactions. InProceedings of the 50th tions newsletter, 11(1), 10-18. Annual Meeting of the Association for Computational Interpreting LIWC Output. Retrieved April 17, 2016, Linguistics: Short Papers-Volume 2 (pp. 208-212). from http://liwc.wpengine.com/interpreting-liwc- Association for Computational Linguistics. output/ Castillo, C., Mendoza, M., & Poblete, B. (2011, March). Liakata, M., Saha, S., Dobnik, S., Batchelor, C., & Information credibility on twitter. In Proceedings of Rebholz-Schuhmann, D. (2012). Automatic recogni- the 20th international conference on World wide tion of conceptualization zones in scientific articles web (pp. 675-684). ACM. and two life science applica- Chawla, N. V., Bowyer, K. W., Hall, L. O., & Keg- tions. Bioinformatics, 28(7), 991-1000. elmeyer, W. P. (2002). SMOTE: synthetic minority Llewellyn, C., Grover, C., Oberlander, J., & Klein, E. over-sampling technique. Journal of artificial intelli- (2014). Re-using an Argument Corpus to Aid in the gence research, 321-357. Curation of Social Media Collections. In LREC(pp. Davis, C. A., Varol, O., Ferrara, E., Flammini, A., & 462-468). Menczer, F. (2016, April). BotOrNot: A System to Mochales, R., & Ieven, A. (2009, June). Creating an ar- Evaluate Social Bots. In Proceedings of the 25th In- gumentation corpus: do theories apply to real argu- ternational Conference Companion on World Wide ments?: a case study on the legal argumentation of the Web (pp. 273-274). International World Wide Web ECHR. In Proceedings of the 12th International Con- Conferences Steering Committee. ference on and Law (pp. 21-30). Etlinger, S., & Amand, W. (2012, February). Crimson ACM. Hexagon [Program documentation]. Retrieved April, Palau, R. M., & Moens, M. F. (2009, June). Argumenta- 2016, from http://www.crimsonhexagon.com/wp-co tion mining: the detection, classification and structure tent/uploads/2012/02/CrimsonHexagon_Altimeter_W of arguments in text. In Proceedings of the 12th inter- ebinar_111611.pdf national conference on artificial intelligence and Feng, V. W., & Hirst, G. (2011, June). Classifying ar- law (pp. 98-107). ACM. guments by scheme. InProceedings of the 49th Annu- Park, J., & Cardie, C. (2014, June). Identifying appro- al Meeting of the Association for Computational Lin- priate support for propositions in online user com- guistics: Human Language Technologies-Volume ments. In Proceedings of the First Workshop on Ar- 1 (pp. 987-996). Association for Computational Lin- gumentation Mining (pp. 29-38). guistics. Pennebaker, J. W. (1997). Writing about emotional ex- Freeley, A., & Steinberg, D. (2013). Argumentation and periences as a therapeutic process. Psychological sci- debate. Cengage Learning. ence, 8(3), 162-166.

9

Pennebaker, J. W., & Francis, M. E. (1996). Cognitive, Stab, C., & Gurevych, I. (2014a). Annotating Argument emotional, and language processes in disclo- Components and Relations in Persuasive Essays. sure. Cognition & Emotion, 10(6), 601-626. In COLING (pp. 1501-1510). Procter, R., Vis, F., & Voss, A. (2013). Reading the riots Stab, C., & Gurevych, I. (2014b). Identifying Argumen- on Twitter: methodological innovation for the analysis tative Discourse Structures in Persuasive Essays. of big data. International journal of social research In EMNLP (pp. 46-56). methodology, 16(3), 197-214. Teufel, S., & Kan, M. Y. (2011). Robust argumentative Reed, C., & Rowe, G. (2004). Araucaria: Software for zoning for sensemaking in scholarly documents (pp. argument analysis, diagramming and representa- 154-170). Springer Berlin Heidelberg. tion. International Journal on Artificial Intelligence Teufel, S., & Moens, M. (2002). Summarizing scientific Tools, 13(04), 961-979. articles: experiments with relevance and rhetorical Rieke, R. D., & Sillars, M. O. (1984). Argumentation status. Computational linguistics, 28(4), 409-445. and the decision making process. Addison-Wesley Toulmin, S. E. (2003). The uses of argument. Cam- Longman. bridge University Press. Rinott, R., Dankin, L., Alzate, C., Khapra, M. M., Villalba, M. P. G., & Saint-Dizier, P. (2012). Some Fac- Aharoni, E., & Slonim, N. (2015, September). Show ets of Argument Mining for Opinion Analy- Me Your Evidence–an Automatic Method for Context sis. COMMA, 245, 23-34. Dependent Evidence Detection. In Proceedings of the 2015 Conference on Empirical Methods in NLP Walton, D., Reed, C., & Macagno, F. (EMNLP), Lisbon, Portugal (pp. 17-21). (2008). Argumentation schemes. Cambridge Universi- ty Press. Rosenthal, S., & McKeown, K. (2012, September). De- tecting opinionated claims in online discussions. Wilson, T., Wiebe, J., & Hoffmann, P. (2005, October). In Semantic Computing (ICSC), 2012 IEEE Sixth In- Recognizing contextual polarity in phrase-level sen- ternational Conference on (pp. 30-37). IEEE. timent analysis. In Proceedings of the conference on human language technology and empirical methods Schneider, J., & Wyner, A. (2012). Identifying Consum- in natural language processing (pp. 347-354). Asso- ers' Arguments in Text. In SWAIE (pp. 31-42). ciation for Computational Linguistics. Schneider, J., Samp, K., Passant, A., & Decker, S. Wyner, A., Schneider, J., Atkinson, K., & Bench-Capon, (2013, February). Arguments about deletion: How T. J. (2012). Semi-Automated Argumentative Analy- experience improves the acceptability of arguments in sis of Online Product Reviews. COMMA,245, 43-50. ad-hoc online task groups. In Proceedings of the 2013 conference on supported cooperative

work (pp. 1069-1080). ACM.

Smith, L. M., Zhu, L., Lerman, K., & Kozareva, Z.

(2013, September). The role of social media in the

discussion of controversial topics. In Social Compu-

ting (SocialCom), 2013 International Conference

on (pp. 236-243). IEEE.

10

Appendix A. Feature types used in our Model Type Feature Description Basic Unigram Word count for each single word that appears in the tweet Features Bigram Word count for each two words that appears in the tweet

Percentage of words that refers to multiple sensory and perceptual dimensions Perceptual process associated with the five senses. Biological process Percentage of words related to body, health, sexual and Ingestion Core Drives and Percentage of words related to personal drives as power, achievement, reward Needs and risk Cognitive Process- Percentage of words related to causation, discrepancy, tentative, certainty, inhi- es bition and inclusive. Personal Concerns Percentage of words related to work, leisure, money, death, home and religion

PsychometricsFeatures Social Words Percentage of words that are related to family and friends Analytical Think- Percentage of words that captures the degree to which people use words that ing suggest formal, logical, and hierarchical thinking patterns Percentage of words related to the relative social status, confidence, or leader- Clout ship that people display through there writing or talking. Percentage of words that reveals people in an authentic or honest way, they are Authenticity more personal, humble, and vulnerable Percentage of words related to the emotional tone of the writer which is a com- Emotional Tone bination of both positive emotion and negative emotion dimensions. Percentage of words related to informal language markers as assents, fillers and Informal Speech swears words

istic Features istic Time Orientation Percentage of words that refer to Past focus, present focus and future focus. Percentage of words that refer to personal pronouns, impersonal pronouns, arti- Grammatical

Lingu cles, prepositions, auxiliary verbs, common adverbs, punctuation Positive emotion Percentage of positive words in a sentence Negative emotion Percentage of negative words in a sentence Subjectivity type Subjectivity type derived by Wilson et al. (2005) lexicon Percentage of punctuation in text including periods, commas, colons, semico- Punctuation lons etc. RT 1.0 if the tweet is a retweet Title 1.0 if the tweet contains a title to the article title Mention 1.0 if the tweet contains a mention to another user ’@’ Verified account 1.0 if the author has a ’verified’ account

URL 1.0 if the tweet contains a link to a URL

Followers Number of people this author is following at posting time Following Number of people following this author at posting time specific - Posts Total number of user’s posts Features hashtag 1.0 if the tweet contains a hashtag ’#’ Twitter WC Word count of the tweet Words>6 letters Count of words with more then six letters WPS Count of words per sentence QMark Percentage of words contains question mark Exclam Percentage of words contains exclamation mark

11 Summarizing Multi-Party Argumentative Conversations in Reader Comment on News

Emma Barker and Robert Gaizauskas Department of University of Sheffield e.barker, r.gaizauskas @sheffield.ac.uk { }

Abstract Various researchers have already proposed methods to automatically generate summaries of Existing approaches to summarizing reader comment (Khabiri et al., 2011; Ma et al., multi-party argumentative conversations 2012; Llewellyn et al., 2014). These authors adopt in reader comment are extractive and fail broadly similar approaches: first reader comments to capture the argumentative nature of are topically clustered, then comments within these conversations. Work on argument clusters are ranked and finally one or more top- mining proposes schemes for identifying ranked comments are selected from each cluster, argument elements and relations in text yielding an extractive summary. A major draw- but has not yet addressed how summaries back of such summaries is that they fail to capture might be generated from a global anal- the essential argument-oriented nature of these ysis of a conversation based on these multi-way conversations, since single comments schemes. In this paper we: (1) propose taken from clusters do not reflect the argumenta- an issue-centred scheme for analysing tive structure of the conversation. I.e. such sum- and graphically representing argument maries do not identify the issues about which com- in reader comment discussion in on-line menters are arguing, the alternative viewpoints news, and (2) show how summaries commenters take on the issues or key evidence capturing the argumentative nature of supporting one viewpoint or another, which a truly reader comment can be generated from informative summary must do. our graphical representation. By contrast, researchers working on argument mining from social media, including reader com- 1 Introduction ment and on-line debate, have articulated vari- ous schemes defining argument elements and re- A very common feature of on-line news is a reader lations in argumentative discourse (e.g. Ghosh et comment facility that lets readers comment on al. (2014), Habernal et al. (2014), Swanson et news articles and on previous readers’ comments. al. (2015)). If such elements and relations could What emerges are multi-party conversations, typ- be automatically extracted then they could poten- ically argumentative, in which, for example, read- tially serve as a basis for generating a summary ers question, reject, extend, offer evidence for, ex- that better reflects the argumentative content of plore the consequences of points made or reported reader comment. Indeed, several of these authors in the original article or in earlier commenters’ have cited summarization as a motivating applica- comments. See, e.g. The Guardian on-line. tion for their work. However, to the best of our One problem with such conversations is that knowledge none have proposed how, given a for- they can rapidly grow to hundreds or thousands of mal analysis of an conversation in response to a comments. Few readers have the patience to wade news article, one might produce a summary of that through this much content, a task made all the conversation. This is a non-trivial issue. more difficult by lack of explicit topical structure. In this paper we make two main contributions. A potential solution is to develop methods to sum- First (Section 2) we present a light-weight ana- marize comment automatically, allowing readers lytical framework consisting of various argument to gain an overview of the conversation. elements and relations, specifically developed to

12 Proceedings of the 3rd Workshop on Argument Mining, pages 12–20, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics capture argument in reader comments and news Issues Implicitly related to notion of viewpoint and we show, via an example, how an analysis is that of issue. We can think of an issue as a ques- using this framework may be graphically repre- tion or problem to which there are two or more sented. Secondly (Section 3), we make propos- contending answers. The space of possible an- als for how summaries that capture the argument- swers is the set of related but opposed viewpoints oriented character of reader comment conversa- expressed in the comment set. I.e. an issue is that tions could be derived from the graphical repre- which a viewpoint is a viewpoint on. sentation of the argument structure of a set of com- Issues may be expressed in various ways, e.g. ments and the article, as presented in Section 2. (1) via a “whether or not”-type expression, e.g. whether or not to lower the drinking age; (2) via 2 A Framework for Characterising a yes-no question, e.g. Should Britain leave the Argument in Comment on News EU?; (3) via a “which X?”-type expression when 2.1 Issues, Viewpoints and Assertions there are more than two alternatives, e.g. Which was the best film of 2015?. However, issues are From an idealised perspective, commenters ad- rarely explicitly articulated in reader comments or issues viewpoints dress , hold on issues and make in the initial news article. Rather, as the dialogue assertions , which serve many purposes including evolves, a set of assertions made by commenters directly expressing a viewpoint and providing ev- may indicate a space of alternative, opposed view- idence for an assertion (or viewpoint). Of course points, and an issue can then be recognised. reader comments may also have other functions, Sub-issues frequently emerge within the discus- e.g. expressing emotions or making jokes, but here sion of an issue, i.e. issues have a recursive nature. we are primarily interested in their argumentative When evidence proposed as support for a view- content. We expand on these terms as follows1: point on an issue is contended, the two contending Assertions A comment typically comprises one comments, which may in turn attract further com- or more assertions – propositions that the com- ments, become viewpoints on a new issue, sub- menter puts forward and believes to be true. Each ordinate to the first. Sub-sub-issues may arise be- assertion has a particular role in the local dis- low sub-issues and so on. course. We find relations between assertions made within a comment, between assertions made in dif- 2.2 A Graphical Representation ferent comments and between assertions in com- In the previous sub-section we defined the key ments and assertions in the article. Key relations concepts in our approach to analysing argument between assertions include: rationale (one pro- in comment on news. To demonstrate how they vides evidence to support another); agree/disagree can be used to analyse a particular news article (one agrees or disagrees with another). plus comment set we propose a graphical repre- Viewpoints Disagreement or contention be- sentation of the argument structure, with indices tween comments is a pervasive feature of reader that anchor the representation in textual elements. comment and news. When an assertion made by A graphical approach is well-suited to the task of one comment is contradicted by or contends i) an identifying structural relations between elements assertion expressed in another comment, or ii) an in a scheme, particularly when some of the ele- assertion reported in or entailed by something re- ments are abstractions not themselves directly rep- ported in the news article, each opposed assertion resented in the text (as is widely recognised in expresses a viewpoint. It follows that whether or the argumentation community (Reed et al., 2007; not an assertion expresses a viewpoint is an emer- Conklin and Begeman, 1988)). gent property of the discourse and only relative to We introduce our graphical representation via the local discourse; it is not an inherent feature of an example. Figure 1 shows an extract from a the assertion itself. Guardian news article about the controversy sur- 1We would like to thank one our reviewers for pointing rounding a town council’s decision to reduce the out close similarities between the framework we describe frequency of bin collection, and 11 (of 248) com- here and the IBIS framework of Kunz and Rittel (1970). In ments posted in response to the article. Figure 2 particular they share the ideas that issues are questions, are key primitive elements in a theory of argumentative discourse shows a partial depiction of the issues, viewpoints and emerge dynamically and recursively in argument. and rationales and argumentative structure in this

13 [S0] Rubbish? [S1] Bury council votes to collect wheelie bins just once every three weeks [S2] Locals fear the new move will lead to an increase in fly-tipping and attract foxes and vermin, but the council insists it will make the borough more environmentally friendly. [S3] Is it just a desperate cost cutting measure? . . . [S4] A council in Greater Manchester is to be the first in England to start collecting wheelie bins only once every three weeks, scrapping the current fortnightly collection. [S5] The controversial decision was unanimously passed by councillors in Bury on Wednesday night, despite fears fly tipping would increase. [S6] One councillor who voted for the motion accused her opponents of “scaremongering” after they warned rubbish would pile up and attract vermin. [S7] Another argued the money saved could be spent on more social workers. [S8] It affects the grey bins used for general household waste which can’t be recycled . . . [S9] The Labour-run council claims the move is part of a strategy to turn Bury into a “zero waste borough”, boost recycling and save money on landfill fees . . . [S10] Many residents feel it is simply a desperate cost saving measure, after the town hall was told to make more than £32m of cuts over the next two years . . . Id Poster Reply Comment 1 A I can’t see how it won’t attract rats and other vermin. I know some difficult decisions have to be made with cuts to funding, but this seems like a very poorly thought out idea. 2 B 2 1 Plenty of people use compost bins and have no trouble with rats or foxes. 3 C 3 → 2 If they are well-designed and well-managed- which is very easily accomplished. → If 75% of this borough composted their waste at home then they could have their bins collected every six-weeks. It’s amazing what doesn’t need to be put into landfill. 4 D 4 1 It won’t attract vermin if the rubbish is all in the bins. Is Bury going to provide larger bins for families → or provide bins for kitchen and garden waste to cut down the amount that goes to landfill? Many people won’t fill the bins in 3 weeks - even when there was 5 of us here, we would have just about managed. 5 E 5 1 Expect Bury to be knee deep in rubbish by Christmas it’s a lame brained Labour idea and before long → it’ll be once a month collections. I’m not sure what the rubbish collectors will be doing if there are any. We are moving back to the Middle Ages, expect plague and pestilence. 6 F Are they completely crazy? What do they want a new Plague? 7 G 7 6 Interesting how you suggest that someone else is completely crazy, and then talk about a new plague. 8 H 8 →7 Do you think this is a good idea? We struggle with fortnightly collection. This is tantamount to a → dereliction of duty. What are taxpayers paying for? I doubt anyone knew of this before casting their vote. 9 I 9 8 I think it is an excellent idea. We have fortnightly collection, and the bin is usually half full or → less[family of 5].. Since 38 of the 51 council seats are held by Labour, it seems that people did vote for this. Does any party offer weekly collections? 10 G 10 8 I don’t think it’s a good idea. But..it won’t cause a plague epidemic. 11 G 11 →9 I live by myself, so my bin is going to be smaller ..but I probably have more bin-space-per-person. And → I recycle everything I can possibly recycle, and make sure nothing slips through the net. Yet I almost fill my bin with food waste and the odd bit of unrecyclable packaging in a fortnight.. How are you keeping your bin so empty? Figure 1: Part of a news article (top) and comments responding to it (bottom). Comments are taken from two threads in sequence but some intermediate comments have been omitted. Full article and comments at: http://gu.com/p/4v2pb/sbl. example as a directed graph. on the issue at the head, e.g. [less frequent collec- Nodes in the graph represent issues, viewpoints tion] “will attract vermin” or “will not attract ver- or assertions. Issues are distinguished by italics, min”. Blue edges indicate the assertion at the tail e.g. Is reducing bin collection to once every 3 provides a rationale for the assertion at the head. weeks a good idea? Nodes inside dashed boxes To create such a graph manually is a labori- are implicit parts of the argument, i.e. are not di- ous process of iterative refinement: (1) Look for rectly expressed in the comments or article but are textual units expressing contention in the article. implied by them and allow other nodes which are When spotted, formulate initial glosses of opposed explicit to be integrated into the overall argument viewpoints. (2) Read the comments in turn, by structure. Nodes are labelled with abstract glosses thread, looking for textual units expressing con- of content explicitly or implicitly mentioned in tention between comments or comments and the the article, with repeated content represented only article. When spotted, formulate/refine glosses for once. For nodes that are expressed or signalled opposed viewpoints (will attract vermin/won’t at- in the news article or comments, a list of article tract vermin) and propose a potential issue. (3) sentence [sn] and comment [cn] indices is given, Group similar and related comment together – re- grounding the argument graph in the text. quires re-reading earlier content to assess similar- Relations between nodes are indicated by di- ity and may result in refining earlier glosses of rected edges in the graph. Orange edges indicate viewpoints and issues. (4) As rationale relations that the assertion at the tail of arrow is a viewpoint are recognised, add these in beneath viewpoint or

14 21

Is reducing bin collection to once every 3 Key: weeks a good idea? issues in italics [c8] viewpoint on issue rationale/grounds implicit/supressed

No, it is a bad idea Yes, it is a good idea [s2, c1, c6, c8, c10] Will less frequent bin [s1, s5, s6, c9] collection attract vermin? Just a cost cutting Will not attract measure vermin [s6 c4] Will see c4 Will make the Will save money, [s10] increase in Will attract vermin borough more which can be fly tipping [s2, s6, c1, c5] environmentally spent on other [s2, s5] friendly council services Compost bins (which [s2, s9] [s7] contain food and aren’t collected) don’t attract rats and foxes [c2] Will less frequent Will not lead to collection lead to rubbish piling up Will lead to rubbish piling up? outside the bins Will save landfill rubbish piling up Will encourage [S6] fees [s9] outside the bins recycling [s9] [s6, c5]

Will 3 weeks worth of rubbish fit in the household bin?

if bins are well Households will produce more designed and well rubbish in 3 weeks than can fit managed in the bin 3 weeks of rubbish can fit in the bin [c3] [c3,c4]

Personal testimony: if there are larger bins commenters struggle Personal testimonies: for larger households if people compost with fortnightly family of 5 could and bins for garden and correctly [c3] collection, some manage; bin only half kitchen waste [c4] recycling everything full after two weeks [c8, c11] [c4, c9]

Figure 2: Argument graph for the article and comment subset shown in Figure 1 rationale nodes already in the graph. Implicit as- could, e.g., be allowed to expand or collapse nodes sertions or viewpoints may need to be added to on demand, starting from the top-most issues(s) capture the structure of the argument (e.g. house- and viewpoints. While this is a promising line of holds will produce more rubbish in 3 weeks than work, here we focus on how to generate a textual can fit in the bin). This should only be done when summary from an argument graph. In the next sub- necessary to integrate parts of the argument that section we discuss features of the graph one could are explicit (i.e. not all implicit assertions need to take into account to generate a summary. Then be made explicit). we present two heuristic algorithms for generating summaries that exploit these features and illustrate 3 Generating Summaries from the output of one of them by using it (manually) to Argument Graphs produce a summary from an extended version of the argument graph presented in Figure 2. Finally, Argument graphs show the issues raised, view- we compare that summary with a human summary points taken and rationales given across a (possi- of the same article and comments produced in a bly very large) set of comments. Such a graph re- less prescriptive way. veals the argumentative structure of the comment set in relation to the article, something that should 3.1 Features for Summarization be reflected in any argument-oriented summary. Given an argument graph as presented in Section But clearly it contains far too much information 2.2, a summarization algorithm could exploit both to appear in a summary. How, then, can we use quantitative features of nodes in the graph as well the graph structure and the information it contains as structural relations between nodes. to produce a summary? Features of issue nodes to use include: The argument graph itself could be used as a visual presentation mechanism for the argumenta- Issue Index Count Number of comments or arti- tive content of the comments and article. The user cle sentences explicitly mentioning the issue.

15 Issue Textual Gloss of Issue Issue Max Sub Sub Total Id Index Issue Node Issue Index Count Depth Count Count Count 1 Is reducing bin collection to once every 3 weeks a good idea? 1 7 31 7 58 2 Will less frequent bin collection attract vermin? 0 6 21 5 38 3 Will less frequent collection lead to rubbish piling up? 0 4 15 3 24 4 Will 3 weeks of rubbish fit in the bin? 0 4 13 2 21 5 Will reducing collection encourage recycling? 0 3 10 2 13 6 Can people recycle/compost more rubbish than they do? 0 2 7 1 11 7 Are vermin attracted by the type of rubbish in the black/grey bin? 0 1 2 0 5 8 Do people in flats have composting facilities? 0 1 2 0 4

Table 1: Issues in Bin Collection Article and First 30 Comments

Id Is- Textual Gloss of Viewpoint on Issue VPt Total Evid. Total Max Evid. sue Index Index Count Evid. VPt Nodes Count Count Count Depth Contd 1 1 No it is a bad idea 6 21 3 7 4 4 2 1 Yes it is a good idea 4 14 3 7 4 2 3 2 Will attract vermin 4 12 2 4 3 3 4 2 Will not attract vermin 3 19 3 10 5 4 5 3 Will lead to rubbish piling up 2 5 1 2 2 1 6 3 Will not lead to rubbish piling up 1 12 1 6 4 2 7 4 3 weeks of rubbish can fit in the bin 2 11 3 5 3 1 8 4 3 weeks of rubbish cannot fit in the bin 0 3 1 1 1 0 9 5 Reducing collection will encourage recycling 1 6 2 3 2 1 10 5 Reducing collection will not encourage recycling 0 5 1 3 2 2 11 6 People don’t recycle/compost everything they can 3 4 1 2 1 0 12 6 People recycle/compost everything they can already 1 5 2 2 1 1 13 7 Black/grey bins contain the type of rubbish that at- 3 3 0 0 0 0 tracts vermin 14 7 Black/grey bins contain packaging that does not at- 2 2 0 0 0 0 tract vermin 15 8 People in flats don’t have composting facilities 3 3 0 0 0 0 16 8 People in flats have composting facilities 1 1 0 0 0 0

Table 2: Viewpoints (VPts) in Bin Collection Article and First 30 Comments

Maximum Issue Depth Count of all levels below is meant to be indicative only, shows these counts an issue. for a version of the argument graph of Figure 2 ex- Sub-Node Count Count of all viewpoint and ra- tended to include 30 comments (two full threads). tionale nodes below the issue node. including Features of viewpoint nodes to use include: points of contention. Viewpoint Index Count Count of the indices of Sub-Issue Count Count of all issues below a comments or article sentences that directly given issue. support the viewpoint. Total Index Count Count of all indices on the is- Viewpoint Total Index Count Count of all in- sue itself and all sub-nodes. dices on the viewpoint, both direct and indi- rect; i.e. indices on the viewpoint and indices Maximum Issue Depth, together with the Sub- on all rationale nodes supporting the high node Count give a broad indication of the struc- level viewpoint (excludes indices on nodes tural character of the argument. A shallow depth contending any of the lower level rationales). score of say 2, with a high number of nodes sug- gest that there are lots of different reasons given Evidence Count Count of the number of ratio- for supporting or not supporting something, but nale nodes directly below a viewpoint. no complex case given to explain or justify the Total Evidence Count Count of all nodes play- support for these positions. Sub-issue Count is in- ing a role in supporting the viewpoint. dicative of the degree of contention on an issue and Maximum Viewpoint Depth Count of the levels Total Index Count indicates the volume of explicit of rationale given below the viewpoint. comment relating to an issue, i.e. is an indication Evidence Nodes Contended Count of the num- of the number of participants in the conversation ber of direct contentions to rationales sup- saying things related to the issue. Table 1, which porting a viewpoint.

16 Viewpoint Index Count shows the strength of Algorithm 1 Single Issue Summarizer direct support for a viewpoint. Viewpoint Total Require: An argument graph G; issue I in G; Index Count provides an indication of both di- comparison functions fS(., .), fE(., .) for or- rect support for the viewpoint and support for the dering viewpoint and rationale nodes respec- supporting arguments. Together, Evidence Count, tively; a measure T hresholdE on evidence Total Evidence Count and Maximum Viewpoint nodes Depth indicate the structural complexity and de- 1: Summary [] tail of the supporting case. Evidence Nodes Con- ← 2: Summary += I tended indicates the degree to which rationales for 3: S list of the viewpoints on I ordered by f a viewpoint are contended by counter arguments. ← S 4: for each viewpoint s in S do Table 2 shows these counts for the extended ver- 5: Summary += s sion of the argument graph shown in Figure 2. 6: Rs list of rationales for s ordered by f Table 2 defines measures relating to the position ← E 7: for each rationale node rs in Rs do and popularity of issue and viewpoint nodes in an 8: if fE(T hresholdE, rs) then argument graph. The same measures can be calcu- 9: Summary += rs lated for evidence nodes. 10: end if 3.2 Algorithms for Summarization 11: end for 12: end for The counts specified in the previous section to- gether with the structure of the argument graph can be used in many different ways to generate sum- maries. Here we mention just two as an indication Simple Argument-Oriented Summarizer The of the space of possibilities. 2 simple issue-oriented approach ignores informa- tion about viewpoints and rationales and about Simple Issue-oriented Summarizer One sim- which sub-issues relate to which specific dominat- ple baseline is to list issues discussed, up to the ing issues. A more interesting approach to sum- summary length limit, ordered by whichever quan- marization should take this into account. One way titative measure for issues is felt to best indicate to do this is shown in Algorithm 1, which out- significance. Choosing an ordering measure like lines the logic for selecting the content for inclu- Total Index Count places value on the number of sion in an argument-oriented summary of one is- commenters discussing the issue; choosing Sub- sue. Given an argument graph and an issue, the Node Count favours more elaborated arguments algorithm starts by including the issue in the sum- and Sub-Issue Count favours issues that give rise mary, then for each viewpoint on the issue adds to more contention. In the example shown in Ta- that viewpoint in an order defined over some fea- ble 1 the various measures all correlate closely so ture of viewpoints (e.g. Total Index Count). As the choice of which measure to use is arbitrary; each viewpoint is added, evidence nodes for the however this need not always be the case. viewpoint are added, ordered by some node fea- 2One important technical observation should be made In ture (e.g. Evidence Nodes Contended), provided the example above the argument graph is a connected graph their count on this feature exceeds a threshold. in which there is a unique issue node (the root issue): (1) to which all other nodes are connected either via viewpoint Algorithm 1 only summarizes a single issue. It or rationale relations or via issues arising from contention of could be used to generate a high level summary a rationale node otherwise related to the root issue, and (2) of an argument graph by calling it with the root none of whose viewpoint nodes are rationales for other nodes in the argument. In Figure 2, e.g., while no issue node has node. Or, it could be extended to cover more a parent, the issue Is reducing bin collection to once every 3 of the graph in various ways. For example, af- weeks a good idea? is unique in that none of its viewpoint ter line 9, when the decision to add an evidence nodes is a rationale for any other node in the graph. In the general case, comment sets can give rise to multiple, unre- node rs to the summary has been made, rs could lated root issues. We do not discuss such cases here, i.e. we be checked to see if it is a viewpoint on a issue, assume the graph to be summarised is a connected graph with a single root issue. However, the algorithms discussed below i.e. if it has been contended. If yes, it could be could easily be extended to accommodate the more general added to a list SubI of sub-issues to report in the case, e.g. by distributing the total summary length between summary. After line 12, SubI could be sorted by each root-dominated sub-graph, possibly allowing “larger” sub-graphs, as determined by one of the measures above, a some measure of importance and, possibly, thresh- greater proportion of overall summary length. olded or truncated to shorten it. The algorithm

17 Summary 1 (97 words) Many commenters were unhappy Summary 2 (112 words) The central issue discussed was with the less frequent collections; some were struggling al- whether reducing bin collection to once every 3 weeks is a ready with the fortnightly collection and were concerned good idea. Some argued it was a bad idea because it would with vermin or overflowing bins. A few commenters, how- lead to vermin being attracted and to an increase in flying ever, countered that black bins were for non-food waste that tipping. Others argued it was a good idea as it would save would not attract vermin. Other commenters thought fewer money that could be spent on other council services and collections were manageable if people recycled their food would make the borough more environmentally. Whether waste, garden waste, and any other recycleable materials. the proposal would lead to vermin being attracted was de- Few commenters, however, pointed out the lack of com- bated. Some argued they would because the proposal would posting facilities for those living in some areas or flats. The lead to rubbish piling up in the streets. Others argued it council should provide more and services in these would not as the proposal would not lead to rubbish piling areas to encourage more people to recycle. up in the streets.

Figure 3: A human authored summary (Summary 1) and a potential automatic summmary (Summary 2) of the first 30 comments on the Bury Bin Collection Article could then be called recursively on each of the is- ical representation of the same 30 comments and sues in SubI or SubI could be returned and added while the summary addresses only a subset of the to an agenda maintained by a higher level control- graph nodes, it does not introduce any additional ling algorithm, which calls Algorithm 1 iteratively content. This is a promising, if weak, form of val- on each of the issues in its agenda. Of course the idation as it suggests that summaries read off our summary must not exceed its length limit. argument graph using the algorithm of the last sec- The algorithm only selects the content for in- tion are very similar to those produced by humans, clusion in a summary and ignores details of how it given only modest direction. Further compari- is to be realised. A more or less mechanical sur- son of our gold standard human summaries with face realisation process could be used to generate a graphical representations of the same source texts summary like that shown in Figure 3, Summary 2. might provide additional insights into how to re- In this summary for the extended argument graph fine algorithms for summary content selection. underlying Tables 1 and 2, we assume the root is- sue has been summarised (sentences 1-3) and that 4 Related Work one further issue (2) has also been chosen for in- In recent years various authors have begun work clusion using the sort of extension to the algorithm on argument mining in on-line discussion forums described in the last paragraph (sentences 4-6). (e.g., Cabrio and Vilatta (2012); Boltuziˇ c´ and Snajderˇ (2014); Swanson et al. (2015)) and reader 3.3 Comparison with Human Summaries comment on news (e.g., Sobhani et al. (2015); Figure 3 shows a human-authored summary of the Carstens and Toni (2015); Sardianos et al. (2015)). first 30 comments on the bin collection article, cre- While sharing some features, such as allowing ated as part of a corpus of gold standard reader multiple participants to exchange views, make comment summaries (SENSEI Project, 2016). claims and supply supporting arguments, these Annotators created the gold standard summaries two sources of argumentative discourse also ex- using a novel 3 stage method: (1) each comment hibit notable differences. For instance, in on- in the source set is annotated with a label (i.e. line discussion forums such as debatepedia.org or a mini-summary of the main points in the com- convinceme.net, debates are topically organized or ment); (2) related labels are sorted into groups that tagged with key words, e.g. climate change, and the annotator believes will be helpful for writing a debate is typically framed by a starting motion an overview summary and a group label is pro- or question and an example of a supporting or duced to indicate common content in the group; counter statement (similar to our notion of issue (3) based on the analysis and annotations created and viewpoint). In reader comment this structured in stages (1) and (2), an overview summary is writ- information is missing and the debate is framed ten, which should identify the main points raised solely by a document (the article), with issues, as in the discussion, different views, areas of consen- we define them, rarely explicitly signalled in the sus, the proportion of comments addressing a topic article or comments. Thus, the task of structur- or sharing a view, and strong feelings shown. ing the debate by discovering the issues, which our The human summary sentences shown above framework addresses, is a challenge of particular correspond very closely to elements in the graph- importance for reader comment.

18 Many authors propose models of argumentation cussed how an argument graph can be used to gen- and associated annotation schemes, e.g. Ghosh et erate summaries of argumentative conversations, al. (2014), Swanson et al. (2015), Carstens and proposing features that could be extracted from an Toni (2015). These models/schemes specify a set argument graph to assist in selecting content to be of argumentative elements and relations between summarised and sketching two basic summariza- them and, as noted by Peldszus and Stede (2013), tion algorithms suggestive of the space of possible approaches to argument mining typically address algorithms that could be developed. the subtasks of identifying, classifying and relat- We are fully aware that our analytical frame- ing argumentative discourse units (ADUs) accord- work, graphical representation and proposals for ing to the types of ADU and argumentative rela- summarization algorithms are theoretical prelimi- tion specified in whatever model/scheme has been naries and, while grounded in extensive observa- adopted. Our framework too relies upon defining tion and analysis of data, need to be implemented and operationalising the identification of similar and empirically evaluated to be validated. This argument elements and relations (viewpoints and forms the core of our current and future work. rationales in our case). However, with the excep- Specifically we need to further develop, imple- tion of Kunz and Rittel (1970) we are not aware of ment and evaluate (1) methods for reliably ex- any argumentation model that puts the notion of tracting an argument graph from news articles and issue in the form of a question at the centre of the comments (2) summarization algorithms of the model and organises argument elements and rela- sort outlined above. Building argument graphs is tions around it. the greater of these challenges and is perhaps best Aside from differences in the text type ad- approached by factoring it into sub-tasks, such as dressed (reader comment rather than on-line de- candidate assertion detection, argumentative rela- bate) and the prominence given to notion of issue tion detection and issue identification. Candidate in our anaytical framework, our principal differ- assertion detection involves segmenting the text ence to other work in argument mining is the task into clauses that could play a role in the argument. we focus on: summarization. Some authors have Argumentative relation detection involves identi- cited summarization as a motivating end-user task, fying various relations between candidate asser- e.g. Swanson et al. (2015) and Misra et al. (2015). tions, such as identity, disagreement or contradic- However, both these works aim at summarising tion and evidence or support. Issue identification an argument on a single topic like “gun control” involves detecting a disagreement or contradiction across multiple dialogues and do not address the relation between assertions and establishing suf- summarization of single, multi-party argumenta- ficient supporting argumentation for the opposed tive conversations that may address multiple is- assertions and/or repetition across multiple partic- sues, such as those found in reader comments. To ipants in the conversation to deem them an issue. the best of our knowledge no one has addressed Building components to carry out these sub-tasks the form that an end-user overview summary of is likely to require the creation of annotated re- reader comment might take or how it might be sources for training and testing. Existing super- generated from the abstract representation of an vised learning techniques can then be brought to argument, as we do in this paper. bear. As well as implementing our proposals, fur- ther work should be carried out to refine and val- 5 Discussion and Future Work idate our analytical framework, e.g., by getting multiple analysts to generate argument graphs for In this paper we have defined notions of issue, a corpus of comment sets. viewpoint and assertion as part of a framework While these challenges are substantial we be- for analysing argumentative conversations such lieve the proposals made in this paper provide a re- as those that appear in response to news arti- alistic framework to progress work on summariza- cles in on-line news. We introduced a graph- tion of multi-party argumentative conversations. ical representation for representing these argu- ment elements and relations between them, such Acknowledgements The authors would like to as viewpoint-on holding between viewpoints and thank the European Commission for supporting issues and rationale-for holding between asser- this work, carried out as part of the FP7 SENSEI tions and viewpoints/other assertions. We also dis- project, grant reference: FP7-ICT-610916.

19 References Andreas Peldszus and Manfred Stede. 2013. From ar- ˇ gument diagrams to argumentation mining in texts: Filip Boltuziˇ c´ and Jan Snajder. 2014. Back up your A survey. International Journal of Cognitive Infor- stance: Recognizing arguments in online discus- matics and Natural Intelligence (IJCINI), 7(1):1–31. sions. In Proceedings of the First Workshop on Ar- gumentation Mining, pages 49–58. SENSEI Project. 2016. The SENSEI corpus of human summaries of reader comment conversations in on- Elena Cabrio and Serena Villata. 2012. Combining line news. Available from: nlp.shef.ac.uk/sensei/. textual entailment and argumentation theory for sup- porting online debates interactions. In The 50th An- Chris Reed, Douglas Walton, and Fabrizio Macagno. nual Meeting of the Association for Computational 2007. Argument diagramming in logic, law and arti- Linguistics, Proceedings of the Conference, pages ficial intelligence. Knowledge Engineering Review, 208–212. 22(1):87–109. Lucas Carstens and Francesca Toni. 2015. Towards Christos Sardianos, Ioannis Manousos Katakis, Geor- relation based argumentation mining. In Proceed- gios Petasis, and Vangelis Karkaletsis. 2015. Argu- ings of the 2nd Workshop on Argumentation Min- ment extraction from news. In Proceedings of the ing, pages 29–34, Denver, CO, June. Association for 2nd Workshop on Argumentation Mining, pages 56– Computational Linguistics. 66. Jeff Conklin and Michael L. Begeman. 1988. gIBIS: Parinaz Sobhani, Diana Inkpen, and Stan Matwin. A hypertext tool for exploratory policy discussion. 2015. From argumentation mining to stance clas- ACM Trans. Inf. Syst., 6(4):303–331. sification. In Proceedings of the 2nd Workshop on Debanjan Ghosh, Smaranda Muresan, Nina Wacholder, Argumentation Mining, pages 67–77. Mark Aakhus, and Matthew Mitsui. 2014. Analyz- ing argumentative discourse units in online interac- Reid Swanson, Brian Ecker, and Marilyn Walker. tions. In Proceedings of the First Workshop on Ar- 2015. Argument mining: Extracting arguments gumentation Mining, pages 39–48. from online dialogue. In Proceedings of the SIG- DIAL 2015 Conference, pages 217–226. Ivan Habernal, Judith Eckle-Kohler, and Iryna Gurevych. 2014. Argumentation mining on the web from information seeking perspective. In Proceed- ings of the Workshop on Frontiers and Connections between Argumentation Theory and Natural Lan- guage Processing, pages 26–39. Elham Khabiri, James Caverlee, and Chiao-Fang Hsu. 2011. Summarizing user-contributed comments. In Proceedings of The Fifth International AAAI Con- ference on Weblogs and Social Media (ICWSM-11), pages 534–537. Werner Kunz and Horst W. J. Rittel. 1970. Issues as el- ements of information systems. Working Paper No. 131, Institute of Urban and Regional Development, Univ. of California, Berkeley, Calif. Clare Llewellyn, Claire Grover, and Jon Oberlander. 2014. Summarizing newspaper comments. In Pro- ceedings of the Eighth International Conference on Weblogs and Social Media, ICWSM 2014, pages 599–602. Zongyang Ma, Aixin Sun, Quan Yuan, and Gao Cong. 2012. Topic-driven reader comments summariza- tion. In Proceedings of the 21st ACM international Conference on Information and Knowledge Man- agement, CIKM ’12, pages 265–274. Amita Misra, Pranav Anand, Jean E. Fox Tree, and Marilyn Walker. 2015. Using summarization to dis- cover argument facets in online idealogical dialog. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 430–440.

20 Argumentative texts and clause types

Maria Becker, Alexis Palmer, and Anette Frank Leibniz ScienceCampus Empirical Linguistics and Computational Language Modeling Department of Computational Linguistics, Heidelberg University Institute of German Language, Mannheim mbecker,palmer,frank @cl.uni-heidelberg.de { }

Abstract for classifying the argumentative functions served by premises. Argumentative texts have been thoroughly Specifically, this is an empirical investigation analyzed for their argumentative structure, of the semantic types of clauses found in argu- and recent efforts aim at their automatic mentative text passages, using the inventory of classification. This work investigates lin- clause types developed by Smith (2003) and ex- guistic properties of argumentative texts tended in later work (Palmer et al., 2007; Friedrich and text passages in terms of their seman- and Palmer, 2014). Situation entity (SE) types tic clause types. We annotate argumenta- describe how clauses behave in discourse, and as tive texts with Situation Entity (SE) classes, such they are aspectual rather than ontological cat- which combine notions from lexical aspect egories. Individual clauses of text evoke differ- (states, events) with genericity and habit- ent types of situations (for example, states, events, uality of clauses. We analyse the correla- generics, or habituals), and the situations evoked tion of SE classes with argumentative text in a text passage are linked to the text type of genres, components of argument structures, the passage. For more detail see Section 2. Fur- and some functions of those components. thermore, SE types are recognizable (and annotat- Our analysis reveals interesting relations able) through a combination of linguistic features between the distribution of SE types and of the clause and its main verb, and models have the argumentative text genre, compared to recently been released for their automatic classifi- other genres like fiction or report. We also cation (Friedrich et al., 2016). see tendencies in the correlations between Our approach is the first we know of to link argument components (such as premises clause type to argumentative structure, although and conclusions) and SE types, as well as features of the verb have been widely used in pre- between argumentative functions (such as vious work for classifying argumentative vs. non- support and rebuttal) and SE types. The ob- argumentative sentences. For example, Moens et served tendencies can be deployed for au- al. (2007) include verb lemmas and modal auxil- tomatic recognition and fine-grained classi- iaries as features, and Florou et al. (2013) find that, fication of argumentative text passages. for Greek web texts related to public policy issues, tense and mood features of verbal constructions are 1 Introduction helpful for determining the role of the sentences In this study we annotate a corpus of short argu- within argumentative structures. mentative texts with semantic clause types drawn Our analysis is performed on German texts. Tak- from work in theoretical linguistics on modes of ing the argumentative microtext corpus (Peldszus discourse. The aim is to better understand the lin- and Stede, 2015a) as a set of prototypical argumen- guistic characteristics of argumentative text pas- tative text passages, we annotated each clause with sages. Our study suggests that these clause types, its SE type (Section 3).1 In this way we are able to as linguistic features of argumentative text pas- investigate which SE types are most prevalent in sages, could be useful for automatic argument min- argumentative texts and, further, to link the clause ing – for identifying argumentative regions of text, 1The segmentation of microtexts into clauses is discussed for identifying premises and conclusions/claims, or in Section 3.2.

21 Proceedings of the 3rd Workshop on Argument Mining, pages 21–30, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics type (via SE label) to the argumentation graphs seen growing interest in computational linguistics provided with the microtext corpus (Section 4). We (Siegel and McKeown, 2000; Zarcone and Lenci, additionally provide SE annotations for a number 2008; Herbelot and Copestake, 2009; Reiter and of non-argumentative (or at least only partially ar- Frank, 2010; Costa and Branco, 2012; Nedoluzhko, gumentative) German texts, in order to contrast SE 2013; Friedrich and Pinkal, 2015, for example). type distributions across these text types. This annotation case study addresses the follow- 2.1 Situation entity types ing questions: We directly follow Friedrich and Palmer (2014) for the inventory of SE types, described below. 1. Do argumentative text passages differ from The inventory of SE types starts with states and non-argumentative text passages with respect events, including a subtype of events for attribu- to clause type? tional statements. REPORT-type clauses such as (3) 2. Do particular clause types correlate with par- do not necessarily refer to an actual event of speak- ticular elements in the argumentation graphs? ing but rather indicate a source of information. 3. Do particular clause types correlate with par- ticular functions of argument components? 1.S TATE (S) Armin has brown eyes. 2.E VENT (E) Bonnie ate three tacos. Recently, systems have been developed for au- 3. REPORT (R) The agency said applications tomatically deriving full argumentative structures had increased. from text. Peldszus and Stede (2015b) present a system for automatic argument structure prediction, Phenomena such as modality, negation, future the first for the microtext corpus. The linguistic fea- tense, and conditionality, when coupled with an tures used by the system include discourse connec- EVENT-type clause, cause a coercion to STATE. tives, lemmas and parts-of-speech, verb morphol- In brief, such phenomena refer to actual or poten- ogy, and dependency patterns. Stab and Gurevych tial states of the world rather than actual events. (2016) develop an end-to-end argument structure Several examples appear below. parser for persuasive texts. The parser performs the E S: Carlo should get the job. entire task pipeline, including segmentation of texts • → into argumentative vs. non-argumentative compo- E S: Darya did not answer. • → nents, using sequence labeling at the token level E S: If he wins the race, . . . • → in order to accurately model sentences containing multiple argument components. The features used An important distinction for argumentative texts, vary with the particular task, with structural, syn- as we will see, is between generic sentences and tactic, lexical, and lexico-syntactic, as well as dis- generalizing sentences, or habituals. While the course connective, features. former predicate over classes or kinds, the latter None of these systems have investigated seman- describe regularly-occurring events, such as habits tic features from the perspective of the clause. of individuals. We propose, based on the outcomes from our anno- 4. GENERIC SENTENCE (GEN): Birds can fly. / tation and analysis, that SE types are worth explor- Scientific papers make arguments. ing as features for argument mining. 5.G ENERALIZING SENTENCE (GS): 2 Theoretical background Fei travels to India every year. The phrase situation entity refers to the fact that The next category of SE types is broadly referred clauses of text evoke situations within a discourse. to as Abstract Entities. This type of clause presents For example, the previous sentence describes two semantic content in a manner that draws attention situations: (i) the meaning of situation entity, and to its epistemic status. We focus primarily on a (ii) what clauses of text do, in general. The second small subset of constructions - factive and proposi- situation is embedded as part of the first. Notions tional predicates with clausal complements. Of related to SE type have been widely studied in the- course a wide range of linguistic constructions oretical linguistics (Vendler, 1957; Verkuyl, 1972; can be used to convey such information, and to Dowty, 1979; Smith, 1991; Asher, 1993; Carl- address them all would require a comprehensive son and Pelletier, 1995, among others) and have treament of subjective language. In the examples

22 below, the matrix clause is in both cases a STATE, as Methods or Results. Going back to discourse and the embedded clause (in italics) is both an modes, Smith’s claim, supported by manual tex- EVENT and either a FACT, PROPOSITION, or RE- tual analysis, is that different modes show different SEMBLANCE-type situation entity. Note that in the characteristic distributions of SE types. RESEMBLANCE case, it is unclear whether or not Previous work shows that this claim is supported the embedded event took place. at the level of genre (Palmer and Friedrich, 2014), taking genre as an approximation of discourse 6. FACT (F): Georg knows that Reza won the mode. This approximation is problematic, though, competition. because texts of any genre are in fact composed 7. PROPOSITION (P): Georg thinks that Reza of multiple text passages which instantiate differ- won the competition. ent discourse modes. In this study we focus down 8. RESEMBLANCE (RES): Reza looks like he to the level of the text passage, offering the first won the competition. empirical analysis of argumentative text passages The first sentence of this section, for example, from the perspective of clause types. The analy- would be segmented and labeled as shown below. sis looks at German texts, both contrasting purely argumentative texts with mixed texts and, further, (i) The phrase situation entity refers to the fact analyzing the correlations between SE annotations [S] and argument structure graphs. (ii) that clauses of text evoke situations within a discourse. [F, GEN] 3 Data The first segment is a STATE, and the embedded In this section we describe the corpus of texts we clause is both a FACT, by virtue of being the annotate with SE types and the process used for complement clause following the fact that, and annotating the texts. Section 3.3 presents the first a GENERIC SENTENCE, by virtue of predicating step of the analysis, comparing the distribution of over a class of entities (clauses of text). SE types in the argumentative microtexts to those Finally, the labels QUESTION and IMPERATIVE for texts from other genres. are added to allow complete annotation of texts. Although they fulfill important and varied rhetori- 3.1 Argumentative microtext corpus cal functions, neither of these clause types directly Our data includes the whole argumentative micro- evoke situations. text corpus (Peldszus and Stede, 2015a) which con- sists of 112 German texts.2 Each microtext is a 9.Q UESTION (Q): Why do you torment me so? short, dense argument written in response to a ques- 10.I MPERATIVE (IMP): Listen to this. tion on some potentially controversial topic (e.g. 2.2 Discourse modes ”Should intelligence services be regulated more The inventory of SE types is motivated by theoreti- tightly by parliament?”). The writers were asked cal work (Smith, 2003; Smith, 2005) which aims to include a direct statement of their main claim to determine which specific linguistic features of as well as at least one objection to that claim. The text passages allow human readers to distinguish, texts, each of which contains roughly 5 argumen- e.g. narrative passages from argumentative pas- tative segments (and no irrelevant segments), were sages. Smith identifies five modes of discourse: first written in German and professionally trans- Narrative, Description, Report, Information, and lated into English. Argument/Commentary. Each mode is linked to An example text from the corpus appears in Ta- linguistic characteristics of the clauses which com- ble 1, segmented into argumentative discourse pose the passages. units (ADUs) as in the original corpus version. The set of SE types outlined above allows a com- The texts in the corpus are manually annotated plete, comprehensive description of these five dis- according to a scheme based on Freeman’s theory course modes. The discourse modes approach is of the macro-structure of argumentation (Freeman, related to that of Argumentative Zoning (Teufel, 1991; Freeman, 2011) for representing text-level 2000; O’Seaghdha and Teufel, 2014), in which argumentation structure. This scheme represents linguistic features of scientific texts are used to dis- 2https://github.com/peldszus/ tinguish genre-specific types of text passages, such arg-microtexts

23 SE German/English GS Die Geheimdienst mussen¨ dringend starker¨ vom Parlament kontrolliert werden, Intelligence services must urgently be regulated more tightly by parliament; S das sollte jedem nach den Enthullungen¨ von Edward Snowden klar sein. this should be clear to everyone after the disclosures of Edward Snowden. S Die betreffen zwar vor allem die britischen und amerikanischen Geheimdienste, Granted, those concern primarily the British and American intelligence services, GEN aber mit denen arbeiten die Deutschen Dienste bekanntermaßen eng zusammen. but the German services evidently do collaborate with them closely. GS Deren Werkzeuge, Daten und Knowhow wird schon lange zu unserer Uberwachung¨ genutzt. Their tools, data and expertise have been used to keep us under surveillance for a long time.

Table 1: Sample microtext (micro b005), both German and English versions, with SE labels. an argument, consisting of one conclusion and supports the conclusion (first segment), while the one or more premises, as a “hypothetical exchange” third segment undercuts segment two. This under- (Peldszus and Stede, 2013a) between the propo- cutting move itself is then undercut by the fourth nent and the opponent, who respectively defend segment (a proponent node), which is in turn sup- or question a specific claim. Note that Peldszus and ported by segment five. Stede (2013a) use the term “argument” to describe the complex of premises put forward in favor of 3.2 Annotation process a conclusion, while premises and conclusions are The granularity of situation-evoking clauses is dif- propositions expressed in the text. ferent from that of ADUs, requiring that the texts The microtexts are segmented into “elementary be re-segmented prior to SE annotation. Table 3 units of argumentation” (sometimes called ADUs illustrates a microtext with more SE segments than as above), which consist of either the conclusion ADUs. For segmentation we use DiscourseSeg- or a single premise. Each ADU corresponds to a menter (Sidarenka et al., 2015), a python pack- structural element in an argument graph (Figure 1). age offering both rule-based and machine-learning In these texts, conclusions are only associated with based discourse segmenters.3 After preprocessing the proponent, but premises can be associated with texts with the Mate Tools,4 we use DiscourseSeg- either the proponent or the opponent. menter’s rule-based segmenter (edseg), which Round nodes in the graph are proponent nodes employs German-specific rules to determine the (premises or conclusions) while square ones are boundaries of elementary discourse units in texts. opponent nodes (premises only). In addition, sev- Because DiscourseSegmenter occasionally over- eral different supporting and attacking moves (also split segments, we did a small amount of post- called “argumentative functions” of a segment processing. On average, one ADU contains 1.16 (Peldszus and Stede, 2013b)) are labelled and rep- SE segments. Table 2 shows the number of seg- resented by the arcs connecting the nodes. The ments of each type, as well as the distribution of most frequent argumentative functions are: both argument components and SE types over the segments. support: a premise supporting a conclusion • In addition to labeling each segment with its rebuttal: a premise which attacks a conclu- • SE type, we also annotate three important verb- sion or premise by challenging the acceptabil- or clause-level features: (a) lexical aspect (dy- ity of the proposition being attacked namic/stative) of the main verb, (b) whether the undercut: a premise which attacks by chal- • main referent of the clause is generic or non- lenging the acceptability of an inference be- 3https://github.com/WladimirSidorenko/ tween two propositions DiscourseSegmenter 4http://www.ims.uni-stuttgart.de/ The argument graph for our sample text appears forschung/ressourcen/werkzeuge/matetools. in Figure 1. Here we see that the second segment html

24 Figure 1: Argument graph for sample microtext (micro b005), with SE labels added.

#texts=112 #ADUs=576 #SE Segs=668 concl./premise 112 / 464 state 212 (0.32) generic 320 (0.48) question 10 (0.01) prop./opp. 451 / 125 event 57 (0.09) generalizing 67 (0.10) resemblance 2 (0.00)

Table 2: Microtext corpus used for this analysis (percentages in brackets). ADUs are the argument component segments in the original data; SE Segs are situation entity segments.

German/English (+SE-Labels) CONCL. Alternative Heilpraktiken sollten genauso wie arztliche¨ Behandlungen bezuschusst werden, (GEN) Alternative treatments should be subsidized in the same way as conventional treatments,

PREMISE weil beide Wege zur Verhinderung, Minderung oder Heilung einer Krankheit fuhren¨ konnten.¨ (GEN) since both methods can lead to the prevention, mitigation or cure of an illness.

PREMISE Zudem musste¨ es im Sinne der Krankenkassen liegen, (GEN) Naturheilpraktiken als arztliche¨ Behandlung anzuerkennen, (GEN) \\ Besides it should be in the interest of the health insurers to recognize alternative medicine as treatment, \\ PREMISE da eine Chance der Genesung besteht. (ST) since there is a chance of recovery.

PREMISE Es spielt dabei doch keine Rolle, (GS) dass die Behandelnden nicht den Arzte-Status”¨ tragen. (GEN) It doesn’t matter after all that those\\ who administer the treatment don’t have ’doctor status’. \\ Table 3: Sample microtext (micro b010), both German and English versions, with SE labels. generic, and (c) whether the clause is habitual (a annotators and one expert annotator. Following pattern of occurrences), episodic (a fixed number of Mavridou et al. (2015), with a modified and trans- occurrences), or static (an attribute, characteristic, lated version of an existing SE annotation man- or quality). Using these feature values in a deci- ual,5 student annotators were trained on a set of sion tree has been shown to improve human agree- longer texts from different genres, automatically ment on the SE type annotation task (Friedrich and segmented as described above: fiction (47 seg- Palmer, 2014). ments), reports (42 segments), TED talks (50 seg-

Annotators and annotator training. Each text 5www.coli.uni-saarland.de/projects/ was annotated by two trained (but novice) student sitent/page.php?id=resources

25 ments), and commentary (127 segments). Inter-annotator agreement. We compute agree- ment separately for SE type and for the three fea- tures introduced above, as shown in Table 4 (re- ported as Cohen’s Kappa). Numbers in brackets represent IAA for the training texts.

level Observed Chance Cohen’s Agreement Agreement Kappa Main Referent 0.62 (0.76) 0.50 (0.57) 0.23 (0.45) Aspect 0.84 (0.91) 0.50 (0.52) 0.69 (0.81) Habituality 0.68 (0.70) 0.26 (0.27) 0.57 (0.58) Figure 2: Distribution of SE types in all genres. Situation Entity 0.52 (0.61) 0.21 (0.22) 0.40 (0.50)

Table 4: Inter-annotator agreement on microtexts (and training texts).

The table shows that the IAA for our training set is (slightly) better than the IAA we gained for the microtexts. The best results we obtained in the category Aspect (0.69 K), the worst in the category Main Referent (0.23 K). Error Analysis. As reported by several studies (Peldszus and Stede, 2013b), the annotation of ar- Figure 3: Distribution of SE types in microtexts. gumentative texts is a difficult task for humans. Our annotators often disagreed about the gener- predominant SE types. We investigate this claim icity of the Main Referent (binary: generic/non- by comparing the distribution of SE types in the generic). The SE-labels that caused most dis- microtext subcorpus to those in the training texts agreement among the annotators are STATE vs. (see Figure 2). The microtexts (Figure 3), which GENERIC SENTENCE (39% of all disagreements can be described as ‘purely’ argumentative texts, regarding SE-type), followed by STATE vs. GEN- are characterized by a high proportion of generic ERALIZING SENTENCE (22%). and generalizing sentences and very few events, This first annotation round suggests that we might while reports and talks, for example, contain a high benefit from: proportion of states. Fiction is characterized by 1. adapting the SE annotation scheme for a high number of events. The genre commentary, (purely) argumentative texts; and which contains many argumentative passages but 2. training annotators specifically to deal with is not as ‘purely’ argumentative as the microtexts, (purely) argumentative texts. is most likely to be comparable to the microtexts. This finding suggests that SE types could be helpful Given the relatively low agreement, for the anal- for modeling argumentative regions of text. ysis we produce a gold standard annotation by re- annotating all segments for which the two student 4 Analysis annotators disagree about the SE type label. This third annotation is done by an expert annotator (one Extraction of argument graphs. As a next step, of the authors). For the segments which needed re- we look at the correspondences between SE type annotation, the expert annotator agreed with one of labels and various aspects of the argument graph the two student annotators regarding SE label 87% components (ADUs): of the time. conclusion vs. premise • 3.3 Distributions proponent premise vs. opponent premise • One key claim of Smith (2003) is that text pas- support vs. attack by rebuttal vs. attack by • sages in different discourse modes have different undercut

26 Figure 4: Correlations between SE types and argu- Figure 6: Correlations between SE types and pro- ment components. ponent/opponent argument components.

licence.)

Figure 5, where all premises are contrasted with all conclusions, shows this difference even more clearly. The STATE label is more frequent for con- clusions (44%) than for premises (29%). On the other hand, premises contain 10% EVENT labels, while there are only 3% EVENT labels in the set of conclusions. The SE types also correspond to the distinc- Figure 5: Correlations between SE types and con- tion between premises that support a conclusion clusions/premises. (“pro”-label) and premises that attack a conclusion (“opp”-label), as Figure 6 indicates. First, propo- For this analysis, we match the ADU segments nent ADUs contain more GENERIC SENTENCE extracted from the argument graphs to the situa- labels (52%) than opponent units (42%). Further- tion segments produced by our own segmentation more, the STATE and EVENT labels are more fre- routine. Because the segments of the microtext cor- quent within opponent units (32% and 14%), while pus (ADUs) sometimes contain several situation they make up only 29% and 8% in proponent units. segments, it can happen that several SE labels are To illustrate this, below we show an extract assigned to one ADU. The results of these analyses of an argument consisting of a stative opponent are presented below. premise (in bold face) attacking a conclusion:

4.1 Correlations between SE types and Die Krankenkassen sollten Behandlungen beim argument components Natur- oder Heilpraktiker nicht zahlen, es sei denn The mapping of the two sets of annotations der versprochene Effekt und dessen medizinis- reveals interesting correlations between SE types cher Nutzen sind handfest nachgewiesen. and argument components. Figure 4 shows the (Translation: Health insurance companies should overall distribution of SE types over specific not cover treatment in complementary medicine argument components. First, conclusions are unless the promised effect and its medical benefit almost exclusively either GENERIC SENTENCES have been concretely proven.) (48%) or STATES (44%), while premises also consist of GENERALIZING SENTENCES (12%) Of course, the corpus on which these investiga- and EVENTS (10%). tions are carried out is quite small, and the phenom- An example of a generic conclusion is given below: ena we observe can be interpreted solely as tenden- cies. Nevertheless, we hypothesize that the preva- GEN: Nicht jeder sollte verpflichtet sein, den lence of the GENERIC SENTENCE and GENERAL- Rundfunkbeitrag zu bezahlen. (Translation: Not IZING SENTENCE labels, as well as the absence everyone should be obliged to pay the TV & radio (or rareness) of the EVENT label are indicators for

27 Figure 7: Correlations between SE types and argu- Figure 8: Correlations between SE types and argu- mentative functions mentative functions with regard to pro/opp argumentative text passages. Furthermore, such (51% and 56%). prevalences (or also rarenesses) can be interpreted To give a better idea of how rebuttals of different as indicators for particular argument components SE types look, below we show two examples, one such as proponent premises or conclusions. a STATE and the other one GENERIC SENTENCE:

4.2 Correlations between SE types and Die Krankenkassen sollten Behandlungen beim argumentative functions Natur- oder Heilpraktiker nicht zahlen, es sei There are also some interesting correlations be- denn der versprochene Effekt und dessen medi- tween SE types and the argumentative functions of zinischer Nutzen sind handfest nachgewiesen. premises. In this study we focus on the three most (STATE) frequent functions in our data, namely support, re- (Translation: Health insurance companies buttal and undercut. Table 5 gives an overview of should not cover treatment in complementary their distribution within the microtexts. medicineunless the promised effect and its medical benefit have been concretely proven.) function pro/opp n support proponent 250 Ja, seinen Mull¨ immer ordentlich zu trennen [...] Aber immer noch support opponent 13 ist nervig und muhselig.¨ wird in Deutschland viel zu viel Mll produziert. rebut proponent 12 (GENERIC SENTENCE) rebut opponent 96 (Translation: Yes, it’s annoying and cumbersome undercut proponent 51 to separate your rubbish properly all the time. undercut opponent 12 [...] But still Germany produces way too much rubbish.) Table 5: Distribution of argumentative functions within the microtexts. Finally, Figure 8 shows the distribution of SE types among the different argumentative functions, It is worth mentioning that the rebuttals in the separated into proponent and opponent premises. microtext corpus seem to differ slightly from all of According to this distribution, support moves, the other segments in terms of the frequency of the for example, can be distinguished by SE type label GENERIC SENTENCE. While this is by far into premises supporting the opponent and those the most frequent label in the microtexts (overall supporting the proponent. While proponent sup- frequency of 48%), here it has only a frequency of port premises contain clearly more GENERIC SEN- 42%, followed by the label STATE with a frequency TENCES (52%) than STATES (29%), the number of 32% (see Figure 7). Rebuttals are less strongly of GENS and STATES within opponent support biased toward GENERIC SENTENCES, compared to premises is almost equal (32% and 31%). other argumentative functions. In contrast, support- The following is a generic premise that supports ing and undercutting moves are characterized by an the proponent, contrasted with a STATE premise above-average number of GENERIC SENTENCES supporting the opponent:

28 linguistically. Due to the small dataset, our results Mit der BA-Arbeit kann man jedoch die Inter- can be interpreted solely as tendencies which have essen und die Fachkenntnisse besonders gut zeigen. to be confirmed by more extensive studies in the Schlielich ist man nicht in jedem Fach sehr gut. future. Nonetheless there is some evidence that (GENERIC SENTENCE) the observed tendencies can be deployed for auto- (Translation: With a BA dissertation one can, matic recognition and fine-grained classification of however, demonstrate interests and subject matter argumentative text passages. expertise particularly well. After all one doesn’t In addition to the ongoing annotation work excel in every subject.) which will give us more data to analyze, we intend to cross-match SE types with the newly-available Es bleibt jedoch fragwurdig,¨ ob die tatsachliche¨ discourse structure annotations for the microtext Durchfuhrung¨ derartiger Kontrollen auch gle- corpus (Stede et al., 2016). We would additionally ichzeitig zur starkeren¨ Einhaltung von Gesetzen explore the role of modal verbs within this inter- fhrt, denn jetztendlich liegt diese Entscheidung section of SE type and argument structure status. in den Handen¨ des jeweiligen Regierungschefs. The end goal of this investigation, of course, is to (STATE) deploy automatically-labeled SE types as features (Translation: Yet it remains questionable for argument mining. whether the actual implementation of such super- vision would at the same time lead to a stronger Acknowledgments observance of laws, as ultimately this decision We want to thank Michael Staniek for building is in the hands of the respective government up the annotation tool and preprocessing the texts. leaders.) Sabrina Effenberger and Rebekka Sons we thank for the annotations and their helpful feedback on These results suggest that SE types could be the annotation manual. We would also like to thank helpful even for a finer-grained analysis of argu- the reviewers for their insightful comments and mentative functions. Nonetheless we would reiter- suggestions on the paper. ate here that our results are based on a small dataset This research is funded by the Leibniz Science- only and need to be confirmed by further experi- Campus Empirical Linguistics and Computational ments. Therefore, annotations of larger texts with Language Modeling, supported by Leibniz Asso- a mixture of argumentative and non-argumentative ciation grant no. SAS-2015-IDS-LWC and by the passages are already underway. Ministry of Science, Research, and Art of Baden- Wurttemberg.¨ 5 Conclusion

This paper presents an annotation study whose goal References is to dig into the usefulness of semantic clause types Nicholas Asher. 1993. Reference to Abstract objects (in the form of situation entity labels) for mining in Discourse. Kluwer Academic Publishers. argumentative passages and modeling argumenta- tive regions of text. This is the first study to label Gregory N. Carlson and Francis Jeffry Pelletier, editors. 1995. The Generic Book. University of Chicago argumentative texts (microtexts, in this case) with Press. situation entity types at the clause level. We have explored the correlation of SE classes with argu- Francisco Costa and Antonio´ Branco. 2012. Aspec- mentative text genres, particularly in comparison tual type and temporal relation classification. In Proceedings of the 13th Conference of the European to non-argumentative texts. We have also looked at Chapter of the Association for Computational Lin- the correlations between SE types and both specific guistics, pages 266–275. Association for Computa- argument components (premise vs. conclusion, pro- tional Linguistics. ponent vs. opponent) and specific argumentative David Dowty. 1979. Word Meaning and Montague functions (support, rebuttal, undercut). Grammar. Reidel. We do our analysis on German texts, but we fully Eirini Florou, Stasinos Konstantopoulos, Antonis expect that the results should carry over to other Kukurikos, and Pythagoras Karampiperis. 2013. Ar- languages, as the link between SE type distribution gument extraction for supporting public policy for- and discourse mode has been shown to hold cross- mulation. In Proceedings of the 7th Workshop on

29 Language Technology for Cultural Heritage, Social Andreas Peldszus and Manfred Stede. 2013b. Ranking Sciences, and Humanities. the annotators: An agreement study on argumenta- tion structure. In Proceedings of the 7th Linguistic James B. Freeman. 1991. Dialectics and the Annotation Workshop and Interoperability with Dis- Macrostructure of Argument. Foris. course, pages 196–204, Sofia, Bulgaria, August. As- sociation for Computational Linguistics. James B. Freeman. 2011. Argument Structure: Repre- sentation and Theory, volume 18 of Argumentation Andreas Peldszus and Manfred Stede. 2015a. An an- Library. Springer. notated corpus of argumentative microtexts. In Pro- ceedings of the First European Conference on Argu- Annemarie Friedrich and Alexis Palmer. 2014. Situ- mentation: Argumentation and Reasoned Action. ation entity annotation. In Proceedings of the Lin- guistic Annotation Workshop VIII. Andreas Peldszus and Manfred Stede. 2015b. Joint prediction in MST-style discourse parsing for argu- Annemarie Friedrich and Manfred Pinkal. 2015. mentation mining. In Proceedings of EMNLP. Discourse-sensitive Automatic Identification of Generic Expressions. In Proceedings of the 53rd An- Nils Reiter and Anette Frank. 2010. Identifying nual Meeting of the Association for Computational Generic Noun Phrases. In Proceedings of the 48th Linguistics (ACL), Beijing, China, July. Annual Meeting of the Association for Computa- tional Linguistics, pages 40–49, Uppsala, Sweden, Annemarie Friedrich, Alexis Palmer, and Manfred July. Association for Computational Linguistics. Pinkal. 2016. Situation entity types: automatic clas- sification of clause-level aspect. In Proceedings of Uladzimir Sidarenka, Andreas Peldszus, and Manfred ACL 2016. Stede. 2015. Discourse Segmentation of German Texts. Journal for Language Technology and Com- Aurelie Herbelot and Ann Copestake. 2009. Anno- putational Linguistics, 30(1):71–98. tating genericity: How do humans decide? (A case study in ontology extraction). Studies in Generative Eric V. Siegel and Kathleen R. McKeown. 2000. Grammar 101, page 103. Learning methods to combine linguistic indica- tors: Improving aspectual classification and reveal- Kleio-Isidora Mavridou, Annemarie Friedrich, Melissa ing linguistic insights. Computational Linguistics, Peate Sorensen, Alexis Palmer, and Manfred Pinkal. 26(4):595–628. 2015. Linking discourse modes and situation enti- ties in a cross-linguistic corpus study. In Proceed- Carlota S. Smith. 1991. The Parameter of Aspect. ings of the EMNLP Workshop LSDSem 2015: Link- Kluwer. ing Models of Lexical, Sentential and Discourse- level Semantics. Carlota S Smith. 2003. Modes of discourse: The local structure of texts, volume 103. Cambridge Univer- Marie-Francine Moens, Erik Boiy, Raquel Mochales sity Press. Palau, and Chris Reed. 2007. Automatic detection of arguments in legal texts. In Proceedings of ICAIL Carlota S Smith. 2005. Aspectual entities and tense 2007. in discourse. In Aspectual inquiries, pages 223–237. Springer. Anna Nedoluzhko. 2013. Generic noun phrases and annotation of coreference and bridging relations in Christian Stab and Iryna Gurevych. 2016. Parsing ar- the Prague Dependency Treebank. In Proceedings gumentation structures in persuasive essays. arXiv of the 7th Linguistic Annotation Workshop and In- preprint, under review, April. teroperability with Discourse, pages 103–111. Manfred Stede, Stergos Afantenos, Andreas Peldszus, Diarmuid O’Seaghdha and Simone Teufel. 2014. Un- Nicholas Asher, and Jeremy Perret. 2016. Parallel supervised learning of rhetorical structure with un- discourse annotations on a corpus of short texts. In topic models. In Proceedings of COLING. Proceedings of LREC. Simone Teufel. 2000. Argumentative zoning: Infor- Alexis Palmer and Annemarie Friedrich. 2014. Genre mation extraction from scientific text. Ph.D. thesis, distinctions and discourse modes: Text types differ University of Edinburgh. in their situation type distributions. In Proceedings of the Workshop on Frontiers and Connections be- Zeno Vendler, 1957. Linguistics in Philosophy, chapter tween Argumentation Theory and NLP. Verbs and Times, pages 97–121. Cornell University Press, Ithaca, New York. Alexis Palmer, Elias Ponvert, Jason Baldridge, and Car- lota Smith. 2007. A sequencing model for situation H.J. Verkuyl. 1972. On the Compositional Nature of entity classification. In Proceedings of ACL. the Aspects. Reidel. Andreas Peldszus and Manfred Stede. 2013a. From ar- Alessandra Zarcone and Alessandro Lenci. 2008. gument diagrams to argumentation mining in texts: Computational models of event type classification in A survey. International Journal of Cognitive Infor- context. In Proceedings of LREC2008. matics and Natural Intelligence (IJCINI), 7(1):1–31.

30 Contextual stance classification of opinions: A step towards enthymeme reconstruction in online reviews

Pavithra Rajendran Danushka Bollegala Simon Parsons University of Liverpool University of Liverpool King’s College London

Abstract Examining arguments that are found in natural language texts quickly leads to the recognition that Enthymemes, that are arguments with many such arguments are incomplete (Lippi and missing premises, are common in natural Torroni, 2015a). That is if you consider an ar- language text. They pose a challenge for gument to be a set of premises and a conclusion the field of argument mining, which aims that follows from those premises, one or more of to extract arguments from such text. If we these elements can be missing in natural language can detect whether a premise is missing texts. A premise is a statement that indicates sup- in an argument, then we can either fill the port or reason for a conclusion. In the case where missing premise from similar/related argu- a premise is missing, such incomplete arguments ments, or discard such enthymemes alto- are known as enthymemes (Walton, 2008). One gether and focus on complete arguments. classic example is given below. In this paper, we draw a connection be- tween explicit vs. implicit opinion classifi- Major premise All humans are mortal (unstated). cation in reviews, and detecting arguments Minor premise Socrates is human (stated). from enthymemes. For this purpose, we Therefore, Socrates is mortal (stated). train a binary classifier to detect explicit Conclusion vs. implicit opinions using a manually la- According to Walton, enthymemes can be com- belled dataset. Experimental results show pleted with the help of common knowledge, that the proposed method can discrimi- echoing the idea from Aristotle that the miss- nate explicit opinions from implicit ones, ing premises in enthymemes can be left implicit thereby providing encouraging first step in most settings if they represent familiar facts towards enthymeme detection in natural that will be known to those who encounter the language texts. enthymemes. Structured models from compu- tational argumentation, which contain structures 1 Introduction that mimic the syllogisms and argument schemes Argumentation has become an area of increas- of philosophical argumentation will struggle to ing study in artificial intelligence (Rahwan and cope with enthymemes unless we can somehow Simari, 2009). Drawing on work from philosophy, provide the unstated information. which attempts to provide a realistic account of Several authors have already grappled with the human reasoning (Toulmin, 1958; van Eemeren et problem of handling enthymemes and have rep- al., 1996; Walton and Krabbe, 1995), researchers resented shared common knowledge as a solution in artificial intelligence have developed computa- to reconstruct these enthymemes (Walton, 2008; tional models of this form of reasoning. A rel- Black and Hunter, 2012; Amgoud and Prade, atively new sub-field of argumentation is argu- 2012; Hosseini et al., 2014). ment mining (Peldszus and Stede, 2013), which In this paper, we argue that there exists a close deals with the identification of arguments in text, relationship between detecting whether a partic- with an eye to extracting these arguments for later ular statement conveys an explicit or an implicit processing, possibly using the tools developed in opinion, and whether there is a premise that sup- other areas of argumentation. ports the conclusion (resulting in a argument) or

31 Proceedings of the 3rd Workshop on Argument Mining, pages 31–39, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics not (resulting in an enthymeme). For example, consider the following two statements S1 and S2:

S1 = I am extremely disappointed with the room.

S2 = The room is small.

Both S1 and S2 express a negative sentiment to- wards the room aspect of this hotel. In S1, the stance of the reviewer (whether the reviewer is in favour or against the hotel) is explicitly stated by the phrase extremely disappointed. Consequently, we refer to S1 as an explicitly opinionated state- ment about the room. However, to interpret S2 as a negative opinion we must possess the knowl- edge that being small is often considered as neg- ative with respect to hotel rooms, whereas being Figure 1: The relationship between explicit/implicit opinions small could be positive with respect to some other and arguments/enthymemes. entity such as a mobile phone. The stance of the reviewer is only implicitly conveyed in S . Con- 2 b. Perform an aspect-level analysis to ob- sequently, we refer to S as an implicitly opinion- 2 tain the aspects present in each state- ated statement about the room. Given the con- ment and those statements that include clusion that this reviewer did not like this room an aspect are considered and the rest of (possibly explicitly indicated by a low rating given the statements are discarded. to the hotel), the explicitly opinionated statement c. Classify the stance of statements as be- S1 would provide a premise forming an argument, ing explicit or implicit. whereas the implicitly opinionated statement S2 would only form an enthymeme. Thus: Step-2 Premise extraction Argument a. Explicit opinions paired with the prede- Major premise I am extremely disappointed with the room. fined conclusions can give us complete arguments. Conclusion The reviewer is not in favour of the hotel. b. Implicit opinions paired with the prede- whereas: fined conclusions can either become ar- guments or enthymemes. Enthymemes Enthymeme require additional premises to complete Major premise A small room is considered bad (unstated). the argument.

Minor premise The room is small. c. Common knowledge can then be used to complete the argument. Conclusion The reviewer is not in favour of the hotel. This process uses both opinion mining and stance Our proposal for enthymeme detection via opin- detection to extract arguments, but it still leaves us ion classification is illustrated in Figure 1, and with enthymemes. Under some circumstances, it consists of the following two steps. This assumes a may be possible to combine explicit and implicit separate process to extract the (“predefined”) con- premises to complete enthymemes. clusion, for example from the rating that the hotel To see how this works, let us revisit our pre- is given. vious example. The explicitly opinionated state- ment “I am extremely disappointed with the room” Step-1 Opinion structure extraction can be used to complete an argument that has a. Extract statements that express opinions premise “the rooms are small and dirty”, which with the help of local sentiment (positive was extracted from the review, and a conclusion or negative) and discard the rest of the that “The hotel is not favored” which comes from statements. the fact that the review has a low rating.

32 Argument unstructured texts found online. It is very diffi- cult to extract properly formed arguments in on- Major premise I am extremely disappointed with the room. line discussions and the absence of proper anno- Minor premise the rooms are small and dirty tated corpora for automatic identification of these Conclusion The reviewer is not in favour of the hotel. arguments is problematic. According to Lippi and Torroni (2015a) who While developing this approach is our long- have made a survey of the various works carried term goal, this paper has a much more limited fo- out in argument mining so far with an emphasis on cus. In particular we consider Step 1(c), and study the different machine learning approaches used, the classification of opinions into those with ex- the two main approaches in argument mining re- plicit stance and those with implicit stance. late to the extraction of abstract arguments (Cabrio We focus on user reviews such as product re- and Villata, 2012; Yaglikci and Torroni, 2014) and views on Amazon.com, or hotel reviews on Tri- structured arguments. pAdvisor.com. Such data has been extensively re- Much recent work in extracting structured ar- searched for sentiment classification tasks (Hu and guments has concentrated on extracting argu- Liu, 2004; Lazaridou et al., 2013). We build on ments pertaining to a specific domain such as this work, in particular, aspect-based approaches. online debates (Boltuziˇ c´ and Snajder,ˇ 2014), In these approaches, sentiment classification is user comments on blogs and forums (Ghosh based around the detection of terms that denote et al., 2014; Park and Cardie, 2014), Twitter aspects of the item being reviewed — the battery datasets (Llewellyn et al., 2014) and online prod- in the case of reviews of portable electronics, the uct reviews (Wyner et al., 2012; Garcia-Villalba room and the pool in the case of hotel reviews — and Saint-Dizier, 2012). Each of these work target and whether the sentiment expressed about these on identifying the kind of arguments that can be aspects is positive or negative. detected from a specific domain. Our contributions in this paper are as follows: Ghosh et al. (2014) analyse target-callout pairs As described above, we propose a two-step among user comments, which are further anno- • framework that identifies opinion structures tated as stance/rationale callouts. Boltuzic and in aspect-based statements which help in de- Snaider (2014) identify argument structures that tecting enthymemes and reconstructing them. they propose can help in stance classification. Our focus is not to identify the stance but to use the We manually annotated a dataset classifying • stance and the context of the relevant opinion to opinionated statements to indicate whether help in detecting and reconstructing enthymemes the author’s stance is explicitly or implicitly present in a specific domain of online reviews. indicated. Lippi and Torroni (2015b) address the domain- We use a supervised approach using the dependency of previous work by identifying • SVM classifier to automatically identify the claims that are domain-independent by focussing opinion structures as explicit and implicit on rhetoric structures and not on the contextual in- opinions using the n-grams, part of speech formation present in the claim. (POS) tags, SentiWordNet scores and noun- Habernal et al. (2014) consider the context- adjective patterns as features. independent problem using two different argument schemes and argues that the best scheme to use 2 Related work varies depending upon the data and problem to Argument mining is a relatively new area in the be solved. In this paper, we address a domain- field of computational argumentation. It seeks to dependent problem of identifying premises with automatically identify arguments from natural lan- the help of stance classification. We think that guage texts, often online texts, with the aim of claim identification will not solve this problem, helping to summarise or otherwise help in process- as online reviews are rich in descriptive texts that ing such texts. It is a task which, like many natural are mostly premises leading to a conclusion as to language processing tasks, varys greatly from do- whether a product/service is good or bad. main to domain. A major part of the challenge There are a few papers that have concen- lies in defining what we mean by an argument in trated on identifying enthymemes. Feng and

33 Hirst (2011) classify argumentation schemes using ble (predefined) conclusions for the hotel reviews, explicit premises and conclusion on the Araucaria and these were: dataset, which they propose to use to reconstruct Conclusion 1 The reviewer is in favor of an as- enthymemes. Similar to (2011), Walton (2010) in- pect of the hotel or the hotel itself. vestigated how argumentation schemes can help in addressing enthymemes present in health product Conclusion 2 The reviewer is against an aspect of advertisements. Amgoud et al. (2015) propose a the hotel or the hotel itself. formal language approach to construct arguments We then annotated each of the 784 opinions with from natural language texts that are mostly en- one of these conclusions. This was done to make thymemes. Their work is related to mined argu- the annotation procedure easier, since each opin- ments from texts that can be represented using ion related to the conclusion forms either a com- a logical language and our work could be use- plete argument or an enthymeme. During the an- ful for evaluating (Amgoud et al., 2015) on a real notation process, each opinion was annotated as dataset. Unlike the above, our approach classifies either explicit or implicit based on the stance def- stances which can identify enthymemes and im- initions given above. The annotation was carried plicit premises that are present in online reviews. out by a single person and the ambiguity in the an- Research in opinion mining has started to un- notation process was reduced by setting out what derstand the argumentative nature of opinionated kind of statements constitute explicit opinions and texts (Wachsmuth et al., 2014a; Vincent and Win- how these differ from implicit opinions. These are terstein, 2014). This growing interest to sum- as follows: marise what people write in online reviews and not just to identify the opinions is much of the motiva- General expressive cues Statements that explic- tion for our paper. itly express the reviewer’s views about the hotel or aspects of the hotel. Example indi- 3 Method cators are disappointed, recommend, great.

3.1 Manual Annotation of Stance in Opinions Specific expressive cues Statements that point to We started with the ArguAna corpus of hotel re- conclusions being drawn but where the rea- views from TripAdvisor.com (Wachsmuth et al., soning is specific to a particular domain and 2014b) and manually separated those statements varies from domain to domain. Examples that contained an aspect and those that did not. are “small size batteries” and “rooms are This process could potentially be carried out au- small”. Both represent different contextual tomatically using opinion mining tools, but since notions, where the former suggests a posi- this information was available in the corpus, we tive conclusion about the battery and the lat- decided to use it directly. We found that many of ter suggests a negative conclusion about the the individual statements in the corpus directly re- room. Such premises need additional sup- fer to certain aspects of the hotel or directly to the port. hotel itself. These were the statements we used for our study. The rest were discarded.1 Event-based cues Statements that describe a situ- Each statement in the corpus was previ- ation or an incident faced by the reviewer and ously annotated as positive, negative or objec- needs further explanation to understand what tive (Wachsmuth et al., 2014b). Statements with a the reviewer is trying to imply. positive or negative sentiment were more opinion- Each statement in the first category (general ex- oriented and hence we discarded the statements pressive) is annotated as an explicit opinion and that were annotated as objective. A total of 180 re- those that match either of the last two categories views then gave us 784 opinions. Before we anno- (specific expressive, event-based) were annotated tated the statements, we needed to define the possi- as non-explicit opinions. The non-explicit opin- 1The remaining statements could potentially be used, but ions were further annotated as having a neutral or it would require much deeper analysis in order to construct implicit stance. We found that there were state- arguments that are relevant to the hotels. The criteria for our ments that were both in favor of and against the ho- current work is to collect simpler argument structures that can be reconstructed easily, and so we postpone the study of the tel and we annotated such ambiguous statements rest of the data from the reviews for future work. as being neutral.

34 Explicit stance Implicit stance i would not choose this hotel again. the combination of two side jets and one fixed head led to one finding the entire this bathroom flooded upon exiting the shower. great location close to public transport and chinatown. the pool was ludicrously small for such a large property, the sun loungers started to free up in the late afternoon. best service ever the rooms are pretentious and boring.

Table 1: A few examples of statements from the hotel data that are annotated with explicit and implicit stances.

From the manually annotated data, 130 state- service” will work to complete the argument be- ments were explicit, 90 were neutral and the rest cause the aspect front desk staff of the specific ex- were implicit. In our experiments, we focussed on plicit opinion A2 matches that of the implicit state- extracting the explicit opinions and implicit opin- ment A1. However, if the implicit statement was ions and thus ignored the neutral ones. Table 1 about another aspect (say the room cleaning ser- shows examples of statements annotated as ex- vice), then A2 woud not match the aspect, whereas plicit and implicit. the more general statement A3 would. As shown in Figure 1, explicit opinions with Having sketched our overall approach to argu- their appropriate conclusions can form complete ment extraction and enthymeme completion, we arguments. This is not the case for implicit opin- turn to the main contribution of the paper — an ions. Implicit opinions with their appropriate con- exploration of stance classification on hotel review clusions may form complete arguments or they data, to demonstrate that Step 1(c) of our process may require additional premises to entail the con- is possible. clusion. In this latter case, the implicit opinion and conclusion form an enthymeme. As discussed 3.2 Learning a Stance Classifier above, we may be able to use related explicit opin- Since we wish to distinguish between explicit and ions to complete enthymemes. When we look to implicit stances, we can consider the task as a bi- do this, we find that the explicit opinions in our nary classification problem. In this section we de- dataset fall into two categories: scribe the features that we considered as input to a range of classifiers that we used on the problem. General These explicit opinions are about an as- pect category, which in general, can be re- Section 4 describes the choice of classifiers that lated to several sub-level aspects within the we used. category. The following are a set of features that we used.

Specific These explicit opinions are about a spe- Baseline As a baseline comparison, statements cific aspect and hence can only be related to containing words from a list of selected cues that particular aspect. such as excellent, great, worst etc. are pre- dicted as explicit and those that do not con- To illustrate the difference between the two kinds tain words present in the cue list are predicted of explicit claim, let us consider three examples as implicit. The criteria followed here is that given below. the statement should contain atleast one cue A1 : “The front desk staffs completely ig- word to be predicted as explicit. The ten most • nored our complaints and did nothing to important cue words were considered. make our stay better”. (implicit) N-grams (Uni, Bi) Unigrams (each word) and A2 : “The front desk staff are worst”. (spe- bigrams (successive pair of words). • cific explicit) Part of Speech (POS) The NLTK2 tagger helps A3 : “I am disappointed with the overall cus- in tagging each word with its respective part • tomer service!” (general explicit) of speech tag and we use the most common tags (noun, verb and adjective) present in the In this case, both the specific opinion A2: “The explicit opinions as features. front desk staff are worst”, and the general opinion A3: “I am disappointed with the overall customer 2Natural Language Toolkit, www.nltk.org

35 Classifier Explicit Implicit Bayes and Linear SVM — using the basic uni- Logistic Regression 0.44 0.86 grams and bigrams as features and determine the MultinomialNB 0.27 0.85 Linear SVM 0.75 0.90 best classifier for our task. Table 2 gives the 5 cross-fold validation F1-score results with the lin- Table 2: F1-scores of 5-fold cross validation results per- ear SVM classifier performing best. We used the formed with different classifiers. The bold figures are the highest in each column. scikit-learn GridSearchcv function to perform an evaluative search on our data to get the best regu- larization parameter value for the linear SVM clas- Part of Speech (POS Bi) As for POS, but we sifier. This was C=10. consider the adjacent pairs of part of speech tags as a feature. 4.2 Training data Having picked the classifier, the second experi- We used the Senti- SentiWordNet score (senti) ment was to find the best mix of data to train on. WordNet (Baccianella et al., 2010) lexical re- This is an important step to take when we have source to assign scores for each word based data that is as unbalanced, in terms of the num- on three sentiments i.e positive, negative and ber elements of each type of data we are classify- objective respectively. The positive, negative ing, as the data we have here. The manually an- and objective scores sum up to 1. We use the notated statements were divided into two sets — individual lemmatized words in a statement training set and test set. We collected 30 explicit as an input and obtain the scores for each of and 150 implicit opinions as the test set. These them. For each lemmatized word, we obtain were not used in training. We gathered the re- the difference between their positive and neg- maining 100 explicit opinions and created a train- ative score. We add up the computed scores ing set using these statements and a variable-sized for all the words present in a statement and set of implicit opinions. For each such training set, average it which gives the overall statement we ran a 5 fold cross-validation and also tested it score as a feature. against the test set that we had created. We use Noun-Adjective patterns Both the statements in the linear SVM classifier to train and test the data general expression cues and specific expres- with the basic features (unigrams and bigrams re- sions cues contain combinations of noun and spectively). The mean F1-scores for the cross- adjective pairs. For every noun present in the validation on different train sets and the F1-scores text, each combination of adjective was con- on the test set for both explicit and implicit opin- sidered as a noun-adjective pair feature. ions are shown in Figure 2. The plot also contains the false positive rate for the test set with respect In addition to these features, each token is paired to different training sets. with its term frequency, defined as: 4.3 Features number of occurrences of a token (1) Given the results of the second experiment, we can total number of tokens identify the best size of training set, in terms of the number of explicit and implicit opinions. Consid- Thus rather than a statement containing several in- ering Figure 2, we see that a training set contain- stances of a common term (like “the”), it will con- ing 100 explicit and 250 implicit opinions will be tain a single instance, plus the term frequency. sufficient. With this mix, the false positive rate 4 Experiments is close to minimum, and the performance on the test set is close to maximum. We then carried out a Having covered the features we considered, this third experiment to find the best set of features to section describes the experimental setup and the identify the stances. To do this we ran a 5 fold results we obtained. We used the scikit-learn cross-validation on the training set using the all toolkit library to conduct three experiments. the features described in Section 3.2 — in other words we expanded the feature set from just uni- 4.1 Classifier grams and bigrams — using both individual fea- The first experiment was to train different classi- tures and sets of features. We also performed the fiers — Logistic Regression, Multinomial Naive same experiment using these different features on

36 Figure 2: The results of the experiment to identify the best mix of explicit and implicit stances for training. The training set contained 100 explicit stances and as many implicit stances as indicated on the x-axis. The graph shows the cross-validation F1 scores for the training sets, and the corresponding F1 scores obtained on the test set. False positive rates for the test set with respect to each training set are also plotted. the test set. tures in the data. The classifier is based on the following decision function: 4.4 Results Table 3 contains the results for the third exper- y(x) = w>x + b (2) iment. The best performance results are high- lighted — the highest values in each of the first where w is the weight vector and b is the bias four columns (classification accuracy) are in bold, value. Support vectors represent those weight as is the lowest value in the final column (false weight vectors that are non-zero, and we can use positive rate). We see that the basic features, un- these to obtain the most important features. Ta- igrams and bigrams, give good results for both ble 4 gives the most important 10 features identi- the cross-validation of the training set and for the fied in this way for both explicit and implicit opin- test set. We also see that while the sentiment of ions. each statement was useful in determining whether 5 Conclusion a statement is an opinion (and thus the statement is included in our data), sentiment does not help in In this paper, we focus on a specific domain distinguishing the explicit stance from the implicit of online reviews and propose an approach that stance which is why there is no improvement with can help in enthymemes detection and reconstruc- the SentiWordNet scores as features. This is be- tion. Online reviews contain aspect-based state- cause both positive and negative statements can be ments that can be considered as stances repre- either implicit or explicit. In contrast, the special senting for/against views of the reviewer about features that include the noun-adjective patterns the aspects present in the product or service and along with unigrams and bigrams gave the best the product/service itself. The proposed approach performance for the test set, and also produced the is a two-step approach that detects the type of lowest false positive rate. stances based on the contextual features, which can then be converted into explicit premises, and 4.5 Top 10 features these premises with missing information repre- The linear SVM classifier gives the best perfor- sents enthymemes. We also propose a solution us- mance results and thus we use the weights of the ing the available data to represent common knowl- classifier for identifying the most important fea- edge that can fill in the missing information to

37 Features Training set Test set F1 Score F1 Score False positive rate Explicit Implicit Explicit Implicit Baseline 0.73 0.88 0.67 0.92 0.41 Uni 0.74 0.90 0.65 0.92 0.40 Uni +Bi 0.75 0.90 0.70 0.94 0.30 Uni + Bi + POS 0.74 0.90 0.68 0.93 0.33 Uni + Bi + POS + POS Bi 0.72 0.89 0.71 0.94 0.26 Uni + Bi + POS + POS Bi + Senti 0.66 0.89 0.68 0.94 0.3 Uni + Bi + POS + Senti 0.73 0.90 0.70 0.94 0.31 Uni + Bi + Noun-Adj patterns 0.77 0.90 0.72 0.94 0.26

Table 3: The results of the experiment to identify the best feature set. The table gives the F1 scores for training set and test set using different sets of features. False positive rate on the test set is also listed. All results were obtained using the Linear SVM classifier except the baseline classifier. The bold numbers are the highest classification rates in each column, or the lowest false positive rate for the column, as appropriate.

Explicit Weight Implicit Weight sentence level, we may be misclassifying some excellent 4.18 Adj + Noun -1.26 sentences as expressing implicit opinions when location 3.43 the hotel -1.25 great 2.55 nice star -1.23 they include both implicit and explicit opinions. experience 2.02 fairly -1.08 To refine the classification, we need to examine recommend 1.91 hotel the -1.03 was excellent 1.84 helpful + Noun -0.96 sub-sentential clauses of the sentences in the re- hotel 1.61 location but -0.95 views to identify if any of them express explicit service 1.48 hotel with -0.94 opinions. If no explicit opinions are expressed in extremely 1.45 advice stay -0.94 was great 1.43 hotel stars -0.94 any of the sub-sentential clauses, then the whole sentence can be correctly classified as a implicit Table 4: List of the 10 most important features present in opinion, and along with the predefined conclusion explicit and implicit stances with their weights will become an enthymeme. The second of the steps towards enthymeme reconstruction is to look complete the arguments. The first-step requires to use related explicit opinions to complete en- automatic detection of the stance types — explicit thymemes, as discussed in Section 3.1. Here the and implicit, which we have covered in this pa- distinction between general and specific opinions per. We use a supervised approach to classify becomes important, since explicit general opin- the stances using a linear SVM classifier, the best ions might be combined with any implicit opin- performance results on the test set with a macro- ion about an aspect in the same aspect category, averaged F1-scores of 0.72 and 0.94 for explicit while explicit specific opinions can only be com- and implicit stances respectively. These identi- bined with implicit opinions that relate to the same fied implicit stances are then explicit premises of aspect. Effective combination of explicit general either complete arguments or enthymemes. (If opinions with related implicit opinions requires they are premises of complete arguments, there are a detailed model which expresses what “related” other, additional premises.) The identified explicit means for the relevant domain. We expect the de- stances can then represent common knowledge in- velopment of this model to be as time-consuming formation for the implicit premises, thus becom- as all work formalising a domain. Another issue in ing explicit premises to fill in the gap present in enthymeme reconstruction is evaluating the output the respective enthymemes. of the process. Identifying whether a given en- thymeme has been successfully turned into a com- plete argument is a highly subjective task, which 6 Future work will likely require careful human evaluation. Per- The next steps in this work take us closer to the au- forming this at a suitable scale will be challenging. tomatic reconstruction of enthymemes. The first of these steps is to look to refine our identifica- tion of explicit premises (and thus complete ar- References guments, circumventing the need for enthymeme Leila Amgoud and Henri Prade. 2012. Can AI mod- reconstruction). The idea here is that we believe els capture natural language argumentation? In that since we are currently looking only at the IJCINI’12, pages 19–32.

38 Leila Amgoud, Philippe Besnard, and Anthony Hunter. Clare Llewellyn, Claire Grover, Jon Oberlander, and 2015. Representing and reasoning about arguments Ewan Klein. 2014. Re-using an argument corpus mined from texts and dialogues. In ECSQARU’15, to aid in the curation of social media collections. In pages 60–71. LREC’14, pages 462–468.

Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- Joonsuk Park and Claire Cardie. 2014. Identifying tiani. 2010. Sentiwordnet 3.0: An enhanced lexical appropriate support for propositions in online user resource for sentiment analysis and opinion mining. comments. In ACL’14, pages 29–38. In LREC’10, pages 2200–2204. Andreas Peldszus and Manfred Stede. 2013. From ar- Elizabeth Black and Anthony Hunter. 2012. A gument diagrams to argumentation mining in texts: relevance-theoretic framework for constructing and A survey. In IJCINI’13, volume 7, pages 1–31. deconstructing enthymemes. J. Log. Comput., 22(1):55–78. Iyad Rahwan and Guillermo R. Simari, editors. 2009. Argumentation in Artificial Intelligence. Springer Filip Boltuziˇ c´ and Jan Snajder.ˇ 2014. Back up your Verlag, Berlin, Germany. stance: Recognizing arguments in online discus- Stephen Toulmin. 1958. The Uses of Argument. Cam- sions. In ACL’14, pages 49–58. bridge University Press, Cambridge, England. Elena Cabrio and Serena Villata. 2012. Combin- Frans H. van Eemeren, Rob Grootendorst, Francisca S. ing textual entailment and argumentation theory for Henkemans, J. Anthony Blair, Ralph H. Johnson, supporting online debates interactions. In ACL’12, Erik C. W. Krabbe, Christian Plantin, Douglas N. pages 208–212. Walton, Charles A. Willard, John Woods, and David Vanessa Wei Feng and Graeme Hirst. 2011. Classify- Zarefsky. 1996. Fundamentals of Argumentation ing arguments by scheme. In ACL’11, pages 987– Theory: A Handbook of Historical Backgrounds and 996. Contemporary Developments. Lawrence Erlbaum Associates, Mahwah, NJ. Maria Paz Garcia-Villalba and Patrick Saint-Dizier. 2012. A framework to extract arguments in opinion Marc Vincent and Gregoire´ Winterstein. 2014. Ar- texts. In IJCINI’12, volume 6, pages 62–87. gumentative insights from an opinion classification task on a French corpus. In New Frontiers in Artifi- Debanjan Ghosh, Smaranda Muresan, Nina Wacholder, cial Intelligence, pages 125–140. Mark Aakhus, and Matthew Mitsui. 2014. Analyz- Henning Wachsmuth, Martin Trenkmann, Benno Stein, ing argumentative discourse units in online interac- and Gregor Engels. 2014a. Modeling review tions. In ACL’14, pages 39–48. argumentation for robust sentiment analysis. In Ivan Habernal, Judith Eckle-Kohler, and Iryna ICCL’14, pages 553–564. Gurevych. 2014. Argumentation mining on the web Henning Wachsmuth, Martin Trenkmann, Benno Stein, from information seeking perspective. In Proceed- Gregor Engels, and Tsvetomira Palakarska. 2014b. ings of the Workshop on Frontiers and Connections A review corpus for argumentation analysis. In IC- between Argumentation Theory and Natural Lan- CLITP’14, pages 115–127. guage Processing, pages 26–39. Douglas N. Walton and Erik C. W. Krabbe. 1995. Seyed Ali Hosseini, Sanjay Modgil, and Odinaldo Ro- Commitment in Dialogue: Basic Concepts of Inter- drigues. 2014. Enthymeme construction in dia- personal Reasoning. State University of New York logues using shared knowledge. In COMMA’14, Press, Albany, NY, USA. pages 325–332. Douglas N. Walton. 2008. The three bases for the Minqing Hu and Bing Liu. 2004. Mining opinion fea- enthymeme: A dialogical theory. J. Applied Logic, tures in customer reviews. In AAAI’04, pages 755– 6(3):361–379. 760. Douglas Walton. 2010. The structure of argumentation Angeliki Lazaridou, Ivan Titov, and Caroline in health product messages. Argument & Computa- Sporleder. 2013. A Bayesian model for joint tion, 1(3):179–198. unsupervised induction of sentiment, aspect and discourse representations. In ACL’13, pages Adam Wyner, Jodi Schneider, Katie Atkinson, and 1630–1639. Trevor J. M. Bench-Capon. 2012. Semi-automated argumentative analysis of online product reviews. In Marco Lippi and Paolo Torroni. 2015a. Argu- COMMA’12, pages 43–50. ment mining: A machine learning perspective. In TAFA’15, pages 163–176. Nefise Yaglikci and Paolo Torroni. 2014. Microde- bates app for Android: A tool for participating in ar- Marco Lippi and Paolo Torroni. 2015b. Context- gumentative online debates using a handheld device. independent claim detection for argument mining. In ICTAI’14, pages 792–799. In IJCAI’15, pages 185–191.

39 The CASS Technique for Evaluating the Performance of Argument Mining

Rory Duthie1, John Lawrence1, Katarzyna Budzynska1,2, and Chris Reed1

1Centre for Argument Technology, University of Dundee, Scotland 2Institute of Philosophy and Sociology, Polish Academy of Sciences, Poland

Abstract Commonly to find the agreement of manual annotators or the effectiveness of an automatic Argument mining integrates many distinct solution, two scores are given, Cohen’s kappa computational linguistics tasks, and as a (Cohen, 1960), which takes into account the ob- result, reporting agreement between anno- served agreement between two annotators and the tators or between automated output and chance agreement, giving an overall kappa value gold standard is particularly challenging. for agreement, and F1 score (Rijsbergen, 1979), More worrying for the field, agreement which is the harmonic mean of the precision and and performance are also reported in a recall of an algorithm. The way in which these wide variety of different ways, making scores are utilised can over penalise differences in comparison between approaches difficult. argumentative structures. In particular, if used in- To solve this problem, we propose the correctly, Cohen’s kappa can penalise doubly (pe- CASS technique for combining metrics nalise for segmentation and penalise segmentation covering different parts of the argument in argumentative structures) if not split into sepa- mining task. CASS delivers a justified rate tasks or penalise too harshly when annotations method of integrating results yielding con- have only slight differences, again if the calcula- fusion matrices from which CASS-κ and tion is not split by argumentative structure. When CASS-F1 scores can be calculated. using the F1 score the same problems arise with- out split calculations. 1 Introduction To combat these issues this paper introduces To calculate the agreement, or similarity, between two advances: first, the definition of an overall two different argumentative structures is an im- score, the Combined Argument Similarity Score portant and commonly occurring task in argument (CASS), which incorporates a separate segmen- mining. For example, measures of similarity are tation score, propositional content relation score required to determine the efficacy of annotation and dialogical content relation score; and second, guidelines via inter-annotator agreement, and to the deployment of an automatic system of compar- compare test analyses against a gold standard, ative statistics for calculating the agreement be- whether these test analyses are produced by stu- tween annotations over the two steps needed to dents, or automated argument mining techniques ultimately perform argument mining: manual an- (cf. (Moens, 2013; Peldszus and Stede, 2013)). notations compared with manual annotations (cor- To find the the similarity of automatic and man- pora compared with corpora) and automatic anno- ually segmented texts and what impact these seg- tations evaluated against a gold standard (automat- ments have on agreement between annotations for ically created argument structures compared with an overall argument structure, is a complex task. a manually annotated corpus). Similar to these problems is the task of evaluat- ing the argumentative structure of annotations us- 2 Related Work ing pre-segmented text. Despite the relative ease of manually analysing these situations, arguments Creating the CASS technique and an automatic with long relations can easily make this task com- system to calculate it, is based on theories estab- plex. lished in linguistics and computational linguistics.

40 Proceedings of the 3rd Workshop on Argument Mining, pages 40–49, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

In (Afantenos et al., 2012), a discourse graph is distance between nodes is taken for each annota- considered and split into discourse units and rela- tion with each node distance as a fraction. The dis- tions, to calculate agreement using F1 score. This tance is added, then multiplied by the overall num- gives what is described as a “brutal estimation” ber of edges giving a normalised score for both which gives an underestimation of the agreement. annotations, not considering the direction or types To combat this it is suggested that reasoning over of relations or any unconnected propositions. The the structures is needed. harmonic mean is then taken to provide the agree- In (Artstein and Poesio, 2008) a survey is given ment between the annotations. of agreement values in computational linguistics. Results are also provided when considering re- Different measures of the statistics both Cohen’s lation types for weighted average and nodes with kappa and Krippendorff’s alpha (Krippendorff, distance less than six for inter-annotator agree- 2007) along with other variations are considered ment on propositional content nodes for a pre- for different tasks. On the task of segmentation, it segmented text. is noted that kappa in any form does not account If we consider the papers submitted to the 2nd for near misses (where a boundary missed by a workshop on argumentation mining, we can see word or two words) and that instead other mea- there is an inconsistency in the area when cal- sures (see Section 4) should be considered. On culating inter-annotator agreement and overall ar- the topic of relations and discourse entities, again gument mining results. To calculate the agree- kappa in its various forms and alpha are consid- ment between annotators, three papers used Co- ered. For both relations and discourse entities the hen’s kappa (cf. (Bilu et al., 2015; Carstens and kappa score is low overall because partial agree- Toni, 2015; Sobhani et al., 2015)), three papers ment is not considered. Instead the idea of a par- used inter-annotator agreement as a percentage (cf. tial agreement coefficient is introduced as being (Green, 2015; Nguyen and Litman, 2015; Kiesel applicable. et al., 2015)), two used precision and recall (cf. In (Habernal and Gurevych, 2016), Krippen- (Sardianos et al., 2015; Oraby et al., 2015)) and three others used different methods (cf. (Kirschner dorff’s unitized alpha (αU ) is proposed as an eval- uation method, to take into account both labels and et al., 2015; Yanase et al., 2015; Reisert et al., boundaries of segments by reducing the task to a 2015)). To calculate the results of argument min- ing, four papers used accuracy (cf. (Bilu et al., token level. The αU is calculated over a contin- uous series of documents removing the need for 2015; Kiesel et al., 2015; Nguyen and Litman, averaging on a document level, but is dependent 2015; Yanase et al., 2015)) and five papers used on the ordering of documents where the error rate precision, recall and F1 score (cf. (Lawrence and of ordering is low. Reed, 2015; Sobhani et al., 2015; Park et al., 2015; Nguyen and Litman, 2015; Peldszus and Stede, Finally, in (Kirschner et al., 2015) meth- 2015)) with one paper using a macro-averaged F1. ods for calculating inter-annotator agreement are What is required in the area of argument mining is specified: adapted percentage agreement (APA), a coherent model to give results for both annotator weighted average and a graph based technique agreement but also the results of argument mining. (see also Section 3.2). In the area of text summarization, Recall- APA takes the total number of agreed annota- Oriented Understudy for Gisting Evaluation tions and divides it by the total number of anno- (ROUGE) (Lin, 2004) was created exactly for the tations, on a sentence level of argument but not purpose of having a coherent measure to allow corrected for chance. Chance is taken into ac- systems in the Document Understanding Confer- count, when performing the weighted average. A ence (DUC) to be evaluated. In creating the CASS weight is provided for the distance between related technique we aim to emulate ROUGE, and provide propositions when the distance is not greater than consistency in the area of argument mining. six. Meaning any relation with a distance greater than six is discounted. This is justified with only 3 Foundation 5% of relations having a distance greater than two. Chance is accounted for by using this weighted av- 3.1 Representing Argument erage for multi-annotator kappa and F1 score. Fi- Arguments in argument mining can be represented nally, a graph based approach is defined, where the in many forms which is particularly important for

41 the development of the CASS technique, as this AIF was developed as a means of describing ar- score must be applicable to different ways of rep- gument networks that would provide a flexible, resenting argument. yet semantically rich, specification of argumenta- tion structures. Central to the AIF core ontology Scheme for Argumentative Microtexts. In are two types of nodes: Information- (I-) nodes (Peldszus and Stede, 2013) an annotation scheme (propositional contents) and Scheme (S-) nodes was defined which was incorporated in a corpus (relations between contents). I-nodes represent of 122 argumentative microtexts (Peldszus and propositional information contained in an argu- Stede, 2015). In this annotation scheme an argu- ment, such as a conclusion, premise etc. A subset ment is defined as a non-empty premise, that is a of I-nodes refers to propositional reports specif- premise which holds some form of relation which ically about discourse events: these are L-nodes supports a conclusion. Graphically this is repre- (locutions). sented by proposition nodes and support relations, with support relations represented as an arrow be- S-nodes capture the application of schemes of tween the node and its conclusion. three categories: argumentative, illocutionary and The scheme defined builds on and extends the dialogical. Amongst argumentative patterns there work of (Freeman, 2011). Support relations are are inferences or reasoning (RA-nodes), conflict defined in the most basic way, as an argument (CA-nodes) and rephrase (MA-nodes). Dialogical in the form of premise and conclusion. This transitions (TA-nodes) are schemes of interaction accompanied by attack relations where rebutting or protocol of a given dialogue game which de- is defined for when an argument is attacked di- termine possible relations between locutions. Il- rectly and undercutting when a premise is at- locutionary schemes are patterns of communica- tacked. Counter attacks then allow rebuttals of an tive intentions which speakers use to introduce attack support, the undercutting of an attack sup- propositional contents.2 Illocutionary connections port and a counter consideration argument. Each (YA-nodes) can be either anchored (associated, as- microtext is pre-segmented to avoid bias from an- signed) in locutions or in transitions. In the first notators segmenting text in their own style, with case (see e.g. asserting, challenging, question- rules defined in the scheme which allow annota- ing), the locution provides enough information to tors to change the segmentation. reconstruct illocutionary force and content. Illo- cutionary connections are anchored in a transi- Argument Internet Argument Corpus (IAC). tion when we need to know what a locution is data is also represented use quote-response pairs a response to and to understand an illocution or (QR pairs) in the IAC (Walker et al., 2012). The its content. AIFdb Corpora allows for operation IAC provides 390,704 individual posts automati- with either an individual NodeSet, or any group- cally extracted from an Internet forum. Each post ing of NodeSets captured in a corpus. By integrat- is related to a response which is provided through ing closely with the OVA+ (Online Visualisation a tree structure of all the posts on the forum. of Argument) analysis tool (Janier et al., 2014), QR pairs work with a pre-defined segmentation AIFdb Corpora allows for the rapid creation of which can allow annotators to identify relations large corpora compliant with AIF. between a quote (post) and a response. Relations can be on a number of levels with the most ba- AIFdb provides the largest publicly available sic of these, agree and disagree, to the more com- dataset comprising multiple corpora of analysed plex, sarcasm where an annotator decides if a re- argumentation; and in addition AIF works as an sponse is of sarcastic manner using their own in- interlingua facilitating translation from other rep- tuition where a formal definition or annotation is resentation languages with both the IAC and Mi- near impossible without being present during the crotext corpora in AIF format, for example. For vocalisation of the point. both of these reasons, we have used AIF for our examples here (although the CASS technique it- Argument Interchange Format (AIF). Argu- self is largely independent of annotation scheme). ment data can also be represented according to the AIF (Chesnevar˜ et al., 2006) implemented in the AIFdb1 database (Lawrence et al., 2012). The 2Illocutionary schemes are based on illocutionary forces 1http://www.aifdb.org defined in (Searle, 1969; Searle and Vanderveken, 1985).

42 Figure 1: Segmentation boundaries and mass for first and second annotators.

3.2 Comparing Analysis rate corpora in AIFdb. Calculating the inter-annotator agreement of man- To account for a differing segmentation which ual analysis, can be problematic when using tra- does not doubly penalise the argument structure, ditional methods such as Cohen’s kappa. In the agreement calculation involves smaller sub- (Kirschner et al., 2015, p.3), the authors highlight calculations which can give an overview of the this challenge: “as soon as the annotation of one full agreement between annotators. Segmentation entity depends on the annotation of another entity, agreement considers the number of possible seg- or some entities have a higher overall probability ments on which two annotators agree. A segmen- for a specific annotation than others, the measures tation which differs between annotations can have may yield misleadingly high or low values. (...) a substantial effect on argument structure, such as Therefore, many researchers still report raw per- the assignment of relations between proposition. centage agreement without chance correction.” An example is provided in Figure 1where segmen- tation is given for a first annotator (S ) and a sec- In the comparative statistics module we look to 1 ond annotator (S ). In this case the two annota- extend the solution in (Kirschner et al., 2015) in 2 tions give segments which resemble very similar seven ways, by: (i) Calculating the segmentation mass(the number of words in a segment), however, differences between two annotations; (ii) Calculat- more boundaries are placed in S when compared ing propositional content relations using confusion 2 to S with a difference in granularity and a bound- matrices, accounting for all the nodes within an 1 ary misplaced by a word. and accounting for a differing seg- Three techniques are provided to tackle this mentation; (iii) Calculating dialogical content re- problem with each recognising that a near miss lations (if they are contained in an argument map) (two segments that differ by a small margin, e.g. using confusion matrices, accounting for all the a word) should not be as heavily penalised as a nodes within an argument map and accounting for full miss on the placement of segment boundaries. a differing segmentation; (iv) Defining the CASS Performing the same calculation with F1 score or technique to allow calculation scores to be com- Cohen’s kappa would result in a heavily penalised bined; (v) Allowing the use of any metric for the segmentation. CASS technique, which uses a confusion matrix, The P statistic (Beeferman et al., 1999), in- to give consistency to the area of argument mining; k volves sliding a window of length k (where k is (vi) Providing results for not just inter-annotator half the average segment size) over the segmented agreement, but also, the comparison of manually text. For each position of the window, the words at annotated corpora against corpora automatically each end of the window are taken and the segment created by argument mining; (vii) Allowing the in which they lie is considered. comparison of analysis given in different annota- The WindowDiff statistic (Pevzner and Hearst, tion schemes but migrated to AIF (e.g. compare 2002), takes into account situations in which P text annotated in IAC to the annotation scheme k fails. In P false negatives are penalised more than from the Microtext corpus). k false positives, thus the agreement value could be 4 Comparative Statistics: Segmentation unfair. The WindowDiff statistic remedies this by taking into account the number of reference Comparative statistics can provide for a number boundaries and comparing this to the number of of cases with two main motivations: evaluation of hypothesised boundaries. automatic annotations against manual gold stan- The segmentation similarity statistic dards, and comparison of multiple manual annota- (S)(Fournier and Inkpen, 2012), again takes tions. The calculation is given between two sepa- into account perceived failings of the Pk and Win- 3 rate annotations A1 and A2 available in two sepa- dowDiff statistics. Where both WindowDiff and Pk use fixed sized windows, which can adversely 3Throughout this paper A is used to denote annotation, l denotes a locution, p a propositional content node, ta a tran- sition node and ra a propositional content relation.

43 solution, a solution for which any match between propositions and locutions makes those individ- ual matches consistent without making any other match worse and vice-versa. An agreement calculation is given for all propo- sitional content relations (support and attack rela- tions). This calculation is based on the location of support and attack nodes within an analysis and Figure 2: Propositional Content Relations for an the nodes to which they are connected. For a full annotation from AIFdb with first and second an- 4 agreement between annotators, a support or attack notators. node must be connected between two propositions pi, p j, with these propositions being a match in affect the outcome of an agreement calculation, S A1 and A2. A support or attack node also has proposes that a minimum edit distance, scaled to full agreement when one annotation is more fine the overall segmentation size, is considered. This grained but holds the same propositional content edit distance allows near misses to be penalised as the other annotation. For example, if annota- but not to the same degree as a full miss. tion A1 contains a support node which begins its relation in pbc and gives a relation between pbc 5 Comparative Statistics: Propositional and pa, then this is the same as if A2 had a sup- Relations port node with two separate propositions, pb and To compare relations it is important to calcu- pc and related to pa. This notion is extended when late the agreement between each of the individ- considering Figure 1 and Figure 2. ual items which are annotated within an argument The differing segmentation in Figure 1 has analysis. By providing calculations for individual an effect on the comparison between propo- items in an annotation we take into account that sitions. When considering propositions, non- segmentation’s may differ but do not penalise on identical propositions lead to a near zero similar- this basis. ity on support relations between these annotations. In the case of analysis with a differing seg- This is however an unintuitive approach to take, mentation, we use a guaranteed matching formula. as the overall argumentative structure is penalised This formula makes use of the Levenshtein dis- doubly (if we consider the segmentation and argu- tance (Levenshtein, 1966), where each locution or mentative structure as different tasks) by the dif- proposition in an annotation is compared with ev- fering segmentation. ery locution or proposition in a second annotation. This is demonstrated in Figure 2 where the two The Levenshtein distance for each comparison is annotators agree that there is a convergent argu- taken and normalised, this is extended by using ment been nodes four and five in annotation 1 and the position of words within the annotations taken nodes five and seven in annotation 2. Extending from the original text. The position of words in this is proposition two of both annotations, where the original text is important to correctly match in annotation 1, proposition two is connected to propositions and locutions and therefore a propo- four and five by a convergent argument. Yet in sition or locution which does not have matching annotation 2, proposition 2 is a separate support positional words cannot be a match. In this situa- relation. In the first instance of a convergent argu- tion the Levenshtein distance is increased (moved ment, splitting the segmentation calculation from to zero) to account for a non-match. Each calcula- the propositional relation calculation gives a fair tion taken is then stored in a matrix. The matrix is representation of the argument structure without then traversed to find the smallest distance (high- penalising for segmentation doubly. est value between zero and one), selecting the pair In the second instance of a convergent argu- of locutions or propositions. This is continued un- ment and a separate support relation, there is a til all nodes are matched or there are no matches slight disagreement between the annotators. De- which can be made, thus giving a Pareto optimal spite both annotators agreeing that proposition two 4Numbered nodes represent propositions in the overall connects to the same node (propositions six in an- text and arrows represent support relations. notation 1 and eight in annotation two being the

44 Figure 3: Full AIF IAT diagram for the annotation from the first annotator.

same node) a disagreement is shown because of tains a YA-node which is anchored in li and when the connection type if we consider Cohen’s kappa A2 contains the same YA anchored in li, then an or F1 score purely. Two options are available agreement is observed. This also holds for TA’s. when calculating the similarity for this situation, The overall calculation then involves a confusion either a similarity of zero is given or two sepa- matrix where all disagreements are observed when rate calculations could be used with agreement on YA nodes do not match. If we consider Figures 3 a premise conclusion basis but no agreement on and 4 we can see between both annotations that the type of argument, thus giving a penalty. there are a difference of four YA nodes. To provide a confusion matrix all the possible A second calculation for YA-nodes, checking node pairs to which a propositional content rela- the agreement on the propositional content nodes tion could be connected have to be considered. in which they anchor and to where they are an- Any node pairs which both annotators have not chored (locution or TA), is also given. This calcu- connected are then counted and all nodes which lation involves a multiple category confusion ma- are matched are counted, giving the observed trix. An example of when agreement is observed is agreement. All node pairs which the annotators when A1 contains p j anchored in a YA and the YA do not agree upon are also counted. anchored in li and in A2 the same structure with p j and li is observed with the same YA node. The 6 Comparative Statistics: Dialogical multi-category confusion matrix is calculated with Relations disagreements observed when propositions and lo- cutions do not match. When considering Figures 3 Dialogical relations consider only the dialogue of and 4 we see an example of agreement between the an argument with the intentions of the speaker annotators on propositions 1, 2 and 3. Proposition noted. A differing segmentation in various anal- 4, in Figure 3 and proposition 5, in Figure 4 also ysis can lead to low kappa or F1 scores. By split- match and the same for proposition 5, in Figure 3 ting dialogical relations into a separate calculation and proposition 7, in Figure 4. Disagreements are it removes the double penalty assigned by segmen- then observed with propositions 4 and 6 in Fig- tation. When comparing dialogical relations again ure 4. we use the Levenshtein distance as described in Three separate calculations are also given for Section 5. TA-nodes. The first concerns the position of a A calculation is provided for illocutionary con- TA node within locutions. Agreement is observed nections (YA) anchored in TAs or in Locu- when A1 contains a TA which is anchored in li and tions. This calculation involves multiple cate- anchors l j and A2 contains the same TA anchored gories, meaning a multiple category confusion ma- in li and anchoring l j. For the final calculation all trix, due to the large number of possible YA-node possible locution pairs are considered to give val- types which can be chosen by annotators. An ues for agreements on TA placement, agreements agreement is observed when both annotators select on non-TA placement and disagreements on TA the same illocutionary connections. When A1 con- placement. In the examples Figures 3 and 4 there

45 Figure 4: Full AIF IAT diagram for the annotation from the second annotator. Differences from Figure 3 are highlighted. is a agreement between the annotators on TA 1, In equation 1 the arithmetic mean, M, is the the 2 and TA 3 in Figure 3 and TA 6, in Figure 4. A sum of all propositional content calculations, P, second calculation is then given for pairs of propo- plus the sum of all dialogical content calculations, sitional content nodes and TA-nodes. When p j is D, over the total number of calculations made, n. anchored in tai for A1 and the same structure is We use this figure along with the segmentation observed in A2 for the same propositional content similarity score to perform the harmonic mean and node then there is agreement between the annota- provide an overall agreement figure normalised tors. The overall confusion matrix is calculated by and taking into account any penalties for segmen- considering all pairs of TA-nodes and propositions tation errors. Equation 2 gives the CASS tech- and all disagreements between annotators. A third nique as the arithmetic mean, M, combined with and final calculation is given for TA-nodes anchor- the segmentation similarity, S. ing propositional content relations. For A1 if rai is anchored finally in tai and ra j is anchored fi- The CASS technique allows for any consis- nally in ra j, in A2 then agreement is observed. The tent combination of scores to be used as either overall confusion matrix is calculated by consid- the propositional content calculations or dialogical ering all possible pairs of TA’s and propositional calculations. That is to say that the CASS tech- content relations. In Figures 3 and 4 agreement is nique is not solely dependent on Cohen’s kappa, observed only on inference 1 in Figure 3 and infer- or F1 score and can instead be substituted for any ence 4 in Figure 4. This provides a small penalty other overall measure. For the purpose of this ex- between the annotations for the added inference 2 ample we will use the Cohen’s kappa metric, as in Figure 4, where earlier in Section 5 no penalty both annotations were annotated manually. We was given. also use the S statistic for segmentation similar- ity as it handles the errors in Pk and WindowDiff 7 Aggregating into the CASS technique statistics more effectively. Sections 4, 5 and 6 provide calculations for seg- mentation, propositional content relations and di- We sum both kappa scores giving an arithmetic alogical content relations. We have defined CASS mean, M, of 0.43. The S score, 0.95, is then which incorporates all of these calculation figures combined with M in equation 2 to give an over- to provide a single figure for the agreement be- all CASS of 0.59. scores this gives a fair rep- tween annotators or a manual analysis and an au- resentation of the overall agreement between the tomatic one, using both propositional content rela- two annotators. In Table 1 the CASS technique tions and dialogical content relations. is compared with Cohen’s kappa and F1 score, where both scores do not take into account the ∑P+∑D M ∗S slight difference in argument structure and there- M = (1) CASS = 2 (2) n M +S fore penalise this.

46 Method Overall Score 7.2 Deployment Cohen’s κ 0.44 CASS-κ 0.59 Comparative statistics (see Figure 5) is part of the F1 score 0.66 Argument Analytics suite which is to be publicly CASS-F1 0.74 accessible at http://analytics.arg.tech/. It provides a suite of techniques for analysing sets Table 1: Scores are provided for Cohen’s kappa of AIF data, with components ranging from the and F1 score, for both segmentation and structure, detailed statistics required for discourse analysis and CASS with S for segmentation and both kappa or argument mining, to graphic visual represen- and F1 for structure. tations, offering insights in a way that is acces- sible to a general audience. Modules are avail- able for: viewing simple statistical data, which provides both an overview of the argument struc- ture and frequencies of patterns such as argumen- tation schemes; dialogical data highlighting the behaviour of participants of the dialogue; and real- time data allowing for the graphical representation of a developing over time argument structure.

8 Conclusions Despite the widespread use of Cohen’s kappa and F1 score in reporting agreement and performance, they present two key problems when applied to argument mining. First, they do not effectively handle errors of segmentation (or unitization); and second, they are not sensitive to the variety of structural facets of argumentation. These two problems lead to kappa and F1 underestimating Figure 5: Screenshot of the comparative statistics performance or agreement of argument annota- module within Argument Analytics. tion. The CASS technique allows for the integration of results for segmentation with those for struc- 7.1 Extending Relation Comparisons tural annotation yielding coherent confusion ma- The CASS technique and comparative statistics trices from which new CASS-κ and CASS-F1 module caters for the creation of confusion matri- scores can be derived. CASS is straightforward ces for each calculation, allowing for the adaption to implement, and we have shown that it can be of the overall results. This allows kappa, accu- included in web-based analytics for quickly calcu- racy, precision, recall and F1 score all to be cal- lating agreement or performance between online culated, but, other metrics can also be considered datasets. CASS offers an opportunity for increas- for evaluating automatic analyses when using the ing coherence within the community, aiding it to CASS technique. Balanced accuracy (Brodersen emulate the academic success of other subfields of et al., 2010), allows the evaluation of imbalanced computational linguistics such as summarization; datasets. When one class is much larger than the and its subsequent deployment offers a simple way other Balanced Accuracy takes this into account of applying it to future community efforts such as and lowers the score appropriately. Informedness shared tasks and competitions. (Powers, 2011), gives the probability that an au- Acknowledgments tomatic system is making an informed decision when performing classification. A select set of We would like to acknowledge that the work re- metrics are part of the comparative statistics mod- ported in this paper has been supported in part by ule, although, no metric is ruled out from this, al- EPSRC in the UK under grants EP/M506497/1, lowing any metric employing a confusion matrix EP/N014871/1 and EP/K037293/1. to use the CASS technique.

47 References Ivan Habernal and Iryna Gurevych. 2016. Argumenta- tion mining in user-generated web discourse. Com- Stergos Afantenos, Nicholas Asher, Farah Benamara, putational Linguistics. Myriam Bras, Cecile´ Fabre, Lydia-Mai Ho-Dac, Anne Le Draoulec, Philippe Muller, Marie-Paule Mathilde Janier, John Lawrence, and Chris Reed. Pery-Woodley,´ Laurent Prevot,´ et al. 2012. An 2014. OVA+: An argument analysis interface. In empirical resource for discovering cognitive prin- S. Parsons, N. Oren, C. Reed, and F. Cerutti, editors, ciples of discourse organisation: the annodis cor- Proceedings of the Fifth International Conference pus. In Proceedings of the Eight International on Computational Models of Argument (COMMA Conference on Language Resources and Evaluation 2014), pages 463–464, Pitlochry. IOS Press. (LREC’12). European Language Resources Associ- ation (ELRA). Johannes Kiesel, Khalid Al-Khatib, Matthias Hagen, and Benno Stein. 2015. A shared task on argu- Ron Artstein and Massimo Poesio. 2008. Inter-coder mentation mining in newspaper editorials. In Pro- agreement for computational linguistics. Computa- ceedings of the Second Workshop on Argumentation tional Linguistics, 34(4):555–596. Mining. Association for Computational Linguistics, pages 35–38. Doug Beeferman, Adam Berger, and John Lafferty. 1999. Statistical models for text segmentation. Ma- Christian Kirschner, Judith Eckle-Kohler, and Iryna chine learning, 34(1-3):177–210. Gurevych. 2015. Linking the thoughts: Analy- sis of argumentation structures in scientific publi- Yonatan Bilu, Daniel Hershcovich, and Noam Slonim. cations. In Proceedings of the Second Workshop 2015. Automatic claim negation: Why, how and on Argumentation Mining. Association for Compu- when. In Proceedings of the Second Workshop on tational Linguistics, pages 1–11. Argumentation Mining. Association for Computa- Klaus Krippendorff. 2007. Computing Krippendorff’s tional Linguistics, pages 84–93. alpha reliability. Departmental papers (ASC), page 43. K. H. Brodersen, C. S. Ong, K. E. Stephan, and J. M. Buhmann. 2010. The balanced accuracy John Lawrence and Chris Reed. 2015. Combining ar- and its posterior distribution. In Pattern Recogni- gument mining techniques. In Proceedings of the tion (ICPR), 2010 20th International Conference on, Second Workshop on Argumentation Mining. Asso- pages 3121–3124. ciation for Computational Linguistics, pages 127– 136. Lucas Carstens and Francesca Toni. 2015. Towards re- lation based argumentation mining. In Proceedings John Lawrence, Floris Bex, Chris Reed, and Mark of the Second Workshop on Argumentation Mining. Snaith. 2012. AIFdb: Infrastructure for the Argu- Association for Computational Linguistics, pages ment Web. In Proceedings of the Fourth Interna- 29–34. tional Conference on Computational Models of Ar- gument (COMMA 2012), pages 515–516. Carlos Chesnevar,˜ Sanjay Modgil, Iyad Rahwan, Chris Reed, Guillermo Simari, Matthew South, Gerard V. I. Levenshtein. 1966. Binary Codes Capable of Cor- Vreeswijk, Steven Willmott, et al. 2006. Towards recting Deletions, Insertions and Reversals. Soviet an Argument Interchange Format. The Knowledge Physics Doklady, 10:707. Engineering Review, 21(04):293–316. Chin-Yew Lin. 2004. Rouge: A package for auto- Jacob Cohen. 1960. A coefficient of agreement matic evaluation of summaries. In Text summariza- for nominal scales. Educational and Psychological tion branches out: Proceedings of the ACL-04 work- Measurement, 20(1):37–46. shop, volume 8. Marie-Francine Moens. 2013. Argumentation mining: Chris Fournier and Diana Inkpen. 2012. Segmenta- Where are we now, where do we want to be and how tion similarity and agreement. In Proceedings of the do we get there? In FIRE ’13 Proceedings of the 5th 2012 Conference of the North American Chapter of 2013 Forum on Information Retrieval Evaluation. the Association for Computational Linguistics: Hu- man Language Technologies, pages 152–161. Asso- Huy V Nguyen and Diane J Litman. 2015. Extract- ciation for Computational Linguistics. ing argument and domain words for identifying ar- gument components in texts. In Proceedings of the James B Freeman. 2011. Argument Structure: Repre- Second Workshop on Argumentation Mining. Asso- sentation and Theory. Springer. ciation for Computational Linguistics, pages 22–28. Nancy L Green. 2015. Identifying argumentation Shereen Oraby, Lena Reed, Ryan Compton, Ellen schemes in genetics research articles. In Proceed- Riloff, Marilyn Walker, and Steve Whittaker. 2015. ings of the Second Workshop on Argumentation And that’s a fact: Distinguishing factual and emo- Mining. Association for Computational Linguistics, tional argumentation in online dialogue. In Pro- pages 12–21. ceedings of the Second Workshop on Argumentation

48 Mining. Association for Computational Linguistics, Toshihiko Yanase, Toshinori Miyoshi, Kohsuke Yanai, pages 116–126. Misa Sato, Makoto Iwayama, Yoshiki Niwa, Paul Reisert, and Kentaro Inui. 2015. Learning sentence Joonsuk Park, Arzoo Katiyar, and Bishan Yang. 2015. ordering for opinion generation of debate. In Pro- Conditional random fields for identifying appropri- ceedings of the Second Workshop on Argumentation ate types of support for propositions in online user Mining. Association for Computational Linguistics, comments. In Proceedings of the Second Workshop pages 94–103. on Argumentation Mining. Association for Compu- tational Linguistics, pages 39–44.

Andreas Peldszus and Manfred Stede. 2013. From ar- gument diagrams to argumentation mining in texts: a survey. International Journal of Cognitive Infor- matics and Natural Intelligence (IJCINI), 7(1):1–31.

Andreas Peldszus and Manfred Stede. 2015. An anno- tated corpus of argumentative microtexts. Proceed- ings of the First Conference on Argumentation, Lis- bon, Portugal, June. to appear.

Lev Pevzner and Marti A Hearst. 2002. A critique and improvement of an evaluation metric for text seg- mentation. Computational Linguistics, 28(1):19– 36.

D.M.W. Powers. 2011. Evaluation: from precision, recall and f-measure to roc, informedness, marked- ness and correlation. Journal of Machine Learning Technologies, 2:27–63.

Paul Reisert, Naoya Inoue, Naoaki Okazaki, and Ken- taro Inui. 2015. A computational approach for generating toulmin model argumentation. In Pro- ceedings of the Second Workshop on Argumentation Mining. Association for Computational Linguistics, pages 45–55.

C. J. Van Rijsbergen. 1979. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd edition.

Christos Sardianos, Ioannis Manousos Katakis, Geor- gios Petasis, and Vangelis Karkaletsis. 2015. Argu- ment extraction from news. In Proceedings of the Second Workshop on Argumentation Mining. Asso- ciation for Computational Linguistics, pages 56–66.

John R. Searle and Daniel Vanderveken. 1985. Foun- dations of illocutionary logic. Cambridge Univer- sity Press.

John R. Searle. 1969. Speech acts: An essay in the phi- losophy of language. Cambridge University Press.

Parinaz Sobhani, Diana Inkpen, and Stan Matwin. 2015. From argumentation mining to stance clas- sification. In Proceedings of the Second Workshop on Argumentation Mining. Association for Compu- tational Linguistics, pages 67–77.

Marilyn A Walker, Jean E Fox Tree, Pranav Anand, Rob Abbott, and Joseph King. 2012. A corpus for research on deliberation and debate. In LREC, pages 812–817.

49 Extracting Case Law Sentences for Argumentation about the Meaning of Statutory Terms

Jarom´ır Savelkaˇ Kevin D. Ashley University of Pittsburgh University of Pittsburgh 4200 Fifth Avenue 4200 Fifth Avenue Pittsburgh, PA 15260, USA Pittsburgh, PA 15260, USA [email protected] [email protected]

Abstract When there are doubts about the meaning of the provision they may be removed by interpretation Legal argumentation often centers on the (MacCormick and Summers, 1991). Even a single interpretation and understanding of termi- word may be crucial for the understanding of the nology. Statutory texts are known for a provision as applied in a particular context. frequent use of vague terms that are dif- Let us consider the example rule: “No vehicles ficult to understand. Arguments about the in the park.”1 While it is clear that automobiles or meaning of statutory terms are an insep- trucks are not allowed in the park it may be unclear arable part of applying statutory law to a if the prohibition extends to bicycles. In order to specific factual context. In this work we decide if a bicycle is allowed in the park it is nec- investigate the possibility of supporting essary to interpret the term ‘vehicle’. this type of argumentation by automatic The interpretation involves an investigation of extraction of sentences that deal with the how the term has been referred to, explained, in- meaning of a term. We focus on case law terpreted or applied in the past. This is an impor- because court decisions often contain sen- tant step that enables a user to then construct ar- tences elaborating on the meaning of one guments in support of or against particular inter- or more terms. We show that human anno- pretations. Searching through a database of statu- tators can reasonably agree on the useful- tory law, court decisions, or law review articles ness of a sentence for an argument about one may stumble upon sentences such as these: the meaning (interpretive usefulness) of a specific statutory term (kappa>0.66). We i. Any mechanical device used for transporta- specify a list of features that could be used tion of people or goods is a vehicle. to predict the interpretive usefulness of a ii. A golf cart is to be considered a vehicle. sentence automatically. We work with off- iii. To secure a tranquil environment in the park the-shelf classification algorithms to con- no vehicles are allowed. firm the hypothesis (accuracy>0.69). iv. The park where no vehicles are allowed was closed during the last month. 1 Introduction v. The rule states: “No vehicles in the park.” Statutory law is written law enacted by an official legislative body. A single statute is usually con- Some of the sentences are useful for the inter- cerned with a specific area of regulation. It con- pretation of the term ‘vehicle’ from the example sists of provisions which express the individual le- provision (i. and ii.). Some of them look like they gal rules (e.g., rights, prohibitions, duties). may be useful (iii.) but the rest appears to have Understanding statutory provisions is difficult very little (iv.) if any (v.) value. Going through the because the abstract rules they express must ac- sentences manually is labor intensive. The large count for diverse situations, even those not yet en- number of useless sentences is not the only prob- countered. The legislators use vague (Endicott, lem. Perhaps, even more problematic is the large 2000) open textured (Hart, 1994) terms, abstract redundancy of the sentences. standards (Endicott, 2014), principles, and values 1The example comes from the classic 1958 Hart-Fuller (Daci, 2010) in order to deal with this uncertainty. debate over the interpretation of rules.

50 Proceedings of the 3rd Workshop on Argument Mining, pages 50–59, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

In this paper we investigate if it is possible to # HV # CV # PV # NV # HV 19 1 1 0 retrieve the set of useful sentences automatically. (1/4/14) (0/0/1) (0/1/0) (0/0/0) Specifically, we test the hypothesis that by using # CV 15 12 9 1 a set of automatically generated linguistic features (1/6/8) (2/0/10) (1/4/4) (0/1/0) # PV 2 27 105 11 about/in the sentence it is possible to evaluate how (0/0/2) (11/1/15) (29/36/40) (0/3/8) useful the sentence is for an interpretation of the # NV 0 0 4 36 term from a specific statutory provision. (0/0/0) (0/0/0) (2/2/0) (5/13/18) In Section 2 we describe the new statutory term Table 1: Confusion matrix of the labels assigned interpretation corpus that we created for this work. by the two annotators (HV: high value, CV: certain Section 3 describes the tentative set of the features value, PV: potential value, NV: no value; the num- for the evaluation of the sentences’ interpretive ber in bold is the total count and the numbers in the usefulness. In Section 4 we confirm our hypoth- brackets are the counts for the individual terms: esis by presenting and evaluating a rudimentary (‘independent economic value’/‘identifying par- version of the system (using stock ML algorithms) ticular’/‘common business purpose’)). capable of determining how useful a sentence is for term interpretation. 20 decisions only reflected the high cost of the la- 2 Statutory Term Interpretation Data beling. In total we assembled a small corpus of 243 sentences. Court decisions apply statutory provisions to spe- Two expert annotators, each with a law degree, cific cases. To apply a provision correctly a judge classified the sentences into four categories ac- usually needs to clarify the meaning of one or cording to their usefulness for the interpretation of more terms. This makes court decisions an ideal the corresponding term: source of sentences that possibly interpret statu- tory terms. Legislative history and legal commen- 1. high value - This category is reserved for taries tentatively appear to be promising sources sentences the goal of which is to elaborate on as well. We will investigate the usefulness of these the meaning of the term. By definition, these types of documents in future work. Here we focus sentences are those the user is looking for. on sentences from court decisions only. 2. certain value - Sentences that provide In order to create the corpus we selected three grounds to draw some (even modest) conclu- terms from different provisions of the United sions about the meaning of the term. Some of States Code, which is the official collection of the these sentences may turn out to be very use- 2 federal statutes of the United States. The selected ful. terms were ‘independent economic value’ from 18 3. potential value - Sentences that provide ad- U.S. Code 1839(3)(B), an ‘identifying particu- § ditional information beyond what is known lar’ from 5 U.S. Code 552a(a)(4), and ‘common § from the provision the term comes from. business purpose’ from 29 U.S. Code 203(r)(1). § Most of the sentences from this category are We specifically selected terms that are vague and not useful. come from different areas of regulation. We are 4. no value - This category is used for sentences aware that the number of terms we work with is that do not provide any additional informa- low. We did not specify additional terms because tion over what is known from the provision. the cost of subsequent labeling is high. Three By definition, these sentences are not useful terms are sufficient for the purpose of this paper. for the interpretation of the term. For future work we plan to extend the corpus. For each term we have collected a small set of Eventually, we would like the system to assign a sentences by extracting all the sentences mention- sentence with a score from a continuous interval. ing the term from the top 20 court decisions re- Since we cannot ask the human annotators to do 3 trieved from Court Listener. The focus on the top the same, we discretized the interval into the four 2Available at https://www.law.cornell.edu/uscode/text/ categories for the purpose of the evaluation. There 3Available at https://www.courtlistener.com/. The search was no time limit imposed on the annotation pro- query was formulated as the phrase search for the term and it cess. was limited to the 120 federal jurisdictions. The corpus cor- responds to the state of Court Listener database on February 16, 2016, which is the last time we updated the corpus.

51 Term # HV # CV # PV # NV # Total 3 Features for Predicting Interpretive Ind. economic val. 2 5 40 5 52 Identifying part. 6 8 40 17 71 Usefulness of Sentences C. business purp. 20 26 51 23 120 Total 28 39 131 45 243 For testing the hypothesis we came up with a ten- tative list of features that could be helpful in pre- Table 2: Distribution of sentences with respect to dicting the interpretive usefulness of a sentence. their interpretive value (HV: high value, CV: cer- We reserve the refinement of this list for future tain value, PV: potential value, NV: no value). work. In addition, many features were gener- ated with very simple models which leaves space for significant improvements. We briefly describe Table 1 shows the confusion matrix of the la- each of the features in the following subsections. bels as assigned by the two expert annotators. The average inter-annotator agreement was 0.75 with 3.1 Source weighted kappa at 0.66. For the ‘independent This category models the relation between the economic value’ the agreement was 0.71 and the source of the term of interest (i.e., the statutory kappa 0.51, for the ‘identifying particular’ 0.75 provision it comes from) and the source of the and 0.67, and for the ‘common business purpose’ term as used in the retrieved sentence. To auto- 0.75 and 0.68 respectively. The lower agreement matically generate this feature we used a legal ci- in case of the ‘independent economic value’ could tation extractor.4 Each sentence can be assigned be explained by the fact that this term was the first with one of the following labels: the annotators were dealing with. Although, we provided a detailed explanation of the annotation 1. Same provision: This label is predicted if we task we did not provide the annotators with an op- detect a citation of the provision the term of portunity to practice before they started with the interest comes from in any of the 10 sen- annotation. The practice could be helpful and we tences preceding or following the sentence plan to use it in future additions to the corpus. mentioning the term of interest. After the annotation was finished the annotators 2. Same section: We predict this label if we met and discussed the sentences for which their detect a citation of the provision from the labels differed. In the end they were supposed same section of the United States Code in the to agree on consensus labels for all of those sen- window of 10 sentences around the sentence tences. For example, the following sentence from mentioning the term of interest. the ‘identifying particular’ part of the corpus was 3. Different section: This label is predicted if we assigned with different labels: detect any other citation to the United States Code anywhere in the decision’s text. Here, the district court found that the 4. Different jurisdiction: We predict this label if duty titles were not numbers, symbols, we are not able to detect any citation to the or other identifying particulars. United States Code. The distribution of the labels in this category is One of the reviewers opted for the ‘certain value’ summarized in the top left corner of Table 3. We label while the other one picked the ‘high value’ can see that the distribution wildly differs across label. In the end the reviewers agreed that the goal the terms we work with. For the ‘independent eco- of the sentence is not to elaborate on the mean- nomic value’ the ‘different jurisdiction’ (DJR) la- ing of the ‘identifying particular’ and that it pro- bel is clearly dominant whereas for the ‘common vides grounds to conclude that, e.g., duty titles are business purpose’ we predict the ‘same provision’ not identifying particulars. Therefore, the ‘certain (SPR) almost exclusively. value’ label is more appropriate. As an example let us consider the following sen- Table 2 reports counts for the consensus labels. tence retrieved from one of the decisions: The most frequent label (53.9%) is the ‘potential The full text of 1839(3)(B) is: “[...]”. value.’ The least frequent (11.5%) is the ‘high § [...] Every firm other than the original value’ label. The distribution varies slightly for the different terms. 4https://github.com/unitedstates/citation

52 Source Semantic Similarity Structural Placement SPR SSC DSC DJR SAM SIM REL DIF STS CIT QEX HD FT Ind. economic val. 9 0 0 43 37 1 14 0 9 29 11 0 3 Identifying part. 39 28 0 4 67 0 0 4 29 33 5 0 4 C. business purp. 118 0 0 2 118 2 0 0 65 29 24 2 0 Total 166 28 0 49 224 3 14 4 103 91 40 2 7 Syntactic Importance Rhetorical Role DOM IMP NOT STL APL APA STF INL EXP RES HLD OTH Ind. economic val. 5 25 22 23 13 0 3 3 2 7 1 0 Identifying part. 3 21 47 32 7 1 6 9 5 6 5 0 C. business purp. 22 64 34 32 27 1 8 23 14 6 5 4 Total 30 110 103 87 47 2 17 35 21 19 11 4 Attribution Assignment/Contrast Feature JUD LEG PTY WIT EXP NA ASC TSC TSA TNA NA AF TF Ind. economic val. 20 25 7 0 0 52 0 0 0 0 37 0 15 Identifying part. 36 32 3 0 0 15 49 0 0 7 28 0 43 C. business purp. 87 25 7 0 1 107 8 0 3 2 98 11 11 Total 143 82 17 0 1 177 57 0 3 9 163 11 69

Table 3: The table shows distribution of the features generated for the prediction of sentences’ interpre- tive usefulness. Source: Same provision (SPR), same section (SSC), different section (DSC), different jurisdiction (DJR). Semantic similarity: same (SAM), similar (SIM), related (REL), different (DIF). Structural placement: quoted expression (QEX), citation (CIT), heading (HD), footnote (FT), standard sentence (STS). Syntactic importance: dominant (DOM), important (IMP), not important (NOT). Rhetorical role: application of law to factual context (APL), applicability assessment (APA), statement of fact (STF), interpretation of law (INL), statement of law (STL), general explanation or elaboration (EXP), reasoning statement (RES), holding (HLD), other (OTH). Attribution: legislator (LEG), party to the dispute (PTY), witness (WIT), expert (EXP), judge (JUD). Assignment/Contrast: another term is a specific case of the term of interest (ASC), the term of interest is a specific case of another term (TSC), the term of interest is the same as another term (TSA), the term of interest is not the same as another term (TNA), no assignment (NA). Feature assignment: the term of interest is a feature of another term (TF), another term is a feature of the term of interest (AF), no feature assignment (NA).

equipment manufacturer and RAPCO feature based on the label in the ‘source’ cate- had to pay dearly to devise, test, and gory as well as on the cosine similarity between win approval of similar parts; the details the bag-of-words (TFIDF) representations of the unknown to the rivals, and not discov- source provision and the retrieved sentence. Each erable with tape measures, had consid- sentence can be assigned with one of the following erable “independent economic value ... labels: from not being generally known”. 1. Same: We predict this label if the ‘same pro- Here we detect the citation to the same provision vision’ label was predicted in the source cat- in the sentence mentioning the term of interest. egory. We predict the ‘same source’ label. 2. Similar: We predict this label if the cosine similarity is higher than 0.5. 3.2 Semantic Similarity 3. Related: We predict this label if the cosine This category is auxiliary to the ‘source’ discussed similarity is between 0.25 and 0.5. in the preceding subsection. Here we model the 4. Different: We predict this label if the cosine semantic relationship between the term of interest similarity is lower than 0.25. as used in the statutory provision and in the re- trieved sentence. Essentially, we ask if the mean- By definition this feature is useful only in case ing of the terms is the same and if not how much the ‘same provision’ label is not predicted in the do the meanings differ. We partially model this ‘source’ category. The distribution of the labels in

53 this category can be seen in the middle component As an example let us consider the following ex- of the top row in Table 3. As we have predicted the ample sentence with its syntactic tree shown in ‘same’ label in most of the cases, this feature did Figure 1: not prove as very helpful in our experiments (see Section 4). We plan to refine the notion of this fea- The park where no vehicles are allowed ture in future work. For example, we would like to was closed during the last month. use a more sophisticated representation of the term of interest such as word2vec. The two following examples show sentences that use the same term with different meaning:

[...] the information derives indepen- dent economic value, actual or potential, from not being generally known to, and not being readily ascertainable through proper means by, the public; Figure 1: [...] posted in the establishment in a prominent position where it can be read- The syntactic tree contains only one token which ily examined by the public; is deeper in the structure than the ‘vehicle’ (the term of interest). Therefore, the ratio is 1/13 and The first sentence mentions the term ‘public’ for this sentence is labeled as ‘not important.’ the purpose of the trade secret protection. The term refers to customers, competitors and the gen- 3.4 Structural Placement eral group of experts on a specific topic. The sec- This category describes the place of the retrieved ond sentence uses the term to refer to a general sentence and the term of interest in the structure of ‘public.’ the document it comes from. To model this feature we use simple pattern matching. Each sentence 3.3 Syntactic Importance can be assigned with one of the following labels: In this category we are interested in how dominant the term is in the retrieved sentence. To model the 1. Quoted expression: We predict this label for feature we use syntactic parsing (Chen and Man- a sentence that contains the term of interest in ning, 2014). Specifically, we base our decision a sequence of characters enclosed by double on the ratio of the tokens that are deeper in the or single quotes if the sequence starts with a tree structure (further from the root) than the to- lower case letter. kens standing for the term of interest divided by 2. Citation: This label is predicted if all the con- the count of all the tokens. Each sentence can be ditions for the ‘quoted expression’ label are assigned with one of the following labels: met except that the starting character of the sequence is in upper case. 1. Dominant: We predict this label if the ratio is 3. Heading: This label is predicted if we detect greater than 0.5. an alphanumeric numbering token at the be- 2. Important: This label is predicted if the ratio ginning of the retrieved sentence. is less than 0.5 but greater than 0.2. 4. Footnote: We predict this label for a sen- 3. Not important: We predict this label if the ra- tence that starts a line with a digits enclosed tio is less than 0.2. in square brackets. 5. Standard sentence: We predict this label if The distribution of the labels in this category is none of the patterns for other labels matches. summarized in the left section of the middle row in Table 3. We labeled most sentences as either ‘im- The distribution of the labels in this category is portant’ or ‘not important’ (around the same pro- shown in the top right corner of Table 3. Almost portion). Only a small number of sentences were all the sentences were labeled as the ‘standard sen- labeled with the ‘dominant’ label. tence’, the ‘citation’, or the ‘quoted expression.’

54 Only a very small number of sentences was recog- 2. Party to the Dispute: We predict this category nized as the ‘heading’ or the ‘footnote.’ if we detect a mention of the party (either its Two examples below show a heading and a name or its role such as plaintiff) followed by footnote correctly recognized in the retrieved sen- one of the specifically prepared list of verbs tences: such as ‘contend’, ‘claim’, etc.

A. Related Activities and Common Busi- 3. Witness: This label is predicted if we match ness Purpose the word ‘witness’ followed by one of the [5] [...] However, in view of the ‘com- verbs from the same set as in case of the pre- mon business purpose’ requirement of ceding label. the Act, we think [...] 4. Expert: This label is predicted in the same 3.5 Rhetorical Role way as the ‘witness’ label but instead of the In this category we are interested in the rhetori- word ‘witness’ we match ‘expert’. cal role that the retrieved sentence has in the docu- ment it comes from. Although, some more sophis- 5. Judge: We predict this label if none of the ticated approaches to automatic generation of this patterns for other labels matches. feature have been proposed (Saravanan and Ravin- The distribution of the labels in this category is dran, 2010; Ravindran and others, 2008; Grabmair shown in the bottom left corner of Table 3. We et al., 2015) we model it as a simple sentence clas- were able to recognize a reasonable number of the sification task. We used bag of words (TFIDF ‘legislator’ labels but apart from that we almost weights) representation as features and manually always used the catch-all ‘judge’ label. assigned labels for training. Each sentence can be The following example shows a sentence for assigned with one of the following labels: which we predict the ‘party to the dispute’ label: 1. Application of law to factual context 2. Applicability assessment In support of his contention that Gold 3. Statement of fact Star Chili and Caruso’s Ristorante con- stitute an enterprise, plaintiff alleges that 4. Interpretation of law Caruso’s Ristorante and Gold Star Chili 5. Statement of law were engaged in the related business ac- 6. General explanation or elaboration tivity [...]. 7. Reasoning statement 8. Holding 3.7 Assignment/Contrast 9. Other Here we are interested if the term of interest in The distribution of the labels in this category is the retrieved sentence is said to be (or not to be) shown in the right part of the middle row in Ta- some other term. To model this category we use ble 3. Most of the sentences were labeled as the pattern matching on the verb phrase of which the ‘statement of law,’ the ‘application of law,’ or the term of interest is part (if there is such a phrase in ‘interpretation of law.’ the sentence). Each sentence can be assigned with one of the following labels: 3.6 Attribution This category models who has uttered the retrieved 1. Another term is a specific case of the term of sentence. For the purpose of this paper we rely interest: This label is predicted if one of the on pattern matching with the assumption that the specified set of verbs (e.g., may be, can be) is judge utters the sentence if none of the patterns preceded by a noun and followed by a term matches. Each sentence can be assigned with one of interest within a verb phrase. of the following labels: 2. The term of interest is a specific case of an- 1. Legislator: We predict this label if we detect other term: In case of this label we proceed a citation to US statutory law followed by a in the same way as in case of the preceding pattern corresponding to citation described in label but the noun and the term of interest are the earlier category. swapped.

55 3. The term of interest is the same as another Classifier CV STD TEST STD SIG Most frequent .545 .025 .531 .049 – term: In case of this label we use a different Na¨ıve Bayes .544 .037 .611 .066 no set of verbs (e.g., is, equals) and we do not SVM .633 .044 .657 .066 no care about the order of the term of interest Random Forest .677 .033 .696 .042 yes and the noun. Table 4: Mean results from 100 runs of a classi- 4. The term of interest is not the same as an- fication experiment (CV: 10-fold cross validation other term: We proceed in the same way as on the training set, TEST: validation on the test in the case of the preceding label but we also set, SIG: statistical significance) require a negation token to occur (e.g., not). Features CV STD TEST STD -source .519 .05 .586 .046 5. No assignment: We predict this label if none -semantic relationship .675 .031 .694 .049 of the patterns for other labels matches. -syntactic importance .532 .028 .521 .047 -structural placement .695 .033 .708 .047 The distribution of the labels in this category is -rhetorical role .687 .033 .695 .049 -attribution .657 .034 .671 .048 shown in the middle part of the bottom row in Ta- -assignment/contrast .668 .032 .669 .045 ble 3. A certain amount of the ‘another term is a -feature assignment .662 .032 .684 .047 specific case of the term of interest’ was predicted Table 5: Mean results of classification experiment in the ‘identifying particular’ part of the data set. where each line reports the performance when the For the rest of the dataset the catch-all ‘no assign- respective feature was removed. ment’ label was used in most of the cases. The following example shows a sentence that we labeled with the ‘the term of interest is the The following example shows a sentence that same as another term’ label: we labeled with the ‘the term of interest is a feature of another term’ label: The Fifth Circuit has held that the profit motive is a common business purpose if However, Reiser concedes in its brief shared. that the process has independent eco- nomic value. 3.8 Feature Assignment Here, the independent economic value is said to be In this category we analyze if the term of interest an attribute of the process. in the retrieved sentence is said to be a feature of another term (or vice versa). We model this cat- 4 Predicting Usefulness of Sentences for egory by pattern matching on the verb phrase of Interpretation of the Terms of Interest which the term of interest is part. Each sentence We work with the dataset described in Section 2. can be assigned with one of the following labels: The goal is to classify the sentences into the four 1. The term of interest is a feature of another categories reflecting their usefulness for the inter- term: This label is predicted if one of the pretation of the terms of interest. As features we specified set of verbs (e.g., have) is followed use the categories described in Section 3. by a term of interest within a verb phrase. The experiment starts with a random division of the sentences into a training set (2/3) and a test 2. Another term is a feature of the term of in- set. The resulting training set consists of 162 sen- terest: This label is predicted if the term of tences while there are 81 sentences in the test set. interest precedes one of the verbs. As classification models we train a Na¨ıve Bayes, 3. No feature assignment: We predict this la- an SVM (with linear kernel and L2 regularization), bel if none of the patterns for other labels and a Random Forest (with 10 estimators and Gini matches. impurity as a measure of the quality of a split) us- ing the scikit-learn library (Pedregosa et al., 2011). The distribution of the labels in this category is We use a simple classifier always predicting the shown in the bottom left corner of Table 3. The most frequent label as the baseline. ‘no feature assignment’ label was predicted in ap- Because our data set is small and the division proximately 2/3 of the cases and the ’term of in- into the training and test set influences the perfor- terest is a feature of another term’ in the rest. mance we repeat the experiment 100 times. We

56 report the mean results of 10-fold cross validation question answering problem. An extension in- on the training set and evaluation on the test set as troducing interactivity was proposed by Lin et al. well as the standard deviations in Table 4. (2010). All the three classifiers outperform the most fre- A number of interesting applications deal with quent class baseline. However, due to the large similar tasks in different domains. Sauper and variance of the results from the 100 runs the im- Barzilay (2009) proposed an approach to auto- provement is statistically significant (α = .05) matic generation of Wikipedia articles. Demner- only for the Random Forest which is the best per- Fushman and Lin (2006) described an extractive forming classifier overall. With the accuracy of summarization system for clinical QA. Wang et al. .696 on the test set the agreement of the Random (2010) presented a system for recommending rel- Forest classifier with the consensus labels is quite evant information to the users of Internet forums close to the inter-annotator agreement between the and blogs. Yu et al. (2011) mine important prod- two human expert annotators (.746). uct aspects from online consumer reviews. We also tested which features are the most im- portant for the predictions with the Random For- 6 Discussion and Future Work est. We ran the 100-batches of the experiments The results of the experiments are promising. leaving out one feature in each batch. The results They confirm the hypothesis even though we used reported in Table 5 show that the source and the extremely simplistic (sometimes clearly inade- syntactic importance were the most important. quate) approaches to generate the features auto- matically. We have every reason to expect that im- 5 Related Work provements in the quality of the feature generation Because argumentation plays an essential role in will improve the quality of the interpretive useful- law, the extraction of arguments from legal texts ness assessment. We would like to investigate this has been an active area of research for some assumption in future work. time. Mochales and Moens detect arguments con- It is also worth mentioning that we used only sisting of premises and conclusions and, using simple off-the-shelf classification algorithms that different techniques, they organize the individ- we did not tweak or optimize for the task. As in ual arguments extracted from the decisions of the the case of the features, improvements in the al- European Court of Human Rights into an over- gorithms we use would most likely lead to an im- all structure (Moens et al., 2007; Mochales and provement in the quality of the interpretive useful- Ieven, 2009; Mochales-Palau and Moens, 2007; ness assessment. We plan to focus on this aspect Mochales and Moens, 2011). In their work on in future work. vaccine injury decisions Walker, Ashley, Grab- The analysis of the importance of the individ- mair and other researchers focus on extraction of ual features for the success in our task showed that evidential reasoning (Walker et al., 2011; Ashley contribution of some of the features was quite lim- and Walker, 2013; Grabmair et al., 2015). Brun- ited. We would caution against the conclusion that inghaus and Ashley (2005) and Kubosawa et al. those features are not useful. It may very well be (2012) extract case factors that could be used in the case that our simplistic techniques for the au- arguing about an outcome of the case. In addi- tomatic generation of those features did not model tion, argumentation mining has been applied in a them adequately. As already mentioned, we plan study of diverse areas such as parliamentary de- on improving the means by which the features are bates (Hirst et al., 2014) or public participation in generated in future work. rulemaking (Park et al., 2015). We are well aware of the limitations of the work The task we deal with is close to the tra- stemming from the small size of the corpus. This ditional NLP task of query-focused summariza- is largely due to the fact that getting the labels is tion of multiple documents as described in Gupta very expensive. Since the nature of this work is (2010). Fisher and Roark (2006) presented a sys- exploratory in the sense of showing that the task tem based on supervised sentence ranking. Daume´ is (a) interesting and (b) can be automatized, we and Marcu (2006) tackled the situation in which could not afford a corpus of more adequate size. the retrieved pool of documents is large. Schiff- However, since the results of the experiments are man and McKeown (2007) cast the task into a promising we plan to extend the corpus.

57 This work is meant as the first step towards summarization for clinical question answering. In a fully functional and well described framework Proceedings of the 21st International Conference supporting argumentation about the meaning of on Computational Linguistics and the 44th annual meeting of the Association for Computational Lin- statutory terms. Apart from facilitating easier ac- guistics, pages 841–848. Association for Computa- cess to law for lawyers, it is our goal to lower tional Linguistics. the barrier for public officials and other users who Timothy Endicott. 2000. Vagueness in Law. Oxford need to work with legal texts. In addition, we be- University Press. lieve such a framework could support dialogue be- tween lawyers and experts from other fields. There Timothy Endicott. 2014. Law and Language the stan- could be a great impact on legal education as well. ford encyclopedia of philosophy. http://plato.stanford.edu/. Accessed: 2016-02-03. 7 Conclusion Seeger Fisher and Brian Roark. 2006. Query-focused summarization by supervised sentence ranking and We investigated the possibility of automatic ex- skewed word distributions. In Proceedings of the traction of case law sentences that deal with the Document Understanding Conference, DUC-2006, meaning of statutory terms. We showed that hu- New York, USA. Citeseer. man annotators can reasonably agree on the inter- Matthias Grabmair, Kevin D Ashley, Ran Chen, Preethi pretive usefulness of a sentence for argumentation Sureshkumar, Chen Wang, Eric Nyberg, and Vern R about the meaning of a specific statutory term. We Walker. 2015. Introducing luima: an experiment in specified the list of features that could be useful for legal conceptual retrieval of vaccine injury decisions using a uima type system and tools. In Proceedings a prediction of the interpretive usefulness of a sen- of the 15th International Conference on Artificial In- tence. We used stock classification algorithms to telligence and Law, pages 69–78. ACM. confirm the hypothesis that by using a set of auto- matically generated linguistic features about/in the Vishal Gupta and Gurpreet Singh Lehal. 2010. A survey of text summarization extractive techniques. sentence it is possible to evaluate how useful the Journal of Emerging Technologies in Web Intelli- sentence is for an argumentation about the mean- gence, 2(3):258–268. ing of a term from a specific statutory provision. Herbert L. Hart. 1994. The Concept of Law. Claren- don Press, 2nd edition. References Graeme Hirst, Vanessa Wei Feng, Christopher Kevin D Ashley and Vern R Walker. 2013. From in- Cochrane, and Nona Naderi. 2014. Argumenta- formanon retrieval (ir) to argument retrieval (ar) for tion, ideology, and issue framing in parliamentary legal cases: Report on a baseline study. discourse. In ArgNLP. Stefanie Bruninghaus¨ and Kevin D Ashley. 2005. Gen- Shumpei Kubosawa, Youwei Lu, Shogo Okada, and erating legal arguments and predictions from case Katsumi Nitta. 2012. Argument analysis. In Le- texts. In Proceedings of the 10th international con- gal Knowledge and Information Systems: JURIX ference on Artificial intelligence and law, pages 65– 2012, the Twenty-fifth Annual Conference, volume 74. ACM. 250, page 61. IOS Press. Danqi Chen and Christopher D Manning. 2014. A fast Jimmy Lin, Nitin Madnani, and Bonnie J Dorr. 2010. and accurate dependency parser using neural net- Putting the user in the loop: interactive maximal works. In EMNLP, pages 740–750. marginal relevance for query-focused summariza- tion. In Human Language Technologies: The 2010 Jordan Daci. 2010. Legal principles, legal values and Annual Conference of the North American Chap- legal norms: are they the same or different? Aca- ter of the Association for Computational Linguistics, demicus International Scientific Journal, 02:109– pages 305–308. Association for Computational Lin- 115. guistics.

Hal Daume´ III and Daniel Marcu. 2006. Bayesian D. Neil MacCormick and Robert S. Summers. 1991. query-focused summarization. In Proceedings of Interpreting Statutes. Darmouth. the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Asso- Raquel Mochales and Aagje Ieven. 2009. Creating ciation for Computational Linguistics, pages 305– an argumentation corpus: do theories apply to real 312. Association for Computational Linguistics. arguments?: a case study on the legal argumentation of the echr. In Proceedings of the 12th International Dina Demner-Fushman and Jimmy Lin. 2006. An- Conference on Artificial Intelligence and Law, pages swer extraction, semantic clustering, and extractive 21–30. ACM.

58 Raquel Mochales and Marie-Francine Moens. 2011. Christina Sauper and Regina Barzilay. 2009. Auto- Argumentation mining. Artificial Intelligence and matically generating wikipedia articles: A structure- Law, 19(1):1–22. aware approach. In Proceedings of the Joint Confer- ence of the 47th Annual Meeting of the ACL and the Raquel Mochales-Palau and M Moens. 2007. Study 4th International Joint Conference on Natural Lan- on sentence relations in the automatic detection of guage Processing of the AFNLP: Volume 1-Volume argumentation in legal cases. FRONTIERS IN AR- 1, pages 208–216. Association for Computational TIFICIAL INTELLIGENCE AND APPLICATIONS, Linguistics. 165:89. Barry Schiffman, Kathleen McKeown, Ralph Grish- Marie-Francine Moens, Erik Boiy, Raquel Mochales man, and James Allan. 2007. Question answering Palau, and Chris Reed. 2007. Automatic detec- using integrated information retrieval and informa- tion of arguments in legal texts. In Proceedings of tion extraction. In HLT-NAACL, pages 532–539. the 11th international conference on Artificial intel- ligence and law, pages 225–230. ACM. Vern R Walker, Nathaniel Carie, Courtney C DeWitt, and Eric Lesh. 2011. A framework for the ex- Joonsuk Park, Cheryl Blake, and Claire Cardie. 2015. traction and modeling of fact-finding reasoning from Toward machine-assisted participation in erulemak- legal decisions: lessons from the vaccine/injury ing: an argumentation model of evaluability. In Pro- project corpus. Artificial Intelligence and Law, ceedings of the 15th International Conference on Ar- 19(4):291–331. tificial Intelligence and Law, pages 206–210. ACM. Jia Wang, Qing Li, Yuanzhu Peter Chen, and Zhangxi F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Lin. 2010. Recommendation in internet forums and B. Thirion, O. Grisel, M. Blondel, P. Pretten- blogs. In Proceedings of the 48th Annual Meet- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- ing of the Association for Computational Linguis- sos, D. Cournapeau, M. Brucher, M. Perrot, and tics, pages 257–265. Association for Computational E. Duchesnay. 2011. Scikit-learn: Machine learn- Linguistics. ing in Python. Journal of Machine Learning Re- search, 12:2825–2830. Jianxing Yu, Zheng-Jun Zha, Meng Wang, and Tat- Seng Chua. 2011. Aspect ranking: Identifying Balaraman Ravindran et al. 2008. Automatic identi- important product aspects from online consumer re- fication of rhetorical roles using conditional random views. In Proceedings of the 49th Annual Meeting of fields for legal document summarization. the Association for Computational Linguistics: Hu- man Language Technologies - Volume 1, HLT ’11, M Saravanan and Balaraman Ravindran. 2010. Iden- pages 1496–1505, Stroudsburg, PA, USA. Associa- tification of rhetorical roles for segmentation and tion for Computational Linguistics. summarization of a legal judgment. Artificial Intel- ligence and Law, 18(1):45–76.

59 Scrutable Feature Sets for Stance Classification

Angrosh Mandya Advaith Siddharthan Adam Wyner Computing Science Computing Science Computing Science University of Aberdeen, UK University of Aberdeen, UK University of Aberdeen, UK [email protected] [email protected] [email protected]

Abstract has to follow the reasoning quite closely to de- cide which of the posts in Table 1 argues for This paper describes and evaluates a or against abortion. novel feature set for stance classifica- In Table 1, the posts are monologic (in- tion of argumentative texts; i.e. de- dependent of each other), but even with the ciding whether a post by a user is availability of dialogic structure connecting for or against the issue being de- posts, both humans and classifiers experience bated. We model the debate both difficulties in stance classification (Anand et as attitude bearing features, including al., 2011), in part because posts that contain a set of automatically acquired ‘topic rebuttal arguments do not provide clear ev- terms’ associated with a Distributional idence that they are arguing for or against Lexical Model (DLM) that captures the main issue being debated. Stance classi- the writer’s attitude towards the topic fication is considered particularly challenging term, and as dependency features that however when the posts are monologic since represent the points being made in the the lack of dialogic structure means all features debate. The stance of the text towards for classification have to be extracted from the the issue being debated is then learnt text itself. Indeed studies to classify such in- in a supervised framework as a func- dependent posts have previously found it diffi- tion of these features. The main ad- cult to even beat a unigram classifier baseline; vantage of our feature set is that it is for example, Somasundaran and Wiebe (2010) scrutable: The reasons for a classifica- achieved only a 1.5% increase in accuracy from tion can be explained to a human user the use of more sophisticated features such as in natural language. We also report opinion and arguing expressions over a simple that our method outperforms previous unigram model. approaches to stance classification as In this paper, we propose a new feature set well as a range of baselines based on for stance classification of independent posts sentiment analysis and topic-sentiment that, unlike previous work, captures two key analysis. characteristics of such debates; namely, writ- ers express their attitudes towards a range of 1 Introduction topics associated with the issue being debated In recent years, stance classification for online and also argue by making logical points. We debates has received increasing research inter- model the debate using a combination of the est (Somasundaran and Wiebe, 2010; Anand following features. et al., 2011; Walker et al., 2012; Ranade et • topic-stance features – a set of automati- al., 2013; Sridhar et al., 2014). Given a post cally extracted ‘topic terms’ (for abortion rights, belonging to a two-sided debate on an issue these would include, for example, ‘fetus’, ‘baby’, (e.g. abortion rights; see Table 1), the task ‘woman’ and ‘life’), where each topic term is asso- ciated with a distributional lexical model (DLM) is classify the post as for or against the issue. that captures the writer’s stance towards that The argumentative nature of such posts makes topic. stance classification difficult; for example, one • stance bearing terminology – words related

60 Proceedings of the 3rd Workshop on Argument Mining, pages 60–69, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

tions in the MPQA corpus (Wilson and Wiebe, For abortion rights 2005), barely outperforming a unigram base- If women (not men) are solely burdened by line that achieved 62.5%. They also reported pregnancy, they must have a choice. Men performance using the sentiment lexicon alone are dominant in their ability to impregnate a woman, but carry no responsibilities after- of only 55.0% and made the point that senti- ward. If woman carry the entire burden of ment features alone were not useful for stance. pregnancy, they must have a choice. More recently, Hasan and Ng (2014) have Against abortion rights focused on identifying reasons for supporting Life is an individual right, not a privilege, for or opposing an issue under debate, using a unborn humans [...] The right to life does not corpus that provides information about post depend, and must not be contingent, on the pleasure of anyone else, not even a parent or sequence, and with manually annotated rea- sovereign [...] sons. The authors experiment with different features such as n-grams, dependency-based Table 1: Samples from posts arguing for and features, frame-semantic features, quotation against abortion rights features and positional features for stance clas- sification of reasons. Nguyen and Litman by adjectival modifiers (amod) and the noun com- (2015) proposed a feature reduction method pound (nn) relations that carry stance bearing based on the semi-supervised derivation of lex- language. ical signals of argumentative and domain con- • logical point features – features of the form tent. Specifically, the method involved post- subject-verb-object (SVO) extracted from the de- pendency parse that capture basic points being processing a topic-model to extract argument made. words (lexical signals of argumentative con- • unigrams and dependency features – back- tent) and domain words (terminologies in ar- off features, useful for classifying short posts lack- gument topics). ing other features. A larger number of studies have focused The contributions of this paper are two fold. on the use of dialogic structure for stance Using the features listed above, we learn the classification. Anand et al. (2011) worked stance of the debate towards the issue in a su- with debates that have rebuttal links between pervised setting, demonstrating better classi- posts. With respect to stance classification, fication performance than previous work. Sec- they achieved accuracies ranging from 54% to ond, we argue that our feature set lends it- 69% using such contextual features. Walker self to human scrutable stance classification, et al. (2012) focused on capturing the di- through features that are human readable. alogic structure between posts in terms of The paper is organised as follows. In §2, we agreement relations between speakers. They discuss related work on stance classification. showed that such a representation improves re- In §3, we describe our methods to model on- sults as against the use of contextual features line debates and in §4, we present and discuss alone, achieving accuracies ranging from 57% the results achieved in this study. In §5, we to 64%. Several others have modelled dialogic present our conclusions. structure in more sophisticated ways, report- ing further improvements from such strategies 2 Related work (Ranade et al., 2013; Sridhar et al., 2014, for Somasundaran and Wiebe (2010) developed a example). balanced corpus (with half the posts for and For the related task of opinion mining, the other half against) of political and ide- dependency parse based features have been ological debates and carried out experiments shown to be useful. Joshi and Penstein-Ros´e on stance classification pertaining to four de- (2009) transformed dependency triples into bates on abortion rights, creation, gay rights ‘composite backoff features’ to show that they and gun rights. They achieved an overall generalise better than regular dependency fea- accuracy of 63.9% using a sentiment lexicon tures. The composite backoff features replaces as well as an ngrams-based lexicon of argu- either head term or modifier term with its POS ing phrases derived from the manual annota- tag in a dependency relation to result in two

61 types of features for each relation. Greene and 3.1 Distributional Lexical Model of Resnik (2009) focused on ‘syntactic packaging’ Topic of ideas to identify implicit sentiment. The au- Dependency grammar allows us to identify thors proposed the concept of observable prox- syntactically related words in a sentence, ies for underlying semantics (OPUS) which in- by modelling the syntactic structure of a volves identifying a set of relevant terms using sentence using binary asymmetrical relations relative frequency ratio. These terms are used (De Marneffe and Manning, 2008). We use to identify all relations with these terms in the these relations to build a Distributional Lexi- dependency graph, which are further used to cal Model (DLM), excluding stop words such define the feature set. Paul and Girju (2010) as determiners and conjunctions to obtain a presented a two-stage approach to summarise set of content words connected to the topic multiple contrasting viewpoints in opinionated term through syntax. The DLM is constructed text. In the first stage they used the topic- in three steps: aspect model (TAM) for jointly modelling top- Step 1. identify topic terms ti in the sentence; ics and viewpoints in the text. Amongst other Step 2. for each ti, identify all content words wj in features such as bag-of-words, negation and a dependency relation with ti. polarity, the TAM model also used the com- Step 3. for each wj , identify all content words wk in a dependency relation with wj ; i.e., iden- posite backoff features proposed by (Joshi and tify words that are within two dependency Penstein-Ros´e,2009). relations of the topic term. In summary, many studies on stance clas- In order to derive the topic terms, we sification have focused on the use of dialogic used Mallet (McCallum, 2002), which im- structure between posts (Anand et al., 2011; plements topic modelling using Latent Dirich- Walker et al., 2012; Ranade et al., 2013; Srid- let Allocation (LDA) (Blei et al., 2003). Given har et al., 2014), but there has been less a set of documents, Mallet produces a set of work on exploring feature sets for monologic likely topics where each topic is a distribution posts, though a large body of such work ex- over the vocabulary of the document set such ists for the related task of opinion mining. that the higher probability words contribute We are unaware of any attention paid to the more towards defining the topic. We config- scrutability of classifiers, though users might ured Mallet to produce 10 set of likely top- well be interested in why a post has been clas- ics for the collection of posts for a given politi- sified in a certain manner. To address these cal debate, and used the default setting of the gaps, we consider again the task of stance top 19 words for each topic. As we required classification from monologic posts, using the our topic words to be nouns, we filtered the dataset created by Somasundaran and Wiebe 190 words by part of speech. After further (2010). We focus on modelling of the patterns removing repetitions of words in different top- within a post rather than connections between ics, this resulted in 96, 105, 135, 105 and 110 posts, and aim to design a competitive classi- distinct topic terms for the political debates fier whose decisions can be explained to a user. on abortion rights, creation, gay rights, god and gun rights, respectively. Examples of such 3 Methods topic terms created for the domain of abortion rights are shown in Table 2. As described earlier, the goals of this paper are For the sentence and dependency parse two fold: (1) to develop a classifier for stance shown in Fig. 1 (with punctuation and word classification; and (2) employ the results of positions removed for simplicity), there are classification to create human readable expla- three topic terms: ‘fetus’, ‘woman’ and ‘preg- nations of the reasons for classification. Ac- nancy’, and the 3-steps above generate the fol- cordingly, we focus on the following features lowing DLMs: which lend themselves to human readable ex- planation, as discussed later: (a) topic-based fetus: ‘causes’; ‘sickness’; ‘discomfort’; distributional lexical models; (b) stance bear- ‘pain’; ‘woman’ ing relations; (c) points represented as subject- woman: ‘causes’; ‘sickness’; ‘discomfort’; ‘pain’; ‘pregnancy’; ‘labor’ verb-object triplets. pregnancy: ‘causes’; ‘woman’; ‘labor’

62 The fetus causes sickness discomfort and ex- shown in Fig. 3, where the labelled arc indi- treme pain to a woman during her pregnancy and labor. cates the sentence in which the relation ap- pears, and the direction of the arrow indicates det(fetus, the) whether the topic term precedes or follows the nsubj(causes, fetus) dobj(causes, sickness) related word. As seen, a topic word can be dobj(causes, discomfort) connected to different terms in the graph, e.g. conj and(sickness, discomfort) dobj(causes, and) pain and causes are connected to fetus in sen- conj and(sickness, and) tences 1 and 2. amod(pain, extreme) dobj(causes, pain) conj and(sickness, pain) 3.2 Stance-bearing terminology det(woman, a) prep to(causes, woman) We also consider words connected by adjecti- poss(pregnancy, her) prep during(woman, pregnancy) val modifier (amod) and noun compound mod- prep during(woman, labor) ifier (nn) relations from the dependency graph conj and(pregnancy, labor) as features for the classifier. Given the po- litical debate on abortion rights, phrases such Figure 1: Dependency Parse (simplified to re- as ‘individual rights’, ‘personal choices’, ‘per- move punctuation and word positions) sonal decision’ and ‘unwanted children’ are used primarily in posts arguing for abortion rights. Similarly, phrases such as ‘human life’, These features facilitate scrutability because ‘unborn child’, ‘innocent child’ and ‘distinct we can explain a classification of a post as DNA’ provide good indicators that the posts ‘for abortion rights’ with a sentence such as is arguing against abortion rights. In the ex- “This post is classified as being in favour of ample in Fig. 1, the feature ‘extreme-pain’ is abortion rights because it associates words such extracted in this manner. These features could as ‘causes’, ‘sickness’, ‘discomfort’, ‘pain’ and be used in an explanation in a sentence such ‘woman’ with the term ‘fetus’.” Note that as “This post is classified as being in favour of in practice only a few features will select for abortion rights because it contains subjective a particular stance, and this example (which phrases such as ‘extreme pain’.” uses all the word pairs) is just for illustration. The process of deriving the model for the 3.3 Modelling argumentative points topic term fetus from a dependency tree is graphically shown in Fig. 2. As seen, the word We also extract features aimed at modelling causes (shown in thin dotted lines) is identi- elementary points made in a debate. We do fied in Step 2 and the other words discomfort, this in a limited manner by defining a point sickness, pain and woman are obtained in Step simply as a subject-verb-object triple from the 3 (shown in thick dotted lines). Non-content dependency parse. More sophisticated defini- words are excluded from the model. tions would not necessarily result in useful fea- This method is aimed at identifying stance tures for classification. For the sentence in Fig. bearing words associated with topic terms in 1, the following points are extracted to be used argumentative posts. The resulting graph for as features: the post arguing for abortion in Table 4 is fetus-causes-sickness fetus-causes-discomfort Abortion Topic Terms fetus-causes-and life; human; conception; embryo; choice; sex; fetus-causes-pain vote; position; birth; rape; war; church; act; Non-content words are excluded from the evil; fetus; person; body; womb; brain; baby; analysis. This analysis could be used to con- sperm; egg; cell; logic; people; argument; god; reason; law; woman; pregnancy; children; fam- struct explanations such as “This post is clas- ily; abortion; murder; sified as being in favour of abortion rights be- cause it makes points such as ‘fetus causes Table 2: Examples of topic terms produced by sickness’, ‘fetus causes discomfort’ and ‘fetus Mallet for the domain of abortion rights causes pain’.”

63 Figure 2: Deriving related words for ‘fetus’ from the dependency graph.

(a) Post arguing for abortion rights (b) Post arguing against abortion rights

Figure 3: DLM models for the two posts in Table 4

3.4 Baselines a positive sentiment is expressed in sentences In addition to the features proposed above, we arguing against abortion. Our second baseline experimented with a variety of baselines for is to therefore model the stance of a post us- comparison. ing features that indicate the sentiment of the writer towards key topics related to the issue 3.4.1 Sentiment model being debated. Our first baseline involved treating stance For example, let us consider the two sen- (‘for’ or ‘against’) as sentiment (‘positive’ or tences that argue for abortion in Table 3. ‘negative’). For this purpose, we used the Using topic modelling, we can identify topic Stanford sentiment tool 1 (Socher et al., 2013) terms such as ‘fetus’ and ‘woman’ in sen- to obtain sentence-level sentiment labels and tence 1. Further, using sentiment analysis the provide these as features for stance classifica- sentence can be identified to be negative. By tion of posts. tagging this sentiment to the topic terms con- tained in the sentence, we can associate a neg- 3.4.2 Topic-sentiment model ative sentiment with topic terms ‘fetus’ and However, we do not expect a direct equiva- ‘human’. Similarly for sentence 2, a negative lence between sentiment and stance; for ex- sentiment can be associated with topic terms ample, in Table 3, a negative sentiment is ex- such as ‘fetus’, ‘woman’ and ‘pregnancy’. pressed in sentences arguing for abortion and This model has the advantage over the sen- 1http://nlp.stanford.edu:8080/sentiment/ timent analysis baseline that sentiment is asso-

64 Sentences arguing for abortion rights For abortion rights 1. A fetus is no more a human than an acorn is The fetus causes physical pain; the woman has a tree. a right to self-defense. The fetus causes sick- 2. The fetus causes sickness, discomfort, and and ness, discomfort, and extreme pain to a woman extreme pain to a woman during her preg- during her pregnancy and labor. It is, there- nancy and labor. fore, justifiable for a woman to pursue an abor- tion in self-defense. Sentences arguing against abortion rights 3. A fetus is uniquely capable of becoming a Against abortion rights person; deserves rights, it is unquestionable Human life and a right to life begin at concep- that the fetus, at whatever stage of develop- tion; abortion is murder. Human life is con- ment, will inevitably develop the traits of a tinuum of growth that starts at conception, full-grown human person. not at birth. The person, therefore, begins at 4. This is why extending a right to life is of ut- conception. Killing the fetus, thus, destroys a most importance; the future of the unborn de- growing person and can be considered murder. pends on it. Table 4: Example posts for and against abor- Table 3: Example sentences arguing for and tion rights against abortion rights ciated with topic terms such as ‘fetus’, rather than the wider issue (abortion) being debated; here, a negative sentiment expressed towards a fetus is not a negative sentiment expressed towards abortion. Applying the topic-sentiment model to the sentences in Table 3 arguing against abortion, we can associate a positive sentiment for topic terms such as ‘fetus’, ‘person’, ‘stage’, ‘devel- opment’ and ‘human’ in sentence 3, and for topic terms ‘life’ and ‘unborn’ in sentence 4. We used Mallet as described in §3.1 to derive the topic terms. For an example of a topic- sentiment model, see Fig. 4, which shows the model obtained for the posts in Table 4.

3.4.3 Unigram model Figure 4: Topic-sentiment model for the two We used a third baseline feature set containing posts in Table 4 all unigrams. The more realistic assumption here (compared to equating stance with senti- the dependency graph. We also evaluated this ment) is that writers use different vocabularies against a baseline feature set which makes use to argue for or against an issue, and there- of all word-pairs obtained from the depen- fore a model can be learnt that predicts the dency graph. likelihood of a class based solely on the words 4 Evaluation used in the post. As mentioned earlier, previ- ous studies have struggled to outperform such We used the dataset created by Somasun- a unigram model (Somasundaran and Wiebe, daran and Wiebe (2010) containing monologic 2010). posts about five issues: abortion, creation, gay rights, god and gun rights. Somasundaran and 3.4.4 Full dependency model Wiebe (2010) reported results on a balanced Our proposed feature set for stance classi- subset of the corpus with equal numbers of fication using a distributional lexical model, posts for and against each issue. We adopted stance bearing terminology and points was de- the same methodology as them to create a bal- signed to be scrutable, but therefore made use anced subset and evaluated on our balanced of only a subset of word-pair features from dataset containing 4870 posts in total, with

65 Feature set Abortion Creation Gay Existence Gun Average rights* rights of God rights Baselines B1 51.46 52.56 50.67 53.52 49.48 51.53 B2 57.72 56.64 60.89 61.23 61.77 59.65 B3 77.74 77.50 76.52 76.33 83.95 78.40 B4 87.10 86.13 86.94 85.12 87.71 86.60 Topic DLM D1 73.37 67.48 77.87 66.34 70.98 71.20 SVO, amod and nn S1 70.45 72.14 72.25 67.20 72.18 70.84 Combined Models C1 (D1+S1) 77.35 76.22 78.48 78.06 76.10 77.24 C2 (D1+S1+B3) 84.06 82.86 82.81 83.38 88.39 84.30 C3 89.40 87.99 90.18 88.05 93.51 89.26 (D1+S1+B3+B4) *Development set

Table 5: Results of supervised learning experiments using Naive Bayes Multinomial model

1030, 856, 1478, 920 and 586 posts for do- B4 Dependency features composed of all word mains of abortion rights, creation, gay rights, pairs connected by a dependency relation. god and gun rights, respectively. We devel- 2. Distributional Lexical Models (DLM): oped our ideas by manual examination of the D1 Topic based features resulting from DLM discussed in §3.1. abortion rights debate, leaving the other four debates unseen. We report results for both the 3. SVO, amod and nn relations based model: S1 The subject-verb-object (SVO) triplets, also development set and the four unseen test sets. broken up into SV and VO pairs, and the word pairs obtained from the amod and nn 4.1 Classifier and Evaluation Metric relations in the dependency parse. We conducted experiments using Multinomial 4. Combined Models: Naive Bayes classifier implemented in the C1 D1+S1 - combining topic based features with SVO triplets and word-pairs from Weka toolkit (Hall et al., 2009). The Multi- amod and nn relations. nomial Naive Bayes model has been previ- C2 D1+S1+B3 - combining topic based fea- ously shown to perform better on text classi- tures with SVO, word-pairs from amod and nn relations, and unigrams. fication tasks with increasing vocabulary size, C3 D1+S1+B3+B4 - combining topic based taking into account word frequencies (McCal- features with SVO, word-pairs from amod lum et al., 1998), and this was also our ex- and nn relations, unigrams and dependency features. perience. For feature sets produced by each model described in the Methods section, we 4.3 Results and analysis used the FilteredAttributeEval method avail- Performance of various models: The 10- able in Weka for feature selection, retaining all fold cross validation results for Multinomial features with a score greater than zero. Fea- Naive Bayes for different models are reported ture counts were normalised by tf*idf. The in Table 5. performance of the classifier is reported using As seen in Table 5, the baseline B4 using the accuracy metric, which is most appropri- all relations from the dependency parse per- ate for a balanced dataset. forms significantly better compared to other 4.2 Compared Models models that focus on selecting specific features for stance classification. The features that § Our discussion in 3 results in the following we introduce (D1 and S1) become competi- different models for stance classification. We tive only when combined with one or more present in the next section, the results of our baseline models. C3, the best performing experiments. model, combines the unigram and dependency 1. Baseline Models: baselines with topic DLMs, SVO points, and B1 Sentence level sentiment features. B2 Topic-sentiment features. stance bearing amod and nn relations, and B3 Unigram features. outperforms previously published approaches

66 to stance classification described in §2 by a beat. We additionally find that dependency substantial margin. features B4 provide an even stronger baseline. With respect to scrutability, the features in C1, as described earlier in this paper, are eas- Comparison with other systems: Our ily explained in natural language. C2, the first experiments are directly comparable to Soma- competitive system, extends C1 with unigram sundaran and Wiebe (2010) as we report re- features. These can be easily included in an sults on the same dataset. Our best scor- explanation; for example, “This post is clas- ing system achieves an overall accuracy of sified as being in favour of abortion rights be- 89.26%, in comparison to their overall accu- cause it contains words such as ‘extreme’ and racy of 63.63%, a statistically significant in- p < . ‘pain’.”. C3, which is the best performing clas- crease ( 0 0001; z-test for difference in pro- sifier, also uses arbitrary dependency features portions). Further, our system performs bet- that are harder to use in explanations. How- ter for each of the debate issues investigated. ever, even when using C3, the classification While not directly comparable, our results decision for the vast majority of posts can be also compare well to studies in dialogic stance explained using features from C2. Table 6 ex- classification. For example, Anand et al. plores the coverage of different features in the (2011) achieved a maximum of 69% accuracy dataset, following feature selection. using contextual features based on LIWC, and Walker et al. (2012) obtained a highest of 64% Feature set Coverage using information related to agreement rela- Baselines tions between speakers. Ranade et al. (2013) B1 100.00% B2 32.90% achieved 70.3% by focusing on capturing users’ B3 74.41% intent and Sentiwordnet scores. More recently, B4 75.10% Hasan and Ng (2014) achieved an overall ac- Topic DLM D1 37.64% curacy of 66.25% for four domains including SVO, amod and nn abortion and gay rights, using features based S1 40.56% on dependency parse, frame-semantics, quota- Combined Models C1 (D1+S1) 54.40% tions and position information. Their accu- C2 (D1+S1+B3) 80.45% racy for abortion and gay rights was 66.3% C3 (D1+S1+B3+B4) 86.58% and 65.7%, respectively. Our approach, unlike these, focuses on a finegrained modelling of the Table 6: Percentage of posts containing at lexical context of important topic terms, and least one feature for each feature set (following on dependency relations that relate to points feature selection) and stance bearing phrases. Our results show that this is indeed beneficial. Poverty of sentiment-based models: While we expected our baseline model B1 that 4.4 Human readable explanations uses an off the shelf sentiment classifier to per- While previous work in stance classification form poorly on this task (see example in §3.4.2 has primarily focused on the classifier, this is for reasoning), we were slightly surprised by a topic where scrutability is of interest. A user the poor performance of the topic-sentiment might want to know why a post has been clas- models (B2). Clearly there is more to stance sified in a certain way, and a good response classification than sentiment, and more effort can build trust in the system. The features into modelling the range of lexical associations we have introduced in this paper lend them- with topic terms pays off for the distributional selves to the generation of explanations. Table lexical models. The unigram model (B3) per- 7 shows some example posts (selected to be formed better than the topic-sentiment models short due to space constraints), the features (B2) and the off-the-shelf sentiment analysis (after feature selection) present in the posts, tool (B1). This supports the results of So- and the generated explanations. The points masundaran and Wiebe (2010), who similarly are generated from the SVO by including all found that sentiment features did not prove premodifiers of the subject, verb and object helpful, while unigram features were hard to in the sentence. The explanation sentence is

67 Post: Abortion is the woman’s choice, not the father’s The Father should be told that the woman is having an abortion but until he carries and gives birth to his own baby then it is not his choice to tell the woman that she has to keep and give a painful birth to this fetus. Points derived using SVO information: [‘Abortion is the woman choice’; ‘it is not his choice’]; Unigrams: [’fetus’; ’woman’; ’choice’]; No other features present Classified Stance: For Abortion rights Explanation: This post has been classified as being in favour of abortion rights because it makes points such as ‘abortion is the woman’s choice’ and ‘it is not his choice’, and uses vocabulary such as ‘fetus’, ‘woman’ and ‘choice’. Post: A dog is not a person. Therefore, it does not have rights. Positive feelings about dogs should have no bearing on the discussion. A fetus is not a person. Negative feelings about the metaphysically independent status of women should have no bearing on the discussion. Points derived using SVO information: [‘a fetus is not a person’; ‘a dog is not a person’]; Unigrams: [‘fetus’; ‘independent’; ‘bearing’]; No other features present Classified Stance: For Abortion rights Explanation: This post has been classified as being in favour of abortion rights because it makes points such as ‘a fetus is not a person’ and ‘a dog is not a person’, and uses vocabulary such as ‘fetus’, ‘independent’, and ‘bearing’.

Post: God exists in the unborn as in the born Unigrams: [‘unborn’]; No other features present Classified Stance: Against Abortion rights Explanation: This post has been classified as being against abortion rights because it uses vocabulary such as ‘unborn’. Post: Any abortions should not be aloud if you are stupid enough to get pregnant when you do not want a baby or selfish enough not to want to look after it when you find out it may have an illness then it is your own fault why should the life of an innocent unborn child be killed because of your mistake amod features: [‘unborn child’; ‘innocent child’]; Unigrams: [‘baby’; ‘unborn’; ‘killed’]; No other features present Classified Stance: Against Abortion rights Explanation: This post has been classified as being against abortion rights because it uses vocabulary such as ‘baby’, ‘unborn’ and ‘killed’ and subjective phrases such as ‘unborn child’ and ‘innocent child’.

Table 7: Examples of explanations generated for stance classification based on a very simple template that takes as cation is a staging post to more in-depth ar- input a list for each feature type, and popu- gumentation mining. Our ultimate goal is to lates slots based on which features are present. model a richer argumentative framework in- cluding the support and rebuttal of claims, 5 Conclusions and future work and the changing of opinion by users in on- In this paper, we presented a new feature line debates. set for stance classification in online debates, designed to be scrutable by human users as Acknowledgements well as capable of achieving high accuracy This work was supported by the Economic of classification. We showed that our pro- and Social Research Council [Grant number posed model significantly outperforms other ES/M001628/1]. approaches based on sentiment analysis and topic-sentiment analysis. We believe our mod- els capture some of the subtleties of argumen- References tation in text, by breaking down the stance to- Pranav Anand, Marilyn Walker, Rob Abbott, Jean wards the debated issue into expressed stances E Fox Tree, Robeson Bowmani, and Michael Mi- towards a variety of related topics, as well as nor. 2011. Cats rule and dogs drool!: Classify- modelling, albeit in a simple way, the notion of ing stance in online debate. In Proceedings of a point. However, this is just a first step; we do the 2nd workshop on computational approaches to subjectivity and sentiment analysis, pages 1– not yet model the sequence of points or topic- 9. Association for Computational Linguistics. stance changes in the post, or dialogic struc- ture connecting posts. Finally, stance classifi- David M Blei, Andrew Y Ng, and Michael I Jordan.

68 2003. Latent dirichlet allocation. the Journal of Richard Socher, Alex Perelygin, Jean Y Wu, Jason machine Learning research, 3:993–1022. Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep Marie-Catherine De Marneffe and Christopher D models for semantic compositionality over a sen- Manning. 2008. The stanford typed dependen- timent treebank. In Proceedings of the confer- cies representation. In Coling 2008: Proceedings ence on empirical methods in natural language of the workshop on Cross-Framework and Cross- processing (EMNLP), volume 1631, page 1642. Domain Parser Evaluation, pages 1–8. Associa- Citeseer. tion for Computational Linguistics. Swapna Somasundaran and Janyce Wiebe. 2010. Stephan Greene and Philip Resnik. 2009. More Recognizing stances in ideological on-line de- than words: Syntactic packaging and implicit bates. In Proceedings of the NAACL HLT sentiment. In Proceedings of human language 2010 Workshop on Computational Approaches technologies: The 2009 annual conference of the to Analysis and Generation of Emotion in Text, north american chapter of the association for pages 116–124. Association for Computational computational linguistics, pages 503–511. Asso- Linguistics. ciation for Computational Linguistics. Dhanya Sridhar, Lise Getoor, and Marilyn Walker, 2014. Proceedings of the Joint Workshop on So- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard cial Dynamics and Personal Attributes in Social Pfahringer, Peter Reutemann, and Ian H Wit- Media, chapter Collective Stance Classification ten. 2009. The weka data mining software: an of Posts in Online Debate Forums, pages 109– update. ACM SIGKDD explorations newsletter, 117. Association for Computational Linguistics. 11(1):10–18. Marilyn A Walker, Pranav Anand, Robert Abbott, Kazi Saidul Hasan and Vincent Ng. 2014. Why are and Ricky Grant. 2012. Stance classification us- you taking this stance? identifying and classify- ing dialogic properties of persuasion. In Proceed- ing reasons in ideological debates. In Proceed- ings of the 2012 Conference of the North Amer- ings of the 2014 Conference on Empirical Meth- ican Chapter of the Association for Computa- ods in Natural Language Processing (EMNLP), tional Linguistics: Human Language Technolo- pages 751–762, Doha, Qatar, October. Associa- gies, pages 592–596. Association for Computa- tion for Computational Linguistics. tional Linguistics.

Mahesh Joshi and Carolyn Penstein-Ros´e. 2009. Theresa Wilson and Janyce Wiebe. 2005. Anno- Generalizing dependency features for opinion tating attributions and private states. In Pro- mining. In Proceedings of the ACL-IJCNLP ceedings of the Workshop on Frontiers in Corpus 2009 Conference Short Papers, pages 313–316. Annotations II: Pie in the Sky, pages 53–60. As- Association for Computational Linguistics. sociation for Computational Linguistics.

Andrew McCallum, Kamal Nigam, et al. 1998. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, volume 752, pages 41–48. Citeseer.

Andrew Kachites McCallum. 2002. Mal- let: A machine learning for language toolkit. http://www.cs.umass.edu/ mccallum/mallet.

Huy V Nguyen and Diane J Litman. 2015. Ex- tracting argument and domain words for iden- tifying argument components in texts. NAACL HLT 2015, page 22.

Michael Paul and Roxana Girju. 2010. A two- dimensional topic-aspect model for discovering multi-faceted topics. Urbana, 51:61801.

Sarvesh Ranade, Rajeev Sangal, and Radhika Mamidi, 2013. Proceedings of the SIGDIAL 2013 Conference, chapter Stance Classification in Online Debates by Recognizing Users’ Inten- tions, pages 61–69. Association for Computa- tional Linguistics.

69 Argumentation: Content, Structure, and Relationship with Essay Quality

Beata Beigman Klebanov1, Christian Stab2, Jill Burstein1,Yi Song1, Binod Gyawali1, Iryna Gurevych2,3 1 Educational Testing Service 660 Rosedale Rd, Princeton NJ, 08541, USA bbeigmanklebanov,jburstein,ysong,bgyawali @ets.org { 2Ubiquitous Knowledge Processing Lab (UKP-TUDA)} Department of Computer Science, Technische Universitat¨ Darmstadt 3Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research www.ukp.tu-darmstadt.de

Abstract is it in fact a good argument? When choosing a provider for trash collection, how relevant is the In this paper, we investigate the rela- color of the trucks? In contrast, in (3) the argu- tionship between argumentation structures mentative structure is not very explicit, yet the ar- and (a) argument content, and (b) the gument itself, if the reader is willing to engage, is holistic quality of an argumentative essay. actually more pertinent to the case, content-wise. Our results suggest that structure-based Example (4) has both the structure and the content. approaches hold promise for automated evaluation of argumentative writing. (1)“ The mayor is stupid. People should not have voted for him. His policy will fail. The 1 Introduction new provider uses ugly trucks.” With the advent of the Common Core Stan- (2)“ The mayor’s policy of switching to a new dards for Education,1 argumentation, and, more trash collector service is flawed because he specifically, argumentative writing, is receiving failed to consider the ugly color of the trucks increased attention, along with a demand for used by the new provider.” argumentation-aware Automated Writing Evalu- ation (AWE) systems. However, current AWE (3)“ The mayor is stupid. The switch is a bad systems typically do not consider argumentation policy. The new collector uses old and (Lim and Kahng, 2012), and employ features polluting trucks.” that address grammar, mechanics, discourse struc- ture, syntactic and lexical richness (Burstein et al., (4)“ The mayor’s policy of switching to a new 2013). Developments in Computational Argumen- trash collector service is flawed because he tation (CA) could bridge this gap. failed to consider the negative environmental Recently, progress has been made towards a effect of the old and air-polluting trucks used more detailed understanding of argumentation in by the new provider.” essays (Song et al., 2014; Stab and Gurevych, 2014; Persing and Ng, 2015; Ong et al., 2014). Song et al. (2014) took the content approach, An important distinction emerging from the rele- annotating essays for arguments that are pertinent vant work is that between argumentative structure to the argumentation scheme (Walton et al., 2008; and argumentative content. Facility with the argu- Walton, 1996) presented in the prompt. Thus, a mentation structure underlies the contrast between critique raising undesirable side effects (examples (1) and (2) below: In (1), claims are made without 3 and 4) is appropriate for a prompt where a policy support; relationships between claims are not ex- is proposed, while the critique in (1) and (2) is not. plicit; there is intervening irrelevant material. In The authors show, using the annotations, that rais- (2), the argumentative structure is clear – there is a ing pertinent critiques correlates with holistic es- critical claim supported by a specific reason. Yet, say scores. They build a content-heavy automated model; the model, however, does not generalize 1www.corestandards.org

70 Proceedings of the 3rd Workshop on Argument Mining, pages 70–75, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics well across prompts, since different prompts use scores. They report that rule-based approaches for different argumentation schemes and contexts. identifying argument components can be effective We take the structure-based approach that is in- for ranking but not rating. However, they used dependent of particular content and thus has better a very small data set. In contrast, we study the generalization potential. We study its relationship relationship between content-based and structure- with the content-based approach and with overall based approaches and investigate whether argu- essay quality. Our contributions are the answers to mentation structures correlate with holistic quality the following research questions: of essays using a large public data set. In the literature on the development of argu- 1. whether the use of good argumentation struc- mentation skills, an emphasis is made on both the ture correlates with essay quality; structure, namely, the need to support one’s po- 2. while structure and content are conceptually sition with reasons and evidence (Ferretti et al., distinct, they might in reality go together. We 2000), and on the content, namely, on evaluating therefore evaluate the ability of the structure- the effectiveness of arguments. For example, in a based system to deal with content-based an- study by Goldstein et al. (2009), middle-schoolers notations of argumentation. compared more and less effective rebuttals to the same original argument. 2 Related Work 3 Argumentation Structure Parser Existing work in CA focuses on argumentation mining in various genres. Moens et al. (2007) For identifying argumentation structures in essays, identify argumentative sentences in newspapers, we employ the system by Stab and Gurevych parliamentary records, court reports and online (2016) as an off-the-shelf argument structure discussions. Mochales-Palau and Moens (2009) parser. The parser performs the following steps: identify argumentation structures including claims Segmentation: Separates argumentative from and premises in court cases. Other approaches fo- non-argumentative text units; identifies the boun- cus on online comments and recognize argument daries of argument components at token-level. components (Habernal and Gurevych, 2015), jus- Classification: Classifies each argument compo- tifications (Biran and Rambow, 2011) or differ- nent as Claim, Major Claim or Premise. ent types of claims (Kwon et al., 2007). Work in Linking: Identifies links between argument com- the context of the IBM Debater project deals with ponents by classifying ordered pairs of compo- identifying claims and evidence in Wikipedia arti- nents in the same paragraph as either linked or not. cles (Rinott et al., 2015; Aharoni et al., 2014). Tree generation: Finds tree structures (or forests) Peldszus and Stede (2015) identify argumenta- in each paragraph which optimize the results of the tion structures in microtexts (similar to essays). the previous analysis steps. They rely on several base classifiers and mini- Stance recognition: Classifies each argument mum spanning trees to recognize argumentative component as either for or against in order to dis- tree structures. Stab and Gurevych (2016) ex- criminate between supporting or opposing argu- tract argument structures from essays by recog- ment components and argumentative support and nizing argument components and jointly model- attack relations respectively. ing their types and relations between them. Both 4 Experiment 1: Content vs Structure approaches focus on the structure and neglect the content of arguments. Persing and Ng (2015) an- 4.1 Data notate argument strength, which is related to con- We use data from Song et al. (2014) – essays writ- tent, yet what it is that makes an argument strong ten for a college-level exam requiring test-takers has not been made explicit in the rubric and the an- to criticize an argument presented in the prompt. notations are essay-level. Song et al. (2014) follow Each sentence in each essay is classified as generic the content-based approach, annotating essay sen- (does not raise a critical question appropriate for tences for raising topic-specific critical questions the argument in the prompt) or non-generic (raises (Walton et al., 2008). an apt critical question); about 40% of sentences Ong et al. (2014) report on correlations between are non-generic. Data sizes are shown in Table 1. argument component types and holistic essay

71 Train Test tio of false-positives to false-negatives is 2.9 (A) #Es- #Sen- Non- #Es- #Sen- and 3.1 (B). That is, argumentative structure with-

Prompt says tences generic says tences out argumentative content is about 3 times more A 260 4,431 42% 40 758 common than the reverse. False positives in- B 260 4,976 41% 40 758 clude sentences that are too general (“Numbers are needed to compare the history of the roads”) as Table 1: Data description for Experiment 1. well as sentences that have an argumentative form, but fail to make a germane argument (“If accidents are happening near a known bar, drivers might be under the influence of alcohol”). 4.2 Selection of Structural Elements Out of all the false-negatives, 30% were cases where the argument parser predicted no argumen- We use the training data to gain a better under- tative structures at all (no claims of any type and standing of the relationship between structural and no premises). Such sentences might not have a content aspects of argumentation. Each selection clear argumentative form but are understood as is evaluated using kappa against Song et al. (2014) making a critique in the context. For example, generic vs non-generic annotation. “What was it 3 or 4 years ago?” and “Has the Our first hypothesis is that any sentence where population gone up or down?” look like fact- the parser detected an argument component (any seeking questions in terms of structure, but are in- claim or premise) could contain argument-relevant terpreted in the context as questioning the causal (non-generic) content. This approach yields kappa mechanism presented in the prompt. Overall, of 0.24 (prompt A) and 0.23 (prompt B). in 9% of all non-generic sentences the argument We observed that the linking step in the parser’s parser detected no claims or premises. output identified many cases of singleton claims – namely, claims not supported by an elaboration. 4.3 Evaluation For example, “The county is following wrong as- Table 2 shows the evaluation of the structure- sumptions in the attempt to improve safety” is based predictions (classifying all sentences with an isolated claim. This sentence is classified as a Premise as non-generic) on test data, in com- “generic”, since no specific scheme-related cri- parison with the published results of Song et al. tique is being raised. Removing unsupported (2014), who used content-heavy features (such as claims yields kappas of 0.28 (A) and 0.26 (B). word ngrams in the current, preceding, and sub- Next, we observed that even sentences that con- sequent sentence). The results clearly show that tain claims that are supported are often treated as while the structure-based prediction is inferior to “generic”. Test-takers often precede a specific cri- content-based one when the test data are essays tique with one or more claims that set the stage responding to the same prompt as the training data, for the main critique. For example, in the follow- the off-the-shelf structure-based prediction is on- ing 3-sentence sequence, only the last is marked as par with content-based prediction on the cross- raising a critical question: “If this claim is valid we prompt evaluation. Thus, when the content is ex- would need to know the numbers. The whole argu- pected to shift, falling back to structure-based pre- ment in contingent on the reported accidents. Less diction is potentially a reasonable strategy. reported accidents does not mean less accidents.” The parser classified these as Major Claim, Claim, System Train Test κ and Premise, respectively. Our next hypothesis is Song et al. (2014) A A .410 that it is the premises, rather than the claims, that Song et al. (2014) B B .478 are likely to contain specific argumentative con- Song et al. (2014) A B .285 tent. We predict that only sentences containing Song et al. (2014) B A .217 a premise would be “non-generic.” This yields a Structure-based (Premises) – A .265 substantial improvement in agreement, reaching Structure-based (Premises) – B .247 kappas of 0.34 (A) and 0.33 (B). Looking at the overall pattern of structure-based Table 2: Evaluation of content-based (Song et al., vs content-based predictions, we note that the 2014) and structure-based prediction on content- structure-based prediction over-generates: The ra- based annotations.

72 5 Experiment 2: Argumentation More extensive use of argumentation structures Structure and Essay Quality is thus correlated with overall quality of an ar- gumentative essay. However, argumentative flu- Using argumentation structure and putting for- ency specifically is difficult to disentangle from ward a germane argument are distinct, not only fluency in language production in general mani- theoretically, but also empirically, as suggested by fested through the sheer length of the essay. In a the results of Experiment 1. In this section, we timed test, a more fluent writer will be able to write evaluate to what extent the use of argumentation more. To examine whether fluency in argumenta- structures correlates with overall essay quality. tion structures can explain additional variance in 5.1 Data scores beyond that explained by general fluency (as approximated through the number of words in We use a publicly available set of essays written an essay), we estimated a length-only based lin- for the TOEFL test in an argue-for-an-opinion-on- ear regression model as well as a model that uses an-issue genre (Blanchard et al., 2013). Although all the 9 argument structure features in addition to this data was originally used for natural language length. As shown in Table 3, the addition of argu- identification experiments, coarse-grained holis- mentation structures yields a small improvement tic scores (3-grade scale) are provided as part of across all measures over a length-only model. the LDC distribution. Essays were written by non-native speakers of English; we believe this Model κ r qwk increases the likelihood that fluency with argu- Arg .195 .389 .344 mentation structures would be predictive of the Len .365 .605 .518 score. We sampled 6,074 essays for training and Arg + Len .389 .614 .540 2,023 for testing, both across 8 different prompts. In terms of distribution of holistic scores in the Table 3: Prediction of holistic scores using argu- training data, 54.5% received the middle score, ment structure features (Arg), length (Len), and ar- 11% – the low score, and 34.5% – the high score. gument structure features and length (Arg+Len). “qwk” stands for quadratically weighted kappa. 5.2 Features for essay scoring Our set of features has the following essay-level aggregates: the numbers of any argument compo- 6 Conclusion & Future Work nents, major claims, claims, premises, supporting In this paper, we set out to investigate the rela- and attacking premises, arguments against, argu- tionship between argumentation structures, argu- ments for, and the average number of premises ment content, and the quality of the essay. Our per claim. Using the training data, we found that experiments suggest that (a) more extensive use 90% Winsorization followed by a log transforma- of argumentation structures is predictive of better tion improved the correlation with scores for all quality of argumentative writing, beyond overall features. The correlations range from 0.08 (major fluency in language production; and (b) structure- claims) to 0.39 (argument components). based detection of argumentation is a possible fall- back strategy to approximate argumentative con- 5.3 Evaluation tent if an automated argument detection system is To evaluate whether the use of argumentation to generalize to new prompts. The two findings to- structures correlates with holistic scores, we esti- gether suggest that the structure-based approach is mated a linear regression model using the nine ar- a promising avenue for research in argumentation- gument features on the training data and evaluated aware automated writing evaluation. on the test data. We use Cohen’s kappa, as well as In future work, we intend to improve the Pearson’s correlation and quadratically-weighted structure-based approach by identifying charac- kappa, the latter two being standard measures in teristics of argument components that are too essay scoring literature (Shermis and Burstein, general and so cannot be taken as evidence of 2013). Row “Arg” in Table 3 shows the results; germane, case-specific argumentation on the stu- argument structures have a moderate positive rela- dent’s part (claims like “More information is tionship with holistic scores. needed”), as well as study properties of seem- ingly non-argumentative sentences that neverthe-

73 less have a potential for argumentative use in con- Namhee Kwon, Liang Zhou, Eduard Hovy, and Stu- text (such as asking fact-seeking questions). We art W. Shulman. 2007. Identifying and classifying believe this would allow pushing the envelope of subjective claims. In Proceedings of the 8th An- nual International Conference on Digital Govern- structure-based analysis towards identification of ment Research: Bridging Disciplines & Domains, arguments that have a higher likelihood of being pages 76–81, Philadelphia, PA, USA. effective. Hyojung Lim and Jimin Kahng. 2012. Review of Criterion R . Language Learning & Technology, Acknowledgments 16(2):38–45. The work of Christian Stab and Iryna Gurevych Raquel Mochales-Palau and Marie-Francine Moens. has been supported by the Volkswagen Foundation 2009. Argumentation mining: The detection, classi- as part of the Lichtenberg-Professorship Program fication and structure of arguments in text. In Pro- under grant No. I/82806 and by the German Fed- ceedings of the 12th International Conference on Ar- tificial Intelligence and Law, ICAIL ’09, pages 98– eral Ministry of Education and Research (BMBF) 107, Barcelona, Spain. as a part of the Software Campus project AWS un- der grant No. 01 S12054. Marie-Francine Moens, Erik Boiy, Raquel Mochales | Palau, and Chris Reed. 2007. Automatic detection of arguments in legal texts. In Proceedings of the 11th International Conference on Artificial Intelli- References gence and Law, ICAIL ’07, pages 225–230, Stan- Ehud Aharoni, Anatoly Polnarov, Tamar Lavee, Daniel ford, CA, USA. Hershcovich, Ran Levy, Ruty Rinott, Dan Gutfre- Nathan Ong, Diane Litman, and Alexandra und, and Noam Slonim. 2014. A benchmark dataset Brusilovsky. 2014. Ontology-based argument for automatic detection of claims and evidence in the mining and automatic essay scoring. In Pro- context of controversial topics. In Proceedings of ceedings of the First Workshop on Argumentation the First Workshop on Argumentation Mining, pages Mining, pages 24–28, Baltimore, MA, USA. 64–68, Baltimore, Maryland. Andreas Peldszus and Manfred Stede. 2015. Joint pre- Or Biran and Owen Rambow. 2011. Identifying jus- diction in mst-style discourse parsing for argumen- tifications in written dialogs by classifying text as tation mining. In Conference on Empirical Methods argumentative. International Journal of Semantic in Natural Language Processing, EMNLP ’15, page Computing, 5(4):363–381. (to appear), Lisbon, Portugal. Daniel Blanchard, Joel Tetreault, Derrick Higgins, Isaac Persing and Vincent Ng. 2015. Modeling ar- Aoife Cahill, and Martin Chodorow. 2013. gument strength in student essays. In Proceedings TOEFL11: A corpus of non-native English. ETS of the 53rd Annual Meeting of the Association for Research Report Series, 2013(2):i–15. Computational Linguistics and the 7th International Jill Burstein, Joel Tetreault, and Nitin Madnani. 2013. Joint Conference on Natural Language Processing The e-rater automated essay scoring system. In (Volume 1: Long Papers), ACL ’15, pages 543–552, M. Shermis and J Burstein, editors, Handbook of Beijing, China. Automated Essay Evaluation: Current Applications Ruty Rinott, Lena Dankin, Carlos Alzate Perez, and Future Directions . New York: Routledge. Mitesh M. Khapra, Ehud Aharoni, and Noam Ralph Ferretti, Charles MacArthur, and Nancy Dowdy. Slonim. 2015. Show me your evidence - an au- 2000. The effects of an elaborated goal on the per- tomatic method for context dependent evidence de- suasive writing of students with learning disabilities tection. In Proceedings of the 2015 Conference on and their normally achieving peers. Journal of Edu- Empirical Methods in Natural Language Process- cational Psychology, 93:694–702. ing, EMNLP ’15, pages 440–450, Lisbon, Portugal. Marion Goldstein, Amanda Crowell, and Deanna Mark D. Shermis and Jill Burstein. 2013. Handbook of Kuhn. 2009. What constitutes skilled argumen- Automated Essay Evaluation: Current Applications tation and how does it develop? Informal Logic, and Future Directions. Routledge Chapman & Hall. 29:379–395. Yi Song, Michael Heilman, Beata Beigman Klebanov, Ivan Habernal and Iryna Gurevych. 2015. Exploit- and Paul Deane. 2014. Applying argumentation ing debate portals for semi-supervised argumenta- schemes for essay scoring. In Proceedings of the tion mining in user-generated web discourse. In First Workshop on Argumentation Mining, pages Proceedings of the 2015 Conference on Empirical 69–78, Baltimore, MA, USA. Methods in Natural Language Processing, EMNLP Christian Stab and Iryna Gurevych. 2014. Annotat- ’ 15, pages 2127–2137, Lisbon, Portugal. ing argument components and relations in persua- sive essays. In Proceedings of the 25th International

74 Conference on Computational Linguistics, COLING ’14, pages 1501–1510, Dublin, Ireland, August. Christian Stab and Iryna Gurevych. 2016. Parsing ar- gumentation structures in persuasive essays. arXiv preprint arXiv:1604.07370.

Douglas Walton, Chris Reed, and Fabrizio Macagno. 2008. Argumentation schemes. New York, NY: Cambridge University Press.

Douglas Walton. 1996. Argumentation schemes for presumptive reasoning. Mahwah, NJ: Lawrence Erlbaum.

75 Neural Attention Model for Classification of Sentences that Support Promoting/Suppressing Relationship

Yuta Koreeda, Toshihiko Yanase, Kohsuke Yanai, Misa Sato, Yoshiki Niwa Research & Development Group, Hitachi, Ltd. yuta.koreeda.pb, toshihiko.yanase.gm, kohsuke.yanai.cs, { misa.sato.mw, yoshiki.niwa.tx @hitachi.com }

Abstract say that supports a claim of gambling ( ) X S promotes crime ( ) relationship. Such a V Evidences that support a claim “a subject technique is important because it allows extract- phrase promotes or suppresses a value” ing both sides of an opinion to be used in decision help in making a rational decision. We aim makings (Sato et al., 2015). to construct a model that can classify if a We take a deep learning approach for this ev- particular evidence supports a claim of a idence classification, which has started to out- promoting/suppressing relationship given perform conventional methods in many linguistic an arbitrary subject-value pair. In this pa- tasks (Collobert et al., 2011; Shen et al., 2014; per, we propose a recurrent neural network Luong et al., 2015). Our work is based on a (RNN) with an attention model to clas- neural attention model, which had promising re- sify such evidences. We incorporated a sult in a translation task (Bahdanau et al., 2015) word embedding technique in an attention and in a sentiment classification task (Zichao et model such that our method generalizes al., 2016). The neural attention model achieved for never-encountered subjects and value these by focusing on important phrases; e.g. when phrases. Benchmarks showed that the is economy and is Gambling boosts V X method outperforms conventional meth- the city’s revenue., the attention layer ods in evidence classification tasks. focuses near the phrase boosts the city’s revenue. 1 Introduction The neural attention model was previously ap- With recent trend of big data and electronic plied to aspect-based sentiment analysis (ABSA) records, it is getting increasingly important to col- (Yanase et al., 2016), which has some similarity lect evidences that support a claim, which usually to the evidence classification in that it classifies comes along with a decision, for rational decision sentimental polarities towards a subject given S making. Argument mining can be utilized for this an aspect (corresponding to ) (Pontiki et al., V purpose because an argument itself is an opinion 2015). A limitation of (Yanase et al., 2016) was of the author that supports the claim, and an ar- that the learned attention layer is tightly attached gument usually consists of evidences that support to each or and does not generalize for never- S V the claim. Identification of a claim has been rig- encountered subjects/values. This means that it re- orously studied in argument mining including ex- quires manually labeled data for all possible sub- traction of arguments (Levy et al., 2014; Boltui jects and values, which is not practicable. Instead, and najder, 2014; Sardianos et al., 2015; Nguyen when we train a model to classify an evidence that and Litman, 2015) and classification of claims supports a claim of a relationship between, for ex- (Sobhani et al., 2015). ample, gambling and crime, we want the same Our goal is to achieve classification of pos- learned model to work for other and pairs such S V itive and negative effects of a subject in a as smoking and health. In other words, we form “a subject phrase promotes/suppresses a want the model to learn how to classify evidences S value .” For example, given a subject = that support a relationship of and , rather than V S S V gambling, a value = crime and a text learning the relationship itself. V = casino increases theft, we can In this paper, we propose a neural attention X

76 Proceedings of the 3rd Workshop on Argument Mining, pages 76–81, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

polarity y Value (# of promoting / Subject V Fully connected S suppressing / total labels) z Training data Sum national lottery economy (88 / 57 / 145), ~ ~ Concat. regressive tax (4 / 1 / 5) stut vtut Value attention sale of human organ moral (0 / 6 / 6) generic drug cost (32 / 87 / 119), Subject attention poverty (0 / 1 / 1)

ut cannabis economy (61 / 7 / 68), Backward medicine (215 / 68 / 283) RNN tourism economy (142 / 11 / 153), Internal corruption (10 / 3 / 13) Forward state Test data RNN smoking income (36 / 33 / 69), x1 x2 x3 x4 x5 xs xv disease (158 / 1 / 159) Word embedding violent video game crime (36 / 7 / 43), Casino increases theft . □ gambling crime moral (7 / 14 / 21) tokens Subject S Value V Table 1: Subject phrases and value phrases in the Figure 1: Structure of the proposed bi-directional dataset RNN with word embedding-based attention layer. Colored units are updated during training. on making attention model generalize to first en- countered words. In case there exists more than model that can learn to focus on important phrases one word in and , we take an average of word from text even when and are never encoun- S V S V embedding vectors. tered, allowing the neural attention model to be ap- Next, the word vector sequence X is inputted to plied to the evidence classification. We extend the a recurrent neural network (RNN) to encode con- neural attention model by modeling the attention textual information into each token. The RNN layer using a distributed representation of words in calculates an output vector for each xt at to- which similar words are treated in a similar man- ken position t. We use a bi-directional RNN ner. We also report benchmarks of the method (BiRNN) (Schuster and Paliwal, 1997) to consider against previous works in both neural and lexicon- both forward context and backward context. A for- based approaches. We show that the method can ward RNN processes tokens from head to tail to effectively generalize to an evidence classification obtain a forward RNN-encoded vector −→ut, and a task with never-encountered phrases. backward RNN processes tokens from tail to head to obtain a backward RNN-encoded u . The out- 2 Neural Attention Model ←−t put vector is u = u u , where is the concate- t −→t||←−t || Given a subject phrase , a value phrase , and a nation of vectors. We tested the method with long S V text , our model aims to classify whether sup- short-term memory (LSTM) (Sak et al., 2014) and X X ports promotes or suppresses . A text is a gated recurrent units (GRUs) (Cho et al., 2014) as S V X sequence of word tokens, and the classification re- implementations of RNN units. sult is outputted as a real value y [0.0, 1.0] that Lastly, we filter tokens with and to deter- ∈ S V denotes the promoting/suppressing polarity; i.e., mine the importance of each token and to extract has a higher chance of supporting the promot- information about the interactions of and . In X S V ing claim if it is nearer to 1.0 and the suppressing the attention layer, attention weight st R at each ∈ claim if it is nearer to 0.0. token t is calculated using subject phrase vector Our method is shown in Figure 1. First of xs. We model attention with Equation (1) in which all, we apply skip-gram-based word embedding Ws is a parameter that is updated alongside the (Mikolov et al., 2013) to each token in and ob- RNN during the training. X tain a varying-length sequence of distributed rep- st = x>Wsut (1) resentations X = x0, x1, ..., xT , where T is the s number of tokens in the sentence. This is to al- Then, we take the softmax over all tokens in a low words with similar meaning to be treated in a sentence for normalization. similar manner. We also apply word embedding to and to exp(st) S V s˜t = (2) obtain xs and xv respectively. This is a core idea j exp(sj) P 77 Parameter BiRNN BiRNN+ATT 1.0 Dropout rate 0.7 0.5 Learning rate 0.00075 0.0017 RNN model GRU LSTM RNN state size 128 64 Mini-batch size 16 32 Training epochs 6 17 0.8

Table 2: Hyperparameters of BiRNN and Precision BiRNN+ATT(AUC=0.88) BiRNN+ATT (our method) BiRNN (AUC=0.85) BoW (AUC=0.86) Macro BoM (AUC=0.81) Average AUC- Accuracy AUC-PR ROC prec. 0.6 BiRNN+ATT 0.59 0.64 0.62 0.51 0.0 0.5 1.0 BiRNN 0.57 0.59 0.54 0.45 Recall BoM 0.58 0.57 0.49 0.22 Promoting BoW 0.56 0.61 0.56 0.42 (a) -claim as positive 1.0 BiRNN+ATT(AUC=0.30) Table 3: Performance of the classifiers. The best BiRNN (AUC=0.28) result for each metric is shown bold. BoW (AUC=0.26) BoM (AUC=0.34)

The attention v˜t R for the value vector is cal- 0.5 ∈ culated likewise using a parameter Wv. s˜t and v˜t Precision are used as the weight of each token ut to obtain sentence feature vector z.

z = (˜stut v˜tut) (3) 0.0 || 0.0 0.5 1.0 t X Recall Finally, the polarity y of the claim is calculated (b) Suppressing-claim as positive two-layered fully-connected perceptrons with lo- gistic sigmoid functions. Figure 2: Precision-recall curves for the classifica- The model is trained by backpropagation using tions of the evidences cross entropy as the loss and AdaGrad as the op- timizer (Duchi et al., 2011). During training, pa- English Gigaword (Napoles et al., 2012). From rameters of fully-connected layers, RNN, Ws, and candidates of 7000 sentences, we manually ex- Wv are updated. Note that xs, xv are not updated unlike (Yanase et al., 2016). Dropout (Srivastava tracted and labeled 1,085 self-contained sentences et al., 2014) is applied to the input and output of that support promoting/suppressing relationship. We allowed sentences in which , did not ap- the RNN and gradient norm is clipped to 5.0 to S V improve the stability. pear. We chose five subject phrases as training data and other two as test data. Notice that only 3 Experiments a fraction of the test data had overlapping value phrases with the training data. The purpose of this experiment was to test if the proposed RNN with word embedding-based atten- 3.2 Metrics tion model could perform well in a evidence clas- sification task. We benchmarked our method to the We compared the methods in terms of the area RNN without an attention model and conventional under a precision-recall curve (AUC-PR) because lexicon-based classification methods. it represents a method’s performance well even when data are skewed (Davis and Goadrich, 2006). 3.1 Dataset The area under a curve is obtained by first calcu- We chose seven subject phrases and one or two lating precision-recall for every possible threshold value phrases for each subject phrase (total of 13 (precision-recall curve) and integrating the curve pairs) as shown in Table 1. For each pair of with trapezoidal rule. We took the average AUC- S and , we extracted sentences having both and PR for when the promoting or suppressing claim V S within two adjacent sentences from Annotated was taken as positive because it was a binary clas- V

78 # Text 1. Smoking costs some 22 000 Czech citizens their lives every year though the tobacco industry earns huge profits for the nation S Smoking costs some 22 000 Czech citizens their lives every year though the tobacco industry earns huge profits for the nation V 2. For the nation the health costs of smoking far outweigh the economic benefits of a thriving tobacco industry the commentary said S For the nation the health costs of smoking far outweigh the economic benefits of a thriving tobacco industry the commentary said V Table 4: Visualization of attention in test data with =smoking and =income. Highlights show sˆ S V t and vˆ . An underlined word had the smallest cosine distance to and , respectively. t S V sification task. We calculated the area under a re- For the BoM, BiRNN and BiRNN+ATT, we ceiver operating characteristic curve (AUC-ROC) used pretrained word embedding of three hun- in a similar manner as a reference. dred dimensional vectors trained with the Google Since we formulated the learning algorithm in News Corpus1. We pretrained the BiRNN and the regression-like manner, we chose the cutoff value BiRNN+ATT with the Stanford Sentiment Tree- with the best macro-precision within the training bank (Socher et al., 2013) by stacking a logistic dataset to obtain predicted label. This was used regression layer on top of a token-wise average to calculate the macro-precision, the accuracy and pooling of ut and by predicting the sentiment po- the McNemar’s test, which were for a reference. larity of phrases. For the BiRNN and BiRNN+ATT, the maxi- 3.3 Baselines mum token size was 40, and tokens that over- Baselines in this experiment were as follows. flowed were dropped. BiRNN and BiRNN+ATT were implemented Dictionary of all words in Bag-of-Words (BoW) with TensorFlow (Abadi et al., 2015). training/test texts, and were used. The S V word counts vector was concatenated with 3.5 Results one-hot (or n-hot in case of a phrase) vec- The average AUC-PRs and reference metrics are tors of and and used as a feature for a S V shown in Table 3. BiRNN+ATT performed sig- classifier. nificantly better than baselines with p = 0.016 Bag-of-Means (BoM) The average word embed- (BiRNN), p = 1.1 10 15 (BoM) and p = × − ding (Mikolov et al., 2013) was used as a fea- 0.010 (BoW), respectively (McNemar’s test). The ture for a classifier. BiRNN without attention layer was no better than BiRNN without attention layer This was the BoW (p = 0.41, McNemar’s test). same as our method except that it took Precision-recall curves of the baselines and our an average of the BiRNN output and con- method are shown in Figure 2. catenated it with the word vector from S and to be fed into the perceptron; i.e., 4 Discussion V z = x x (u ). s|| v|| t t By extending the neural attention model using a We tested BoWP and BoM with a linear support distributed representation of words, we were ca- vector machine (LSVM) and random forest (RF), pable of applying the neural attention model to and BoW with multinomial na¨ıve bayes (NB). We the evidence classification task with never encoun- carried out 5-fold cross validation within a training tered words. The results implied that it learned dataset, treating each subject phrase as a fold, how to classify evidences that support a relation- S to determine the best performing hyperparameters ship of and , rather than the relationship itself. S V and classifiers. The best performing classifier for The attention layer selects which part of the sen- BoM was RF with 27 estimators. The best per- tence the model uses for classification with mag- forming classifier for BoW was NB with α = 0.38 nitudes of sˆt and vˆt for each token. We visualize with no consideration of prior probabilities. the magnitudes of sˆt and vˆt on sentences extracted from a test dataset shown in Table 4. 3.4 System setting We observed that the attention layers react to the We tuned hyperparameters for our method and the target phrases’ synonyms and their qualifiers. For BiRNN in the same manner. The best settings are 1The model retrieved from https://code.google. shown in Table 2. com/archive/p/word2vec/

79 example, the value income reacted to the word [Boltui and najder2014] Filip Boltui and Jan najder. profit in Table 4, #1. The classification result 2014. Back up your Stance: Recognizing Argu- and ground truth were both promoting. General- ments in Online Discussions. In Proceedings of the First Workshop on Argumentation Mining, pages ization to similar words was observed for other 49–58, Baltimore, Maryland, June. Association for words such as Marijuana ( = cannabis) Computational Linguistics. S and murder ( = crime). This implies that V [Cho et al.2014] Kyunghyun Cho, Bart van Merrien- the attention layers learned to focus on important boer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi phrases, which was the reason why the proposed Bougares, Holger Schwenk, and Yoshua Bengio. method outperformed conventional BiRNN with- 2014. Learning Phrase Representations using RNN out an attention layer. EncoderDecoder for Statistical Machine Transla- tion. In Proceedings of the 2014 Conference on The method failed in Table 4, #2 in which the Empirical Methods in Natural Language Processing ground truth was suppressing and the method pre- (EMNLP), pages 1724–1734, Doha, Qatar, October. dicted promoting. The method shortsightedly fo- Association for Computational Linguistics. cused on the word benefits and failed to com- [Collobert et al.2011] Ronan Collobert, Jason Weston, prehend longer context. As a future work, we Lon Bottou, Michael Karlen, Koray Kavukcuoglu, will incorporate techniques that allow our model and Pavel Kuksa. 2011. Natural Language Process- to cope with a longer sequence of words. ing (Almost) from Scratch. J. Mach. Learn. Res., 12:2493–2537, November. 5 Conclusion [Davis and Goadrich2006] Jesse Davis and Mark Goad- rich. 2006. The Relationship Between Precision- We proposed a RNN with a word embedding- Recall and ROC Curves. In Proceedings of the based attention model for classification of evi- 23rd International Conference on Machine Learn- ing, ICML ’06, pages 233–240, New York, NY, dences. Our method outperformed the RNN with- USA. ACM. out an attention model and other conventional methods in benchmarks. The attention layers [Duchi et al.2011] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for learned to focus on important phrases even if Online Learning and Stochastic Optimization. J. words were never encountered, implying that our Mach. Learn. Res., 12:2121–2159, July. method learned how to classify evidences that sup- [Levy et al.2014] Ran Levy, Yonatan Bilu, Daniel Her- port a claim of a relationship of subject and value shcovich, Ehud Aharoni, and Noam Slonim. 2014. phrases, rather than the relationship itself. Context Dependent Claim Detection. In Proceed- ings of COLING 2014, the 25th International Con- ference on Computational Linguistics: Technical References Papers, pages 1489–1500, Dublin, Ireland, August. Dublin City University and Association for Compu- [Abadi et al.2015] Martn Abadi, Ashish Agarwal, Paul tational Linguistics. Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, [Luong et al.2015] Thang Luong, Hieu Pham, and Matthieu Devin, Sanjay Ghemawat, Ian Goodfel- Christopher D. Manning. 2015. Effective Ap- low, Andrew Harp, Geoffrey Irving, Michael Is- proaches to Attention-based Neural Machine Trans- ard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, lation. In Proceedings of the 2015 Conference on Manjunath Kudlur, Josh Levenberg, Dan Man, Rajat Empirical Methods in Natural Language Process- Monga, Sherry Moore, Derek Murray, Chris Olah, ing, pages 1412–1421, Lisbon, Portugal, September. Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Association for Computational Linguistics. Sutskever, Kunal Talwar, Paul Tucker, Vincent Van- [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, houcke, Vijay Vasudevan, Fernanda Vigas, Oriol Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Vinyals, Pete Warden, Martin Wattenberg, Martin Distributed Representations of Words and Phrases Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. Ten- and their Compositionality. In C. J. C. Burges, sorFlow: Large-Scale Machine Learning on Hetero- L. Bottou, M. Welling, Z. Ghahramani, and K. Q. geneous Systems. Software available from tensor- Weinberger, editors, Advances in Neural Informa- flow.org. tion Processing Systems 26, pages 3111–3119. Cur- ran Associates, Inc. [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine [Napoles et al.2012] Courtney Napoles, Matthew Translation by Jointly Learning to Align and Trans- Gormley, and Benjamin Van Durme. 2012. An- late. In Proceedings of the International Confer- notated Gigaword. In Proceedings of the Joint ence on Learning Representations (ICLR 2015), San Workshop on Automatic Knowledge Base Con- Juan, Puerto Rico, May. struction and Web-scale Knowledge Extraction,

80 AKBC-WEKEX ’12, pages 95–100, Strouds- [Socher et al.2013] Richard Socher, John Bauer, burg, PA, USA. Association for Computational Christopher D. Manning, and Ng Andrew Y. 2013. Linguistics. Parsing with Compositional Vector Grammars. In Proceedings of the 51st Annual Meeting of the [Nguyen and Litman2015] Huy Nguyen and Diane Lit- Association for Computational Linguistics (Volume man. 2015. Extracting Argument and Domain 1: Long Papers), pages 455–465, Sofia, Bulgaria, Words for Identifying Argument Components in August. Association for Computational Linguistics. Texts. In Proceedings of the 2nd Workshop on Argu- mentation Mining, pages 22–28, Denver, CO, June. [Srivastava et al.2014] Nitish Srivastava, Geoffrey Hin- Association for Computational Linguistics. ton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to [Pontiki et al.2015] Maria Pontiki, Dimitris Galanis, Prevent Neural Networks from Overfitting. Journal Haris Papageorgiou, Suresh Manandhar, and Ion of Machine Learning Research, 15:1929–1958. Androutsopoulos. 2015. SemEval-2015 Task 12: Aspect Based Sentiment Analysis. In Proceedings [Yanase et al.2016] Toshihiko Yanase, Kohsuke Yanai, of the 9th International Workshop on Semantic Eval- Misa Sato, Toshinori Miyoshi, and Yoshiki Niwa. uation (SemEval 2015), pages 486–495, Denver, 2016. bunji at SemEval-2016 Task 5: Neural and Colorado, June. Association for Computational Lin- Syntactic Models of Entity-Attribute Relationship guistics. for Aspect-based Sentiment Analysis. In Proceed- ings of the 10th International Workshop on Seman- [Sak et al.2014] Haim Sak, Andrew Senior, and Fra- tic Evaluation (SemEval-2016), pages 294–300, San noise Beaufays. 2014. Long Short-Term Mem- Diego, California, June. Association for Computa- ory Based Recurrent Neural Network Architec- tional Linguistics. tures for Large Vocabulary Speech Recognition. [Zichao et al.2016] Yang Zichao, Diyi Yang, Chris arXiv:1402.1128 [cs, stat], February. arXiv: Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 1402.1128. 2016. Hierarchical Attention Networks for Docu- ment Classification. In 15th Annual Conference of [Sardianos et al.2015] Christos Sardianos, Ioan- the North American Chapter of the Association for nis Manousos Katakis, Georgios Petasis, and Computational Linguistics: Human Language Tech- Vangelis Karkaletsis. 2015. Argument Extraction nologies, June. from News. In Proceedings of the 2nd Workshop on Argumentation Mining, pages 56–66, Denver, CO, June. Association for Computational Linguistics.

[Sato et al.2015] Misa Sato, Kohsuke Yanai, Toshinori Miyoshi, Toshihiko Yanase, Makoto Iwayama, Qinghua Sun, and Yoshiki Niwa. 2015. http://www.aclweb.org/anthology/ P15-4019End-to-end Argument Generation Sys- tem in Debating. In Proceedings of ACL-IJCNLP 2015 System Demonstrations, pages 109–114, Beijing, China, July. Association for Computational Linguistics and The Asian Federation of Natural Language Processing.

[Schuster and Paliwal1997] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recur- rent neural networks. Signal Processing, IEEE Transactions on, 45(11):2673–2681.

[Shen et al.2014] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grgoire Mesnil. 2014. Learn- ing Semantic Representations Using Convolutional Neural Networks for Web Search. In Proceedings of the 23rd International Conference on World Wide Web, WWW ’14 Companion, pages 373–374, New York, NY, USA. ACM.

[Sobhani et al.2015] Parinaz Sobhani, Diana Inkpen, and Stan Matwin. 2015. From Argumentation Min- ing to Stance Classification. In Proceedings of the 2nd Workshop on Argumentation Mining, pages 67– 77, Denver, CO, June. Association for Computa- tional Linguistics.

81 Towards Feasible Guidelines for the Annotation of Argument Schemes

Elena Musi , Debanjan Ghosh* and Smaranda Muresan Center of Computational† Learning Systems, Columbia University† †*School of Communication and Information, Rutgers University [email protected], [email protected], [email protected]

Abstract some premises in support of certain conclusions, with the aim of negotiating different opinions and The annotation of argument schemes rep- reaching consensus (Van Eemeren et al., 2013). resents an important step for argumenta- The automatic identification and evaluation of ar- tion mining. General guidelines for the guments require three main stages: 1) the identi- annotation of argument schemes, applica- fication, segmentation and classification of argu- ble to any topic, are still missing due to mentative discourse units (ADUs), 2) the identifi- the lack of a suitable taxonomy in Argu- cation and classification of the relations between mentation Theory and the need for highly ADUs (Peldszus and Stede, 2013a), and 3) the trained expert annotators. We present a set identification of argument schemes, namely the of guidelines for the annotation of argu- implicit and explicit inferential relations within ment schemes, taking as a framework the and across ADUs (Macagno, 2014). Argumentum Model of Topics (Rigotti and Although considerable steps have been taken Morasso, 2010; Rigotti, 2009). We show towards the first two stages (Teufel and Moens, that this approach can contribute to solv- 2002; Stab and Gurevych, 2014; Cabrio and Vil- ing the theoretical problems, since it of- lata, 2012; Ghosh et al., 2014; Aharoni et al., fers a hierarchical and finite taxonomy of 2014; Rosenthal and McKeown, 2012; Biran and argument schemes as well as systematic, Rambow, 2011; Llewellyn et al., 2014), the third linguistically-informed criteria to distin- stage still constitutes a major challenge because guish various types of argument schemes. large corpora systematically annotated with argu- We describe a pilot annotation study of ment schemes are lacking. As noticed by Palau 30 persuasive essays using multiple min- and Moens (2009), this is due to the proliferation imally trained non-expert annotators .Our in Argumentation Theory of different taxonomies findings from the confusion matrixes pin- of argument schemes based on weak distinctive point problematic parts of the guidelines criteria, which makes it difficult to develop inter- and the underlying annotation of claims subjective guidelines for annotation. In the Arau- and premises. We conduct a second anno- caria dataset (Reed and Rowe, 2004), for example, tation with refined guidelines and trained two argument scheme sets other than Walton’s are annotators on the 10 essays which received used as annotation protocols (Katzav and Reed, the lowest agreement initially. A signif- 2004; Pollock, 1995). icant improvement of the inter-annotator To overcome this problem, the most success- agreement shows that the annotation of ar- fully applied strategy has been to pre-select from gument schemes requires highly trained existing larger typologies, such as that of Walton annotators and an accurate annotation of et al. (2008), a subset of argument schemes which argumentative components (premises and is most frequent in a particular text genre, domain claims). or context (Green, 2015; Feng and Hirst, 2011; Song et al., 2014; Schneider et al., 2013) and pro- 1 Introduction vide annotators with critical questions as a means Argumentation is a type of discourse in which to identify the appropriate scheme. Such a bot- various participants make arguments, presenting tom up approach allows one to improve the identi-

82 Proceedings of the 3rd Workshop on Argument Mining, pages 82–93, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

fication conditions for a set of argument schemes study. Our findings from measuring the inter- (Walton, 2012), but it is hardly generalizable since annotator agreement (IAA) support previous find- it is restricted to specific argumentative contexts. ings that annotation of argument schemes would Moreover, while critical questions constitute use- require highly trained annotators (Section 4). We ful tools to evaluate the soundness of arguments also performed an analysis of confusion matrices (Song et al., 2014), they are far less suitable as to see which argument schemes were more dif- a means to identify the presence of arguments: ficult to identify, and which parts of the guide- adopting a normative approach, annotators would lines might need refinement (Section 4). Another conflate the notion of “making an argument” with finding of this study is that the identification of that of “making a sound argument”, while defea- argument schemes constitutes a means to refine sibility should not be considered as an identifica- the annotation of premises and claims (Section tion condition for the mere retrieval of arguments 5). We refined the guidelines and tested them in texts. through the annotation of the 10 essays which re- ceived the lowest inter-annotator agreement using We hypothesize that the Argumentum Model of 2 trained non-expert annotators and 1 expert an- Topics (Rigotti and Morasso, 2010; Rigotti, 2009), notator (Section 6). The results show an improve- an enthymematic approach for the study of the in- ment in the inter-annotator agreement. The con- ferential configuration of arguments, has the po- fusion matrix suggests that the frequency of non- tential to enhance the recognition of argument argumentative relations between premises/claims, schemes. Unlike other approaches (Van Eemeren claims/major claims highly affects disagreement. and Grootendorst, 1992; Walton et al., 2008; Kien- The guidelines and the annotated files are available pointner, 1987), it offers a taxonomic hierarchy of at: https://github.com/elenamusi/ argument schemes based on criteria which are dis- argscheme_aclworkshop2016. tinctive and mutually exclusive and which appeal to semantic properties of the state of affairs ex- pressed by premises/claims, and not to the logi- cal forms (deductive, inductive, abductive) of ar- 2 Theoretical Background and guments, whose boundaries are still debated (Sec- Framework tion 2). However, even if these semantic proper- ties are linguistically encoded, and hence poten- As Jacobs (2000, 264) puts it, “arguments are tially measurable, they might call for some back- fundamentally linguistic entities that express [...] ground knowledge in frame semantics to be iden- propositions where those propositions stand in tified as well as for quite specific analytic skills. particular inferential relations to one other”. These Moreover, the cognitive load requested by the an- inferential relations, namely argument schemes, notation of argument schemes is higher than that are textually implicit and have to be reconstructed needed for the annotation of the argumentative dis- by the participants of a critical discussion in or- course structure (e.g., argument components such der to reach agreement or disagreement. In every- as claims and premises, and argument relations day life this happens quite intuitively on the basis such as support/attack). As stated by Peldszus and of common ground knowledge: everyone would Stede (2013b) with regard to the annotation of ar- agree that “The sky is blue” does not constitute a gument structure in short texts, the inter-annotator premise for the assertion “We cannot make brown- agreement among minimally trained annotators is ies”, while the sentence “We ran out of chocolate” bound to be low due to different personal commit- does because chocolate is an essential ingredient ments as well as interpretative skills of the texts. of brownies. However, to classify the relation be- We wanted to test whether this conclusion is valid tween the above given premise-claim pair as an in- for our annotation task. stance of reasoning from the formal cause consti- We conducted a pilot annotation study using tutes a task which lies outside common encyclope- 9 minimally trained non-expert annotators. As dic knowledge. In light of this, a set of guidelines a corpus we used 30 short persuasive essays al- about the explicit and implicit components needed ready annotated as to premises, claims and sup- to recognize different types of argument schemes port/attack relations (Stab and Gurevych, 2014). between given pairs of premises and claims has Section 3 presents the set of guidelines and our been provided.

83 2.1 The Structure of Argument Schemes following the Argumentum Model of Topics Unlike other contemporary approaches, the Argu- mentum Model of Topics (AMT) does not “con- ceive of argument schemes as the whole bear- ing structures that connect the premises to the standpoint or conclusion in a piece of real ar- gumentation” (Rigotti and Morasso, 2010, 483), but as an inference licensed by the combination of both material and procedural premises. Pro- cedural premises are abstract rules of reasoning needed to bridge premises to claims. They in- clude both a broad relation (after which argu- Figure 2: Adopted taxonomy of argument schemes ment schemes are named), which tells us why premises and claims are argumentatively related 2.2 A Semantically Motivated Taxonomy of in a frame, and an inferential rule of the implica- Argument Schemes tive type (“if...then”), which further specifies the reasoning at work in drawing a claim from cer- In this paper, the adopted taxonomy of argument tain premises. Contextual information, necessary schemes is a simplified version of that elaborated to apply abstract rules to a real piece of argumen- by exponents of the Argumentum Model of Top- tation, is provided by material premises which in- ics (Rigotti, 2006; Palmieri, 2014). According to clude the premise textually expressed and some the AMT, argument schemes are organized in hi- common ground knowledge about the world. If we erarchical clusters based on principles relying on consider again the pair of sentences “[We cannot frame semantics and pragmatics. As seen in Fig- make brownies]CLAIM”. “[We ran out of choco- ure 2, there are three main levels. late]PREMISE”, the argument scheme connecting At the top level, argument schemes are distin- them is structured as given in Figure 1. At a struc- guished into three groups depending on the type tural level, the inferential rule works as a major of relations linking the State of Affairs (SoA) ex- premise that, combined with the conjunction of pressed by the premise to that expressed by the the material premises, allows one to draw the con- claim: clusion. Among the premises non-textually ex- Intrinsic argument schemes: the SoA ex- • pressed, while common ground knowledge is per pressed by the premise and that expressed by definition accessible to annotators, the inferential the claim are linked by an ontological rela- rule at work has to be consciously reconstructed. tion since they belong to the same semantic frame, understood as a unitarian scene featur- ing a set of participants (Fillmore and Baker, 2010). This entails that the two SoAs take place simultaneously in the real world or that the existence of one affects the existence of the other. Extrinsic argument schemes: the SoA ex- • pressed by the premise and that expressed by the claim belong to different frames and are connected by semantic relations that are not ontological. This means that the existence of one SoA is independent from the existence of ! the other SoA.

Figure 1: Inferential configuration of argument according Complex argument schemes: the relation be- • to Argumentum Model of Topics (AMT) tween the SoAs expressed by the premise and

84 the claim is not semantic or ontological, but The annotators involved in the project were nine pragmatic. In other words, what guarantees graduate students with no specific background in the support of the claim is reference to an ex- Linguistics or Argumentation. Three different an- pert or an authority. notators have been assigned to the annotation of each essay. The task consisted in annotating the The middle level refers to the different types “support” relations between premise-claim, claim- of ontological, semantic and pragmatic relations major claim, and premise-premise with one of the which further specify the top level classes. Each middle level argumentation schemes given in Fig- middle level argument scheme is defined by mak- ure 2 or NoArgument. For the identification of the ing reference to semantic or pragmatic proper- middle level argument schemes, annotators were ties of the propositions constituting the premises provided with an heuristic procedure and asked and the conclusion. For example, the scheme Ex- to look for linguistic clues as a further confirma- trinsic:Practical Evaluation is defined as follows: tion for their choices. We included the label of “the proposition functioning as premise is an eval- NoArgument to account for potential cases where uation, namely a judgment about something being premises/claims in support of claims/major claims ‘good’ or ‘bad’. The claim expresses a recommen- do not actually instantiate any inferential path and dation/ advice about stopping/continuing/setting cannot, hence, be considered proper arguments. up an action”. For example, in the following pair of clauses: The low level further specifies the middle level “[This, therefore, makes museums a place to en- schemes. For example, the Intrinsic:Causal argu- tertain in people leisure time]PREMISE. [People ment scheme is further specified following the so- should perceive the value of museums in enhanc- called Aristotelian causes (efficient cause, formal ing their own knowledge]CLAIM ”, the clause an- 1 cause, material cause and final cause) . In the an- notated as premise simply does not underpin at notation protocol, this low level has not been con- all the clause annotated as claim. As to the “at- sidered since we hypothesize that it will be dif- tacks” relations, which indicate that a statement ficult for annotators to reliably make such fine- rebuts another statement, they have not been con- grained distinctions, based on results from simi- sidered as targets of the annotation since they do lar studies using Walton’s taxonomy of argument not directly instantiate an argument scheme link- schemes (Song et al., 2014; Palau and Moens, ing the spans of texts annotated as premise/claim 2009). and claim/major claim, but a complex refutatory move pointing to the defeasibility of the rebut- 3 Annotation study ted statement itself or to that of the premises sup- The annotation study has been designed on top of porting it. Annotators have independently read the annotation performed by Stab and Gurevych the guidelines and proceeded with the annotation (2014). In their study, annotators were asked to without any formal training. identify and annotate through the open source an- The guidelines contain the description of the notation tool Brat 2 the argumentative components key notions of argument, premise, claim and ar- (premise, claim, major claim), the stance charac- gument schemes’ components as well as the AMT terizing claims (for/against) and the argumentative taxonomy. Detailed instructions about how to pro- relations connecting pairs of argumentative com- ceed in the annotation of argument schemes and ponents (supports/against) in 90 short persuasive rules were provided as well. The main stages of essays. We selected 30 essays as a sample for our analysis annotators were asked to go through are pilot annotation (11 relations for each essay in av- the following: erage). The text genre of short persuasive essays Identification of the middle level argument • is not bound to the discussion of a specific issue, scheme linking premises-claims or claims- which would prompt the presence of arguments of major claims pairs or recognition of the lack the same type, but enables the presence of the en- of argumentation in doubtful cases (e.g., In- tire spectrum of argument schemes. trinsic:Definitional, Intrinsic:Causal, Intrin- 1The model presents low level argument schemes for sic:Mereological, for a total of 9 choices in- other middle level argument schemes which are not visual- cluding NoArgument, Figure 2) ized in the Figure 2 2http://brat.nlplab.org/ Identification of the inferential rule at work •

85 (e.g., Figure 1). Other linguistic clues: the claim frequently contains a modal verb or a modal construc- We present these two stages of the annotation pro- tion (“must”, “can”, “it is clear/it is neces- cess in the next two subsections. sary”). In the given example, the claim con- 3.1 Identification of the Middle Level tains the modal verb “could”. Argument Schemes As far as linguistic clues are concerned, they In order to recognize the middle level types of have been collected from existing literature about argument schemes, the annotators were asked to linguistic indicators (Rocci, 2012; Miecznikowski browse a set of given identification questions for and Musi, 2015; Van Eemeren et al., 2007) and argument schemes (see Appendix), to choose the from a preliminary analysis of the considered sam- question which best matches the pair of argumen- ple. Annotators have been explicitly warned that tative components linked by a “support” relation, the given linguistic indicators, due to their highly and to check if the argumentative components con- polysemous and context sensitive nature, do not tain linguistic features listed as typical of an argu- represent decisive pointers to the presence of spe- ment scheme (see Appendix). cific arguments schemes, but have to be conceived The explanation of the annotation procedure as supplementary measures. has been backed up by examples. For instance, In presence of difficulties to identify a specific given the premise-claim pair: “[Due to the in- argument scheme applying the given set of iden- creasing number of petrol stations, the competi- tification questions, annotators were instructed to tion in this field is more and more fierce]PREMISE, embed the pair of argumentative components un- thus [the cost of petrol could be lower in the fu- der the hypothetical construction “If it is true that ture]”CLAIM, the annotators were shown which [premise/claim], is it then true that [claim/major argument scheme was appropriate: claim]?” and evaluate its soundness. This simple Intrinsic:Definitional: Does the sentence test was meant to help the annotators checking if • “Due to the increasing number of petrol sta- an inferential relation connecting the argumenta- tions, the competition in this field is more and tive components is possibly there. more fierce” express a definitional property If a premise-claim pair failed the test, annota- of the predicate “be lower” attributed to the tors were asked to choose the label NoAgument cost of petrol? NO and explain why argumentation is not there. In the opposite case, they were told to annotate the pair : the premise and the Other linguistic clues under analysis as Ambiguous and try to identify claim usually share the grammatical subject. the top level class of argument schemes applying The verb which appears in the claim ex- the following round of identification questions: presses a state rather than an action. Intrinsic argument schemes: Can the state of Intrinsic:Mereological: Is the fact that “Due • • affairs expressed in the premise and the state to the increasing number of petrol stations, of affairs expressed in the claim take place the competition in this field is more and more simultaneously in the real world or does the fierce” or an entity of that sentence (e.g., “the realization of one affects the realization of the competition”) an example/a series of exam- other one? If yes, it is an instance of intrinsic ples/a part of the fact that “the cost of petrol argument schemes. could be lower in the future”? NO Other linguistic clues: the premise is fre- Extrinsic argument schemes: Are the exis- • quently signaled by the constructions “for ex- tence of the state of affairs expressed in the ample”, “as an example”, “x proves that”. premise and that expressed in the claim not simultaneous and independent on each other? Intrinsic:Causal: Is the fact that “Due to • If yes, it is an instance of extrinsic argument the increasing number of petrol stations, the schemes. competition in this field is more and more fierce” a cause/effect of the fact that “the cost Complex argument scheme: Is the premise • of petrol could be lower in the future” or is it a discourse/statement expressed by an ex- a means to obtain it? YES pert/an authority/an institution and does the

86 claim coincide with the content of that dis- if they thought that the provided ones were not fit- course? If yes, it is an instance of complex ting. Our hypothesis was that when writing down argument scheme (authority). inferential rules the annotators are forced to con- trol the appropriateness of the chosen argument Example: Let us consider the example below of scheme. a premise supporting a claim.

“[Knowledge from experience seems a 4 Evaluation little different from information con- In order to evaluate the reliability of the anno- tained in books]CLAIM . [To cite an ex- tations we measured the inter-annotators agree- ample, it is common in books that wa- ment (IAA) using Fleiss’ κ to account for multi- ter boils at 100 Celcius degree. How- ple annotators (Fleiss, 1971). When considering ever, the result is not always the same the middle level annotation schemes, the IAA is in reality because it also depends on the κ=0.1, which shows only slight agreement (Lan- height, the purity of the water, and even dis and Koch, 1977). This finding supports the hy- the measuring tool]”PREMISE pothesis that for annotating argument schemes the To determine whether there is an argument IAA is low when using minimally training non- scheme, the annotators could ask themselves: “If it expert annotators. We also measured the IAA be- is true that [it is common in books that water boils tween the top level arguments (Intrinsic, Extrinsic, at 100 Celcius degree. However, the result is not Complex, NoArgument), but did not find any sig- always the same in reality because it also depends nificant difference in the Fleiss’ κ score. on the height, the purity of the water, and even Table 1 represents some descriptive statistics the measuring tool], is it then true that [knowledge about the annotations. Out of 302 argumen- from experience seems a little different from in- tative relations to be annotated, for 30 cases formation contained in books]?” As the answer is (10%) all three annotators agree, while for 179 yes, this premise-claim pair is an instance of argu- cases (59%) at least two out of the three anno- ment schemes. tators agree. When all three annotators agree When the top level class of argument schemes the distribution of the argument schemes is: 7 is concerned, the SoAs expressed by the claim and Intrinsic:Causal, 9 Intrinsic:Mereorogical, 1 In- the premise are simultaneously realized since the trinsic:Definitional, 6 Extrinsic:Practical Eval- premise constitutes an example which shows that uation and 7 NoArgument. When at least what is stated in the claim corresponds to reality. two out of the three annotators agree, the Thus this is an Intrinsic scheme. More specifically distribution of the argument schemes (major- it is an Intrinsic: Mereological scheme (following ity voting) is: 60 (33.5%) Intrinsic:Causal, 46 the questions and the linguistic cues) since a pro- (25.7%) Intrinsic:Mereorogical, 16 (8.9%) Intrin- cess of induction from an exemplary case to a gen- sic:Definitional, 28 (15.6%) Extrinsic:Practical eralization is at work. Evaluation, 3(1%) Extrinsic:Alternatives’ , 3(1%) Extrinsic:Opposition and 23 (12.8%) NoArgu- 3.2 Identification of the Inferential Rule ment’). The last step of the annotation process consisted When considering the 3 top level argument in the identification of the inferential rule at work schemes plus NoArgument, out of 302 argumen- for those pairs in which annotators were able to tative relations to be annotated, for 260 instances identify a middle level argument scheme. Anno- (86%) at least two annotators agreed. The distri- tators were provided with representative rules for bution of majority voting labels in these cases is: each argument scheme (see Appendix) such as the 185 (71%) are Intrinsic, 52 (20%) are Extrinsic, following two for the Intrinsic:Mereological argu- and 23 (8.8%) are NoArgument. ment scheme: “if all parts share a property, then One goal of this pilot study was to determine the whole will inherit this property”; “if a part of whether confusion exists among particular argu- x has a positive value, also x has a positive value”. ment schemes with the aim to improve the guide- They were asked either to write down one of lines. Table 2 shows the confusion matrix between the given inferential rules corresponding to the ar- two argument schemes for all annotators pairs. gument scheme or to formulate a rule on their own This confusion matrix is a symmetric one, so we

87 Argument # of Agreeing # of Instances As to Extrinsic:Practical Evaluation argument Schemes Annotators Middle all 3 30 scheme, the recurrent feature which seems to be at 2 or more 179 the basis of agreement is the presence of a clear Top all 3 77 evaluation in the premise. 2 or more 260 Table 2 shows that, among the three more fre- Table 1: Descriptive Statistics about the annota- quent argument schemes the Extrinsic:Practical tions Evaluation was the one confused the most with another specific argument scheme, namely Intrin- provided only the upper triangular matrix. A de- sic: Causal. From the analysis of the ambigu- tailed discussion is presented in the next section. ous cases, two plausible reasons for the confusion have emerged: 1) the presence of the modal verb 5 Discussion of the Results “should” has been cited in the guideline among the linguistic clues of both argument schemes, and As shown in the previous section, the argument 2) the Extrinsic:Practical Evaluation argument schemes which received the highest IAA were scheme shares with the causal argument scheme Intrinsic:Mereological, Intrinsic:Causal among of the final type the reference to intentionality the Intrinsic argument schemes, and Extrin- and, in general, to the frame of human action sic:Practical evaluation for the Extrinsic argu- where consequences of various choices are taken ment scheme. Going through the examples in into account. For example, the premise/claim which all three annotators agreed, our impression pair “[this kind of ads will have a negative ef- is that both the presence of scheme specific lin- fect to our children] PREMISE. [Advertising alco- guistic clues and the suitability of inferential rules hol, cigarettes, goods and services with adult con- already offered in the guidelines enhanced the an- tent should be prohibited] CLAIM”, which is an notators’ choices. As to Intrinsic:Mereological instance of Extrinsic:Practical Evaluation argu- relations, the frequent presence of constructions ment scheme, has been confused with the Intrin- such as “for example”, “for instance”, compatible sic:Causal argument scheme licensing the infer- only with that specific argument scheme, has plau- ential rule “if an action does not allow to achieve sibly fostered its reliable recognition. the goal, it should not be undertaken”. In order to In the case of Intrinsic:Causal argument improve the annotation, ambiguous cases of this schemes, the cited linguistic clues in the guide- type will have to be discussed during the training lines have turned out to be not relevant: modal process (see Section 6). verbs are not present in the claims/major claims of the pairs annotated as Intrinsic: Causal by the Since the Extrinsic:Practical Evaluation argu- majority of annotators. On the other hand, all ment scheme is the far most frequent Extrinsic ar- these examples are instances of inferential rules gument scheme in our sample, improving its iden- from the cause to the effect. This suggests that the tification promises to highly affect the IAA re- cause-effect inferential relation is considered as garding top level Extrinsic vs. Intrinsic argument the prototypical type of causal argument schemes. schemes. Only one instance of Intrinsic:Definitional ar- The Intrinsic:Causal argument scheme appears gument scheme was recognized by all three an- to be frequently confused also with the Intrin- notators. Notions such as that stative predicates sic:Mereological argument schemes and vice- as identifiable linguistic clues in the guidelines versa. This happened mainly in the presence of were probably not informative for every annota- Mereological argument schemes drawing a gener- tor, as shown by the confusion among the In- alization from a exemplary case (rhetorical induc- trinsic:Definitional and Intrinsic:Causal argument tion) such as the following: “[in Vietnam, many schemes (Table 2). cultural costumes and natural scenes, namely For Intrinsic:Mereological and Intrinsic:Causal drum performance and bay, are being encouraged argument schemes a set of inferential rules was al- to preserve and funded by the tourism ministry.] ready proposed in the guidelines, as opposed to PREMISE [Through tourism industry, many cul- just one rule given for Intrinsic:Definitional. This tural values have been preserved and natural en- has probably helped the annotators to check the vironments have been protected]CLAIM”. Some soundness of the chosen scheme in these cases. annotators misconceived the SoA expressed by the

88 mn l he noaos(nldn h expert). the (including annotators the three of all among shift a show from annotation IAA the of results The continu- doubts. were and/or received misunderstandings and on they feedback essays ous which 2 annotate during to asked session training hour agreement lowest received ( on which essays annotator of expert set the 1 and annotators trained we 2 non-expert guidelines with annotation the further of a performed improvement have or- the In test to provided). der been argumenta- have not examples are (some corpus tive the in relations ports” sic:Causal between as well Evaluation argument between sic:Practical distinction each the for stressing rules scheme, inferential more viding pro- clues, linguistic scheme-specific guidelines only the keeping improved we study, initial the annotators After trained with Annotation 6 evidently unrelated. was con- components the argumentative of nected content propositional the unless was iden- none, there when to even scheme default argumentation an by tify tried annotators has that disagreement revealed showing occurrences the of ysis in enough clear not guidelines. was the time be and can space which in affairs located of state expressing those generalizations and expressing distinction propositions the the that between by suggests expressed behavior SoA This the claim. of effect an as premise κ -.1 hc niae oragreement). poor indicated which =-0.01; h o-xetanttr ettruhatwo a through went annotators non-expert The label the to As al :CnuinMti n3 sas( iial rie o-xetannotators) non-expert trained minimally (3 essays 30 on Matrix Confusion 2: Table n xlctysaigta oe“sup- some that stating explicitly and , κ -.1to =-0.01 Intrinsic:Mereological oArgument No :rcia Evaluation E:Practical κ I:Mereorogical E:Alternatives I:Definitional E:Opposition 031(fi agreement”) (“fair =0.311 NoArgument C:Authority E:Analogy I:Causal and Intrinsic:Causal ulttv anal- qualitative a , and 154 Extrin-

Intrin- Intrinsic:Causal 128 89 as I:Mereorogical 89 36 20 45 esd ri ifrn li ol aebeen have would claim different a if or re- versed, if argumentative been have would components relations. argumenta- discourse between and tive intersections existing the investigation the of in the overcome of to difficulties one major constitutes units), discourse (el- ementary EDUs and clauses, which multiple ADUs, encompass to between tend mismatch Stede the by (2016) out al. pointed et recently at As done level. was clause (2014) the Gurevych original and the Stab the in of bacause claims dataset and happens premises This combined of if annotation only clause. argument another an with as work to pened a as work not does entity argument. linguistic a as however, claim re- stance; the for certain if strategy a on justified stylistic consensus a achieving is as relation considered is “supports” dundancy a of ence in on as freedom]” restriction such any be work] claim, not artists’ should major claim “[There the pair the the of of that content rephrases propositional the quently up. re- the pop of lations nature main argumentative four non annotators, the the for Look- reasons by made notes sample. the overall at the ing to compared higher is as annotated relations of percentage confusion matrix. the calculated have we space agreement annota- non-expert two similar was the tors just among IAA The I:Definitional hr,terlto ewe w argumentative two between relation the Third, hap- premise as annotated clause the Second, fre- pairs claims claims-major the among First, the sample reduced this in that shows 3 Table 80 26 47 82 E:PracticalEvaluation 16 17 6 8 8

AO CLAIM MAJOR E:Alternatives 14 14 17 3 8 6 CLAIM E:Opposition κ 037 nodrt a h dis- the map to order In =0.307. 14 0 0 1 5 5 6 E:Analogy Teats utb given be must artist [The . 74 13 17 33 25 51 82 4 NoArgument nteecss h pres- the cases, these In .

0 1 0 1 0 1 0 4 5 C:Authority oArgument No eeta ue shgl eeatt nacn an- enhancing to in- relevant of highly a reconstruction is rules the be ferential 2) can confusion; they of to otherwise source specific scheme, if only argument annotators one the for clues constitute useful schemes argument lin- of 1) indicators that: of emerged guistic analysis has it qualitative matrixes the confusion the From argu- recognize schemes. to ment annotators for trained studies minimally previous by underlined difficulties guide- the confirms the agreement of inter-annotator low informativeness The the lines. annotators test to non-expert order trained in pi- minimally a essays 9 persuasive conducted short with have 30 of We study annotation lot context. applica- every and to distinctive ble highly linguistic are on which based criteria schemes argument tax- of finite hierarchical onomy a offers it since vantageous Topics of the Model on mentum an- based the schemes for argument guidelines of of notation set novel a presented WorkWe Future and Conclusion 7 causal a invites is which interpretation. premise true”, is the claim “if the test then true, proposed the implica- of the nature to due tive probably is This scheme. ment instead chosen of label frequent In most confusion. the of matter particular, a remains still it well expert), as as (non-expert annotators trained highly with Argument No counterargument. a instead claim, constitutes as but annotated clause the anyway in underpin chosen. al :CnuinMti nasto 0esy hgl rie noaos o-xet n expert) 1 and non-experts 2 annotators: trained (highly essays 10 of set a on Matrix Confusion 3: Table lhuhteareeti h eonto of recognition the in agreement the Although not does premise as annotated clause the Fourth, NoArgument ae a ossetyimproved consistently has cases sta of that is E:PracticalEvaluation I:Mereorogical E:Alternatives oArgument No I:Definitional E:Opposition hsfaeoki ad- is framework This . C:Authority E:Analogy I:Causal Intrinsic:Causal 86 Intrinsic:Causal Argu- argu- 70 19 I:Mereorogical 90 10 0 5 I:Definitional hdAaoi ntl onrv aa ae,Daniel Lavee, Tamar Polnarov, Anatoly Aharoni, Ehud References feedback. valuable their anony- for the reviewers and to mous work like their would for We annotators the Government. thank U.S. the De- or of official Department fense the SNFS, the reflect of not position do or policy and authors the indi- of those as are expressed and views strategies The relations” program. evidential DARPA-DEFT discourse argumentative of of role cators the con- in text: project mining the argumentation for to Grant semantics partially SNFS “From supported Doc Post work Early on the by based is paper This Acknowledgements relations. com- support argumentative and of ponents, presence the verify important to identi- an tool constitutes that schemes is argument study fying the of result of A annotation methodological the cor- (premises/claims). to a components as argumentative in accuracy guidelines higher annotation with pus the plan we again work, test future to For disagreement. main of the cause non is of relations ambiguous frequency or the con- argumentative that The suggests matrix sig- agreement). fusion has (fair agreement improved 2 interannotator nificantly with The an- essays expert one 10 notator. and of annotators sample non-expert tested reduced trained and a results the on improved these them have to We according identify. guidelines to easiest cause” the “Efficient is subtype the schemes among argument 3) and choices notators’ 10 13 escvc,RnLv,Rt iot a Gutfre- Dan Rinott, Ruty Levy, Ran Hershcovich, 1 1 E:PracticalEvaluation

0 0 0 0 0 E:Alternatives

2 0 0 0 0 1 E:Opposition

0 0 0 0 0 0 0 E:Analogy 136 10 21 47 0 4 0 9 NoArgument

0 0 0 0 0 0 0 0 0 C:Authority Intrinsic:Causal und, and Noam Slonim. 2014. A benchmark dataset Fabrizio Macagno. 2014. Argumentation schemes and for automatic detection of claims and evidence in the topical relations. Macagno, F. & Walton, D.(2014). context of controversial topics. In Proceedings of Argumentation schemes and topical relations. In G. the First Workshop on Argumentation Mining, pages Gobber, and A. Rocci (eds.), Language, reason and 64–68. education, pages 185–216.

Or Biran and Owen Rambow. 2011. Identifying justi- Johanna Miecznikowski and Elena Musi. 2015. Verbs fications in written dialogs. In Semantic Computing of appearance and argument schemes: Italian sem- (ICSC), 2011 Fifth IEEE International Conference brare as an argumentative indicator. In Reflec- on, pages 162–168. IEEE. tions on Theoretical Issues in Argumentation The- ory, pages 259–278. Springer. Elena Cabrio and Serena Villata. 2012. Combining textual entailment and argumentation theory for sup- Raquel Mochales Palau and Marie-Francine Moens. porting online debates interactions. In Proceedings 2009. Argumentation mining: the detection, clas- of the 50th Annual Meeting of the Association for sification and structure of arguments in text. In Pro- Computational Linguistics: Short Papers - Volume ceedings of the 12th international conference on ar- 2, ACL ’12, pages 208–212, Stroudsburg, PA, USA. tificial intelligence and law, pages 98–107. ACM. Association for Computational Linguistics. Rudi Palmieri. 2014. Corporate argumentation in Vanessa Wei Feng and Graeme Hirst. 2011. Clas- takeover bids, volume 8. John Benjamins Publish- sifying arguments by scheme. In Proceedings ing Company. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Andreas Peldszus and Manfred Stede. 2013a. From ar- Technologies-Volume 1, pages 987–996. Association gument diagrams to argumentation mining in texts: for Computational Linguistics. A survey. International Journal of Cognitive Infor- matics and Natural Intelligence (IJCINI), 7(1):1–31. Charles J Fillmore and Collin Baker. 2010. A frames approach to semantic analysis. The Oxford hand- Andreas Peldszus and Manfred Stede. 2013b. Ranking book of linguistic analysis, pages 313–339. the annotators: An agreement study on argumenta- tion structure. In Proceedings of the 7th linguistic Joseph L Fleiss. 1971. Measuring nominal scale annotation workshop and interoperability with dis- agreement among many raters. Psychological bul- course, pages 196–204. letin, 76(5):378. John L Pollock. 1995. Cognitive carpentry: A blueprint for how to build a person. Mit Press. Debanjan Ghosh, Smaranda Muresan, Nina Wacholder, Mark Aakhus, and Matthew Mitsui. 2014. Analyz- Chris Reed and Glenn Rowe. 2004. Araucaria: Soft- ing argumentative discourse units in online interac- ware for argument analysis, diagramming and repre- tions. In Proceedings of the First Workshop on Ar- sentation. International Journal on Artificial Intelli- gumentation Mining, pages 39–48. gence Tools, 13(04):961–979.

Nancy L Green. 2015. Identifying argumentation Eddo Rigotti and Sara Greco Morasso. 2010. Com- schemes in genetics research articles. NAACL HLT paring the argumentum model of topics to other 2015, page 12. contemporary approaches to argument schemes: the procedural and material components. Argumenta- Scott Jacobs. 2000. Rhetoric and dialectic from the tion, 24(4):489–512. standpoint of normative pragmatics. Argumentation, 14(3):261–286. Eddo Rigotti. 2006. Relevance of context-bound loci to topical potential in the argumentation stage. Ar- Joel Katzav and Chris A Reed. 2004. On argumenta- gumentation, 20(4):519–540. tion schemes and the natural classification of argu- ments. Argumentation, 18(2):239–259. Eddo Rigotti. 2009. Whether and how classical topics can be revived within contemporary argumentation Manfred Kienpointner. 1987. Towards a typology of theory. In Pondering on problems of argumentation, argumentative schemes. Argumentation: Across the pages 157–178. Springer. lines of discipline, 3:275–87. Andrea Rocci. 2012. Modality and argumentative J Richard Landis and Gary G Koch. 1977. The mea- discourse relations: a study of the italian necessity surement of observer agreement for categorical data. modal dovere. Journal of Pragmatics, 44(15):2129– biometrics, pages 159–174. 2149.

Clare Llewellyn, Claire Grover, Jon Oberlander, and Sara Rosenthal and Kathleen McKeown. 2012. De- Ewan Klein. 2014. Re-using an argument corpus tecting opinionated claims in online discussions. In to aid in the curation of social media collections. In Semantic Computing (ICSC), 2012 IEEE Sixth Inter- LREC, pages 462–468. national Conference on, pages 30–37. IEEE.

91 Jodi Schneider, Krystian Samp, Alexandre Passant, and Other clues: the premise and the claim usu- Stefan Decker. 2013. Arguments about deletion: ally share the grammatical subject. The verb How experience improves the acceptability of argu- which appears in the claim expresses a state ments in ad-hoc online task groups. In Proceedings of the 2013 conference on Computer supported co- (be +noun or be + adjective, consider) rather operative work, pages 1069–1080. ACM. than an action.

Yi Song, Michael Heilman, Beata Beigman, and Kle- Inferential rule: “if x shows typical traits of a banov Paul Deane. 2014. Applying argumentation class of entities (e.g. positive actions, bene- schemes for essay scoring. ficial decisions), then it is an instance of that Christian Stab and Iryna Gurevych. 2014. Identify- class” ing argumentative discourse structures in persuasive essays. In EMNLP, pages 46–56. 2. Intrinsic Mereorogical: Is “the fact that x” or an entity cited in x an Manfred Stede, Stergos Afantenos, Andreas Peldszus, Nicholas Asher, and Jer´ emie´ Perret. 2016. Parallel example /a series of examples /a part of “the discourse annotations on a corpus of short texts. fact that y”?

Simone Teufel and Marc Moens. 2002. Summariz- Other clues: the premise is frequently sig- ing scientific articles: experiments with relevance naled by the constructions or example, as and rhetorical status. Computational linguistics, an example, for instance, x proves that. In 28(4):409–445. the cases in which induction is at work the Frans H Van Eemeren and Rob Grootendorst. 1992. premise coincides with the description of a Argumentation, communication, and fallacies: A situation that is frequently located in the past. pragma-dialectical perspective. Lawrence Erlbaum Associates, Inc. Inferential rules: “if all parts share property, then the Frans H Van Eemeren, Peter Houtlosser, and • AF Snoeck Henkemans. 2007. Argumentative in- whole will inherit this property” dicators in discourse: A pragma-dialectical study, “if a part of x has a positive value, also volume 12. Springer Science & Business Media. • x has a positive value” Frans H Van Eemeren, Rob Grootendorst, Ralph H “if something holds/may hold/held for • Johnson, Christian Plantin, and Charles A Willard. an exemplary case x, it holds/may 2013. Fundamentals of argumentation theory: A handbook of historical backgrounds and contempo- hold/will hold for all the cases of the rary developments. Routledge. same type” “if something holds/may hold/held for Douglas Walton, Christopher Reed, and Fabrizio • Macagno. 2008. Argumentation schemes. Cam- a sample of cases of the type x, it bridge University Press. holds/may hold/will hold for every case of the type x” Douglas Walton. 2012. Using argumentation schemes for argument extraction: A bottom-up method. In- 3. Intrinsic Causal: ternational Journal of Cognitive Informatics and Natural Intelligence (IJCINI), 6(3):33–61. Is x a cause /effect of y or is it a means to obtain y? A Appendix Other clues: the claim frequently contains a We report in what follows the “cheat sheet” lo- modal verb or a modal construction (must, cated at the end of the annotation guidelines which can, it is clear /it is necessary). contains i) an identification question, ii) a set of Inferential rules: linguistic clues and of iii) inferential relations for “if the cause is the case, the effect is the each middle level argument scheme. The complete • guidelines will be made available. case” “if the effect is the case, the cause is • 1. Intrinsic Definition: probably the case” Does x express a definitional property of the “if a quality characterizes the cause, • predicate attributed to the grammatical sub- then such quality characterizes the effect ject in y? too”

92 “if the realization of the goal necessi- Other clues: the claim frequently contains ne- • tates the means x, x must be adopted” cessity modals (must, have to). The premise “if an action does not allow to achieve states that all possible other alternatives are • the goal, it should not be undertaken” excluded. “if somebody has the means to achieve Inferential rules: • a certain goal, he will achieve that goal” “if all the alternatives to x are excluded, • 4. Extrinsic Analogy: then x is unavoidable” “if among a set of alternatives only one Do x and/or y compare situations happened • in different circumstances but similar in some is reasonable it has to be undertaken” respects? 7. Extrinsic Practical Evaluation/Termination Other clues: the premise and /or the and setting up: claim usually contain comparative conjunc- Does x express an evaluation and does y tions /constructions (e.g. as, like, in a similar express an /a recommendation about stop- vein) ping /continuing /setting up that action? Inferential rules: Other clues: the claim usually contains the “if the state of affairs x shows a set of modal verb should. • features which are also present in the Inferential rules: state of affairs y and z holds for x, then “if something is of important value, it z holds for y too” • should not be terminated” “if two events x and y are similar and • “if something has a positive value, it event x had the consequence z, probably • also y will have the consequence z” should be supported /continued /pro- moted /maintained” “if two situations x and y are similar in • “if something has positive effects, it a substantial way and action z was right • in the situation x, action y will be right should be supported /continued /pro- also in the situation y” moted /maintained” “if something has a negative effect it • 5. Extrinsic Opposition: should be terminated” does the occurrence of the state of affairs 8. Complex Authority: x exclude the occurrence of the state of af- fairs y? Or does the premise contain enti- Is the premise a discourse/statement ex- ties /events which are opposite with respect pressed by a an expert /institution /authority entities /events expressed in the claim? in the field and does the claim coincides with the content of that discourse? Other clues: the claim sometimes contain modals which express impossibility (it is im- Other clues: the authority to which the writer possible that, it cannot be that, but it is not appeals is usually introduced by according always the case. to, as shown by, as clarified /explained /de- clared by. Inferential rules: Inferential rules: “If two state of affairs/entities x, y are • one the opposite of the other, the occur- “if the institution /expert /authority in • rence of x excludes the occurrence of y” the field states that proposition x is true, “If two state of affairs x, y are one the then x is true” • opposite of the other, they entail oppo- “if the institution /expert /authority in • site consequences” the field states event x will occur, then x will probably occur” 6. Extrinsic Alternatives: Is/are the state of affairs expressed by x an alternative(s) to the one expressed in y?

93 Identifying Argument Components through TextRank

Georgios Petasis and Vangelis Karkaletsis Software and Knowledge Engineering Laboratory Institute of Informatics & Telecommunications National Center for Scientific Research (N.C.S.R.) “Demokritos” Athens, Greece. petasis,vangelis @iit.demokritos.gr { }

Abstract or the conclusion of the argument. The assump- tions are called the support, or equivalently the In this paper we examine the application premises of the argument, which provide the rea- of an unsupervised extractive summari- son (or equivalently the justification) for the claim sation algorithm, TextRank, on a differ- of the argument. The process of extracting conclu- ent task, the identification of argumenta- sions/claims along with their supporting premises, tive components. Our main motivation is both of which compose an argument, is known to examine whether there is any poten- as argument mining (Goudas et al., 2015; Goudas tial overlap between extractive summari- et al., 2014) and constitutes an emerging research sation and argument mining, and whether field. approaches used in summarisation (which Several approaches have been already presented typically model a document as a whole) for addressing various subtasks of argument min- can have a positive effect on tasks of ar- ing, including the identification of argumentative gument mining. Evaluation has been per- sentences (i.e. sentences that contain argumenta- formed on two corpora containing user tion components such as claims and premises), ar- posts from an on-line debating forum and gumentation components, relations between such persuasive essays. Evaluation results sug- components, and resources for supporting argu- gest that graph-based approaches and ap- ment mining, like discourse indicators and other proaches targeting extractive summarisa- expressions indicating the presence of argumen- tion can have a positive effect on tasks re- tative components. Proposed methods mostly re- lated to argument mining. late to supervised machine learning exploiting a plethora of features (Goudas et al., 2015), includ- 1 Introduction ing the combination of several techniques, such as Argumentation is a branch of philosophy that stud- the work presented in (Lawrence and Reed, 2015). ies the act or process of forming reasons and of One of the difficulties associated to argument drawing conclusions in the context of a discussion, mining relates to the fact that the identification dialogue, or conversation. Being an important el- of argument components usually depends on the ement of human communication, its use is very context in which they appear in. The locality frequent in texts, as a means to convey meaning of this context can vary significantly, based not to the reader. As a result, argumentation has at- only on the domain, but possibly even to personal tracted significant research focus from many dis- writing style. On one hand, discourse indicators, ciplines, ranging from philosophy to artificial in- markers and phrases can provide a strong and lo- telligence. Central to argumentation is the no- calised contextual information, but their use is not tion of argument, which according to (Besnard and very frequent (Lawrence and Reed, 2015). On Hunter, 2008) is a set of assumptions (i.e. infor- the other hand, the local context of a phrase may mation from which conclusions can be drawn), to- indicate that the phrase is a fact, suggesting low gether with a conclusion that can be obtained by or no argumentativeness at all, while at the same one or more reasoning steps (i.e. steps of deduc- time, the same phrase may contradict to another tion). The conclusion of the argument is often phrase several sentences before or after the phrase called the claim, or equivalently the consequent in question, constituting the phrase under ques-

94 Proceedings of the 3rd Workshop on Argument Mining, pages 94–102, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics tion an argumentative component (Carstens and among sentences, based on their content overlap. Toni, 2015). While it is quite easy to handle local We conducted our study on two corpora in En- context through suitable representations and learn- glish. The first one is a corpus of user generated ing techniques, complexity may increase signifi- content, compiled by Hasan and Ng (2014) from cantly when a broader context is required, espe- online debate forums on four topics: “abortion”, cially when relations exist among various parts of “gay rights”, “marijuana”, and “Obama”. The a document. second corpus, compiled by Stab and Gurevych (2014), contains 90 persuasive essays on various In this paper we want to examine approaches topics. Initial results are promising, suggesting that are able to handle interactions and relations that there is an overlap between extractive sum- that are not local, especially the ones that can marisation and argumentation component identi- model a document as a whole. An example of a fication, and the ranking of sentences from Tex- task where documents are modelled in their en- tRank can help in tasks related to argument min- tirety, is document summarisation (Giannakopou- ing, possibly as a feature in cooperation with an los et al., 2015). Extractive summarisation typ- argumentation mining approach. ically examines the importance of each sentence with respect to the rest of the sentences in a docu- The rest of the paper is organised as fol- ment, in order to select a small set of sentences that lows: Section 2 presents an brief overview of ap- are more “representative” for a given document. proaches related to argument mining, while sec- A typical extractive summarisation system is ex- tion 3 presents our approach on applying Tex- pected to select sentences that contain a lot of in- tRank for identifying sentences that contain argu- formation in a compact form, and capture the dif- mentation components. Section 4 presents the ex- ferent pieces of information that are expressed in perimental setting and evaluation results, with sec- a document. The main idea behind this paper is to tion 5 concluding this paper and proposing some examine whether there is any potential overlap be- directions for further research. tween these sentences that summarise a document, and argumentation components that can exist in a 2 Related Work document. Assuming that in a document the au- A plethora of approaches related to argument min- thor expresses one or more claims, which can be ing consider the identification of sentences con- potentially justified through a series of premises taining argument components or not as a key step or support statements, it will be interesting to ex- of the whole process. Usually labeled as “argu- amine whether any of these argumentation com- mentative” sentences, these approaches model the ponents will be assessed as significant enough to process of identifying argumentation components be included in an automatically generated sum- as a two-class classification problem. In this cat- mary. Will a summarisation algorithm capture at egory can be classified approaches like (Goudas least the claims, and characterise them as impor- et al., 2015; Goudas et al., 2014; Rooney et al., tant enough to be included in the produced sum- 2012), where supervised machine learning has mary? been employed in order to classify sentences into In order to examine if there is any overlap argumentative and non-argumentative ones. between extractive summarisation and argument However, there are approaches which try to mining (at least the identification of sentences that solve the argument mining problem in a com- contain some argumentation components, such as pletely different way. Lawrence et al. (2014) claims), we wanted to avoid any influence from combined a machine learning algorithm to extract the documents and the thematic domains under ex- propositions from philosophical text, with a topic amination. Ruling out supervised approaches, we model to determine argument structure, without examined summarisation algorithms that are ei- considering whether a piece of text is part of an ther unsupervised, or can be trained in different argument. Hence, the machine learning algorithm domains than the ones they will be applied on. was used in order to define the boundaries and af- Finally, we opted for an unsupervised algorithm, terwards classify each word as the beginning or TextRank (Mihalcea and Tarau, 2004), a graph- end of a proposition. Once the identification of the based ranking model, which can be applied on ex- beginning and the ending of the argument propo- tractive summarisation by exploiting “similarity” sitions has finished, the text is marked from each

95 starting point till the next ending word. sults show the argument-conclusion relationship is Another interesting approach was proposed by most often indicated by the conjunction because Graves et al. (2014), who explored potential followed by “since”, “therefore” and “so”. sources of claims in scientific articles based on Ghosh et al. (2014) attempted to identify the ar- their title. They suggested that if titles contain a gumentative segments of texts in online threads. tensed verb, then it is most likely (actually almost Expert annotators have been trained to recognise certain) to announce the argument claim. In con- argumentative features in full-length threads. The trast, when titles do not contain tensed verbs, they annotation task consisted of three subtasks: In the have varied announcements. According to their first subtask, annotators had to identify “Argumen- analysis, they have identified three basic types in tative Discourse Units” (ADUs) along with their which articles can be classified according to genre, starting and ending points. Secondly, they had purpose and structure. If the title has verbs then to classify the ADUs according to the “Pragmatic the claim is repeated in the abstract, introduction Argumentation Theory” (PAT) into “Callouts” and and discussion, whereas if the title does not have “Targets”. As a final step, they indicated the link verbs, then the claim does not appear in the title between the “Callouts” and “Targets”. In addi- or introduction but appears in the abstract and dis- tion, a hierarchical clustering technique has been cussion sections. proposed that assess how difficult it is to identify individual text segments as “Callouts”. Another field of argument mining that has re- Levy et al. (2014) defined the task of automatic cently attracted the attention of the research com- claim detection in a given context and outlined munity, is the field of argument mining from on- a preliminary solution, aiming to automatically line discourses. As in most cases of argument pinpoint context dependent claims (CDCs) within mining, the lack of annotated corpora is a lim- topic-related documents. Their supervised learn- iting factor. In this direction, Hasan and Ng ing approach relies on a cascade of classifiers. As- (2014), Houngbo and Mercer (2014), Aharoni suming that the articles examined are relatively et al. (2014), Green (2014), Stab and Gurevych small set of relevant free-text articles, they pro- (2014), and Kirschner et al. (2015) focused on pro- vided either manually or automatic retrieval meth- viding corpora spanning from online posts to sci- ods. More specifically, the first step of their ap- entific publications that could be widely used for proach is to identify sentences containing CDCs in the evaluation of argument mining techniques. In each article. As a second step a classifier is used in this context, Boltuziˇ c´ and Snajderˇ (2014) collected order to identify the exact boundaries of the CDCs comments from online discussions about two spe- in sentences identified as containing CDCs. As cific topics and created a manually annotated cor- a final step, each CDC is ranked in order to iso- pus for argument mining. In addition, they used late the most relevant CDCs to the corresponding a supervised model to match user-created com- topic. ments to a set of predefined topic-based argu- Finally, Carstens and Toni (2015) focus on ex- ments, which can be either attacked or supported tracting argumentative relations, instead of identi- in the comment. In order to achieve this, they used fying the actual argumentation components. De- textual entailment features, semantic text similar- spite the fact that few details are provided and ity features, and one “stance alignment” feature. their approach seems to be concentrated in pairs One step further, Trevisan et al. (2014) de- of sentences, the presented approach is similar to scribed an approach for the analysis of Ger- the approach presented in this paper in the sense man public discourses, exploring semi-automated that both concentrate on relations as the primary argument identification by exploiting discourse starting point for performing argument mining. analysis. They focused on identifying conclusive connectors, substantially adverbs (i.e. “hence”, 3 Extractive Summarisation and “thus”, “therefore”), using a multi-level anno- Argumentative Component tation. Their approach consists of three steps, Identification which are performed iteratively (manual discourse linguistic argumentation analysis, semi-automatic 3.1 The TextRank Algorithm text mining (PoS-tagging and linguistic multi- TextRank is a graph-based ranking model, “which level annotation) and data merge) and their re- can be used for a variety of natural language

96 processing applications, where knowledge drawn 2004): from an entire document is used in making local Wk Wk Si&Wk Sj ranking/selection decisions” (Mihalcea and Tarau, Similarity(S ,S ) = | | ∈ ∈ | i j log( S ) + log( S ) 2004). The main idea behind TextRank is to ex- | i| | j| tract a graph from the text of a document, using In our experiments we have used a slightly mod- textual fragments as vertices. What constitutes a ified similarity measure which employs TF-IDF vertex depends on the task the algorithm is applied (Manning et al., 2008), as implemented by the on. For example, for the task of keyword extrac- open-source TextRank implementation that can be tion vertices can be words, while for summarisa- found at (Bohde, 2012). tion the vertices can be whole sentences. Once the vertices have been defined, edges can be added be- 3.2 Argumentative Component Identification tween two vertices according to the “similarity” The main focus of this paper is to evaluate whether among text units represented by vertices. Again, there is any overlap between argument mining “similarity” depends on the task. As a last step, an and automatic summarisation. An automatic sum- iterative graph-based ranking algorithm (a slightly marisation algorithm such as TextRank is ex- modified version of the PageRank algorithm (Brin pected to rank highly sentences that are “recom- and Page, 1998)) is applied, in order to score ver- mended” by the rest of the sentences in a docu- tices, and associate a value (score) to each vertex. ment, where “recommendation” suggests that sen- These values attached to each vertex are used for tences address similar concepts, and the sentences the ranking/selections decisions. recommended by other sentences are likely to In the case of (extractive) summarisation, Tex- be more informative (Mihalcea and Tarau, 2004), tRank can be used in order to extract a set of sen- thus more suitable for summarising a document. tences from a document, which can be used to In a similar manner, a claim is expected to share form a summary of the document (either through similar concepts with other text fragments that ei- post-processing of the extracted set of sentences, ther support or attack the claim. At the same time, or by using the set of sentences directly as the sum- these fragments are related to the claim with re- mary). In such a case, the following steps are ap- lations such as “support” or “attack”. Thus, there plied: seems to exist some overlap between how argu- ments are expressed and how TextRank selects and The text of a document is tokenised into • scores sentences. In the work presented in this pa- words and sentences. per we will try to exploit this similarity, in order to The text is converted into a graph, with each • use TextRank for identifying sentences that con- sentence becoming a vertex of the graph (as tain argumentation components. the goal is to rank entire sentences). In its initial form for the summarisation task, Connections (edges) between sentences are • TextRank segments text into sentences, and uses established, based on a “similarity” relation. sentences as the text unit to model the given text The edges are weighted by the “similarity” into a graph. In order to apply TextRank for score between the two connected vertices. claim identification, we assume that an argument The ranking algorithm is applied on the • component can be contained within a single sen- graph, in order to score each sentence. tence, essentially ignoring components that are ex- Sentences are sorted in reversed order of their • pressed in more than one sentences. A component score, and the top ranked sentences are se- can also be expressed as a fragment smaller than a lected for inclusion into the summary. sentence: in this case we want to identify whether The notion of “similarity” in TextRank is defined a sentence contains a component or not. As a re- as the overlap between two sentences, which can sult, we define the task of component identifica- be simply determined as the number of common tion as the identification of sentences that contain words between the two sentences. Formally, given an argumentation component. two sentences Si and Sj, of sizes N and M respec- In order to identify the sentences that contain tively, with each sentence being represented by a argumentation components, we tokenise a docu- i i i set of words W such as Si = W1,W2, ..., WN and ment into tokens and we identify sentences. Then j j j Sj = W1 ,W2 , ..., WM , the similarity between Si we apply TextRank, and we extract a small num- and Sj can be defined as (Mihalcea and Tarau, ber (one or two sentences) from the top scored sen-

97 tences. If the document contains an argumentation main reason for the author to be in favour or component, we expect the sentence containing the against the domain. In order to examine whether component to be included in the small set of sen- there is an overlap between argument mining and tences extracted by TextRank. summarisation, we have applied TextRank on each post, and we have examined whether the single, 4 Empirical Evaluation top ranked sentence by TextRank, contains the segment marked as the reason. In case the segment In order to evaluate our hypothesis, that there is is contained in the top ranked sentence returned by a potential overlap between automatic summarisa- TextRank, the post is classified as correctly identi- tion (as represented by extractive approaches such fied. If the reason segment is not contained in the as TextRank) and argument mining (at least claim returned sentence, the post is characterised as an identification), we have applied TextRank on two error. Evaluation results are reported through ac- corpora written in English. The first corpus has curacy (proportion of true results among the total been compiled from online debate forums, con- number of cases examined). taining user posts concerning four thematic do- mains (Hasan and Ng, 2014), while the second Finally, two experiments were performed, with corpus contains 90 persuasive essays on various the only difference being the number of sentences topics (Stab and Gurevych, 2014). selected from TextRank to form the summary. During the first experiment (labelled as E1), only a single sentence was selected (the top-ranked sen- tence as determined by TextRank), while during 4.1 Experimental Setup the second experiment (labelled as E2) we have The first corpus that has been used in our study selected the two top-ranked sentences. has been compiled and manually annotated as de- The main motivation for selecting the corpus scribed in (Hasan and Ng, 2014). User gener- compiled by Hasan and Ng (2014) was the fact ated content has been collected from an online 1 that most of its documents have been manually an- debate forum . Debate posts from four popular notated with a single claim, which was associated domains were collected: “abortion”, “gay rights”, with a text fragment that most of the times is con- “marijuana”, and “Obama”. These posts are ei- tained within a sentence. Having a single sentence ther in favour or against the domain, depending on as a target constitutes the evaluation of an ap- whether the author of the post supports or opposes proach such as the one presented in this paper eas- abortion, gay rights, the legalisation of marijuana, ier, as the single sentence that represents the main or Obama respectively. The posts were manually claim of the document can be compared to the top- examined, in order to identify the reasons for the ranked sentence by the extractive summarisation stance (in favour or against) of each post. A set algorithm. A corpus that has similar properties, in of 56 reasons were identified for the four domains, the sense that there is a “major” claim represented which were subsequently used for annotating the by a text fragment that is contained within a sen- posts: for each post, segments that correspond to tence, has been compiled by Stab and Gurevych any of these reasons were manually annotated. (2014). This corpus contains 90 documents that We have processed the aforementioned corpus, are persuasive essays, and have been manually an- and we have removed the posts where the anno- notated with an annotation scheme that includes a tated segments span more than a single sentence, “major” claim for each document, a series of ar- keeping only the posts where the annotated seg- guments that support or attack the major claim, ments are contained within a single sentence. The and series or premises that underpin the valid- resulting number of posts for each domain are ity of an argument. Despite being a smaller cor- shown in Table 1. The TextRank implementation pus than the first corpus used for evaluation, hav- 2 used in the evaluation has been written in Python , ing only 1675 sentences, that fact that it contains and is publicly available through (Bohde, 2012). only 90 documents suggests that its documents are Each post is associated with one (and in some slightly larger than the posts of the first document cases more than one) segment that expresses the by (Hasan and Ng, 2014). The average length of a 1http://www.createdebate.com/ persuasive essay is 18.61 sentences in this second 2http://www.python.org/ evaluation corpus, which is larger than the aver-

98 “abortion” “gay rights” “marijuana” “Obama” all domains Number of posts 398 403 352 298 1451 Mean post size (in sentences) 11.15 8.22 6.52 6.50 8.25 Mean argumentative sentences number 1.80 1.65 1.97 1.39 1.71

Table 1: Number of posts per domain in the first corpus used for evaluation (Hasan and Ng, 2014).

Domain Total Posts E1 Correct Posts E1 Accuracy E2 Correct Posts E2 Accuracy “abortion” 398 150 0.37 226 0.57 “gay rights” 403 160 0.40 235 0.58 “marijuana” 352 168 0.47 239 0.68 “Obama” 298 159 0.53 208 0.70 all domains 1451 631 0.44 919 0.63

Table 2: Baseline results (for experiments E1 and E2) – first evaluation corpus (Hasan and Ng, 2014).

Experiment Total Correct Essays Accuracy the first evaluation corpus (Hasan and Ng, 2014), E1 90 3 0.03 while the results for the second evaluation corpus E2 90 10 0.11 (Stab and Gurevych, 2014) are presented in Ta- E1 (all claims) 90 30 0.33 ble 3. E2 (all claims) 90 48 0.53 4.3 Evaluation Results E Table 3: Baseline results (for experiments 1 As described in the experimental setting, we have E and 2) – second evaluation corpus (Stab and performed two experiments. During the first ex- Gurevych, 2014). periment (E1) we have generated a “summary” of a single sentence (the top-ranked sentence by age post size of 8.25 sentences of the first corpus TextRank), while for the second experiment (E2) (Table 1). As a result, the second corpus that will we have selected the two top-ranked sentences as be used for evaluation (Stab and Gurevych, 2014) the generated “summary”. In both experiments, provides the opportunity to evaluate TextRank on each post is characterised as correct if the reason larger documents, where the selection of the sen- segment is contained in the extracted “summary”; tence that represents the “major” claim is poten- otherwise the post is characterised as an error. The tially more difficult, as the set of potential candi- evaluation results are shown in Table 4 for experi- date sentences is larger. Finally, there is a single ment E1 and Table 5 for experiment E2. “major” claim for each persuasive essay, and the As can be seen from Tables 4 and 5, TextRank mean number of all claims (including the “major” has achieved better performance (as measured by claim) is 5.64 per persuasive essay. accuracy) than our baseline in both experiments, E1 and E2. For experiment E1, accuracy has in- 4.2 Baseline creased from 0.44 (of the baseline) to 0.51, while As a baseline approach, a simple random sentence in experiment E2, accuracy has increased from extractor has been used. The sentences contained 0.63 to 0.71, when considering all four domains. in each document (post for the first and essay for In addition, TextRank has achieved better perfor- the second evaluation corpus respectively) were mance for all individual domains than the base- randomly shuffled by using the Fisher-Yates shuf- line, which randomly selects sentences. Another fling algorithm (Fisher and Yates, 1963). Then factor is document size: the mean size of posts we extract a small number (the first or the two (measured as the number of contained sentences) first sentences) from the sentences as randomly seems to vary between the four domains, rang- shuffled, simulating how we apply TextRank for ing from 6.5 sentences for domains “Obama” and identifying the sentences that contain argumenta- “marijuana” to 11 sentences for domain “abor- tion components. The results obtained from this tion”. TextRank has exhibited better performance random shuffle baseline are shown in Table 2 for than the baseline even for the domains with larger

99 Domain Total Posts Correct Posts Accuracy Accuracy (Baseline) “abortion” 398 171 0.43 0.37 “gay rights” 403 201 0.50 0.40 “marijuana” 352 199 0.56 0.47 “Obama” 298 175 0.59 0.53 all domains 1451 746 0.51 0.44

Table 4: Evaluation Results (for experiment E1) – first evaluation corpus (Hasan and Ng, 2014).

Domain Total Posts Correct Posts Accuracy Accuracy (Baseline) “abortion” 398 267 0.67 0.57 “gay rights” 403 269 0.67 0.58 “marijuana” 352 273 0.78 0.68 “Obama” 298 223 0.75 0.70 all domains 1451 1032 0.71 0.63

Table 5: Evaluation Results (for experiment E2) – first evaluation corpus (Hasan and Ng, 2014).

Experiment Total Essays Correct Essays Accuracy Accuracy (Baseline) E1 90 14 0.16 0.03 E2 90 26 0.29 0.11 E1 (all claims) 90 47 0.52 0.33 E2 (all claims) 90 67 0.74 0.53

Table 6: Evaluation Results – second evaluation corpus (Stab and Gurevych, 2014).

posts, such as “abortion”. Of course, as the size of by TextRank contain a claim (experiment “E2 (all documents increases the task of selecting one or claims)”). As expected, the performance of both two sentences becomes more difficult, and this is TextRank and the baseline has been increased, as evident by the drop in performance (for both Tex- this is an easier task. The mean number of all tRank and the baseline) for domains “abortion” claims (including the “major” claim) is 5.64 per and “gay rights” when compared to the rest of the persuasive essay. domains. Regarding the overall performance of the sum- Results are similar for the second evaluation marisation algorithm and its use for identifying a corpus of persuasive essays, as is shown in Ta- sentence containing an argumentation component, ble 6. Again TextRank has achieved better per- TextRank has managed to achieve a noticeable in- formance than the baseline for both experiments, crease in performance over the baseline, despite E1 and E2. The overall performance of both Tex- the fact that it is an unsupervised algorithm, re- tRank and the baseline is lower than the first cor- quiring no training or any form of adaptation to pus, mainly due to the increased size of persua- the domain. This suggests that an algorithm that sive essays compared to posts (having an average models a document as a whole can provide posi- size of 18.61 and 8.25 sentences respectively). For tive information for argument mining, even if the the second corpus an additional experiment has algorithm has been designed for a different task, been performed, which expands the set of claims as is the case for TextRank variation used, which that have to be identified, from only the “ma- targets extractive summarisation. In addition, the jor” claim, to all the claims (including the “ma- evaluation results suggest that there is some over- jor” one) in an essay. This experiment (labelled lap between argument mining and summarisation, as “E1 (all claims)” and “E2 (all claims)” in Ta- leading to the conclusion that there are potential ble 6) examines whether the top-ranked sentence benefits for approaches performing argument min- (experiment “E1 (all claims)”) by TextRank is a ing through the synergy with approaches that per- claim, or whether the first two sentences as ranked form document summarisation.

100 5 Conclusions 64–68, Baltimore, Maryland, June. Association for Computational Linguistics. In this paper we have applied an unsupervised al- gorithm for extractive summarisation, TextRank, Philippe Besnard and Anthony Hunter. 2008. El- ements of argumentation, volume 47. MIT press on a task that relates to argument mining, the iden- Cambridge. tification of sentences that contain an argumenta- tion component. Motivated by the need to better Josh Bohde. 2012. Document Summarization address relations and interactions that are not lo- using TextRank. http://joshbohde.com/ blog/document-summarization. [Online; cal within a document, we have applied a graph- accessed 12-May-2016]. based algorithm, which models a whole document having sentences as its basic text unit. Evaluation Filip Boltuziˇ c´ and Jan Snajder.ˇ 2014. Back up has been performed on two English corpora. The your stance: Recognizing arguments in online dis- first corpus contains user posts from an on-line de- cussions. In Proceedings of the First Workshop on Argumentation Mining, pages 49–58, Baltimore, bating forum, which has been manually annotated Maryland, June. Association for Computational Lin- with the reasons each post author uses to declare guistics. its stance, in favour or against, towards a specific topic. The second corpus contains 90 persuasive Sergey Brin and Lawrence Page. 1998. The anatomy essays, which has been manually annotated with of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107–117, April. claims and premises, along with a “major” claim for each essay. Evaluation results suggest that Lucas Carstens and Francesca Toni. 2015. Towards graph-based approaches and approaches targeting relation based argumentation mining. In Proceed- extractive summarisation can have a positive effect ings of the 2nd Workshop on Argumentation Min- ing, pages 29–34, Denver, CO, June. Association for on tasks of argument mining. Computational Linguistics. Regarding directions for further research, there are several axes that can be explored. Our evalu- R.A. Fisher and F. Yates. 1963. Statistical tables ation results suggest that TextRank achieved bet- for biological, agricultural, and medical research. Hafner Pub. Co. ter performance than the baseline for documents between 6 and 11 sentences, and it would be in- Debanjan Ghosh, Smaranda Muresan, Nina Wacholder, teresting to evaluate further its performance on Mark Aakhus, and Matthew Mitsui. 2014. Ana- longer documents. At the same time, the perfor- lyzing argumentative discourse units in online in- mance of TextRank depends on how “similarity” teractions. In Proceedings of the First Workshop on Argumentation Mining, pages 39–48, Baltimore, between its text units is defined; alternative “sim- Maryland, June. Association for Computational Lin- ilarity” measures can be considered, even super- guistics. vised ones that measure distance according to in- formation obtained from a domain, or informa- George Giannakopoulos, Jeff Kubina, Ft Meade, John M Conroy, MD Bowie, Josef Steinberger, tion obtained for a specific task. Even an exter- Benoit Favre, Mijail Kabadjov, Udo Kruschwitz, nal knowledge base can be explored, providing and Massimo Poesio. 2015. Multiling 2015: distances closer to semantic similarity. Finally, a Multilingual summarization of single and multi- third dimension is to examine alternative extrac- documents, on-line fora, and call-center conversa- tions. 16th Annual Meeting of the Special Interest tive summarisation algorithms, in order to clarify Group on Discourse and Dialogue, page 270. further whether other summarisation algorithms can have a positive impact for argument mining, Theodosis Goudas, Christos Louizos, Georgios Petasis, similar to the results achieved by TextRank. and Vangelis Karkaletsis. 2014. Argument extrac- tion from news, blogs, and social media. In Aristidis Likas, Konstantinos Blekas, and Dimitris Kalles, ed- itors, Artificial Intelligence: Methods and Applica- References tions, volume 8445 of Lecture Notes in Computer Science, pages 287–299. Springer. Ehud Aharoni, Anatoly Polnarov, Tamar Lavee, Daniel Hershcovich, Ran Levy, Ruty Rinott, Dan Gutfre- Theodosis Goudas, Christos Louizos, Georgios Peta- und, and Noam Slonim. 2014. A benchmark dataset sis, and Vangelis Karkaletsis. 2015. Argument ex- for automatic detection of claims and evidence in the traction from news, blogs, and the social web. In- context of controversial topics. In Proceedings of ternational Journal on Artificial Intelligence Tools, the First Workshop on Argumentation Mining, pages 24(05):1540024.

101 Heather Graves, Roger Graves, Robert Mercer, and Wu, editors, Proceedings of EMNLP 2004, pages Mahzereen Akter. 2014. Titles that announce ar- 404–411, Barcelona, Spain, July. Association for gumentative claims in biomedical research articles. Computational Linguistics. In Proceedings of the First Workshop on Argumen- tation Mining, pages 98–99, Baltimore, Maryland, Niall Rooney, Hui Wang, and Fiona Browne. 2012. June. Association for Computational Linguistics. Applying kernel methods to argumentation mining. In G. Michael Youngblood and Philip M. McCarthy, Nancy Green. 2014. Towards creation of a corpus for editors, Proceedings of the Twenty-Fifth Interna- argumentation mining the biomedical genetics re- tional Florida Artificial Intelligence Research Soci- search literature. In Proceedings of the First Work- ety Conference, Marco Island, Florida. May 23-25, shop on Argumentation Mining, pages 11–18, Bal- 2012. AAAI Press. timore, Maryland, June. Association for Computa- tional Linguistics. Christian Stab and Iryna Gurevych. 2014. Annotat- ing argument components and relations in persua- Kazi Saidul Hasan and Vincent Ng. 2014. Why are sive essays. In Junichi Tsujii and Jan Hajic, edi- you taking this stance? identifying and classifying tors, Proceedings of the 25th International Confer- reasons in ideological debates. In Proceedings of ence on Computational Linguistics (COLING 2014), the 2014 Conference on Empirical Methods in Natu- pages 1501–1510, Dublin, Ireland, August. Dublin ral Language Processing (EMNLP), pages 751–762, City University and Association for Computational Doha, Qatar, October. Association for Computa- Linguistics. tional Linguistics. Bianka Trevisan, Eva Dickmeis, Eva-Maria Jakobs, Hospice Houngbo and Robert Mercer. 2014. An au- and Thomas Niehr. 2014. Indicators of argument- tomated method to build a corpus of rhetorically- conclusion relationships. an approach for argumen- classified sentences in biomedical texts. In Proceed- tation mining in german discourses. In Proceed- ings of the First Workshop on Argumentation Min- ings of the First Workshop on Argumentation Min- ing, pages 19–23, Baltimore, Maryland, June. Asso- ing, pages 104–105, Baltimore, Maryland, June. As- ciation for Computational Linguistics. sociation for Computational Linguistics.

Christian Kirschner, Judith Eckle-Kohler, and Iryna Gurevych. 2015. Linking the thoughts: Analysis of argumentation structures in scientific publications. In Proceedings of the 2nd Workshop on Argumen- tation Mining, pages 1–11, Denver, CO, June. Asso- ciation for Computational Linguistics.

John Lawrence and Chris Reed. 2015. Combining argument mining techniques. In Proceedings of the 2nd Workshop on Argumentation Mining, pages 127–136, Denver, CO, June. Association for Com- putational Linguistics.

John Lawrence, Chris Reed, Colin Allen, Simon McAl- ister, and Andrew Ravenscroft. 2014. Mining argu- ments from 19th century philosophical texts using topic based modelling. In Proceedings of the First Workshop on Argumentation Mining, pages 79–87, Baltimore, Maryland, June. Association for Compu- tational Linguistics.

Ran Levy, Yonatan Bilu, Daniel Hershcovich, Ehud Aharoni, and Noam Slonim. 2014. Context de- pendent claim detection. In Proceedings of COL- ING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1489–1500. Dublin City University and Association for Computational Linguistics.

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze.¨ 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into texts. In Dekang Lin and Dekai

102 Rhetorical structure and argumentation structure in monologue text

Andreas Peldszus Manfred Stede Applied Computational Linguistics Applied Computational Linguistics UFS Cognitive Science UFS Cognitive Science University of Potsdam University of Potsdam [email protected] [email protected]

Abstract al., 2015). Hence, a potentially useful architecture for argumentation mining could involve an RST On the basis of a new corpus of short parser as an early step that accomplishes a good “microtexts” with parallel manual anno- share of the overall task. How feasible this is has tations, we study the mapping from dis- so far not been determined, though. course structure (in terms of Rhetorical On the theoretical side, different opinions have Structure Theory, RST) to argumentation been voiced in the literature on the role of RST structure. We first perform a qualitative trees for argumentation analysis; we summarize analysis and discuss our findings on cor- the situation below in Section 2. All these opin- respondence patterns. Then we report on ions were based on the experiences that their au- experiments with deriving argumentation thors had made with manually applying RST and structure from the (gold) RST trees, where with analyzing argumentation, but they were not we compare a tree transformation model, based on systematic empirical evidence. In con- an aligner based on subgraph matching, trast, in this paper we use a new resource that we and a more complex “evidence graph” recently released (Stede et al., 2016), which offers model. annotations of both RST and argumentation struc- ture analyses on a corpus of 112 short texts. Our 1 Introduction previous paper presented a first rough analysis of Rhetorical Structure Theory (RST) (Mann and the correlations between RST and argumentation. Thompson, 1988) was designed to represent the The present paper builds on those preliminary re- structure of a text in terms of coherence relations sults and makes two contributions: holding between adjacent text spans, where the same set of relations is being used to join the “ele- We provide a qualitative analysis that exam- • mentary discourse units” (EDUs) and, recursively, ines the commonalities and differences be- the larger spans. The result is a tree structure tween the two levels of representation in the that spans the text completely; there are no “gaps” corpus, and seeks explanations for them. in the analysis, and there are no crossing edges. The relations are being defined largely in terms of We report on experiments in automatically • speaker intentions, so that the analysis is meant mapping RST trees to argumentation struc- to capture the “plan” the author devised to influ- tures, for now on the basis of the manually- ence his or her audience. The developers of RST annotated “gold” RST trees. had not explicitly targeted one particular text type or discourse mode (instructive, argumentative, de- Following the discussion of related work in Sec- scriptive, narrative, expository), but when we as- tion 2, Section 3 gives a brief introduction to the sume that the text is argumentative, the very nature corpus and the annotation schemes that are used of the RST approach suggests that it might in fact for argumentation and for RST. Then, Section capture the underlying argumentation quite well. 4 presents our qualitative (comparative) analysis, Systems for automatic RST parsing have been and Section 5 the results of our experiments on built since the early 00s, with recent approaches automatic analysis. Finally, Section 6 relates these including (Ji and Eisenstein, 2014) and (Joty et two endeavours and draws conclusions.

103 Proceedings of the 3rd Workshop on Argument Mining, pages 103–112, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

2 Related work (2015) argued that the hybrid representation does not readily carry over to a different text genre In this section, we summarize the positions that (biomedical research articles), and she concluded have so far been taken in the literature on the status that RST and argumentation structure operate on of RST analyses for argumentation. two levels that are subject to different motivations The view that performing an RST analysis es- and constraints, and thus should be kept distinct. sentially subsumes the task of determining argu- We also subscribe to the view that (at least for mentation structure was advanced by Azar (1999), many text genres) distinguishing rhetorical struc- who argued that RST’s nucleus-satellite distinc- ture and argumentation structure is important for tion is crucial for distinguishing the two roles in a capturing the different aspects of a text’s coher- dialectical argumentative relationship, and that, in ence on the one hand, and its pragmatic func- particular, five RST relations should be regarded tion on the other. Also, we wish to emphasize as providing argumentative support for different the conflict between segment adjacency (a central types of claims: Motivation for calls for action; feature of RST’s account of coherence) and non- Antithesis and Concession for increasing positive adjacency (a pervasive phenomenon in argumen- regard toward a stance; Evidence for forming a tative function of portions of text). Still, it remains belief; Justify for readiness to accept a statement. to be seen to what extent an RST analysis can in Azar illustrated that idea with a few short sample principle support an argumentation analysis, e.g. texts that he analyzed in terms of RST trees using in a pipeline architecture; shedding light on this these relations. In one of these examples, how- question is our goal for this paper. ever, Azar made the move of combining two non- adjacent text segments into a single node in the 3 The corpus RST tree (representing the central claim), which is Below we provide a very brief description of the in conflict with a basic principle of RST. This indi- data and annotations that we provided in (Stede et cates that Azar borrowed certain aspects from RST al., 2016); for more details, see that paper. Notice but ignored others. In our earlier work (Peldszus that the layers of annotations had been produced and Stede, 2013), we posited that the underlying independently by different people, thus inviting a phenomenon of non-adjacency creates a problem posthoc comparison, which we will perform in the for RST-argumentation mapping in general, i.e., it next sections. For reasons of space, we do not give is not limited to discontinuous claims: Both Sup- further details on RST here; the interested reader port and Attack moves can be directed to material should consult (Mann and Thompson, 1988) or that occurs in non-adjacent segments. (Taboada and Mann, 2006). A small portion of RST’s ideas was incorpo- rated into the annotation of argumentation per- 3.1 Data formed by Kirschner et al. (2015) on student es- The argumentative microtext corpus (Peldszus and says. The authors used standard argumentative Stede, 2016) is a freely available collection of 112 Support and Attack relations, and to these added short texts that were collected from human sub- the coherence relations Sequence and Detail for jects, originally in German. Subjects received a capturing specific argumentative moves; the rela- prompt on an issue of public debate, usually in tion definitions are inspired by those used in RST. the form of a yes/no question (e.g., Should shop- Green (2010) proposed a “hybrid” tree repre- ping malls be open on Sundays?), and they were sentation called ArgRST, which combines RST’s asked to provide their answer to the question along nuclearity principle and some of its relation def- with arguments in support. They were encour- initions with additional annotations capturing as- aged to also mention potential objections. The tar- pects of argumentation: The analyst can add im- get length suggested to the subjects was five sen- plicit statements to the tree (enthymemes in the tences. After the texts were collected, they were argumentation), and in parallel to RST relations, professionally translated to English, so that the the links between segments can also be labeled corpus is now available in two languages. An ex- with relations from the scheme of Toulmin (1958) ample of an English text is: and with those proposed by Walton et al. (2008). Also, the representation allows for noncontiguous Health insurance companies should nat- premises and conclusions. More recently, Green urally cover alternative medical treat-

104 [e3] yet it's ments. Not all practices and approaches [e2] Not all precisely their [e1] Health practices and positive effect [e4] Besides insurance approaches that when many general [e5] and who that are lumped together under this term companies are lumped accompanying practitioners would want to should together under conventional offer such question their naturally cover this term may 'western' counselling and may have been proven in clinical tri- broad alternative have been medical treatments in expertise? medical proven in therapies parallel anyway als, yet its precisely their positive effect treatments. clinical that's been - trials, demonstrated as when accompanying conventional west- beneficial. ern medical therapies thats been demon- strated as beneficial. Besides, many 3 4 5 general practitioners offer such coun- selling and treatments in parallel any- 2 c9 c7 way - and who would want to question their broad expertise? c6

In (Stede et al., 2016), two new annotation lay- 1 ers are introduced for the corpus: Discourse struc- ture in terms of RST, and in terms of Segmented Discourse Representation Theory (Asher and Las- carides, 2003). Importantly, these two as well as the argumentation annotation use an identi- cal segmentation into elementary discourse units (EDUs).

3.2 Argumentation structure representation The annotation of argumentation structure follows the scheme outlined in (Peldszus and Stede, 2013), Figure 1: Example ARG and RST structure which in turn is based on the work of Freeman (1991). It posits that the argumentative text has a central claim (henceforth: CC), which the author play a common argumentative role. In such cases, can back up with statements that are in a Support the EDUs are linked together by a meta-relation relation to it; this is a transitive relation, leading to called Join. The argumentation and RST analyses “serial support” in Freeman’s terms. A statement of the sample text are shown in Figure 1. can also have multiple Supports; these can be inde- 4 Matching RST and argumentation: pendent (each Support works on its own) or linked (only the combination of two statements provides Qualitative analysis the Support). Also, the scheme distinguishes be- When introducing the corpus (Stede et al., 2016), tween “standard” and “example” support, whose we provided figures on how edges in the RST tree function originates from providing an illustration, map to edges in the argumentation graph (which or anecdotal evidence. can be calculated straightforwardly, because both When the text mentions a potential objection, representations build on the same segmentation). this segment is labeled as bearing the role of “op- We found that, ignoring the labels, 60% of the ponent’s voice”; this goes back to Freeman’s in- edges are common to both structures. As can sight that any argumentation, even if monological, be expected, argumentative Support mostly corre- is inherently dialectical. The segment will be in sponds to Reason or Justification; however, 39% an Attack relation to another one (which repre- of the Supports do not have a corresponding RST sents the proponent’s voice), and the scheme dis- edge. Furthermore, 72% of all Rebuts and 33% tinguishes between Rebut (denying the validity of of Undercuts do not have a corresponding RST a claim) and Undercut (denying the relevance of a edge. Thus, the correspondences between the lay- premise for a claim). When the author proceeds ers are certainly not trivial. In order to understand to refute the attack, the attacking segment itself is the mismatches, we undertook a qualitative analy- subject to a Rebut or Undercut relation. sis that focuses on the three central notions of the The building blocks of such an analysis are argumentation: the central claim and its mapping Argumentative Discourse Units, which often are to RST nuclearity, the Support relations, and the larger than EDUs: Multiple discourse segments configurations of Attack/Counterattack.

105 4.1 CC and Nuclearity stract from linguistic realization and to consider the underlying pragmatic relationships. Recall that Azar (1999) already pointed out the importance of RST’s notion of ’nucleus’ for repre- 4.2 Support senting argumentation. To operationalize the anal- ogy, it is important to make use of the “strong Of the 2611 ARG-Supports, 132 have a corre- nuclearity principle” (Marcu, 2000), according to sponding edge in the associated RST tree, with a which the most important segment(s) of a text label that is clearly compatible with Support: Rea- can be found by following the RST tree from its son, Justify, Evidence, Motivation, or Cause. And root down the nucleus links to the leaf nodes. If of the 112 texts, 26 have only such canonical SUP- there are only mononuclear relations along the PORTs (and three texts do not have Supports at way, there is a single most important segment all). Together these are 23% of the texts, so that (henceforth: RSTnuc); otherwise, there are mul- 77% contain non-canonical Support. This calls for tiple ones. A natural first question therefore is closer investigation, and we found two groups: whether the RSTnuc segment corresponds to the (i) 12 Support relations have a corresponding CC in argumentation. We found that for 95 texts, RST edge that is labeled with an “unexpected” re- i.e., the vast majority of the 112 texts (85%), this lation: Elaboration, Background, Result, Interpre- is the case. Considering the goal of RST analy- tation, Antithesis, Concession, or a multinuclear sis, which is to capture the main intention of the relation. These are instances of the dichotomy be- writer, this is the expected default case. tween accounting for the local coherence versus But what happens in the 17 mismatches? In the underlying argumentation; in fact, this corre- five cases, RSTnuc and ARGcc are indeed dis- ponds to a discussion that originated shortly after joint. Four of these are due to the thesis being the introduction of RST and pointed out the poten- stated early in the text and once again (as a para- tial conflict between an “informational” versus an phrase) later on. It is thus left to the annotator to “intentional” analysis (Moore and Pollack, 1992). decide which formulation s/he considers more apt (ii) 117 Support relations do not have a core- to play the central role of the text – and these de- sponding edge in the RST tree. The reasons can cisions happen to have lead to different results in be subclassified as follows, with the observed fre- the four texts. In the fifth, the thesis is not ex- quency given in parentheses. (These attributes can plicitly stated; here, too, there are two plausible combine, so the numbers add to more than 117.) options for choosing the most important segment of the text. The RST segment participates in a multinu- • In 12 texts, RSTnuc and ARGcc overlap, which clear relation (List, Conjunction, Joint), or in can be due to two reasons. (i) In five texts, ARGcc the pseudo-relation Same-Unit. Hence it can consists of two EDUs, with the RSTnuc being one be reached directly by following the respec- of them. This is due to an RST relation that is not tive edges. (70) Relation disagreement: The RST annotator argumentatively relevant (mostly Condition). (ii) • Seven texts show the reverse situation: A multi- did not see a Support-like relation, but used nuclear RST relation induces >1 RSTnuc. In three something else (most often Background or of these cases, this seems due to an unclear text; Elaboration). (21) Transitivity mismatch: ARG and RST anno- the author’s position remains somewhat ambigu- • ous, and the RST annotator considered different tations do not agree on serial versus joint sup- statements as equally important. The ARG anno- port, i.e., whether a segment supports a claim tation, on the other hand, was committed to mak- directly or only indirectly. (16) Grain size: In a segment, RST uses a non- ing a decision on the CC (as stated in the guide- • lines). In the remaining four cases, we find mi- argumentative relation such as Condition, so nor differences in interpretation, where the RST that the nuclearity assignment does not match decision might well be influenced by surface fea- that of the segmentation in the ARG-Join re- tures, in particular the presence of coordinating lation. (9) conjunctions, which suggest a parallel structure 1This number diverges by 25 from that given by Stede et for a coherence-oriented analysis. ARG analysis, al. (2016), because for technical reasons they excluded from on the other hand, encourages the annotator to ab- their statistics 10 texts that have discontinuous segments.

106 Consequence of the different nuclearity sion/Antithesis, and the whole is the • structures we mentioned in the previous sub- satellite of a canonical support relation section. (5) (Reason, Justify, ...). (22) This is shown Different or same reason: RST and ARG in Fig. 1. • annotators differed in whether two segments Canonical-b: Likewise, but the whole • constitute the same Reason/Support, or sepa- participates first in some multinuclear rate ones. (4) relation (List, Joint), which in turn is the satellite of a canonical support. (6) 4.3 Attack Non-canonical: RST annotator did not • Finally, we study what RST constellations corre- see argumentative function as most im- spond to attack configurations in the ARG tree. portant for capturing local coherence. For the time being, we do not distinguish Rebut (8) 2 from Undercut. We discuss the cases in increas- (ii) Slightly more complex: The counterat- ing order of complexity and give the number of tack has >1 segment. (16) texts where the instance occurs (which is almost Canonical: The counterattack subtree identical to the number of instances). • gets some RST analysis, and the overall 1. Text does not have any attacks in ARG. (16) construction is as described in the previ- 2. A single attack node in ARG, or a joined pair; ous category. (13) Noncanonical: reason as in (i). (3) these are leaf nodes. This is the situation • where an attack is not being countered – the (iii) More complex: The attack has >1 seg- author considers his other Supports to implic- ment. (8) itly outweigh the attack. (24) – Variant: The Canonical: overall construction is as de- • attack is not a leaf but supported by another scribed above. (6) opponent-voice node. (7) – We treat these Noncanonical: Support corresponds to • together, and of the 31, 24 have a “canoni- Interpretation/Elaboration. (2) cal” RST counterpart: The attacking segment is also a leaf node, and its is connected via 4.4 Summary one of the RST relations Antithesis, Contrast, We found a large proportion of CC, Support and Concession. The remaining 7 have a “non- Attack confgurations to correspond to “canonical” canonical” RST counterpart: The opponent configurations in RST trees – i..e, subtrees that in- voice is not reflected in the RST tree, or a tuitively reflect the argumentative functions (under local attachment of an attacking subordinate the definitions of the RST relations). While so far segment leads to a non-canonical relation. we looked at the correspondence only in the direc- 3. Similar to (2), but instead of one there are two tion ARG RST, this result still suggests that an → separate attack nodes in ARG. In all of these automatic mapping from RST to ARG tree can be cases, the RST tree combines the two attacks feasible; this will be the topic of the next section. in a Conjunction relation. (7) Furthermore, a central purpose of the manual anal- 4. The attack is being countered: An opponent- ysis was to determine the reasons for mismatches, voice-segment has both an outgoing and an which can inform theoretical considerations on the incoming attack. For illustration, consider relationship between RST and argumentation. For Figure 1 above (the “incoming” attack of reasons of space, we cannot go into detail, but our node 2 there is an undercut). In general, central observation is that RST analysis is subject there are three structural subclasses. (i): Both to a tension between accounting for the local co- attack and counterattack are individual seg- herence or for the global one using underlying in- ments (36), as is the case in Fig. 1. The struc- tentions, i.e., the argumentation. As we noted ear- tures can be straightforwardly compared to lier, this has been discussed in the RST commu- their RST correspondents as follows: nity early on — but it has never been resolved. Canonical-a: The counterattack • The issue is likely to be much more pronounced corresponds to a backward Conces- in longer texts than in the microtexts we are study- 2Intuitively, we see no evidence that this distinction is re- ing here. In principle, the specific RST annotation flected in RST, but it needs to be determined quantitatively. guidelines could ask annotators to clearly prefer

107 one or the other perspective; this would shift the In our study, we follow the experimental setup original goal of the theory, but probably would do of (Peldszus and Stede, 2015). We use the same better justice to the data. train-test splits, resulting from 10 iterations of Considering the option of annotating an 5-fold cross validation, and adopt their evalua- “argumentation-oriented” RST tree, the question tion procedure, where the correctness of predicted arises to what extent it can be theoretically ad- structures is assessed in four subtasks: equate. Of central importance is the correspon- attachment (at): Given a pair of EDUs, are dence between RSTnuc and ARGcc; we found • that for all the mismatches in the corpus, it is poss- they connected? [yes, no] central claim (cc): Given an EDU, is it the ble to construct a plausible alternative RST tree • such that the two are identical or at least overlap- central claim of the text? [yes, no] role (ro): Given an EDU, is it in the [propo- ping (when the granularities of the analyses don’t • match exactly). Another issue is the presence of nent]’s or the [opponent]’s voice? function (ro): Given an EDU, what is its crossing edges, which occur in seven ARG graphs • in the corpus. Since this is likely to occur more of- argumentative function? Here, we use the ten in longer texts, it remains a fundamental issue; fine-grained relation set available in the data. we will return to it at the end. [support, example, rebut, undercut, link, join]

5 Deriving argumentation structure from Note that the argumentative role of each seg- rhetorical structure automatically ment is not explicitly coded in the structures we predict below, but is inferred from the chain of In order to automatically map between RST and supporting (role preserving) and attacking (role argumentation, it is very helpful to have both lay- switching) relations from the central claim (by ers in the same technical format. To that end, definition in proponent’s voice) to the segment of our joint work with colleagues in Toulouse sup- interest. plied a common dependency structure representa- tion (Stede et al., 2016). In the following, we use 5.1.1 Heuristic baseline that version of the corpus. For illustration, see Fig- The baseline model (BL) produces an argumenta- ure 2 for the dependency conversion of the exam- tion structure that is isomorphic to the RST tree. ple text. RST relations are mapped to argumentative func- tions, based on the most frequently aligning class reason as reported in (Stede et al., 2016) – see Fig- reason ure 3. For the two relations marked with an as- concession joint terisk, no direct edge alignments could be found, and thus we assigned them to the class of the 1 2 3 4 5 non-argumentative join-relation. The argumenta- (a) RST tive example and link-relations were not frequent enough to be captured in this mapping. support We expect this baseline to be not an easy one rebut undercut link to beat. It will predict the central claim correctly already for 85% of the texts, due to the corre- 1 2 3 4 5 spondence described in Section 4.1. Also, as we (b) ARG saw above, 60% of the unlabelled edges should be mappable. Finally, the argumentative role is cov- Figure 2: Dependency conversion example ered quite well, too: The chain of supporting and attacking relations determining the role is likely to be correct on an EDU basis, if the relation map- 5.1 Models ping is correct, and even if attachment is wrongly We have implemented three different models: A predicted. simple heuristic tree-transformation serves as a baseline, against which we compare two data- 5.1.2 Naive aligner driven models. All models and their parameters Our naive aligner model (A) learns the probabil- are described in the following subsections. ity of subgraphs in the RST structure mapping to

108 support: background, cause, evidence, justify, list, base features incl. 2-node subgraph features: motivation, reason, restatement, result - absolute and relative position of the segment in the text rebut: antithesis, contrast, unless - binary feature whether it is the first or the last segment undercut: concession - binary feature whether it has incoming/outgoing edges join: circumstance, condition, conjunction, dis- - number of incoming/outgoing edges junction, e-elaboration, elaboration, evaluation-s, - binary feature for each type of incoming/outgoing edge evaluation-n, interpretation*, joint, means, preparation, 3-node subgraph features: purpose, sameunit, solutionhood* - all relation chains of length 2 involving this segment 4-node subgraph features: Figure 3: Mapping of RST relations to ARG rela- - all relation chains of length 3 involving this segment tions, used in the heuristic baseline. Figure 4: Segment feature sets subgraphs of the argumentative structure. - direction of the potential link (forward or backward) - distance between the segments For training, this model applies a subgraph - whether there is a RST relation between the segments alignment algorithm yielding connected compo- - type of the RST relation between the segments or None nents with n nodes occurring in the undirected, un- labelled version of both the RST and the argumen- Figure 5: Segment-pair features tative structures. It extracts the directed, labelled subgraphs for these common components for both evaluation, unknown edges are interpreted as the structures and learns the probability of mapping majority relation type, i.e. as support. one to the other over the whole training corpus. Finally, we added an optional root-constraint For prediction, all possible subgraphs of size (+r) to the model: It forbids outgoing edges from n in the input RST tree are extracted. If one the node corresponding to the RST central nu- maps to an argumentation subgraph according to cleus, and therefore effectively enforces the ARG the mapping learned on the training corpus, the structure to have the same root as the RST tree. corresponding argumentation subgraph is added to an intermediary multi-graph. After all candidate 5.1.3 Evidence graph model subgraphs have been collected, all equal edges are We implemented a variant of the evidence graph combined and their individual probabilities accu- model (EG) of (Peldszus and Stede, 2015). In this mulated. Finally, a tree structure is decoded from model, four base classifiers are trained for the four the intermediary graph using the minimum span- levels of the task (cc, ro, fu and at). For each pos- ning tree (MST) algorithm (Chu and Liu, 1965; sible edge, the predictions of these base classifiers Edmonds, 1967). are combined into one single edge score. Again, The model can be instantiated with different MST decoding is used to select the globally opti- subgraph sizes n. Choosing n = 2 only learns mal tree structure. a direct mapping between RST and ARG edges. The combined edge score reflects the probabil- Choosing larger n can reveal larger structural ity of attachment, the probability of not being the patterns, including edges that cannot be directly central claim (similar to the root constraint in the aligned. Most importantly, the model can be alignment model), the probability of a role switch trained with more than one subgraph size n: for between the connected nodes and the probability example, model A-234 simultaniously extracts of the corresponing edge type. Jointly predicting subgraphs of the size n = [2, 3, 4], so that the edge these different levels has been shown to be supe- probabilities of differently large subgraphs add up. rior over the single prediction of the base classi- The collected edges of all candidate subgraphs fiers. do not necessarily connect all possible nodes. In Our model differs from the original one in two this case, no spanning tree can be derived. We thus respects: First, our model is trained on the new initialize the intermediary multi-graph as a total version of the corpus, featuring a finer segmenta- graph with low-scored default edges of the type tion into EDUs, and it considers the full relation unknown. These should only be selected by the set (in contrast to the reduced relation set of just MST algorithm when there is no other evidence Support and Attack). Second and more impor- for connecting to unconnected subgraphs. The tantly, our base classifiers are trained exclusively number of predicted unknown edges thus serves as on a new feature set reflecting aspects of the input an indicator of the coverage of the learnt model. In RST tree, and do not use any linguistic features.

109 The segment features are shown in Figure 4. We model cc ro fu at unknown distinguish three feature groups: base features in- BL .861 .896 .338 .649 cluding edges (EG-2), base features plus 3-node A-2 .578 .599 .314 .650 10.6% subgraph features (EG-23), and the latter plus 4- A-23 .787 .744 .398 .707 7.5% node subgraph-features (EG-234). Base classifiers A-234 .797 .755 .416 .719 7.0% A-2345 .794 .762 .424 .721 6.8% for the cc, ro, and fu-level are trained on segment A-2+r .861 .681 .385 .682 13.9% features. The at-level base classifier is trained A-23+r .861 .783 .420 .716 11.3% on segment features for the source and the target A-234+r .861 .794 .434 .723 10.8% A-2345+r .861 .800 .443 .725 10.7% node, as well as on relational features, shown in EG-bc-2 .899 .768 .526 .747 Figure 5. EG-bc-23 .907 .845 .525 .749 As in the original model, the base classifiers EG-bc-234 .906 .847 .526 .750 perform an inner cross-validation on the training EG-2 .918 .843 .522 .744 EG-23 .919 .869 .526 .755 data to optimize the hyperparameters of the log- EG-234 .918 .868 .530 .754 linear SGD classifier (Pedregosa et al., 2011). We do not optimize the weighting of the base classi- Table 1: Evaluation scores of all models on the fiers for score combination here, because we had gold RST input trees reported as macro-avg. F1 shown in the original experiments that an equal weighting yields competitive results (Peldszus and Stede, 2015). For the evidence graph model, we will first in- 5.2 Results on gold RST trees vestigate the performance of the base classifiers (EG-bc-*), before we discuss the results of the de- Scores are reported as averages over the 50 train- coder. The difference between the three feature test-splits, with macro-averaged F1 as the metric. sets is most important here. Comparing the classi- For significance testing, we apply the Wilcoxon fier that only uses the basic feature set (EG-bc-2) signed-rank test on the macro-averaged F1 scores against the one with extra features for 3-node sub- and assume a significance level of α = 0.01. The graphs (EG-bc-23), we find the greatest improve- evaluation results are shown in Table 1. ment on the argumentative role level with an extra All alignment models including at least sub- +7.7 points macro F1 score. Central claim iden- graphs of size n=3 (A-23*) improve over the base- tification also profits with a minor gain of +0.8 line (BL) in predicting the relation type (fu) and points. Interestingly, the local models for func- the attachment (at). Considering larger subgraphs tion and attachment are not effected by the richer helps even more, and it decreases the rate of un- feature sets. Extending the features even to 4-node known edges.3 On the role level, the baseline subgraphs (EG-bc-234), does not further improve is unbeaten. For central claim identification, the the results on any level. alignment model performs poorly. Adding the root constraint yields exactly the baseline prediction The evidence graph decoding models (EG-*) for the central claim, but also improves the results combine the predictions of the base classifiers to on all other levels, with the cost of an increased a global optimal structure. The model using the rate of unknown edges. The clear improvement base classifiers with the smallest feature set (EG- over the baseline for the relation type (fu) indi- 2) already outperforms the best alignment model cates that the probability estimates of the aligment on all levels significantly and beats the baseline models capture the relations better than the heuris- on all levels but argumentative role. We attribute tic mapping to the most frequently aligning class this improvement to three aspects of the model: in the baseline. Furthermore, extraction of larger First, the learning procedure of the base classifiers subgraphs gradually increases the result on both is superior to that of the alignment model. Sec- the fu and the at level, showing us that there are ond, the base classifiers not only learn regularities subgraph regularities to be learnt which are not between RST and ARG but also positional prop- captured when assuming isomorphic trees. erties of the target structures. Finally, the joint prediction of the different levels in the evidence 3 Note that when testing the A-234 model on training data, graph model helps to compensate weaknesses of only very few unknown edges are predicted (less than 1%), which indicates that more data might help to fully cover all the local models by enforcing constraints in the of them. combination of the individual predictions: Com-

110 paring the base classifier’s predictions (EG-bc-2) globally optimal structures. The latter achieved with the decoded predictions (EG-2), we observe promising results (with the exception of prima fa- a boost of +7.5 points macro F1 on the role level cie low numbers for argumenative function, but re- and a small boost of +1.9 points for central claim call we are using a much larger tag set than all the through joint prediction. related work). This confirms the conclusion from Adding features for larger subgraphs further im- the qualitative study, and it invites the next step, proves the results: EG-23 beats EG-2 on all levels, which is to use our mapping procedure on the pre- but the improvement is significant only for role dictions of state-of-the-art RST parsers. and attachment. EG-234, though, differs from EG- 23 only marginally and on no level significantly. Note, that the gain from joint prediction is less References strong with better base classifiers, but still valu- Nicholas Asher and Alex Lascarides. 2003. Logics of able with +2.4 points on the role level and +1.2 Conversation. Cambridge University Press. points for central claim. M. Azar. 1999. Argumentative text as rhetorical struc- In conclusion, the baseline model remained un- ture: An application of Rhetorical Structure Theory. beaten on the level of argumentative role. This was Argumentation, 13:97–114. already expected, as the sequence of contrastive Y. J. Chu and T. H. Liu. 1965. On the shortest arbores- relations in the RST tree is very likely to map to a cence of a directed graph. Science Sinica, 14:1396– correct sequence of proponent and opponent role 1400. assignments. On all other levels, the best results Jack Edmonds. 1967. Optimum Branchings. Jour- for mapping gold RST trees to fine-grained argu- nal of Research of the National Bureau of Standards, mentation structures are achieved by the EG-23(4) 71B:233–240. model. James B. Freeman. 1991. Dialectics and the Macrostructure of Argument. Foris, Berlin. 6 Summary and Outlook Nancy Green. 2010. Representation of argumentation in text with Rhetorical Structure Theory. Argumen- We presented the first empirical study on the re- tation, 24:181–196. lationship between discourse structure (here in terms of Rhetorical Structure Theory) and argu- Nancy Green. 2015. Annotating evidence-based ar- gumentation in biomedical text. In Proceedings of mentation structure. In the qualitative analysis, the IEEE Workshop on Biomedical and Health In- we found a large proportion of “canonical” cor- formatics. respondences between RST subtrees and the cen- tral notions of argumentation, with the remain- Yangfeng Ji and Jacob Eisenstein. 2014. Represen- tation learning for text-level discourse parsing. In ing mismatches being due to an inherent ambigu- Proceedings of the 52nd Annual Meeting of the As- ity of RST analysis (informational versus inten- sociation for Computational Linguistics, Baltimore, tional) and to more technical aspects of granular- Maryland, June. Association for Computational Lin- ity (multinuclear relations). By using annotation guistics. guidelines that “drive” the annotator toward cap- Shafiq Joty, Giuseppe Carenini, and Raymond Ng. turing underlying argumentation, the correspon- 2015. Codra: A novel discriminative framework dence could be considerably higher. There remain for rhetorical analysis. Computational Linguistics, 41(3):385–435. problems with non-adjacency in the ARG struc- ture, however. These are likely to increase when Christian Kirschner, Judith Eckle-Kohler, and Iryna texts are larger than our microtexts. Gurevych. 2015. Linking the thoughts: Analy- sis of argumentation structures in scientific publica- For mapping the gold RST trees to ARG struc- tions. In Proceedings of the 2015 NAACL-HLT Con- ture, we compared three mapping mechanisms: ference. Association for Computational Linguistics, A heuristic baseline, transforming RST trees to June. isomorphic trees with corresponding argumenta- William Mann and Sandra Thompson. 1988. Rhetori- tive relations; a simple aligner, extracting match- cal structure theory: Towards a functional theory of ing subgraph pairs from the corpus and applying text organization. TEXT, 8:243–281. them to unseen structures; and one fairly elabo- Daniel Marcu. 2000. The theory and practice of rate evidence graph model, which trains four clas- discourse parsing and summarization. MIT Press, sifiers and combines their predictions for decoding Cambridge/MA.

111 Johanna Moore and Martha Pollack. 1992. A problem for RST: The need for multi-level discourse analy- sis. Computational Linguistics, 18(4):537–544. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learn- ing in Python. Journal of Machine Learning Re- search, 12:2825–2830. Andreas Peldszus and Manfred Stede. 2013. From ar- gument diagrams to automatic argument mining: A survey. International Journal of Cognitive Informat- ics and Natural Intelligence (IJCINI), 7(1):1–31. Andreas Peldszus and Manfred Stede. 2015. Joint pre- diction in mst-style discourse parsing for argumenta- tion mining. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Pro- cessing, pages 938–948, Lisbon, Portugal, Septem- ber. Association for Computational Linguistics. Andreas Peldszus and Manfred Stede. 2016. An anno- tated corpus of argumentative microtexts. In Argu- mentation and Reasoned Action: Proceedings of the 1st European Conference on Argumentation, Lisbon 2015 / Vol. 2, pages 801–816, London. College Pub- lications.

Manfred Stede, Stergos Afantenos, Andreas Peldszus, Nicholas Asher, and Jer´ emy´ Perret. 2016. Paral- lel discourse annotations on a corpus of short texts. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Por- toroz.

Maite Taboada and William Mann. 2006. Rhetorical Structure Theory: Looking back and moving ahead. Discourse Studies, 8(4):423–459.

Stephen Toulmin. 1958. The Uses of Argument. Cam- bridge University Press, Cambridge. D. Walton, C. Reed, and F. Macagno. 2008. Argumen- tation schemes. Cambridge University Press, Cam- bridge.

112 Recognizing the Absence of Opposing Arguments in Persuasive Essays

Christian Stab† and Iryna Gurevych†‡ †Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universitat¨ Darmstadt ‡Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research www.ukp.tu-darmstadt.de

Abstract presented by Stab and Gurevych (2016) or the ap- proach introduced by Peldszus and Stede (2015a) In this paper, we introduce an approach for recognize the internal microstructure of argu- recognizing the absence of opposing ar- ments. Although these approaches can be ex- guments in persuasive essays. We model ploited for identifying opposing arguments, they this task as a binary document classifica- require several consecutive analysis steps like tion and show that adversative transitions separating argumentative from non-argumentative in combination with unigrams and syntac- text units (Moens et al., 2007), recognizing the tic production rules significantly outper- boundaries of argument components (Goudas et form a challenging heuristic baseline. Our al., 2014) and classifying individual arguments approach yields an accuracy of 75.6% and as support or oppose (Somasundaran and Wiebe, 84% of human performance in a persua- 2009). Certainly, an advantage of structural ap- sive essay corpus with various topics. proaches is that they recognize the position of op- posing arguments in text. However, knowing the 1 Introduction position of opposing arguments is only relevant for positive feedback to the author and irrelevant for Developing well-reasoned arguments is an impor- negative feedback, i.e. pointing out that opposing tant ability and constitutes an important part of ed- arguments are missing. Therefore, it is reasonable ucation programs (Davies, 2009). A frequent mis- to model the recognition of missing opposing ar- take when writing argumentative texts is to con- guments as a document classification task. sider only arguments supporting the own stand- point and to ignore opposing arguments (Wolfe The contributions of this paper are the follow- and Britt, 2009). This tendency to ignore oppos- ing: first, we introduce a corpus for detecting ing arguments is known as myside bias or confir- the absence of opposing arguments that we derive mation bias (Stanovich et al., 2013). It has been from argument structure annotated essays. Sec- shown that guiding students to include opposing ond, we propose a novel model and a new fea- arguments in their writings significantly improves ture set for detecting the absence of opposing ar- the argumentation quality, the precision of claims guments in persuasive essays. We show that our and the elaboration of reasons (Wolfe and Britt, model significantly outperforms a strong heuris- 2009). Therefore, it is likely that a system which tic baseline and an existing structural approach. automatically recognizes the absence of opposing Third, we show that our model achieves 84% of arguments effectively guides students to improve human performance. their argumentation. For the same reason, the writ- ing standards of the common core standard1 re- 2 Related Work quire that students are able to clarify the relation between their own standpoint and opposing argu- Existing approaches in computational argumen- ments on a controversial topic. tation focus primarily on the identification of arguments, their components (e.g. claims and Existing structural approaches on argument premises) (Rinott et al., 2015; Levy et al., 2014) analysis like the argumentation structure parser and structures (Mochales-Palau and Moens, 2011; 1www.corestandards.org Stab and Gurevych, 2014b). Among these, there

113 Proceedings of the 3rd Workshop on Argument Mining, pages 113–118, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics are few approaches which distinguish between and class distribution for detecting the absence of supporting and opposing arguments. opposing arguments at the document-level. Each Peldszus and Stede (2015b) use lexical, con- essay in this corpus is annotated with argumen- textual and syntactic features to classify argu- tation structures that allow to derive document- ment components as support or oppose. They level annotations. The argumentation structures experiment with pro/contra columns of a Ger- include arguments supporting or opposing the au- man newspaper and German microtexts. Sim- thor’s stance. Accordingly, we consider an essay ilarly, their minimum spanning tree (MST) ap- as negative if it solely includes supporting argu- proach identifies the structure of arguments and ments and as positive if it includes at least one op- recognizes if an argument component belongs to posing argument. Note that the manual identifica- the proponent or opponent (Peldszus and Stede, tion of opposing arguments is a subtask of the ar- 2015a). However, both approaches presuppose gumentation structure identification. Both require that the components of an argument are already that the annotators identify the author’s stance, known. Thus, they omit important analysis steps the individual arguments and if an argument sup- and cannot be applied directly for recognizing ports or opposes the author’s stance. Thus, deriv- the absence of opposing arguments. Stab and ing document-level annotations from argumenta- Gurevych (2016) present an argumentation struc- tion structures is a valid approach since the deci- ture parser that includes all required steps for iden- sions of the annotators in both tasks are equivalent. tifying argument structures and supporting and op- posing arguments. First, they separate argumenta- 3.1 Inter-Annotator Agreement tive from non-argumentative text units using con- To verify that the derived document-level anno- ditional random fields (CRF). Second, they jointly tations are reliable, we compare the annotations model the argument component types and argu- derived from the argumentation structure annota- mentative relations using integer linear program- tions of three independent annotators. In particu- ming (ILP) and finally they distinguish between lar, we determine the inter-annotator agreement on supporting and opposing arguments. We employ a subset of 80 essays. The comparison shows an this parser as a structural approach and compare it observed agreement of 90%. We obtain substan- to our document classification approach for recog- tial chance-corrected agreement scores of Fleiss’ nizing the absence of opposing arguments in per- κ = .786 (Fleiss, 1971) and Krippendorff’s α = suasive essays. .787 (Krippendorff, 2004). Thus, we conclude that Another related area is stance recognition that the derived annotations are reliable since they are aims at identifying the author’s stance on a con- only slightly below the “good reliability thresh- troversy by labeling a document as either “for” or old” proposed by Krippendorff (2004). “against” (Somasundaran and Wiebe, 2009; Hasan and Ng, 2014). Consequently, stance recognition 3.2 Statistics systems are designed to identify the predominant Table 1 shows an overview of the corpus. It in- stance of a text instead of recognizing the presence cludes 402 essays. On average each essay includes of less conspicuous opposing arguments. 18 sentences and 366 tokens. Other approaches on argumentation in essays Tokens 147,271 focus on thesis clarity (Persing and Ng, 2013), ar- Sentences 7,116 gumentation schemes (Song et al., 2014) or argu- Documents 402 mentation strength (Persing and Ng, 2015). We Negative 251 (62.4%) Positive 151 (37.6%) are not aware of any approach that focuses on rec- ognizing the absence of opposing arguments. Table 1: Size and class distribution of the corpus.

3 Data The class distribution is skewed towards negative essays. The corpus includes 251 (62.4%) essays For our experiments, we employ an argu- that do not include opposing arguments and 151 ment structure annotated essay corpus (Stab and (37.6%) positive essays. For encouraging future Gurevych, 2014a; Stab and Gurevych, 2016). To research, the corpus is freely available.2 the best of our knowledge, this corpus is the only available resource that exhibits an appropriate size 2https://www.ukp.tu-darmstadt.de/data

114 4 Approach opposing arguments are frequently signaled by lexical indicators. We use 47 adversative transi- We consider the recognition of opposing argu- tional phrases that are compiled as a learning re- ments as a binary document classification. Due source5 and grouped in the following categories: to the size of the corpus and to prevent errors concession (18), conflict (12), dismissal (9), em- in model assessment stemming from a particular phasis (5) and replacement (3). For each of the data splitting (Krstajic et al., 2014), we employ five categories, we add two binary features set to a stratified and repeated 5-fold cross-validation true if a phrase of the category is present in the sur- setup. We report the average evaluation scores rounding paragraphs (introduction or conclusion) 100 and the standard deviation over folds result- or in a body paragraph.6 Note that we consider 20 ing from iterations. For model selection, we lowercase and uppercase versions of these features 10% randomly sampled of the training set of each which results in a total of 20 binary features. run as a development set. We report accuracy, Sentiment Features (sent): We average the five macro precision, macro recall and macro F1 scores sentiment scores of all essay sentences for deter- as described by Sokolova and Lapalme (2009, mining the global sentiment of an essay. In addi- p. 430).3 We employ Wilcoxon signed-rank test tion, we count the number of negative sentences on macro F1 scores for significance testing (sig- and define a binary feature indicating the presence nificance level = .005). of a negative sentence. We preprocess the essays using several models Discourse relations (dis): The binary discourse from the DKPro framework (Eckart de Castilho features include the type of the discourse relation and Gurevych, 2014). For tokenization, sentence and indicate if the relation is implicit or explicit. and paragraph splitting, we employ the language For instance, “Contrast imp” indicates an implicit tool segmenter4 and check for line breaks. We contrast relation. Note that we only consider the lemmatize each token using the mate tools lem- discourse relations of body paragraphs since the matizer (Bohnet et al., 2013) and apply the Stan- introduction frequently includes a description of ford parser (Klein and Manning, 2003) for con- the controversy which is not relevant to the au- stituency and dependency parsing. Finally, we thor’s argumentation and whose discourse rela- use a PDTB parser (Lin et al., 2014) and senti- tions could be misleading for the learner. ment analyzer (Socher et al., 2013) for identifying discourse relations and sentence-level sentiment 4.2 Baselines scores. As a learner, we choose a support vector machine (SVM) (Cortes and Vapnik, 1995) with For model assessment, we use the following two polynomial kernel implemented in Weka (Hall et baselines: First, we employ a majority baseline al., 2009). For extracting features, we use the that classifies each essay as negative (not includ- DKPro TC framework (Daxenberger et al., 2014). ing opposing arguments). Second, we employ a rule-based heuristic baseline that classifies an es- 4.1 Features say as positive if it includes the case-sensitive term “Admittedly” or the phrase “argue that” which of- We experiment with the following features: ten indicate the presence of opposing arguments.7 Unigrams (uni): In order to capture the lexical characteristics of an essay, we extract binary and 4.3 Results case sensitive unigrams. Dependency triples (dep): The binary depen- In order to select a model and to analyze our fea- dency features include triples consisting of the tures, we conduct feature ablation tests (lower part lemmatized governor, the lemmatized dependent of Table 2) and evaluate our system with individ- and the dependency type. ual features. The adversative transitions and un- Production rules (pr): We employ binary pro- igrams are the most informative features. Both duction rules extracted from the constituent parse show the best individual performance and a sig- trees (Lin et al., 2009) that occur at least five times. 5www.msu.edu/~jdowell/135/transw.html Adversative transitions (adv): We assume that 6We identify paragraphs by checking for line breaks and consider the first paragraph as introduction, the last as con- 3Since the macro F1 score assigns equal weight to classes, clusion and all remaining ones as body paragraphs. it is well-suited for evaluating experiments with skewed data. 7We recognized these indicators by ranking n-grams using 4www.languagetool.org information gain.

115 Accuracy Macro F1 Precision Recall F1 Negative F1 Positive Model assessment on test data Human Upper Bound∗ .900 .010 .894 .011 .895 .011 .014 .892 .865 .016 .921 .008 Baseline Majority .624±.001 .384±.000 .312±.001 .500±.000 .769±.001 ±0 Baseline Heuristic .711±.039 .679±.050 .715±.059 .646±.045 .797±.027 .497 .083 SVM uni+pr+adv .756±.044 .734±.048 .747±.049 .721±.050 .814±.034 .639±.075 † ± ± ± ± ± ± Model selection and feature ablation on development data SVM all w/o uni .733 .060 .708 .087 .768 .110 .660 .073 .817 .038 .496 .151 SVM all w/o dep‡ .765±.077 .745±.087 .762±.092 .731±.086 .822±.059 .649±.125 SVM all w/o pr .760±.062 .738±.082 .781±.097 .701±.074 .830±.042 .583±.138 SVM all w/o adv .736±.066 .709±.090 .756±.108 .670±.079 .816±.044 .524±.151 SVM all w/o sent‡ .756±.064 .733±.085 .778±.100 .696±.076 .828±.043 .572±.146 SVM all w/o dis .757±.061 .734±.082 .780±.097 .696±.075 .829±.041 .571±.143 ± ± ± ± ± ± SVM uni+pr+adv .770 .071 .750 .081 .767 .086 .735 .080 .825 .055 .656 .118 SVM all features .755±.064 .732±.086 .776±.102 .695±.077 .827±.044 .569±.149 ± ± ± ± ± ± Table 2: Results of the best performing model on the test data and selected results of the model selection experiments on the development data ( significant improvement over Baseline Heuristic; significant † ‡ difference compared to SVM all features; ∗determined on a subset of 80 essays). nificant decrease if removed from the entire fea- assumption that modeling the task as document ture set. Thus, we conclude that lexical indicators classification outperforms structural approaches. are the most predictive features in our feature set. The sentiment and discourse features do not per- 4.4 Error Analysis form well. Individually they do not achieve better To analyze frequent errors of our system, we man- results than the majority baseline and the accuracy ually investigate essays that are misclassified in all increases slightly when removing them from the 100 runs of the repeated cross-validation experi- entire feature set. By experimenting with various ment on the development set. In total, 29 posi- feature combinations, we found that combining tive essays are consistently misclassified as neg- unigrams, production rules and adversative tran- ative. As reason for these errors, we found that sitions yields the best results (SVM uni+pr+adv). the opposing arguments in these essays lack lexi- For model assessment, we evaluate the best per- cal indicators. In addition, we found 14 negative forming model on our test data and compare it to essays which are always misclassified as positive. the baselines (upper part of Table 2). The heuris- Among these essays, we observe that the majority tic baseline considerably outperforms the majority includes opposition indicators (e.g. “but”) which baseline and achieves an accuracy of 71.1%. Our are used in another sense (e.g. expansion). There- best system significantly outperforms this chal- fore, the investigation of both false negatives and lenging baseline with respect to all evaluation false positives shows that most errors are due to measures. It achieves an accuracy of 75.6% and a misleading lexical signals. Consequently, word- macro F1 score of .734. We determine the human sense disambiguation for identifying senses or the upper bound by comparing pairs of annotators and integration of domain and world knowledge in the averaging the results of the 80 independently an- absence of lexical signals could further improve notated essays (cf. Section 3). Compared to the the results. upper bound, our system achieves 14.4% less ac- curacy and 84% of human performance. 5 Conclusion We compare our system to an argumentation We introduced the novel task of recognizing the structure parser that recognizes opposing compo- absence of opposing arguments in persuasive es- nents on a designated 80:20 train-test-split (Stab says. In contrast to existing structural approaches, and Gurevych, 2016). We consider essays with we model this task as a document classification predicted opposing arguments as positive, and which does not presuppose several complex anal- negative if the parser does not recognize an op- ysis steps. The analysis of several features showed posing argument. This yields a macro F1 score of that adversative transitions and unigrams are most .648. Our document-level approach considerably indicative for this task. We showed that our best outperforms the component-based approach with a model significantly outperforms a strong heuristic macro F1 score of .710. Thus, we can confirm our baseline, yields a promising accuracy of 75.6%,

116 outperforms a structural approach and achieves Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard 84% of human performance. For future work, we Pfahringer, Peter Reutemann, and Ian H. Witten. plan to integrate the system in writing environ- 2009. The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1):10–18. ments and to investigate its effectiveness for fos- tering argumentation skills. Kazi Saidul Hasan and Vincent Ng. 2014. Why are you taking this stance? identifying and classifying reasons in ideological debates. In Proceedings of the Acknowledgments 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP ’14, pages 751–762, This work has been supported by the Volk- Doha, Qatar. swagen Foundation as part of the Lichtenberg- Professorship Program under grant No. I/82806 Dan Klein and Christopher D. Manning. 2003. Ac- curate unlexicalized parsing. In Proceedings of the and by the German Federal Ministry of Education 41st Annual Meeting on Association for Computa- and Research (BMBF) as a part of the Software tional Linguistics - Volume 1, ACL ’03, pages 423– Campus project AWS under grant No. 01 S12054. 430, Sapporo, Japan. | We thank Anshul Tak for his valuable contribu- Klaus Krippendorff. 2004. Content Analysis: An In- tions. troduction to its Methodology. Sage, 2nd edition. Damjan Krstajic, Ljubomir J. Buturovic, David E. Leahy, and Simon Thomas. 2014. Cross-validation References pitfalls when selecting and assessing regression and Bernd Bohnet, Joakim Nivre, Igor Boguslavsky, classification models. Journal of Cheminformatics, Richard´ Farkas, Filip Ginter, and Jan Hajic.ˇ 2013. 6(10):1–15. Joint morphological and syntactic analysis for richly Ran Levy, Yonatan Bilu, Daniel Hershcovich, Ehud inflected languages. Transactions of the Association Aharoni, and Noam Slonim. 2014. Context depen- for Computational Linguistics, 1:415–428. dent claim detection. In Proceedings of the 25th In- ternational Conference on Computational Linguis- Corinna Cortes and Vladimir Vapnik. 1995. Support- tics vector networks. Machine Learning, 20(3):273– , COLING ’14, pages 1489–1500, Dublin, Ire- 297. land. Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng. 2009. Peter Davies. 2009. Improving the quality of students’ Recognizing implicit discourse relations in the Penn arguments through ‘assessment for learning’. Jour- Discourse Treebank. In Proceedings of the 2009 nal of Social Science Education (JSSE), 8(2):94– Conference on Empirical Methods in Natural Lan- 104. guage Processing: Volume 1, EMNLP ’09, pages 343–351, Stroudsburg, PA, USA. Johannes Daxenberger, Oliver Ferschke, Iryna Gurevych, and Torsten Zesch. 2014. DKPro TC: Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2014. A Java-based framework for supervised learning A pdtb-styled end-to-end discourse parser. Natural experiments on textual data. In Proceedings of Language Engineering, 20(2):151–184. the 52nd Annual Meeting of the Association for Computational Linguistics. System Demonstrations, Raquel Mochales-Palau and Marie-Francine Moens. ACL ’14, pages 61–66, Baltimore, MD, USA. 2011. Argumentation mining. Artificial Intelligence and Law, 19(1):1–22. Richard Eckart de Castilho and Iryna Gurevych. 2014. Marie-Francine Moens, Erik Boiy, Raquel Mochales A broad-coverage collection of portable NLP com- Palau, and Chris Reed. 2007. Automatic detection ponents for building shareable analysis pipelines. of arguments in legal texts. In Proceedings of the In Proceedings of the Workshop on Open In- 11th International Conference on Artificial Intelli- frastructures and Analysis Frameworks for HLT gence and Law, ICAIL ’07, pages 225–230, Stan- (OIAF4HLT) at COLING 2014, pages 1–11, Dublin, ford, CA, USA. Ireland. Andreas Peldszus and Manfred Stede. 2015a. Joint Joseph L. Fleiss. 1971. Measuring nominal scale prediction in mst-style discourse parsing for argu- agreement among many raters. Psychological Bul- mentation mining. In Proceedings of the 2015 Con- letin, 76(5):378–382. ference on Empirical Methods in Natural Language Processing, EMNLP ’15, pages 938–948, Lisbon, Theodosis Goudas, Christos Louizos, Georgios Petasis, Portugal. and Vangelis Karkaletsis. 2014. Argument extrac- tion from news, blogs, and social media. In Arti- Andreas Peldszus and Manfred Stede. 2015b. To- ficial Intelligence: Methods and Applications, vol- wards detecting counter-considerations in text. In ume 8445 of Lecture Notes in Computer Science, Proceedings of the 2nd Workshop on Argumentation pages 287–299. Springer International Publishing. Mining, pages 104–109, Denver, CO.

117 Isaac Persing and Vincent Ng. 2013. Modeling the- Keith E. Stanovich, Richard F. West, and Maggie E. sis clarity in student essays. In Proceedings of the Toplak. 2013. Myside bias, rational thinking, and 51st Annual Meeting of the Association for Compu- intelligence. Current Directions in Psychological tational Linguistics (Volume 1: Long Papers), ACL Science, 22(4):259–264. ’13, pages 260–269, Sofia, Bulgaria. Christopher R. Wolfe and M. Anne Britt. 2009. Argu- Isaac Persing and Vincent Ng. 2015. Modeling ar- mentation schema and the myside bias in written ar- gument strength in student essays. In Proceedings gumentation. Written Communication, 26(2):183– of the 53rd Annual Meeting of the Association for 209. Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), ACL ’15, pages 543–552, Beijing, China.

Ruty Rinott, Lena Dankin, Carlos Alzate Perez, Mitesh M. Khapra, Ehud Aharoni, and Noam Slonim. 2015. Show me your evidence - an au- tomatic method for context dependent evidence de- tection. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Process- ing, EMNLP ’15, pages 440–450, Lisbon, Portugal.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep mod- els for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Process- ing, EMNLP ’13, pages 1631–1642, Seattle, WA, USA.

Marina Sokolova and Guy Lapalme. 2009. A system- atic analysis of performance measures for classifica- tion tasks. Information Processing & Management, 45(4):427–437.

Swapna Somasundaran and Janyce Wiebe. 2009. Rec- ognizing stances in online debates. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, ACL ’09, pages 226–234, Suntec, Singapore.

Yi Song, Michael Heilman, Beata Beigman Klebanov, and Paul Deane. 2014. Applying argumentation schemes for essay scoring. In Proceedings of the First Workshop on Argumentation Mining, pages 69–78, Baltimore, MA, USA.

Christian Stab and Iryna Gurevych. 2014a. Annotat- ing argument components and relations in persua- sive essays. In Proceedings of the 25th International Conference on Computational Linguistics, COLING ’14, pages 1501–1510, Dublin, Ireland.

Christian Stab and Iryna Gurevych. 2014b. Identify- ing argumentative discourse structures in persuasive essays. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process- ing, EMNLP ’14, pages 46–56, Doha, Qatar.

Christian Stab and Iryna Gurevych. 2016. Parsing ar- gumentation structures in persuasive essays. arXiv preprint arXiv:1604.07370.

118 Expert Stance Graphs for Computational Argumentation

Orith Toledo-Ronen Roy Bar-Haim Noam Slonim IBM Research - Haifa oritht,roybar,noams @il.ibm.com { }

Abstract explaining statistical improbability. It is ob- viously no solution to postulate something We describe the construction of an Expert even more improbable.” (Dawkins, 2006, p. Stance Graph, a novel, large-scale knowl- 158) edge resource that encodes the stance of more than 100,000 experts towards a va- Inferring the stance directly from the above text riety of controversial topics. We suggest is a difficult and complex task. However, this that this graph may be valuable for various complexity may be circumvented by utilizing fundamental tasks in computational argu- background knowledge about (Richard) Dawkins, mentation. Experts and topics in our graph who is a well-known atheist. Dawkins’ page in are Wikipedia entries. Both automatic and Wikipedia1 includes various types of evidence for semi-automatic methods for building the his stance towards atheism: graph are explored, and manual assess- 1. Categories: Dawkins belongs to the follow- ment validates the high accuracy of the re- ing Wikipedia categories: Antitheists, Athe- sulting graph. ism activists, Atheist feminists and Critics of 2 1 Introduction religions . 2. Article text: The article text contains state- Background knowledge plays an important role ments such as “Dawkins is a noted atheist” in many NLP tasks. However, computational ar- and “Dawkins is an outspoken atheist”. gumentation is one area where little work has 3. Infobox: Dawkins has a known-for relation been done on developing specialized knowledge with “criticism of religion”. resources. In this work we introduce a novel knowledge 2 Expert Stance Graphs resource that may support various tasks related to argumentation mining and debating technologies. The Expert Stance Graph (ESG) is a directed bi- This large-scale resource, termed Expert Stance partite graph comprising two types of nodes: (a) Graph, is built from Wikipedia, and provides back- concept nodes, which represent debatable topics ground knowledge on the stance of experts to- such as Atheism, Abortion, Gun control and Same- wards debatable topics. sex marriage, and (b) expert nodes, represent- As a motivating example, consider the follow- ing persons whose stance towards one or more ing stance classification setting, where the polarity of the concepts can be inferred from Wikipedia. of the following expert opinion on Atheism (Pro or Stance is represented as labeled directed edges Con) should be determined: from an expert to a concept, e.g. Richard P ro Dawkins Atheism. Each concept and each ex- Dawkins sums up his argument and states, −−→ pert have their own article in Wikipedia. We use “The temptation (to attribute the appear- the term Experts inclusively to refer to academics, ance of a design to actual design itself) is a false one, because the designer hypothe- 1https://en.wikipedia.org/wiki/ sis immediately raises the larger problem of Richard_Dawkins 2Inferring Pro stance for Atheism from Critics of religions who designed the designer. The whole prob- depends on knowing the contrast relation between Atheism lem we started out with was the problem of and Religion.

119 Proceedings of the 3rd Workshop on Argument Mining, pages 119–123, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics writers, religious figures, politicians, activists, and concept, in a fully-automatic fashion. so on. For both settings, our approach is based on Wikipedia categories and lists, which have sev- 3 Applications eral advantages: (a) they provide an easy access to Expert opinions are highly valuable for mak- large collections of experts, (b) their stance classi- ing persuasive arguments, and expert evidence fication is relatively easy, and (c) their hierarchical (premise) is a commonly used type of argumen- structure can be exploited. tation scheme (Walton et al., 2008). Rinott et 4.1 Concepts al. (2015) describe a method for automatic evi- dence detection in Wikipedia articles. Three com- Offline construction of the graph starts with de- mon evidence types are explored: Study, Expert, riving the set of concepts. We started with 3 and Anecdotal. The proposed method uses type- Wikipedia’s list of controversial issues , which specific features for detecting evidence. For in- contains about 1,000 Wikipedia entries, grouped stance, in the case of expert evidence, a lexicon of into several top-level categories. We manually se- words describing persons and organizations with lected a subset of 12 categories , and filtered out 4 relevant expertise is used. the remaining 3 categories. The process of incorporating expert opinions on One of the authors selected from the remain- a given topic into an argument involves several ing list concepts that represent a two-sided de- steps. First, we need to retrieve from our corpus bate (Meaning of life, for instance, is a controver- articles that contain expert opinions related to the sial topic but does not represent a two-sided de- given topic. Second, the exact boundaries of these bate). Persons and locations were filtered out as opinions should be identified. Finally, the stance well. This list was expanded manually by identi- of the expert opinion towards the topic (Pro or fying relevant concepts in Wikipedia article titles Con) should be determined, to ensure it matches that contain the words “Debate” or “Controversy”. the stance of the argument we are making. Each Finally, two annotators assessed the resulting list of these steps is a challenging task by itself. according to the above guidelines. Concepts that The expert stance graph may facilitate each of were rejected by both annotators were removed. the above subtasks. If an expert E is known to The final list contained 201 concepts. be a supporter or an opponent of some topic T , 4.2 Candidate Expert Categories then the Wikipedia page of E is likely to contain relevant opinions on T . Furthermore, a mention of Next, we search relevant Wikipedia categories E can be a useful feature for identifying relevant and lists for each concept. The process starts expert opinions for T in a given article. with creating search terms. The concept itself Finally, perhaps the most important use of the is a search term, as well as any lexical deriva- graph for expert evidence is stance classification. tion of the concept that represents a person (e.g. Atheism Atheist), which we term person deriva- Previous work on stance classification has shown → that it can be much improved by utilizing exter- tions. Person derivations are found using WordNet nal information beyond the text itself. For exam- (Miller, 1995; Fellbaum, 1998): we look for lexi- ple, posts by the same author on the same topic are cal derivations of the concept that have “person” expected to have the same stance (Thomas et al., as a direct or inherited hypernym. 2006; Hasan and Ng, 2013). Similarly, as shown We then find all Wikipedia categories and lists5 in the previous example, external knowledge on that contain the search terms. For example, given expert stance towards a topic can improve stance the search terms atheism and atheist, some of the classification of expert opinions. categories found are Atheism activists, American 3https://en.wikipedia.org/wiki/ 4 Building the Graph Wikipedia:List_of_controversial_issues 4The selected categories were Politics and economics, We consider two complementary settings for History, Religion, Science biology and health, Sexuality, En- building the graph: (a) Offline, in which the set of tertainment, Environment, Law and order, Philosophy, Psy- concepts is predefined, and minimal human super- chiatry, Technology, and Sports. The excluded categories were Linguistics, Media and culture, and People. vision is allowed, and (b) Online, where our goal 5Lists are Wikipedia pages whose title begins with “List is to find ad-hoc Pro and Con experts for an unseen of”.

120 atheists, List of atheist authors, Converts to Chris- Polarity Concepts Categories Experts Pro 105 3,221 93,570 tianity from atheism or agnosticism and Critics of Con 40 272 10,666 atheism. The set of categories is further expanded with subcategories of the categories found in the Table 1: Statistics on the Expert Stance Graph previous step. This step adds more relevant cate- gories that do not contain the search terms, such of Canadian Trotskyists may help classifying the as Antitheists for Atheism. To avoid topic drifting, latter as Pro for Communism. we only add one level of subcategories. The categories were labeled by a team of six Next, the persons associated with each cate- annotators, with each category labeled by two an- gory6 are identified by considering outgoing links notators. The overall agreement was 0.92, and the from the category page which are of type “Per- average inter-annotator Cohen’s kappa coefficient son”, based on DBPedia’s rdf:type property was 0.79, which corresponds to substantial agree- for the page (Lehmann et al., 2014). Categories ment (Landis and Koch, 1997). Cases of disagree- with fewer than five persons are discarded. We ment were labeled by a third annotator and were also removed three concepts, for which the num- assigned the majority label. Category annotation ber of categories was too large: Christianity, was completed rather quickly - about 260 cate- Catholicism, and Religion. The resulting set in- gories were annotated per hour. The total number cluded 4,603 categories containing 121,995 per- of annotation hours invested in this task was 37. sons. Categories were found for 132 of the 198 The resulting ESG is composed of all experts concepts. in the categories labeled as Pro and Con. A total 4.3 Category Stance Annotation of 104,236 experts were found for 114 out of the 132 concepts, and for 31 concepts, both Pro and Finally, category names are manually annotated Con experts were found. The number of concepts, for stance. The annotation process has two stages: categories and experts for each stance is given in first, determine whether the category explicitly de- Table 1. As shown in the table, the vast major- fines membership in a group of persons. For in- ity of categories and experts found are Pro. Over- stance, Swedish women’s rights activists and Fem- all, our method efficiently constructs a very large inist bloggers meet this criterion, but Feminism ESG, while only requiring a small amount of hu- and history does not. We apply this test since we man annotation time.7 observed that it is much easier to predict with con- fidence the stance of persons in these categories. 4.4 Category Stance Classification Categories that do not pass this filter are marked as Irrelevant. Otherwise, the annotators proceed The offline list of concepts we started with is un- to the second stage, where they are asked to deter- likely to be complete. Therefore, we would like to mine the stance of the persons in the given cate- be able to find on-the-fly Pro and Con experts also gory towards the given concept, based on the cat- for new, unseen concepts. This requires the devel- egory name. Possible labels are: opment of a stance classifier for categories. We randomly split the 198 concepts into two equal- 1. Pro: supporting the concept. size subsets and used one subset for development 2. Con: opposing the concept. and the other for testing. As a result, the 132 3. None: The stance towards the concept cannot concepts for which categories were found are split be determined based on the category name. into a development set, containing 69 concepts and For instance, for the concept Communism their associated 2,069 categories, and a test set, we will have British communists and Cana- containing 63 concepts and 2,534 categories. The dian Trotskyists classified as Pro, Moldovan anti- development set was used for developing a simple communists classified as Con, and Western writers rule-based classifier. about Soviet Russia classified as None. Annotators The logic of the rule-based classifier is sum- may also consider direct parent categories for de- 7The IBM Debating Technologies group in IBM Re- termining stance. In the previous example, know- search has already released several data resources, found ing that Canadian communists is a parent category here: https://www.research.ibm.com/haifa/ dept/vst/mlta_data.shtml. We aim to release the 6For convenience, we will refer in the following to cate- resource presented in this paper as well, as soon as we obtain gories and lists collectively as “categories”. the required licenses.

121 Input: category CAT ; concept C ; person derivation PD for Predicted Correct Labeled P R the concept Categories Output: stance classification of CAT into PRO/CON/NONE Pro 1,298 1,182 1,738 91.1 68.0 if CAT = critics of C then Con 144 140 186 97.2 75.3 ∼ return CON Experts else if CAT = anti|former|... PD then ∼ Pro 28,693 25,754 41,701 89.8 61.8 return CON Con 4,113 4,016 6,912 97.6 58.1 else if CAT = PD dissident|... then ∼ return CON Table 2: Category stance classification results else if CAT = PD then return PRO∼ else if CAT = anti|former|... CPERSON then Predicted Correct Labeled P R return CON∼ Manual Annotation else if CAT = CPERSON then Pro 200 181 189 90.5 95.8 ∼ return PRO Con 200 173 178 86.5 97.2 else Classifier return NONE Pro 76 60 189 78.9 31.7 Algorithm 1: Category stance classification Con 87 81 178 93.1 45.5 Table 3: Expert stance assessment marized in Algorithm 1. “= ” denotes pattern ∼ matching, and PERSON is any hyponym of the this section we put this assumption to the test. word “person” in WordNet, e.g. activist, provider, We sampled 200 Pro experts and 200 Con experts and writer. “. . . ” denotes omission of some lexi- from the test set. The polarity of the experts was cal alternatives. derived from the manual labeling of their category. The algorithm is first applied to the category it- For each sampled instance, we first randomly se- self, and if it fails to make a Pro or Con prediction lected one of the concepts in the test set, and then (i.e returns None), it is applied to its direct parent randomly picked an expert with the requested po- categories, and the classification is made based on larity. If the concept did not have any experts with the majority of their Pro and Con predictions. that polarity, the above procedure was repeated un- Table 2 shows the performance of the classifier til such an expert was found. on the test set, with respect to both categories and We then asked three human annotators to deter- experts. Expert-level evaluation is done by label- mine the stance of the experts towards their associ- ing all the experts in each category with the cate- ated concept (Pro/Con/None), based on any infor- gory label. The following measures are reported mation found on their Wikipedia page, and con- for Pro and Con classes: number of predictions, sidered the majority label. As with the previous number of correct predictions, number of labeled task, the annotators achieved substantial agree- instances for this class, precision (P) and recall ment (average kappa of 0.65). We evaluated the (R). Overall, the classifier achieves high precision expert stance inferred from the category labeling for Pro and Con, both at the category and at the by both the manual annotation and the rule-based expert level, while covering most of the labeled classifier against these 400 labeled experts. The instances. Yet, the coverage of the classifier is in- results are summarized in Table 3. complete. As an example of its limitations, con- For the manual annotation, we see that the cat- sider the categories American pro-choice activists egory name indeed predicts the expert’s stance and American pro-life activists, which are Pro and with high precision. In most of the misclassifi- Con abortion, respectively. Their stance cannot be cations cases, the annotators could not determine determined from the category itself according to the stance from the expert’s web page. This dis- our rules, because they do not contain the concept crepancy is partially due to the fact that the ex- Abortion, and both were added as subcategories of pert’s page shows categories containing the ex- Abortion in the United States, a category that does pert, but does not display lists and parent cate- not have a clear stance (and indeed has both Pro gories containing the expert, which are available and Con subcategories). for category-based stance annotation. The preci- 5 Expert-Level Assessment sion of the classifier on this sample is also quite good (better for Con), but while we are able to So far we assumed that experts’ stances can be identify a substantial part of the experts, recall still predicted precisely from their category names. In leaves much room for improvement.

122 6 Conclusion and Future Work G. A. Miller. 1995. WordNet: A lexical database for English. Communications of the ACM, pages 39–41. We introduced Expert Stance Graphs, a novel, Ruty Rinott, Lena Dankin, Carlos Alzate Perez, large scale knowledge resource that has many po- Mitesh M. Khapra, Ehud Aharoni, and Noam tential use cases in computational argumentation. Slonim. 2015. Show me your evidence - an auto- We presented an offline method for constructing matic method for context dependent evidence detec- tion. In Proceedings of EMNLP. the graph with minimal supervision, as well as a fully-automated method for finding experts for un- Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from seen concepts. Both methods show promising re- Congressional floor-debate transcripts. In Proceed- sults. ings of EMNLP. In future work we plan to improve coverage Douglas Walton, Christopher Reed, and Fabrizio by considering additional sources of information, Macagno. 2008. Argumentation Schemes. Cam- such as the text of the expert’s page in Wikipedia. bridge University Press. We will also apply the graph in different tasks re- lated to the detection and stance classification of expert evidence. We also plan to enrich the graph with additional types of knowledge, which may be utilized to pre- dict missing stance edges. Semantic relations be- tween concepts, such as contrast (e.g. Atheism vs. Religion), may support such inferences, as experts are expected to have opposite stances towards con- trasting concepts. Another possible extension of the graph is influence links between experts, which may indicate similar stances for these experts. In- fluence information is available from Wikipedia infoboxes. Finally, we would like to apply collabora- tive filtering techniques to predict missing expert- concept stance relations. This is based on the intu- ition that experts who tend to have same (or oppo- site) stances on a set of topics, are likely to follow a similar pattern on topics for which we only have partial stance information.

References Richard Dawkins. 2006. The God Delusion. Bantam Press. Christiane Fellbaum, editor. 1998. WordNet: An Elec- tronic Lexical Database. Language, Speech and Communication. MIT Press. Kazi Saidul Hasan and Vincent Ng. 2013. Stance clas- sification of ideological debates: Data, models, fea- tures, and constraints. In Proceedings of IJCNLP. J. R. Landis and G. G. Koch. 1997. The measurements of observer agreement for categorical data. Biomet- rics, 33:159–174. Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Soren¨ Auer, and Christian Bizer. 2014. DB- pedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web Journal.

123 Fill the Gap! Analyzing Implicit Premises between Claims from Online Debates

Filip Boltuziˇ c´ and Jan Snajderˇ University of Zagreb, Faculty of Electrical Engineering and Computing Text Analysis and Knowledge Engineering Lab Unska 3, 10000 Zagreb, Croatia filip.boltuzic,jan.snajder @fer.hr { }

Abstract in this context, it can be thought of as a sophisti- cated opinion mining technique – one that seeks Identifying the main claims occurring to uncover the reasons for opinions and patterns across texts is important for large-scale of reasoning. The potential applications of social argumentation mining from social media. media mining are numerous, especially when done However, the claims that users make are on a large scale. often unclear and build on implicit knowl- In comparison to argumentation mining from edge, effectively introducing a gap between edited texts, there are additional challenges in- the claims. In this work, we study the prob- volved in mining arguments from social media. lem of matching user claims to predefined First, social media texts are more noisy than edited main claims, using implicit premises to fill texts, which makes them less amenable to NLP the gap. We build a dataset with implicit techniques. Secondly, users in general are not premises and analyze how human annota- trained in argumentation, hence the claims they tors fill the gaps. We then experiment with make will often be unclear, ambiguous, vague, or computational claim matching models that simply poorly worded. Finally, the arguments will utilize these premises. We show that us- often lack a proper structure. This is especially true ing manually-compiled premises improves for short texts, such as microbloging posts, which similarity-based claim matching and that mostly consist of a single claim. premises generalize to unseen user claims. When analyzing short and noisy arguments on a large scale, it becomes crucial to identify identical 1 Introduction but differently expressed claims across texts. For Argumentation mining aims to extract and analyze example, summarizing and analyzing arguments argumentation expressed in natural language texts. on a controversial topic presupposes that can iden- It is an emerging field at the confluence of natural tify and aggregate identical claims. This task has language processing (NLP) and computational ar- been addressed in the literature under the name of gumentation; see (Moens, 2014; Lippi and Torroni, argument recognition (Boltuziˇ c´ and Snajder,ˇ 2014), 2016) for a comprehensive overview. reason classification (Hasan and Ng, 2014), argu- Initial work on argumentation mining has fo- ment facet similarity (Swanson et al., 2015; Misra cused on well-structured, edited text, such as le- et al., 2015), and argument tagging (Sobhani et al., gal text (Walton, 2005) or scientific publications 2015). The task can be decomposed into two sub- (Jimenez-Aleixandre´ and Erduran, 2007). Recently, tasks: (1) identifying the main claims for a topic the focus has also shifted to argumentation mining and (2) matching each claim expressed in text to from social media texts, such as online debates claims identified as the main claims. The focus of (Cabrio and Villata, 2012; Habernal et al., 2014; this paper is on the latter. Boltuziˇ c´ and Snajder,ˇ 2014), discussions on regu- The difficulty of the claim matching task arises lations (Park and Cardie, 2014), product reviews from the existence of a gap between the user’s (Ghosh et al., 2014), blogs (Goudas et al., 2014), claim and the main claim. Many factors contribute and tweets (Llewellyn et al., 2014; Bosc et al., to the gap: linguistic variation, implied common- 2016). Mining arguments from social media can sense knowledge, or implicit premises from the be- uncover valuable insights into peoples’ opinions; liefs and value judgments of the person making the

124 Proceedings of the 3rd Workshop on Argument Mining, pages 124–133, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

User claim: Now it is not taxed, and those who sell it are To the best of our knowledge, this is the first usually criminals of some sort. Main claim: Legalized marijuana can be controlled and work that focuses on the problem of implicit regulated by the government. premises in argumentation mining. Besides report- Premise 1: If something is not taxed, criminals sell it. ing on the experimental results of the two studies, Premise 2: Criminals should be stopped from selling things. we also describe and release a new dataset with Premise 3: Things that are taxed are controlled and reg- human-provided implicit premises. We believe our ulated by the government. results may contribute to a better understanding of Table 1: User claim, the matching main claim, and the premise gap between claims. the implicit premises filling the gap. The remainder of the paper is structured as fol- lows. In the next section, we briefly review the related work on argumentation mining. In Section claim; the latter two effectively make the argument 3 we describe the creation of the implicit premises an enthymeme. In Table 1, we give an example dataset. We describe the results of the two stud- from the dataset of Hasan and Ng (2014). Here, ies in Section 4 and Section 5, respectively. We a user claim from an online debate was manually conclude and discuss future work in Section 6. matched to a claim previously identified as one of the main claims on the topic of marijuana legaliza- 2 Related Work tion. Without additional premises, the user claim does not entail the main claim, but the gap may be Work related to ours comes from two broad strands closed by including the three listed premises. of research: argumentation mining and computa- Previous annotation studies (Boltuziˇ c´ and tional argumentation. Within argumentation min- Snajder,ˇ 2014; Hasan and Ng, 2014; Sobhani et al., ing, a significant effort has been devoted to the 2015) demonstrate that humans have little difficulty extraction of argumentative structure from text, in matching two claims, suggesting that they are e.g., (Walton, 2012; Mochales and Moens, 2011; capable of filling the premise gap. However, cur- Stab and Gurevych, 2014; Habernal and Gurevych, rent machine learning-based approaches to claim 2016)). One way to approach this problem is matching do not account for the problem of implicit to classify the text fragments into argumentation premises. These approaches utilize linguistic fea- schemes – templates for typical arguments. Feng tures or rely on textual similarity and textual entail- and Hirst (2011) note that identifying the particular ment features. From an argumentation perspective, argumentation scheme that an argument is using however, these are shallow features and their capac- could help in reconstructing its implicit premises. ity to bridge the gap opened by implicit premises As a first step towards this goal, they develop a is limited. Furthermore, existing approaches lack model to classify text fragments into five most the explanatory power to explain why (under what frequently used Walton’s schemes (Walton et al., premises) one claim can be matched to the other. 2008), reaching 80–95% pairwise classification ac- Yet, the ability to provide such explanations is im- curacy on the Araucaria dataset. portant for apprehending arguments. Recovering argumentative structure from social In this paper, we address the problem of claim media text comes with additional challenges due matching in the presence of gaps arising due to to the noisiness of the text and the lack of argu- implicit premises. From an NLP perspective, this mentative structure. However, if the documents is a daunting task, which significantly surpasses are sufficiently long, argumentative structure could the current state of the art. As a first step in bet- in principle be recovered. In a recent study on so- ter understanding of the task, we analyze the gap cial media texts, Habernal and Gurevych (2016) between user claims and main claims from both a showed that (a slightly modified) Toulmin’s argu- data and computational perspective. We conduct mentation model may be suitable for short docu- two studies. The first is an annotation study, in ments, such as article comments or forum posts. which we analyze the gap, both qualitatively and Using sequence labeling, they identify the claim, quantitatively, in terms of how people fill it. In the premise, backing, rebuttal, and refutation compo- second study, we focus on the computational mod- nents, achieving a token-level F1-score of 0.25. els for claim matching with implicit premises, and Unlike the work cited above, in this work we do gain preliminary insights into such models could not consider argumentative structure. Rather, we benefit from the use of implicit premises. focus on short (mostly single-sentence) claims, and

125 the task of matching a pair of claims. The task of Topic # claim pairs # main claims claim matching has been tackled by Boltuziˇ c´ and Marijuana (MA) 125 10 Snajderˇ (2014) and Hasan and Ng (2014). The for- GayRights (GR) 125 9 Abortion (AB) 125 12 mer frame the task as a supervised multi-label prob- Obama (OB) 125 16 lem, using textual similarity- and entailment-based features. The features are designed to compare the Table 2: Dataset summary. user comments against the textual representation of main claims, allowing for a certain degree of topic independence. In contrast, Hasan and Ng frame the problem as a (joint learning) supervised classifica- 3 Data and Annotation tion task with lexical features, effectively making their model topic-specific. The starting point of our study is the dataset of Hasan and Ng (2014). The dataset contains user Both approaches above are supervised and re- posts from a two-side online debate platform on quire a predefined set of main claims. Given a four topics: “Marijuana” (MA), “Gay rights” (GR), large-enough collection of user posts, there seem “Abortion” (AB), and “Obama” (OB). Each post is to be at least two ways in which main claims can assigned a stance label (pro or con), provided by be identified. First, they can be extracted manually. the author of the post. Furthermore, each post is Boltuziˇ c´ and Snajderˇ (2014) use the main claims split up into sentences and each sentence is manu- already identified as such on an online debating ally labeled with a single claim from a predefined platform, while Hasan and Ng (2014) asked anno- set of main claims, different for each topic. Note tators to group the user comments and identify the that all sentences in the dataset are matched against main claims. The alternative is to use unsupervised exactly one main claim. Hasan and Ng (2014) re- machine learning and induce the main claims auto- port substantial levels of inter-annotator agreement matically. A middle-ground solution, proposed by (between 0.61 and 0.67, depending on the topic). Sobhani et al. (2015), is to first cluster the claims, and then manually map the clusters to main claims. Our annotation task extends this dataset. We In this work, we assume that the main claims have formulate the task as a “fill-the-gap” task. Given been identified using any of the above methods. a pair of previously matched claims (a user claim and a main claim), we ask the annotators to pro- Claim matching is related to the well-established vide the premises that bridge the gap between the NLP problems: textual entailment (TE) and se- two claims. No further instructions were given to mantic textual similarity (STS), both often tackled the annotators; we hoped that they would resort as shared tasks (Dagan et al., 2006; Agirre et al., to common-sense reasoning and effectively recon- 2012). Boltuziˇ c´ and Snajderˇ (2014) explore us- struct the deductive steps needed to entail the main ing outputs from STS and TE in solving the claim claim from the user claim. The annotators were matching problem. Cabrio and Villata (2012) use also free to abstain from filling the gap, if they felt TE to determine support/attack relations between that the claims cannot be matched; we refer to such claims. Boltuziˇ c´ and Snajderˇ (2015) consider the pairs as Non-matching. If no implicit premises notion of argument similarity between two claims. are required to bridge the gap (the two claims are Similarly, Swanson et al. (2015) and Misra et al. paraphrases of each other), then the claim pair is (2015) consider argument facet similarity. annotated as Directly linked. The problem of implicit information has also been tackled in the computational argumentation We hired three annotators to annotate each pair community. Work closest to ours is that of Wyner of claims. The order of claim pairs was randomized et al. (2010), who address the task of inferring for each annotator. We annotated 125 claims pairs implicit premises from user discussions. They an- for each topic, yielding a total of 500 gap-filling notate implicit premises in Attempto Controlled premise sets. Table 2 summarizes the dataset statis- tics. An excerpt from the dataset is given in Table 3. English (Fuchs et al., 2008), define propositional 1 logic axioms with annotated premises, and extract We make the dataset freely available. and explain policy stances in discussions. In our work, we focus on the NLP approach and work 1Available under the CC BY-SA-NC license from with implicit premises in textual form. http://takelab.fer.hr/argpremises

126 Claim pair Annotation A1 A2 A3 Avg. User claim: Obama supports the P1: Obama continued Avg. # premises 3.6 2.6 2.0 2.7 0.7 Bush tax cuts. He did not try to with the Bush tax cuts. Avg. # words 26.7 23.7 18.6 23.0 ± 3.4 end them in any way. Non-matching (%) 1.2 3.6 14.5 6.4 ± 5.8 Main claim: Obama destroyed P2: The Bush tax cuts ± our economy. destroyed our economy. Table 4: Gap-filling parameters for the three anno- What if the child is User claim: Non-matching tators. born and there is so many difficul- ties that the child will not be able to succeed in life? Main claim: A fetus is not a hu- gaps between the user-provided and main claims. man yet, so it’s okay to abort. Finally, we note that each gap was not filled by Technically speaking, User claim: Directly linked the same person who identified the main claim, a fetus is not a human yet. Main claim: A fetus is not a hu- which in turn is not the original author of the claim. man yet, so it’s okay to abort. Therefore, it may well be that the original author would have chosen a different main claim, and that Table 3: Examples of annotated claim pairs. she would commit to a different set of premises than those ascribed to by our annotators. 4 Study I: Implicit Premises Considering the above, we acknowledge that we cannot analyze the genuine implicit premises of the The aim of the first study is to analyze how people claim’s author. However, under the assumption that fill the gap between the user’s claim and the corre- the main claim has been correctly identified, there sponding main claim. We focus on three research is a gap that can be filled with sensible premises. questions. The first concerns the variability of the Depending on how appropriate the chosen main gap: to what extent do different people fill the gap claim was, this gap will be larger or smaller. in different ways, and to what extent the gaps differ across topics. Secondly, we wish to characterize 4.2 Variability in Gap Filling the gap in terms of the types of premises used to fill it. The third question is how the gap relates to We are interested in gauging the variability of gap the more general (but less precise) notion of textual filling across the annotators and topics. To this end, similarity between claims, which has been used for we calculate the following quantitative parameters: claim matching in prior work. the average number of premises, the average num- ber of words in premises, and the proportion of 4.1 Setup and Assumptions non-matched claim pairs. To answer the above questions, we analyze and Table 4 shows that there is a substantial vari- compare the gap-filling premise sets in the dataset ance in these parameters for the three annotators. of implicit premises from Section 3. We note that, The average number of premises per gap is 2.7 and by doing so, we inherit the setup used by Hasan the average number of words per gap is about 23, and Ng (2014). This seems to raise three issues. yielding the average length of about 9 words per First, the main claim to which the user claim has premise. We also computed the word overlap be- been matched to need not be the correct one. In tween the three annotators: 8.51, 7.67, and 5.93 for such cases, it would obviously be nonsensical to annotator pairs A1-A2, A1-A3, and A2-A3, respec- attempt to fill the gap. We remedy this by asking tively. This indicates that, on average, the premise our annotators to abstain from filling the gap if they sets overlap in just 32% of the words. The anno- felt the two claims do not match. Moreover, con- tators A1 and A2 have a higher word overlap and sidering that the agreement on the claim matching use more words to fill the gap. Also, A1 and A2 task on this dataset was substantial (Hasan and Ng, managed to fill the gap for more cases than A3, 2014), we expect this to rarely be the case. who much more often desisted from filling the gap. The second issue concerns the granularity of the An example where A1 used more premises than A3 main claims. Boltuziˇ c´ and Snajderˇ (2015) note that is shown in Table 5. the level of claim granularity is to a certain extent Table 6 shows the gap-filling parameters across arbitrary. We speculate that, on average, the more topics. Here the picture is more balanced. The general the main claims are, the fewer the number least number of premises and the least number of of main claims for a given topic and the bigger the words per gap are used for the AB topic. The GR

127 User claim: It would be loads of empathy and joy for are atomic (77%), while implication and other com- about 6 hours, then irrational, stimulant- plex types constitute 16% and 7% of cases, respec- induced paranoia. If we can expect the for- mer to bring about peace on Earth, the latter tively. In terms of acceptance, premises are well- would surely bring about WWIII. balanced: universal and claim-specific premises Main claim: Legalization of marijuana causes crime. account for 62% and 38% of cases, respectively. A1 Premise 1: Marijuana is a stimulant. We suspect that the kind of the analysis we did A1 Premise 2: The use of marijuana induces paranoia. A1 Premise 3: Paranoia causes war. above might be relevant for determining the overall A1 Premise 4: War causes aggression. strength of an argument (Park and Cardie, 2014). A1 Premise 5: Aggression is a crime. An interesting venue for future work would be to A1 Premise 6:”WWIII” stands for the Third World War. carry out a more systematic analysis of premise A3 Premise 1: Marijuana leads to irrational paranoia which can lead to commiting a crime. acceptance using the complete dataset, dissected across claims and topics, and possibly based on Table 5: User claim, the matching main claim, and surveying a larger group of people. the implicit premise(s) filling the gap provided by two different annotators. 4.4 Semantic Similarity between Claims

Topic Previous work addressed claim matching as a se- mantic textual similarity task (Swanson et al., 2015; MA GR AB OB Avg. Misra et al., 2015; Boltuziˇ c´ and Snajder,ˇ 2015). It Avg. # premises 2.8 2.8 2.5 2.8 2.7 0.1 Avg. # words 23.6 24.9 19.1 23.4 22.8 ± 2.2 is therefore worth investigating how the notion of Non-matching (%) 5.9 6.8 4.6 4.3 5.4 ± 1.0 semantic similarity relates to the gap between two ± claims. We hypothesize that the textual similarity Table 6: Gap-filling parameters for the four topics. between two claims will be negatively affected by the size of the gap. Thus, even though the claims are matching, if the gap is too big, similarity will topic contained the most (about 7%) claim pairs for not be high enough to indicate the match. which the annotators desisted from filing the gap. To verify this, we compare the semantic similar- ity score between each pair of claims against its 4.3 Gap Characterization gap size, characterized by the number of premises We next make a preliminary inquiry into the of required to fill the gap, averaged across the three nature of the gap. To this end, we characterize the annotators. To obtain a reliable estimate of seman- gap in terms of the individual premises that are tic similarity between claims, instead of computing used to fill it. At this point we do not look at the the similarity automatically, we rely on human- relations between the premises (the argumentative annotated similarity judgments. We set up a crowd- structure); we leave this for future work. sourcing task and asked the workers to judge the Our analysis is based on a simple ad-hoc typol- similarity between 846 claim pairs for the MA ogy of premises, organized along three dimensions: topic. The task was formulated as a question “Are premise type (fact, value, or policy), complexity two claims talking about the same thing?”, and (atomic, implication, or complex), and acceptance judgments were made on a scale from 1 (“not sim- (universal or claim-specific). The intuition behind ilar”) to 6 (“very similar”). Each pair of claims the latter is that some premises convey general received five judgments, which we averaged to ob- truths or widely accepted beliefs, while others are tain the gold-similarity score. The average standard specific to the claim being made, and embraced deviation is 1.2, indicating good agreement. only by the supporters of the claim in question. The Pearson correlation coefficient between the We (the two authors) manually classified 50 similarity score and the number of premises filling premises from the MA topic into the above cat- the gap for annotators A1, A2, and A3 is 0.30, − egories and averaged the proportions. The kappa- 0.28, and 0.14, respectively. The correlation − − agreement is 0.42, 0.62, and 0.53 for the premise between the similarity score and the number of type, complexity, and acceptance, respectively. premises averaged across the annotators is 0.22 − Factual premises account for the large majority (p<0.0001). We conclude that there is a statisti- (85%) of cases, value premises for 9%, and policy cally significant, albeit weak negative relationship premises for 6%. Most of the gap-filling premises between semantic similarity and gap size.

128 5 Study II: Claim Matching Model (2013a).2 We represent the texts of the claims and the premises by summing up the distributional In this section we focus on claim matching models vectors of their individual words, as the semantic with implicit premises. In the previous section, we composition of short phrases via simple vector ad- demonstrated that the degree of similarity between dition has been shown to work well (Mikolov et al., matched claims varies and is negatively correlated 2013b). We measure claim similarity using cosine with the number of gap-filling premises. This re- distance between two vectors. sult directly suggests that the similarity scores for Inspired by (Cabrio et al., 2013; Boltuziˇ c´ and matched claims could be increased by reducing Snajder,ˇ 2014), we also attempted to model claim the size of the gap. Furthermore, we expect that matching using textual entailment. However, our the size of the gap can be effectively reduced by results, obtained using the Excitement Open Plat- including premises in the similarity computation. form (Pado´ et al., 2015), were considerably worse Motivated by these insights, we conduct a pre- than that of distributional similarity models, hence liminary study on the use of implicit premises in we do not consider them further in this paper. claim matching. The study is also motivated by our long-term goal to develop efficient models for rec- Baselines. We employ two baselines. First, an ognizing main claims in social media texts. Given unsupervised baseline, which simply computes the a user’s claim, the task is to find the main claim similarity between the user claim and main claim from a predefined set of claims to which the user’s vectors without using the implicit premises. Each claim matches the best. We address three research user claim is matched to the most similar main questions: (1) whether and how the use of implicit claim. The other is a supervised baseline, which premises improves claim matching, (2) how well uses a support vector machine (SVM) classifier do the implicit premises generalize, and (3) could with an RBF kernel, trained on the user comments, the implicit premises be retrieved automatically. to predict the label corresponding to the main claim. We train and evaluate the model using a nested 5 3 × 5.1 Experimental Setup cross-validation, separately for each topic. The hy- The claim matching task can be approached in a perparameters C and γ are optimized using grid supervised or unsupervised manner. We focus here search. We use the well-known LibSVM imple- on the latter, based on semantic similarity between mentation (Chang and Lin, 2011). the claims and the premises. We think unsuper- Premise sets and combination with claims. To vised claim matching provides a more straightfor- obtain a single combined representation of a ward and explicit way of incorporating the implicit premise set, we simply concatenate the premises premises. Furthermore, the unsupervised approach together before computing the distributional vector better corresponds to the very idea of argumenta- representation. We do the same when combining tion, where claims and premises are compared to the premises with either of the claims. This is ex- each other and combined to derive other claims. emplified in Table 7. In what follows, we denote Dataset. We use the implicit premise dataset the user claim, the main claim, and the gap-filling from Section 3, consisting of 125 claim pairs for premise set with Ui,Mj, and Pij, respectively. each of the four topics. We use the gap-filling 5.2 Matching with Implicit Premises premise sets from annotator A1, who on aver- To answer the first research question – whether age has provided the largest number of implicit using premise sets can help in matching claims – premises. We refer to this dataset as the develop- we use gold-annotated premise sets and combine ment set. In addition, we sample and additional test these with either the main claim or the user claim. set consisting of 125 pairs for each topic from the The main idea is that, by combining the premises dataset of Hasan and Ng (2014); for claim pairs with a claim, we encode the information conveyed from this set we have no implicit premises. by the premises into the claim, hopefully making Semantic similarity. We adopt the distributional the two claims more similar at the textual level. semantics approach (Turney and Pantel, 2010) to We consider four models: the unsupervised base- computing semantic textual similarity. We rely line, denoted “Ui Mj”, the supervised baseline, ↔ on distributed representation based on the neu- 2We use the pre-trained vectors available at ral network skip-gram model of Mikolov et al. https://code.google.com/p/word2vec/

129 Type Text content Topic

Ui Marijuana has so many benefits for sick people. Model MA GR AB OB Avg.

Mj Marijuana is used as a medicine for its positive ef- Ui Mj 7.39 12.52 24.59 10.87 13.84 ↔ fects. Ui Mj (S) 35.26 27.81 33.30 20.92 29.32 ↔ Pij Marijuana helps sick people. Sick people use mari- Ui+Pij Mj 22.73 46.03 47.22 21.41 34.35 ↔ juana. Ui Mj+Pij 48.05 28.23 49.34 64.11 47.43 ↔

Ui+Pij Marijuana has so many benefits for sick people. Mar- ijuana helps sick people. Sick people use marijuana. Table 8: Performance of claim matching baselines

Mj+Pij Marijuana is used as a medicine for its positive ef- and oracle performance of the claim matching mod- fects. Marijuana helps sick people. Sick people use els utilizing implicit premises from annotator A1 marijuana. (macro-averaged F1-score). Table 7: Combination of premise sets and claims. However, after the premise sets get combined with all the main claims, this difference diminishes and denoted “U M (S)”, the model in which the i ↔ j the matching performance improves. premises are combined with the user claim, de- noted “U +P M ”, and the model in which the We obtained the above results using premises i ij ↔ j premises are combined with the main claim, de- compiled by annotator A1. To see how model noted “U M +P ”. The latter two predict the performance is influenced by the differences in i ↔ j ij main claim as the one that maximizes the similarity premise sets, we re-run the same experiment with the best-performing U M +P model, this time between two claims, after one of the claims is com- i ↔ j ij using the premises compiled by annotators A2 and bined with the premises. The Ui+Pij Mj model ↔ A3. Although we obtained a lower macro-averaged considers all pairs of the user claim Ui and the gold- F1-score (33.97 for A2 and 32.91 for A3), the annotated premise sets Pi for that user claim. In ∗ model still outperforms both baselines. On the contrast, the Ui Mj+Pij model considers all pair- ↔ other hand, this suggest that the performance very ings of the main claim Mj and the gold-annotated much depends on the quality of the premises. premise sets P j for that main claim. In effect, this ∗ model tries to fill the gap using different premise The claim matching problem bears resemblance sets linked to the given main claim. In this oracle with query matching in information retrieval. A setup, we always use the gold-annotated premise common way to address the lexical gap between set for the main claim. the queries and the documents is to perform query In Table 8, we show the claim matching results expansion (Voorhees, 1994). We hypothesize that in terms of the macro-averaged F1-score on the de- human-compiled premises are more useful for velopment set. Results demonstrate that using the claim matching than standard query expansion. To verify this, we replicate setups U +P M and implicit premises helps in selecting the most simi- i ij ↔ j U M +P , but instead of premise sets, use (1) lar main claim, as the models with added implicit i ↔ j ij k premises outperform the unsupervised baseline by WordNet synsets and (2) top distributionally most 20.5 and 33.6 points of F1-score. Furthermore, the similar words (using word vectors from Section 5.1 and k= 1, 3, 5, 7, 9 ) to expand the user or the model that combines the premises with the main { } claim considerably outperforms the two baselines main claim. We obtained no improvement over the and the model that combines the premises with baselines, suggesting that the lexical information the user claim. An exception is the GR topic, on in the premises is indeed specific. which the the latter model works best. Our analysis 5.3 Premise Generalization revealed this to be due to the presence of very gen- eral (i.e., lexically non-discriminative) premises in From a practical perspective, we are interested to some of the premise sets (e.g., “Straight people what extent the premises generalize, i.e., whether it have the right to marry”), which makes the corre- is possible to reuse the premises compiled for the sponding main claim more similar to user claims. main claims, but different user claims. We choose Another interesting observation is the very good the best-performing model from the previous sec- performance on the OB topic. This is because tion (U M +P ), and apply this model and the i ↔ j ij only one of the 16 main claims contains the word baseline models on the test set. This means that the Obama, also making it more similar to user claims. model uses the premise sets Pij for pairs of claims

130 Topic Topic Model MA GR AB OB Avg. Model MA GR AB OB Avg.

Uk Mj 9.60 19.68 27.70 12.39 17.35 Ui Mj 7.39 12.52 24.59 10.87 13.84 ↔ ↔ Uk Mj (S) 29.01 29.39 21.09 18.22 24.43 Ui+P Mj (WT) 8.95 19.54 29.32 7.30 16.28 ↔ ↔ Uk Mj+Pij 30.63 23.00 32.72 23.87 27.55 Ui+P Mj (XT) 8.56 19.01 28.73 7.07 15.84 ↔ ↔ Uk Mj 9.60 19.68 27.70 12.39 17.35 ↔ Table 9: Performance of claim matching baselines Uk Mj (XT) 5.69 17.75 15.38 12.43 12.82 ↔ and the models utilizing the implicit premises on the test set (macro-averaged F1-score). Table 10: Performance of the claim matching model with premise retrieval on the dev. set (upper part) and test set (lower part); macro-avg. F1-score. Ui and Mj from the training set, and the hope is that the same premise sets will be useful for unseen retrieval. Results suggest that our simple method user claims Uk. Results are shown in Table 9. The for within-topic premise retrieval improves claim model again outperforms the baselines, except on matching over the baseline for all topics except the the GR topic. The performance improvement varies OB topic. On the other hand, results on the test set across topics: the average improvement over the indicate that the model does not generalize well, as unsupervised and supervised baselines is 10.2 and it does not outperform the baseline. 3.12 points of F1-score, respectively. This result suggests that the premises that fill the gap general- 6 Conclusion ize to a certain extent, and thus can be reused for unseen user claims. We addressed the problem of matching user claims to main claims. Implicit premises introduce a gap 5.4 Premise Retrieval between two claims. This gap is easily filled by humans, but difficult to bridge for natural language In a realistic setting, we would not have at our processing methods. disposal the implicit premises for each main claim, In the first study, we compiled a dataset of im- but try to generate or retrieve them automatically. plicit premises between matched claims from on- We preliminary investigate the feasibility of this line debates. We showed that there is a considerable option with our third research question – could the variation in the way how human annotators fill the implicit premises be retrieved automatically? gaps with premises, and that they use premises of To retrieve the premise set P and then perform various types. We also showed that the similarity claim matching, we use a simple heuristic: given between claims, as judged by humans, negatively a user claim as input, we choose N premises most correlates with the size of the gap, expressed in the similar to the user claim, and then combine them number of premises needed to fill it. with the user claim. We next compute the similarity In the second study, we experimented with com- between the premise-augmented claim vector and putational models for claim matching. We showed all the main claims. If the average similarity to that using gap-filling premises effectively reduces main claims has increased, we increment N and the similarity gap between claims and improves repeat the procedure, otherwise we stop. The main claim matching performance. We also showed that idea is to retrieve as many premises as needed to premise sets generalize to a certain extent, i.e., we bring the user claim “closer” to the main claims. can improve claim matching on unseen user claims. We run this with N ranging from 1 to 5. In cases Finally, we made a preliminary attempt to retrieve when combining the user claim with additional automatically the gap-filling premises. premises makes the claim less similar to the main This paper is a preliminary study of implicit claims, no combination takes place. premises and their relevance for argumentation We consider two setups: one in which the pool of mining. For future work, we want to further study premises to retrieve from comes from the topic in the types of implicit premises, as well as relation- question (within-topic), and the other in which the ships between them. We also intend to experiment premises from all four topics are considered (cross- with more sophisticated premise retrieval models. topic). Results are shown in Table 10. We evaluate on both the development set the test set, as well as Acknowledgments. We thank the reviewers for within-topic (WT) and cross-topic (XT) premise their many insightful comments and suggestions.

131 References representation. In Reasoning Web, pages 104–124. Springer. Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. SemEval-2012 task 6: A pi- Debanjan Ghosh, Smaranda Muresan, Nina Wacholder, lot on semantic textual similarity. In Proceedings of Mark Aakhus, and Matthew Mitsui. 2014. Analyz- the First Joint Conference on Lexical and Computa- ing argumentative discourse units in online interac- tional Semantics-Volume 1: Proceedings of the main tions. In Proceedings of the First Workshop on Ar- conference and the shared task, and Volume 2: Pro- gumentation Mining, pages 39–48. Association for ceedings of the Sixth International Workshop on Se- Computational Linguistics. mantic Evaluation, pages 385–393. Association for Computational Linguistics. Theodosis Goudas, Christos Louizos, Georgios Peta- sis, and Vangelis Karkaletsis. 2014. Argument ex- Filip Boltuziˇ c´ and Jan Snajder.ˇ 2014. Back up your traction from news, blogs, and social media. In stance: Recognizing arguments in online discus- Hellenic Conference on Artificial Intelligence, pages sions. In Proceedings of the First Workshop on Ar- 287–299. Springer. gumentation Mining, pages 49–58. Association for Computational Linguistics. Ivan Habernal and Iryna Gurevych. 2016. Argumenta- tion mining in user-generated web discourse. arXiv Filip Boltuziˇ c´ and Jan Snajder.ˇ 2015. Identifying preprint arXiv:1601.02403. prominent arguments in online debates using seman- tic textual similarity. In Proceedings of the 2nd Ivan Habernal, Judith Eckle-Kohler, and Iryna Workshop on Argumentation Mining, pages 110–115. Gurevych. 2014. Argumentation mining on the Association for Computational Linguistics. web from information seeking perspective. In Pro- ceedings of the Workshop on Frontiers and Connec- Tom Bosc, Elena Cabrio, and Serena Villata. 2016. tions between Argumentation Theory and Natural DART: A dataset of arguments and their relations on Language Processing. Twitter. In Proceedings of the Tenth International Conference on Language Resources and Evaluation Kazi Saidul Hasan and Vincent Ng. 2014. Why are (LREC 2016). European Language Resources Asso- you taking this stance? Identifying and classifying ciation. reasons in ideological debates. In Proceedings of the 2014 Conference on Empirical Methods in Natu- Elena Cabrio and Serena Villata. 2012. Combining ral Language Processing (EMNLP), pages 751–762. textual entailment and argumentation theory for sup- Association for Computational Linguistics. porting online debates interactions. In Proceedings of the 50th Annual Meeting of the Association for Mar´ıa Pilar Jimenez-Aleixandre´ and Sibel Erduran. Computational Linguistics: Short Papers-Volume 2, 2007. Argumentation in science education: An pages 208–212. Association for Computational Lin- overview. In Argumentation in Science Education, guistics. pages 3–27. Springer. Elena Cabrio, Serena Villata, and Fabien Gandon. Marco Lippi and Paolo Torroni. 2016. Argumenta- 2013. A support framework for argumentative dis- tion mining: State of the art and emerging trends. cussions management in the web. In The Seman- ACM Transactions on Internet Technology (TOIT), tic Web: Semantics and Big Data, pages 412–426. 16(2):10. Springer. Clare Llewellyn, Claire Grover, Jon Oberlander, and Chih-Chung Chang and Chih-Jen Lin. 2011. LIB- Ewan Klein. 2014. Re-using an argument cor- SVM: A library for support vector machines. ACM pus to aid in the curation of social media collec- Transactions on Intelligent Systems and Technology tions. In Proceedings of the Nineth International (TIST), 2(3):27. Conference on Language Resources and Evaluation (LREC 2014), pages 462–468. European Language Ido Dagan, Oren Glickman, and Bernardo Magnini. Resources Association. 2006. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Eval- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey uating Predictive Uncertainty, Visual Object Classi- Dean. 2013a. Efficient estimation of word repre- fication, and Recognising Tectual Entailment, pages sentations in vector space. In Proceedings of ICLR, 177–190. Springer. Scottsdale, AZ, USA. Vanessa Wei Feng and Graeme Hirst. 2011. Clas- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- sifying arguments by scheme. In Proceedings rado, and Jeff Dean. 2013b. Distributed representa- of the 49th Annual Meeting of the Association tions of words and phrases and their compositional- for Computational Linguistics: Human Language ity. In Advances in Neural Information Processing Technologies-Volume 1, pages 987–996. Association Systems, pages 3111–3119. for Computational Linguistics. Amita Misra, Pranav Anand, JEF Tree, and MA Walker. Norbert E Fuchs, Kaarel Kaljurand, and Tobias Kuhn. 2015. Using summarization to discover argument 2008. Attempto controlled english for knowledge facets in online idealogical dialog. In Proceedings of

132 the 2015 Annual Conference of the North American Adam Wyner, Tom van Engers, and Anthony Hunter. Chapter of the ACL, pages 430–440. Association for 2010. Working on the argument pipeline: Through Computational Linguistics. flow issues between natural language argument, instantiated arguments, and argumentation frame- Raquel Mochales and Marie-Francine Moens. 2011. works. In Proceedings of the Workshop on Compu- Argumentation mining. Artificial Intelligence and tational Models of Natural Argument. Law, 19(1):1–22.

Marie-Francine Moens. 2014. Argumentation mining: Where are we now, where do we want to be and how do we get there? In Post-proceedings of the forum for information retrieval evaluation (FIRE 2013).

Sebastian Pado,´ Tae-Gil Noh, Asher Stern, Rui Wang, and Roberto Zanoli. 2015. Design and realization of a modular architecture for textual entailment. Natu- ral Language Engineering, 21(02):167–200.

Joonsuk Park and Claire Cardie. 2014. Identifying appropriate support for propositions in online user comments. In Proceedings of the First Workshop on Argumentation Mining, pages 29–38. Association for Computational Linguistics.

Parinaz Sobhani, Diana Inkpen, and Stan Matwin. 2015. From argumentation mining to stance classifi- cation. In Proceedings of the 2nd Workshop on Ar- gumentation Mining, pages 67–77. Association for Computational Linguistics.

Christian Stab and Iryna Gurevych. 2014. Annotating argument components and relations in persuasive es- says. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguis- tics: Technical Papers, pages 1501–1510.

Reid Swanson, Brian Ecker, and Marilyn Walker. 2015. Argument mining: Extracting arguments from on- line dialogue. In Proceedings of 16th Annual Meet- ing of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2015), pages 217–227. Associ- ation for Computational Linguistics.

Peter D Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of se- mantics. Journal of artificial intelligence research, 37(1):141–188.

Ellen M Voorhees. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 61–69. Springer-Verlag New York, Inc.

Douglas Walton, Christopher Reed, and Fabrizio Macagno. 2008. Argumentation schemes. Cam- bridge University Press.

Douglas Walton. 2005. Argumentation methods for artificial intelligence in law. Springer.

Douglas Walton. 2012. Using argumentation schemes for argument extraction: A bottom-up method. In- ternational Journal of Cognitive Informatics and Natural Intelligence (IJCINI), 6(3):33–61.

133 Summarising the points made in online political debates

Charlie Egan Advaith Siddharthan Adam Wyner Computing Science Computing Science Computing Science University of Aberdeen University of Aberdeen University of Aberdeen [email protected] [email protected] [email protected]

Abstract limitations of summarisers based on sentence extraction, much of the useful information in Online communities host growing num- discussions is inaccessible. bers of discussions amongst large In this paper, we propose a fully automatic groups of participants on all manner and domain neutral unsupervised approach to of topics. This user-generated content abstractive summarisation which makes the contains millions of statements of opin- key content of such discussions accessible. At ions and ideas. We propose an ab- the core of our approach is the notion of a stractive approach to summarize such ‘point’ - a short statement, derived from a argumentative discussions, making key verb and its syntactic arguments. Points (and content accessible through ‘point’ ex- counter-points) from across the corpus are traction, where a point is a verb and analysed and clustered to derive a summary its syntactic arguments. Our approach of the discussion. To evaluate our approach, uses both dependency parse informa- we used a corpus of political debates (Walker tion and verb case frames to identify et al., 2012), then compared summaries gen- and extract valid points, and generates erated by our tool against a high-performing an abstractive summary that discusses extractive summariser (Nenkova et al., 2006). the key points being made in the de- We report that our summariser improves sig- bate. We performed a human evalu- nificantly on this extractive baseline. ation of our approach using a corpus of online political debates and report 2 Related Work significant improvements over a high- performing extractive summarizer. Text summarisation is a well established task in the field of NLP, with most systems based 1 Introduction on sentence selection and scoring, with possi- People increasingly engage in and contribute bly some post-editing to shorten or fuse sen- to online discussions and debates about top- tences (Nenkova and McKeown, 2011). The ics in a range of areas, e.g. film, politics, vast majority of systems have been developed consumer items, and science. Participants for the news domain or on structured texts may make points and counterpoints, agreeing such as science (Teufel and Moens, 2002). and disagreeing with others. These online ar- In related work on mailing list data, one ap- gumentative discussions are an untapped re- proach clustered messages into subtopics and source of ideas. A high-level, summarised view used centring to select sentences for an extrac- of a discussion, grouping information and pre- tive summary (Newman and Blitzer, 2003). senting points and counter-points, would be The concept of recurring and related subtopics useful and interesting: retailers could analyse has been highlighted (Zhou and Hovy, 2006) product reviews; consumers could zero in on as being of greater importance to discus- what products to buy; and social scientists sion summarisation than the summarisation could gain insight on social treads. Yet, due to of newswire data. In ‘opinion summarisation’, the size and complexity of the discussions and sentences have also been grouped based on the

134 Proceedings of the 3rd Workshop on Argument Mining, pages 134–143, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics feature discussed, to generate a summary of all and its syntactic arguments. Points encapsu- the reviews for a product that minimised rep- late both a human-readable ‘extract’ from the etition (Hu and Liu, 2004). There has also text as well as a pattern representing the core been interest in the summarisation of subjec- components that can be used to match and tive content in discussions (Hu and Liu, 2004; compare points. Extracts and patterns are Lloret et al., 2009; Galley et al., 2004). stored as attributes in a key-value structure In addition to summarisation, our work is that represents a point. concerned with argumentation, which for our Consider the sentence from a debate about purposes relates to expressions for or against abortion: “I don’t think so, an unborn some textual content. Galley et al. (2004) child (however old) is not yet a human.” used adjacency pairs to target utterances that Other sentences may also relate to this idea had been classified as being an agreement or that a child is not human until it is born; disagreement. Others have investigated ar- e.g. “So you say: children are not com- guments raised in online discussion (Boltuzic plete humans until birth?” Both discuss and Šnajder, 2015; Cabrio and Villata, 2012; the point represented by the grammatically Ghosh et al., 2014). A prominent example of indexed pattern: child.subject be.verb argument extraction applies supervised learn- human.object. Note that we are at this stage ing methods to automatically identify claims not concerned with the stance towards the and counter-claims from a Wikipedia corpus point being discussed; we will return to this (Levy et al., 2014). later. To facilitate readability, the extracted In this paper, we explore the intersection points are associated with an ‘extract’ from of text summarisation and argument. We im- the source sentence; in these instances, “an un- plement a novel summarisation approach that born child is not yet a human” and “children extracts and organises information as points are not complete humans until birth?” Gener- rather than sentences. It generates structured ation of points bears a passing resemblence to summaries of argumentative discussions based Text Simplification (Siddharthan, 2014), but on relationships between points, such as coun- is focussed on generating a single short sen- terpoints or co-occurring points. tence starting from a verb, rather than split- ting sentences into shorter ones. 3 Methods Points and extracts are derived from a de- pendency parse graph structure (De Marneffe Our summariser is based on three compo- et al., 2014). Consider: nents. The first robustly identifies and ex- verb punctuation tracts points from text, providing data for sub- determiner nominal subject direct object sequent analysis. Given plain text from a dis- cussion, we obtain (a) a pattern or signature A fetus has rights . that could be used to link points – regardless Here, the nominal subject and direct object of their exact phrasing – and (b) a short read- relations form the pattern, and relations are able extract that could be used to present the followed recursively to generate the extract. point to readers in a summary. A second com- To solve the general case, we must select de- ponent performs a number of refinements on pendency relations to include in the pattern the list of points such as removing meaningless and then decide which should be followed from points. The third component builds on these the verb to include in the extract. extracted points by connecting them in differ- ent ways (e.g., as point and counterpoint, or 3.1.1 Using Verb Frames co-occurring points) to model the discussion. We seek to include only those verb dependen- From this, it formulates a structured summary cies that are required by syntax or are optional that we show to be useful to readers. but important to the core idea. While this of- ten means using only subject and object rela- 3.1 Point Extraction tions, this is not always the case. Some depen- We use the notion of a ‘point’ as the basis for dencies, like adverbial modifiers or parataxis, our analysis – broadly speaking it is a verb which do not introduce information relevant to

135 percentage of all frames in the index. We have dependency relations to implement a means of querying dependency parses for 17 of the Brutus murdered Julius Cesar. more common patterns, which cover 96% of all frames in the index. To do this, we used the dependency parses for the example sen- tences listed in frames to identify the correct mappings. The remaining 4% of frames, as well as frames not covered by FrameNet, were ... matched using a ‘Generic Frame’ and a new query that could be run against any depen- Figure 1: VerbNet frame for ‘murder’ dency graph to extract subjects, objects and open clausal complements. the point’s core idea are universally excluded. 3.1.2 Human Readable Extracts For the rest, we identify valid verb frames us- Our approach to generating human readable ing FrameNet, available as part of VerbNet, extracts for a point can be summarised as fol- an XML verb lexicon (Schuler, 2005; Fillmore lows: recursively follow dependencies from the et al., 2002). Represented in VerbNet’s 274 verb to allowed dependents until the sub-tree ‘classes’ are 4402 member verbs. For each of is fully explored. Nodes in the graph that are these verb classes, a wide range of attributes related to the verb, or (recursively) any of its are listed. FrameNet frames are one such at- dependents, are returned as part of the extract tribute, these describe the verb’s syntactic ar- for the point. However, to keep points suc- guments and the semantic role of each in that cinct the following dependency relations are frame. An example frame for the verb ‘mur- excluded: adverbial clause modifiers, clausal der’ is shown in Figure1. Here we see that subjects, clausal complement, generic depen- the verb takes two Noun Phrase arguments, dencies and parataxis. Generic dependencies an Agent (‘murderer’) and Patient (‘victim’). occur when the parser is unable to determine We use a frame’s syntactic information to the dependency type. These either arise from determine the dependencies to include in the errors or long-distance dependencies and are pattern for a given verb. This use of frames rarely useful in extracts. The other blacklisted for generation has parallels to methods used dependencies are clausal in nature and tend to in abstractive summarisation for generating connect points, rather than define them. The noun phrase references to named entities (Sid- returned tokens in this recursive search, pre- dharthan and McKeown, 2005; Siddharthan et sented in the original order, provide us with a al., 2011). To create a ‘verb index’, we parsed sub-sentence extract for the point pattern. the VerbNet catalogue into a key-value struc- ture where each verb was the key and the list 3.2 Point Curation of allowed frames the value. Points were ex- To better cluster extracted points into distinct tracted by querying the dependency parse rel- ideas, we curated points. We merged subject ative to information from the verb’s frames. pronouns such as ‘I’ or ‘she’ under a single With an index of verbs and their frames, ‘person’ subject as these were found to be used all the information required to identify points interchangeably and not reference particular in parses is available. However, as frames people in the text; for example, points such as are not inherently queriable with respect to a she.nsubj have.verb abortion.dobj and dependency graph structure, queries for each I.nsubj have.verb abortion.dobj were type of frame were written. While frames merged under a new pattern: PERSON.nsubj in different categories encode additional se- have.verb abortion.dobj. Homogenising mantic information, many frames share the points in this way means we can continue same basic syntax. Common frames such as to rely on a cluster’s size as a measure of NounPhrase Verb NounPhrase cover a high importance in the summarisation task.

136 A number of points are also removed Extract Selection: Point extracts were or- using a series of ‘blacklists’. Based on ganised in clusters sharing a common pattern points extracted from the Abortion debate of verb arguments. Such clusters contained all (1151 posts, ~155000 words), which we used the point extracts for the same point pattern for development, these defined generic point and thus all the available linguistic realisations patterns were judged to be either of lit- to express the cluster’s core idea. Even after tle interest or problematic in other ways. filtering out some extracts, as described above, For example, patterns such as it.nsubj there was still much variation in the quality have.verb rights.dobj contain referential of the extracts. Take this example cluster of pronouns that are hard to resolve. We ex- generated extracts about the Genesis creation cluded points with the following subjects: narrative: it_PRP, that_DT, this_DT, which_WDT, “The world was created in six days.” what_WDT. We also excluded a set of verbs “The world was created in exactly 6 days.” with a PERSON subject; certain phrases such “Is there that the world could have been created in six days.” as “I think” or “I object” are very common, “The world was created by God in seven days.” but relate to attribution or higher argumen- “The world was created in 6 days.” “But, was the world created in six days.” tation steps rather than point content. Other “How the world was created in six days.” common cue phrases such as “make the claim” were also removed. All of these passed the ‘Extract Filtering’ stage. Now an extract must be selected to represent the cluster. In this instance our ap- 3.3 Summary Generation proach selected the fifth point, “The world was Our goal is the abstractive summarisation of created in 6 days.” Selecting the best extract argumentative texts. Extracted points have was performed every time a cluster was se- ‘patterns’ that enable new comparisons not lected for use in a summary. Selections were possible with sentence selection approaches to made using a length-weighted, bigram model summarisation, for instance, the analysis of of the extract words in the cluster, in order to counter points. This section describes the pro- select a succinct extract that was representa- cess of generating an abstractive summary. tive of the entire cluster. Extract Presentation: Our points extrac- 3.3.1 Extract Generation tion approach works by selecting the relevant components in a string for a given point, using Extract Filtering: In a cluster of points the dependency graph. While this has a key with the same pattern there are a range of ex- advantage in creating shorter content units, it tracts that could be selected. There is much also means that extracts are often poorly for- variation in extract quality caused by poor matted for presentation when viewed in iso- parses, punctuation or extract generation. We lation (not capitalised, leading commas etc.). implemented a set of rules that prevent a poor To overcome such issues we post-edit the se- quality extract from being presented. lected extracts to ensure the following prop- Predominantly, points were prevented from erties: first character is capitalised; ends in a being presented based on the presence of cer- period; commas are followed but not preceded tain substring patterns tested with regular ex- by a space; contractions are applied where pos- pressions. Exclusion patterns included two sible; and consecutive punctuation marks con- or more consecutive words in block capitals, densed or removed. Certain determiners, ad- repeated words or a mid-word case change. verbs and conjunctions (because, that, there- Following on from these basic tests, there fore) are also removed from the start of ex- were more complex exclusion patterns based tracts. With these adjustments, extracts can on the dependencies obtained from re-parsing typically be presented as short sentences. the extract. Poor quality extracts often con- tained (on re-parsing) clausal or generic de- 3.3.2 Content Selection: pendencies, or multiple instances of conjunc- A cluster’s inclusion in a particular summary tion. Such extracts were excluded. section is a function of the number of points

137 in the cluster. This is based on the idea that tern: man.nsubj have.verb right.dobj ap- larger clusters are of greater importance (as peared in the debate. the point is made more often). Frequency is a Negation terminology was not commonly commonly used to order content in summari- part of the point pattern, for example, the sation research for this reason; however in ar- woman.nsubj have.verb right.dobj cluster gumentative texts, it could result in the sup- could include both “A woman has the right” pression of minority viewpoints. Identifying and “A woman does not have the right” as ex- such views might be an interesting challenge, tracts. To identify negated forms within clus- but is out of scope for this paper. ters, we instead pattern matched for negated Our summaries are organised as sections to words in the point extracts. First the clus- highlight various aspects of the debate (see be- ter was split into two groups, extracts with low). To avoid larger clusters being repeat- negation terminology and those without. The edly selected for each summary section, a list Cartesian product of these two groups gave all of used patterns and extracts is maintained. pairs of negated and non-negated extracts. For When an point is used in a summary section each pair a string difference was computed, it is ‘spent’ and added to a list of used patterns which was used to identify a pair for use in and extracts. The point pattern, string, lem- summary. Point-counterpoint pairs were se- mas and subject-verb-object triple are added lected for the summary based on the average to this list. Any point that matches any ele- cluster size for the point and counterpoint pat- ment in this list of used identifiers cannot be terns. In the summary section with the head- used later in the summary. ing “people disagree on these points”, only the extract for the point is displayed, not the coun- 3.3.3 Summary Sections: terpoint. A summary could be generated just by list- ing the most frequent points in the discus- Co-occurring Points: As well as counter- sion. However, we were interested in gener- points we were also interested in presenting ating more structured summaries that group associated points, i.e. those frequently raised points in different ways, i.e. counterpoints & in conjunction with one another. To identify co-occurring points. co-occurring points, each post in the discus- sion was first represented as a list of points Counter Points: This analysis was in- it made. Taking all pairwise combinations tended to highlight areas of disagreement in of the points made in a post, for all posts, the discussion. Counterpoints were matched we generated a list of all co-occurring point on one of two possible criteria, the presence pairs. The most common pairs of patterns of either negation or an antonym. Potential, were selected for use in the summary. Co- antonym-derived counterpoints, for a given occurring pairs were rejected if they were too point, were generated using its pattern and a similar – patterns must differ by more than list of antonyms. Antonyms were sourced from one component. For example, woman.nsubj WordNet (Miller, 1995). Taking woman.nsubj have.verb choice.dobj could not be dis- have.verb right.dobj as an example pat- played as co-occurring with woman.nsubj tern, the following potential counterpoint pat- have.verb rights.dobj but could be with terns are generated: fetus.nsubj have.verb rights.dobj.

• man.nsubj have.verb right.dobj • woman.nsubj lack.verb right.dobj Additional Summary Sections: First, • woman.nsubj have.verb left.dobj points from the largest (previously) unused clusters were selected. Then we organised Where there were many pattern words with points by topic terms, defined here as com- antonym matches, multiple potential counter monly used nouns. The subjects and objects in point patters were generated. Such hypoth- all points were tallied to give a ranking of topic esised antonym patterns were rejected if the terms. Using these common topics, points con- pattern did not occur in the debate. From the taining them were selected and displayed in a example above, only the first generated pat- dedicated section for that topic.

138 Layout Summary Plain Summary People disagree on these points: People are automatically responsible. People are automatically responsible. Guns were illegal. Guns were illegal. They are no more dangerous than any other They are no more dangerous than any other firearm out there. firearm out there. Criminals have guns. Criminals have guns. Clearly having guns. Clearly having guns. Carry a loaded gun. Carry a loaded gun. Commonly occurring points made in the discussion were: I should not have a gun. Keep arms in our homes. Guns make you safer. Kill many people. Concealed carry permit holder. Own a gun there. Carry permit for 20 years. Users that talk about X often talk about Y. Bear arms, in actuality. (X) I should not have a gun. (Y) Guns make you safer. Bear arms for the purpose of self-defense. (X) Concealed carry permit holder. (Y) Carry permit for 20 Keep arms in our homes. years. Kill many people. (X) Bear arms, in actuality. (Y) Bear arms for the purpose Own a gun there. of self-defense. I grew up in a house hold around firearms. Common points made in the discussion linking terms were: You need a machine gun to kill a deer. I grew up in a house hold around firearms. You take guns off the streets. You need a machine gun to kill a deer. More people are accidentally killed by their You take guns off the streets. own guns. Points for commonly discussed topics: Get the guns and kill the other people. Gun Cars kill many more people per year than On how to use guns. guns. Get guns off the street. On how to use guns. Them carry a gun. Get guns off the street. People Them carry a gun. People kill people. People kill people. Guns don’t kill people. Guns don’t kill people. The militia are the people. The militia are the people. Government Protect ourselves from a tyrannical govern- Protect ourselves from a tyrannical government. ment. Fighting a tyrannical government. Fighting a tyrannical government. Limit democratically-elected governments. Limit democratically-elected governments. Points about multiple topics: Do you really need a machine gun to kill a More people are accidentally killed by their own guns. little deer? Get the guns and kill the other people. Do guns make you safer? Cars kill many more people per year than guns. People ask questions like: Do you really need a machine gun to kill a little deer? Do guns make you safer? Figure 2: Examples of Layout vs Plain Summaries

Most large clusters have a pattern with 4 Evaluation three components. Points with longer patterns Studies were carried out using five political are less common but often offer more devel- debates from the Internet Argument Corpus oped extracts (e.g. “The human life cycle be- (Walker et al., 2012): creation, gay rights, the gins at conception.”) Longer points were se- existence of god, gun ownership and health- lected based on the number of components in care (the 6th, abortion rights, was used as the pattern. An alternative to selecting points a development set). This corpus was ex- with a longer pattern is to instead select points tracted from the online debate site 4forums that mention more than one important topic (www.4forums.com/political) and is a large word. Extracts were sorted on the number of collection of unscripted argumentative dia- topic words they include. Extracts were se- logues based on 390,000 posts. lected from the top 100 to complete this sec- 25 Study participants were recruited using tion using the extract selection process. Amazon Mechanical Turk who had the ‘Mas- As a final idea for a summary section we in- ters’ qualification. Each comparison required cluded a list of questions that had been asked a participant to read a stock summary and ex- a number of times. Questions were much less actly one of the other two (plain and layout) in commonly repeated and this section was there- a random order. Figure2 provides examples fore more an illustration than a summary. of these, which are also defined below.

139 Figure 3: Counts of participant responses when comparing of Plain & Stock.

Figure 4: Counts of participant responses when comparing of Layout & Stock.

• Stock: A summary generated using an imple- 4.1 Results mentation of the state of the art sentence extrac- tion approach described by Nenkova et al. (2006) The study made comparisons between two • Plain: A collection of point extracts with the pairs of summary types: Plain vs. Stock and same unstructured style and length as the Stock summary. Layout vs. Stock. • Layout: A summary adds explanatory text that All of the five comparison factors presented introduces different sections of points. in Figure3 show a preference for our Plain summaries. These counts are aggregated from The Stock summaries were controlled to be all Plain vs. Stock comparisons for all five the same length as other summary in the com- political debates. Each histogram represents parison for fair comparison. Participants were 45 responses for a question comparing the two asked to compare the two summaries on the summaries on that factor. The results were following factors: tested using Sign tests for each comparison factor, with ‘better’ and ‘much better’ aggre- • Content Interest / Informativeness: The gated and ‘same’ results excluded. The family summary presents varied and interesting content significance level was set at α = 0.05; with • Readability: The summary contents make m = 5 hypotheses - using the Bonferroni cor- sense; work without context; aren’t repetitive; and are easy to read rection (α/m); giving an individual signifi- • Punctuation & Presentation: The summary cance threshold of 0.05/5 = 0.01. ‘overall’, contents are correctly formatted as sentences, ‘content’, ‘readability’, ‘punctuation’ and ‘or- punctuation, capital letters and have sentence case ganisation’ were all found to show a signifi- • Organisation: Related points occur near one cant difference (p < 0.0001 for each); i.e. even another unstructured summaries with content at the point level was overwhelmingly preferred to Finally they were asked to give an over- state of the art sentence selection. all rating and justify their response using free Similarly, when Layout was also compared text. 9 independent ratings were obtained for against Stock on the same factors, we observed each pair of summaries for each of the 5 de- an even stronger preference for Layout sum- bates using a balanced design. maries (see Figure4). To test the increased

140 Plain (A) vs. Layout (A) Row Stock (B) vs. Stock (B) Total authors; sections perhaps hint at a more mech- A much better 13 27 40 anised approach. A better 19 14 33 A same 8 1 9 B better 3 1 4 5 Conclusions and Future Work B much better 2 2 4 Column Total 45 45 90 We have implemented a method for extracting Table 1: Plain & Layout vs Stock responses meaningful content units called points, then contingency table grouping points into discussion summaries. We evaluated our approach in a comparison against summaries generated by a statisti- preference for Layout vs. Stock, compared cal sentence extraction tool. The compari- to Plain, we used a One-sided, Fisher’s Ex- son results were very positive with both our act Test. Taking the 45 responses for ‘over- summary types performing significantly bet- all’ ratings from both comparisons, the test ter. This indicates that our approach is a vi- was performed on the contingency table (Ta- able foundation for discussion summarisation. ble1). The p-value was found to be significant Moreover, the summaries structure the points; (p = 0.008); i.e the structuring of points into for instance by whether points are countered, sections with descriptions is preferred to the or whether they link different topic terms. We flat representation. see this project as a step forward in the process of better understanding online discussion. 4.2 Discussion For future work, we think the approach’s The quantitative results above show a prefer- general methods can be applied to tasks be- ence for both Plain and Layout point-based yond summarisation in political debate, prod- summaries compared to Stock. We had also uct reviews, and other areas. It would be at- solicited free textual feedback; these comments tractive to have a web application that would are summarised here. take some discussion corpus as input and gen- Multiple comments made reference to Plain erate a summary, with an interface that could summaries having fewer questions, less surplus support exploration and filtering of summaries information and more content. Comments also based on the user’s interest, for example, us- described the content as being “proper En- ing a discussion-graph built from point noun glish” and using “complete sentences”. Com- component nodes connected by verb edges. ments also suggested some participants be- Currently the approach models discussions lieved the summaries had been written by a as a flat list of posts — without reply/re- human. References were also made to higher sponse annotations. Using hierarchical discus- level properties of both summaries such as sion threads opens up interesting opportuni- “logical flow”, “relies on fallacy”, “explains the ties for Argument Mining using points extrac- reasoning” as well as factual correctness. In tion as a basis. A new summary section that summary, participants acknowledged succinct- listed points commonly made in response to ness, variety and informativeness of the Plain other points in other posts would be a valuable summaries. This shows points can form good addition. There is also potential for further summaries, even without structuring into sec- work on summary presentation. Comments by tions or explaining the links. participants in the evaluation also suggested For comments left about preferences for that it would be useful to present the frequen- Layout summaries, references to organisation cies for points to highlight their importance, doubled with respect to preferences for Plain. and to be able to click on points in an interac- Readability and the idea of assimilating infor- tive manner to see them in the context of the mation were also common factors cited in jus- posts. tifications. Interestingly, only one comment made a direct reference to ‘categories’ (sec- Acknowledgements tions) of the summary. We had expected more This work was supported by the Economic references to summary sections. Fewer com- and Social Research Council [Grant number ments in this comparison referenced human ES/M001628/1].

141 References Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Con- [Boltuzic and Šnajder2015] Filip Boltuzic and Jan sortium, pages 72–77. Association for Computa- Šnajder. 2015. Identifying prominent argu- tional Linguistics. ments in online debates using semantic textual similarity. In Proceedings of the 2nd Workshop [Miller1995] George A. Miller. 1995. Wordnet: A on Argumentation Mining, pages 110–115. lexical database for english. Commun. ACM, 38(11):39–41, November. [Cabrio and Villata2012] Elena Cabrio and Serena Villata. 2012. Combining textual entailment [Nenkova and McKeown2011] Ani Nenkova and and argumentation theory for supporting online Kathleen McKeown. 2011. Automatic sum- debates interactions. In Proceedings of the 50th marization. Foundations and Trends in Annual Meeting of the Association for Com- Information Retrieval, 5(2-3):103–233. putational Linguistics: Short Papers-Volume 2, pages 208–212. Association for Computational [Nenkova et al.2006] Ani Nenkova, Lucy Vander- Linguistics. wende, and Kathleen McKeown. 2006. A compositional context sensitive multi-document [De Marneffe et al.2014] Marie-Catherine summarizer: exploring the factors that influence De Marneffe, Timothy Dozat, Natalia Sil- summarization. In Proceedings of the 29th an- veira, Katri Haverinen, Filip Ginter, Joakim nual international ACM SIGIR conference on Nivre, and Christopher D Manning. 2014. Research and development in information re- Universal stanford dependencies: A cross- trieval, pages 573–580. ACM. linguistic typology. In LREC, volume 14, pages 4585–4592. [Newman and Blitzer2003] Paula S Newman and John C Blitzer. 2003. Summarizing archived [Fillmore et al.2002] Charles J Fillmore, Collin F discussions: a beginning. In Proceedings of the Baker, and Hiroaki Sato. 2002. The framenet 8th international conference on Intelligent user database and software tools. In LREC. interfaces, pages 273–276. ACM. [Galley et al.2004] Michel Galley, Kathleen McKe- [Schuler2005] Karin Kipper Schuler. 2005. Verb- own, Julia Hirschberg, and Elizabeth Shriberg. Net: A broad-coverage, comprehensive verb lex- 2004. Identifying agreement and disagreement icon. Ph.D. thesis, University of Pennsylvania. in conversational speech: Use of bayesian net- works to model pragmatic dependencies. In Pro- [Siddharthan and McKeown2005] Advaith Sid- ceedings of the 42nd Annual Meeting on Associ- dharthan and Kathleen McKeown. 2005. ation for Computational Linguistics, page 669. Improving multilingual summarization: using Association for Computational Linguistics. redundancy in the input to correct mt errors. In Proceedings of the Conference on Human [Ghosh et al.2014] Debanjan Ghosh, Smaranda Language Technology and Empirical Methods Muresan, Nina Wacholder, Mark Aakhus, and in Natural Language Processing, pages 33–40. Matthew Mitsui. 2014. Analyzing argumen- Association for Computational Linguistics. tative discourse units in online interactions. In Proceedings of the First Workshop on [Siddharthan et al.2011] Advaith Siddharthan, Ani Argumentation Mining, pages 39–48. Nenkova, and Kathleen McKeown. 2011. Infor- mation status distinctions and referring expres- [Hu and Liu2004] Minqing Hu and Bing Liu. 2004. sions: An empirical study of references to people Mining and summarizing customer reviews. In in news summaries. Computational Linguistics, Proceedings of the tenth ACM SIGKDD inter- 37(4):811–842. national conference on Knowledge discovery and data mining, pages 168–177. ACM. [Siddharthan2014] Advaith Siddharthan. 2014. A survey of research on text simplification. ITL- [Levy et al.2014] Ran Levy, Yonatan Bilu, Daniel International Journal of Applied Linguistics, Hershcovich, Ehud Aharoni, and Noam Slonim. 165(2):259–298. 2014. Context dependent claim detection. In COLING 2014, 25th International Conference [Teufel and Moens2002] Simone Teufel and Marc on Computational Linguistics, Proceedings of Moens. 2002. Summarizing scientific articles: the Conference: Technical Papers, August 23- experiments with relevance and rhetorical sta- 29, 2014, Dublin, Ireland, pages 1489–1500. tus. Computational linguistics, 28(4):409–445. [Lloret et al.2009] Elena Lloret, Alexandra Bal- [Walker et al.2012] Marilyn A. Walker, Jean E. Fox ahur, Manuel Palomar, and Andrés Montoyo. Tree, Pranav Anand, Rob Abbott, and Joseph 2009. Towards building a competitive opin- King. 2012. A corpus for research on delibera- ion summarization system: challenges and keys. tion and debate. In Proceedings of the Eighth In- In Proceedings of Human Language Technolo- ternational Conference on Language Resources gies: The 2009 Annual Conference of the and Evaluation (LREC-2012), Istanbul, Turkey, North American Chapter of the Association for May 23-25, 2012, pages 812–817.

142 [Zhou and Hovy2006] Liang Zhou and Eduard H Hovy. 2006. On the summarization of dy- namically introduced information: Online dis- cussions and blogs. In AAAI Spring Sympo- sium: Computational Approaches to Analyzing Weblogs, page 237.

143 What to Do with an Airport? Mining Arguments in the German Online Participation Project Tempelhofer Feld

Matthias Liebeck1 Katharina Esau2 Stefan Conrad1 1 Institute of Computer Science, Heinrich Heine University Dusseldorf,¨ Germany liebeck,conrad @cs.uni-duesseldorf.de { } 2 Institute of Social Sciences, Heinrich Heine University Dusseldorf,¨ Germany [email protected]

Abstract cities such as Darmstadt3 and Bonn4 also gather proposals in participatory budgetings. In general, This paper focuses on the automated ex- these platforms are accompanied by offline events traction of argument components from to inform residents and to allow for discussions user content in the German online par- with citizens who cannot or do not want to par- ticipation project Tempelhofer Feld. We ticipate online. In the following, the term online adapt existing argumentation models into participation refers to the involvement of citizens a new model for decision-oriented online in relevant political or administrative decisions. participation. Our model consists of three A participation process usually revolves around categories: major positions, claims, and a specific subject area that is determined by the premises. We create a new German cor- organizer. In a city, the administration might aim pus for argument mining by annotating our to collect ideas to improve a certain situation (e.g. dataset with our model. Afterwards, we how it can beautify a park). Aside from politics, focus on the two classification tasks of companies or institutions can use online participa- identifying argumentative sentences and tion for policy drafting, for example, in universi- predicting argument components in sen- ties (Escher et al., 2016). tences. We achieve macro-averaged F1 By contrast, there are also platforms whose pur- measures of 69.77% and 68.5%, respec- pose is to report defects (e.g., such as a road in tively. need of repair or a street lamp that needs replac- ing), which we do not regard further because they 1 Introduction are only used for reporting issues and do not en- courage discussions between citizens. In the scope In the last few years in Germany, more and more of this paper, we focus only on the subset of online cities are offering their citizens an internet-based participation projects that aim to gather options way to participate in drafting ideas for urban plan- for actions or decisions (e.g., “We should build an ning or in local political issues. The adminis- opera.” or “Should we close the golf course or trations utilize websites to gather the opinions of the soccer field?”). Given a large number of sug- their citizens and to include them in their deci- gestions and comments from citizens, we want to sion making. For example, the German town Lud- automatically identify options for actions and de- wigshafen has an elevated highway that is dam- cisions, extract reasons for or against them (e.g., aged and has to be demolished. Experts created “This would improve the cultural offerings of our four variants for a replacement and Ludwigshafen city.”) and detect users’ stances (e.g., “I totally 1 asked its citizens to discuss them and to gather ar- agree!”). guments for and against each variant, which were As far as we know, it is rather rare in a mu- considered in the final political decision. Other nicipal administration that such participation pro- 2 cities such as Lorrach¨ tap into ideas of their cit- cesses can be attended to by full-time employees, izens for a sustainable urban development and because they have other responsibilities as well. If

1https://www.ludwigshafen-diskutiert. 3https://da-bei.darmstadt.de/discuss/ de Buergerhaushalt2014 2https://gestalten.loerrach.de 4https://bonn-macht-mit.de

144 Proceedings of the 3rd Workshop on Argument Mining, pages 144–153, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics a process is well received by the general public, Most of the researched domains offer a high text it might attract hundreds of suggestions and thou- quality. For instance, in the news domain, the text sands of comments. The manual analysis of this content is usually editorially reviewed before pub- data is time consuming and could take months. lishing. Since our text content is from the web, it Due to budgetary reasons, it might also not be partially lacks proper spelling or grammar and is possible to outsource the analysis. Is it therefore sometimes difficult to understand. Nevertheless, possible that an online participation process was it is important to develop methods for processing a success and a large amount of text contributions web content because everyone’s opinion should be has been created, but not all content can be taken considered, especially in a political context. into account. To avoid that huge amounts of text Another characteristic of our application do- content become unprocessable, it is necessary to main is the presence of discourse between dif- utilize automated techniques to ensure a contem- ferent users. In an online participation platform, porary processing. To the best of our knowledge, users often write comments that refer to other peo- the automated extraction of argument components ple’s suggestions or justifications. This differs in the form of mining decision options and pro and from other domains, such as newspaper articles contra arguments from German online participa- and student essays, where text content is rather tion projects in a political context is a research gap monologic. that we try to fill. To evaluate the performance of an argumenta- The remainder of the paper is structured as fol- tion mining system, datasets are humanly anno- lows: Section 2 describes related work in argu- tated (which results in a gold standard), for in- ment mining. Section 3 explains our data, our an- stance with argument components. More recent notation model and the annotation process. Our publications (Stab and Gurevych, 2014a; Park argument mining approach is described in section and Cardie, 2014; Habernal et al., 2014; Eckle- 4. We conclude and outline future work in section Kohler et al., 2015) report inter-annotator agree- 5. ment values of how well multiple annotators agree on their annotations. Due to different available 2 Related Work inter-annotator agreement measures and different Argumentation mining is an evolving research annotation lengths (tokens, sentences or freely topic that deals with the automatic extraction of assignable spans), there is currently no standard- argument components from text. Most research ized single measure for inter-annotator agreement focuses on English text, but there is also research in the argumentation mining community. As a for German (Houy et al., 2013; Habernal et al., result, we report multiple values to ensure better 2014; Eckle-Kohler et al., 2015) and for the Greek comparability in the future. A detailed overview language (Goudas et al., 2014). of annotation studies can be found in (Habernal and Gurevych, 2016). Previous research spans a variety of domains, such as the legal domain (Palau and Moens, 2009; There has been previous research on automat- Houy et al., 2013), eRulemaking (Park and Cardie, ically mining people’s opinions in the context 2014), student essays (Stab and Gurevych, 2014b), of political decisions named as policy making news (Eckle-Kohler et al., 2015), and web con- (Florou et al., 2013) and as eRulemaking (Park and tent (Goudas et al., 2014; Habernal and Gurevych, Cardie, 2014; Park et al., 2015a), which relate to 2015). Most of the papers share common tasks, our application domain online participation. such as separating text into argumentative and Florou et al. (2013) web crawled Greek web non-argumentative parts, classifying argumenta- pages and social media. The authors aim to ex- tive text into argument components and identify- tract arguments that are in support or in opposi- ing relations between them. Currently, there is no tion of public policies. As a first step, they au- argument model that most researchers agree upon. tomatically classify text segments as argumenta- The chosen argument model often depends on the tive or non-argumentative, although they do not tasks and the application domain. However, most describe what they consider as argumentative and of the recent research agrees that the two argument they do not refer to argumentation theory. In our components claim and premise are usually part of approach, we use text content from a specific plat- the chosen argument models. form (instead of crawling multiple sources); we

145 define three different argument components and that contain a title and text content. To encour- their practical use; we relate to existing argumen- age discussions, users can comment on proposals tation theory; and we further distinguish argument and respond to previous comments, which results components in argumentative sentences. in a tree structured discussion per proposal. Ad- Park and Cardie (2014) focus on English com- hocracy provides a voting system to upvote and ments in the eRulemaking website Regulation downvote content. Therefore, users with limited Room. In their approach, they propose a model for time can follow the wisdom of the crowd by sort- eRulemaking that aims at verifiability by classify- ing proposals by their votes. ing propositions as unverifiable, verifiable experi- In the Tempelhofer Feld online participation ential, and verifiable non-experiential. With their process, users can register anonymously. Their best feature set, they achieve a macro-averaged F1 voting behavior is public (it is possible to see of 68.99% with a support vector machine. (Park et which content was upvoted or downvoted by a spe- al., 2015b) discuss the results of conditional ran- cific user) and their text content is licensed under dom fields as a machine learning approach. In our the Creative Commons License, which makes it approach, we aim at identifying components and attractive for academic use. leave the issue of evaluability up to experts or to The official submission phase for proposals was the wisdom of the crowd. from November 2014 until the end of March 2015. Afterwards, the proposals were condensed in of- 3 Data fline events between May 2015 and July 2015. This section discusses the data from the online par- Until 2015-07-07, the users proposed 340 ideas ticipation project Tempelhofer Feld, presents our and wrote 1389 comments. The comments vary argumentation model, and describes our annota- in length. On average, they contain 3.56 sentences tion process. (σ = 3.36) and 58.7 tokens (σ = 65.7). Each proposal is tagged with one out of seven 3.1 Background categories. We excluded two categories because 5 The Tempelhofer Feld project is an online partic- they mostly contain meta-discussions or serve ipation project that focuses on the former airport as a “doesn’t-fit-anywhere-else” category. This Berlin-Tempelhof (THF) and its future use. Air leaves the remaining five categories: Bewirtschaf- traffic at the airport ceased in 2008. Until today, tung (cultivation), Erinnerung (memory), Freizeit the 300 hectare area of the airport is mostly open (leisure), Mitmachen (participate), and Natur (na- space, which can be used for recreation. ture). In 2014, the ThF-Gesetz (ThF law) entered into The excluded categories are still important for force. It protects the large open space of the field, the participation project, but, for the time being, which is unique in Berlin, and limits structural we focus on proposals that contain ideas or sug- changes, for example by prohibiting the construc- 6 gestions that can potentially be realized. We do tion of new buildings on the field. The participa- not judge whether it makes sense to realize the tion process was commissioned by Berlin’s Senate proposal or not. If someone wants to construct a Department for Urban Development and the Envi- roof over the whole area or wants to scatter blue ronment. pudding on the lawn, we leave it up to the other The project aims to collect ideas that improve users to judge the proposal by voting and com- the field for visitors while adhering to the ThF law, menting on reasons for or against the realization, like settings up drinking fountains. which we want to automatically extract. 3.2 Discussion platform We observed that the users occasionally did not use the platform correctly, by replying to a com- The Tempelhofer Feld project uses the open- ment and referring to another previous comment. source policy drafting and discussion platform Adhocracy7. In Adhocracy, users can create pro- It is worth mentioning that the participation posals, which are text-based ideas or suggestions process is not legally binding and that the most upvoted proposals do not become realized auto- 5 https://tempelhofer-feld.berlin.de matically. Although the participation process is 6There are a few exceptions, like lighting, sanitary facili- ties, seating, and waste bins. encouraged by the politicians, the final decision 7https://github.com/liqd/adhocracy which proposals will be realized is still up to them.

146 3.3 Argumentation Model that either support or attack a claim. According to Stab and Gurevych (2014a), a claim “should We have a practical point of view on text content not be accepted by readers without additional sup- in political online participation processes: To al- port.” Palau and Moens (2009) describe a claim as low politicians to include the opinions expressed “an idea which is either true or false” and Stab in the platform into their decision making, we need and Gurevych (2014a) as a “controversial state- to extract three different components: (i) what do ment that is either true or false.” people want to be built or decided upon, (ii) how do people argue for and against these ideas, and We share the opinion of Habernal et al. (2014) (iii) how many people in the discussion say that that there is no one-size-fits-all argumentation the- they agree or disagree with them. ory for web data and follow the recommendation that the argumentation model should be chosen First, we tried to apply existing argumentation for the task at hand. In our participation project, models that are commonly used in argument min- we are primarily interested in mining suggestions. ing to our dataset, namely Toulmin’s model (Toul- This differs from the common focus on mining min, 1958) and the claim-premise model (based claims as the central component, because the defi- on (Freeman, 1991)). We quickly realized that at- nition of a claim stated above does not apply to our tacks on logical conclusions are rather rare, that dataset: suggestions cannot be classified as true or users frequently express their wishes and partici- false and they can be accepted without additional pate by providing reasons for and against other support, although justifications are commonly pro- suggestions, and that we have to consider this be- vided by the users. havior in the choice of an argumentation model. We adapted the claim-premise family and its Toulmin differentiates between six argument modification for persuasive essays in Stab and components: claim, ground / data, warrant, back- Gurevych (2014a) to a three-part model for mod- ing, qualifier and rebuttal. The model revolves eling arguments in online participation processes: around the claim, the statement of the argument (i) major positions, (ii) claims, and (iii) premises which has to be proven or, in Toulmin’s words, “whose merits we are seeking to establish” (Toul- Major positions are options for actions or deci- min, 2003, p. 90). Grounds are the data that sup- sions that occur in the discussion (e.g., “We should port the claim and serve “as a foundation for the build a playground with a sandbox.” or “The claim” (Toulmin, 2003, p. 90). A ground is con- opening hours of the museum must be at least two nected to the claim by a warrant, which justifies hours longer.”). They are most often someone’s why the ground supports the claim. A warrant vision of something new or of a policy change. can be supported by a backing which establishes If another user suggests a modified version by “authority” (Toulmin, 2003, p. 96) as to why the changing some details, the new suggestion is a warrant is to be accepted. A qualifier specifies the new major position (e.g. “We should build a play- degree of certainty or the “degree of force” (Toul- ground with swings.”). In our practical view, ma- min, 2003, p. 93) of the claim, in respect of the jor positions are unique suggestions from citizens ground. Rebuttals are conditions which “might that politicians can decide on. be capable of defeating” (Toulmin, 2003, p. 94) A claim is a pro or contra stance towards a ma- the claim. With Toulmin’s model, we come to the jor position (e.g. “Yes, we should definitely do same conclusion as Habernal et al. (2014) that it that!”). In our model, claims are text passages is difficult to apply the model to online-generated in which users express their personal positionings discussions, especially when the users argue on a (e.g., “I dislike your suggestion.”). For a politi- level where most of Toulmin’s categories can only cian, the text content of a claim in our defini- be applied very rarely. tion does not serve as a basis for decision making The commonly used claim-premise model because claims do not contain justifications upon (Palau and Moens, 2009; Peldszus and Stede, which decisions can be backed up. The purpose 2013; Stab and Gurevych, 2014a; Eckle-Kohler et behind mining these claims is a conversion into al., 2015) consists of the two components claim two numbers that indicates how many citizens are and premise. A claim “is the central component of for or against a suggestion. an argument” (Stab and Gurevych, 2014a), whose The term premise is defined as a reason that merit is to be established. Premises are reasons attacks or supports a major position, a claim or

147 another premise. Premises are used to make an p. 106) uses the term linked for premises that con- argumentation comprehensible for others, by rea- sist of multiple statements, each of which does not soning why a suggestion or a decision should be separately constitute a support, but together pro- realized or why it should be avoided (e.g. “This duce “a relevant reason”. would allow us to save money.”). We use the Major positions are very similar to the concept term premise in the same way as the claim-premise of policy claims which “advocate a course of ac- model and as Toulmin with grounds. tion” and are about “deciding what to do” (Schi- We do not evaluate if a reason is valid. We only appa and Nordin, 2013, p. 101). determine the user’s intent: If an annotator per- ceives that a user is providing a reason, we anno- 3.4 Annotation tate it as such. Otherwise, we would have to eval- We developed annotation guidelines and refined uate each statement on a semantic level. For ex- them over the course of multiple iterations. Our ample, if a user argues that a suggestion violates a dataset was annotated by three annotators of which building law, the annotators would need to check two are authors of this publication. The third an- this statement. A verification of all reasons for notator was instructed after the annotation guide- correctness would require too much expertise from lines were developed. annotators or a very large knowledge database. In OpenNLP8 was used to split each proposal and our application domain, we leave the evaluation up comment into individual sentences. Errors were to human experts who advise politicians. manually corrected. We also removed embedded Our argumentation model is illustrated in Fig- images that occur sporadically because we focus ure 1. on text content. Afterwards, we used the brat rapid annotation tool (Stenetorp et al., 2012) for attack / support the annotation of the dataset. The text content Major position Premise also contains non-argumentative sentences which attack / we did not annotate. These include salutations, support attack / pro / contra support valedictions, meta-discussions (for instance, com- stance ments about the participation process), and com- Claim prehension questions. In our annotation process, we further divide Figure 1: Our argumentation model for political claims into pro claims and contra claims by clas- online participation sifying the most dominant positioning, based on the content and the wording in order to report a simplified “level of agreement / disagreement” in If a sentence contains only one argument com- preparation for a future user behavior study. More ponent, we annotate the whole sentence. If there observations of our annotation process are detailed is more than one argument component in a sen- in section 6. tence, we annotate the different components sepa- rately, like in the following example: [claim: We 3.4.1 Inter-Annotator Agreement don’t need a new playground] [premise: because ...... Before annotating the data set, we took a subset we already have one.] ...... of 8 proposals with 74 comments to measure the Depending on the writing style of a user, a inter-annotator-agreement, consisting of 261 sen- thought or idea might be expressed in more than tences and 4.1k tokens. The subset was randomly one sentence. In such a case, we combine all suc- drawn from 67 proposals that have between 5 and cessive sentences of the same thought into a group: 40 comments. [major position: The city should build a public As in recent research (Stab and Gurevych, bath. It should contain a 50 meter pool and be 2014a; Park and Cardie, 2014; Habernal et al., flooded with daylight.] The boundaries of these 2014; Eckle-Kohler et al., 2015), we also report groups are based on the content and, therefore, are inter-annotator agreement (IAA) values to quan- subjective. Thus, in our evaluation, we first fo- tify the consensus of our annotators and to make cus on a sentence-based classification of argument our annotation study more comparable. As there is components and consider the identification of the group boundaries as future work. Freeman (1991, 8https://opennlp.apache.org

148 Ao,t κt αu 3.4.2 Corpus all 76.4 62.6 78.0 For our corpus, we randomly drew 72 propos- major positions 89.3 71.9 79.8 als that each contain at least one major position. claims pro 96.3 66.1 59.0 These proposals were commented with 575 com- claims contra 95.6 52.3 57.2 ments. In total, our annotated dataset consists premises 80.9 61.5 80.1 of 2433 sentences and 40177 tokens. We anno- AU / non-AU 90.7 49.1 92.4 tated 2170 argumentative spans. They comprise 548 major positions, 378 claims (282 pro claims Table 1: Inter-annotator agreement scores in per- and 96 contra claims), and 1244 premises. Our centages: Ao,t token-based observed agreement, annotated corpus consists of 4646 (11.6%) non- κt token-based Fleiss’ kappa, and αu Krippen- argumentative and 35531 (88.4%) argumentative dorff’s unitized alpha tokens. This indicates that the text content is highly argumentative. Exactly 88 (3.6%) of the sentences were annotated with more than one ar- currently no standardized single measure in the ar- gument component. gumentation mining community, we report multi- We plan to release our dataset along with our ple IAA values. We use DKPro Agreement (Meyer annotations under an open-source license to allow et al., 2014) to report our inter-annotator agree- reproducibility. ment values. Table 1 summarizes our IAA values for three scenarios: (i) joint measures over all cat- 4 Evaluation egories, (ii) category-specific values, and (iii) ar- gumentative vs. non-argumentative units. This section discusses our initial approach to auto- Since we asked the annotators to assign labels matically identify argumentative sentences and to to freely assignable spans, we use Krippendorff’s classify argument components. unitized alpha αu (Krippendorff, 2004). We have to keep in mind that several comments only con- 4.1 Preprocessing tain one sentence and are, therefore, much easier First, we tokenize all sentences in our dataset with to annotate. An average over IAA values from all OpenNLP and use Mate Tools (Bjorkelund¨ et al., comments would be biased. Hence, we follow the 2010) for POS-tagging and dependency parsing. proposed approach in (Habernal and Gurevych, 2016) to concatenate all text content into a single 4.2 Features document and measure a single Krippendorff’s αu For our classification problems, we evaluate dif- value instead of averaging αu for each document. ferent features and their combinations. They can We also report the token-based observed agree- be divided into three groups: (i) n-grams, (ii) ment Ao,t and the token-based Fleiss’ kappa κt grammatical distributions, and (iii) structural fea- (Fleiss, 1971). The token-based distribution of tures. N-grams are an obvious choice to capture the annotions of all three annotators is as fol- the text content because several words are used re- lows: 1278 non-argumentative tokens and 11220 peatedly in different argument components, like argumentative tokens (3214 major positions, 730 “agree” or “disagree” in the case of claims. We claims pro, 583 claims contra, 6693 premises) use unigrams and bigrams as binary features. We do not report a sentence-based inter- Grammatical Distributions Based on our ob- annotator agreement because more than one anno- servations, we identified that users use different tation per sentence is possible (e.g., a claim fol- tenses and sentences structures for our three cat- lowed by a premise in a subordinate clause) and egories. For instance, claims are often stated in the IAA measures are for single-label annotation the present tence (e.g., “I agree!”). Therefore, only. we use an L2-normalized POS-Tag distribution of The measures of αu = 0.924 for argumenta- the STTS tags (Schiller et al., 1999) and an L2- tive versus non-argumentative spans and the joint normalized distribution of the dependencies in the measure for all categories of α = 0.78 indicate u TIGER annotation scheme (Albert et al., 2003). a reliable agreement between our three annotators. Therefore, we should be able to provide good an- Structural Features We also capture multi- notations for automated classification tasks. ple structural features: token count, percent-

149 AU / non-AU Argument Components Feature Set SVM RF k-NN SVM RF k-NN Unigram 65.99 68.13 61.00 64.40 59.41 40.30 Unigram, lowercased 66.69 64.53 62.26 65.32 53.35 38.25 Bigram 41.79 50.48 16.25 46.62 50.42 11.51 Grammatical 55.88 52.24 48.52 59.54 47.89 46.81 Unigram + Grammatical 69.77 58.39 64.87 68.50 57.13 35.90 Unigram + Grammatical + Structural 67.50 61.14 54.07 65.99 59.46 47.27

Table 2: Macro-averaged F1 scores for the two classification problems: (i) classifying sentences as argumentative and non-argumentative, (ii) classifying sentences as major positions, claims, and premises. age of comma tokens in the sentence, percent- tural features also had different effects on the clas- age of dot tokens in the sentence, and the last sifiers, depending on the subtask. Additionally, token of a sentence as an one-hot encoding we experimented with lemmatized words by Mate (‘.’, ‘!’, ‘?’, ‘OTHER’). Furthermore, we use the Tools (combined with IWNLP (Liebeck and Con- index of the sentence since we have noticed that rad, 2015)) but our results were slightly lower. In users often start their comment with a pro or con- our future work, we will work on better ways to tra claim. Moreover, we use the number of links incorporate lemmatization into our classification in a sentence as a feature. tasks.

4.3 Results 4.3.1 Subtask A We report results for two classification problems. For identifying argumentative sentences, the best Subtask A is the classification of sentences as ar- result of 69.77% was achieved by a support vec- gumentative or non-argumentative and in subtask tor machine with unigrams and grammatical fea- B we automatically classify argument components tures. It is interesting to see that unigrams work in sentences with exactly one annotated argument better with the random forest classifier than with component. Macro-averaged F1 was chosen as an SVM, but, with the additional grammatical fea- evaluation metric. For each subtask, we randomly tures, the SVM outperforms the random forest. split the respective annotated sentences into a 80% The training set for subtask A contains 1667 argu- training set and 20% test set. mentative and 280 non-argumentative sentences, Different feature combinations were evaluated whereas the test set comprises 411 argumentative with three classifiers: Support vector machine and 75 non-argumentative sentences. (SVM) with an RBF kernel, random forest (RF), and k-nearest neighbor (k-NN). We use scikit- 4.3.2 Subtask B learn (Pedregosa et al., 2011) as machine learn- For the classification of argument components, we ing library. The required parameters for our clas- do not further differentiate between pro and con- sifiers (SVM: penalty term C and γ for the kernel tra claims because both of them occur rarer than function; random forest: number of trees, maxi- major positions and premises. Therefore, we have mal depth of the trees, and multiple parameters re- grouped pro and contra claims. The training set for garding splits; k-NN: number of neighbors k and subtask B contains 1592 sentences (951 premises, weight functions) were estimated by a grid search 399 major positions, and 242 claims), whereas the on a 10-fold cross-validation on the training set. test set comprises 398 sentences (219 premises, The results of both subtasks are listed in Table 110 major positions, and 69 claims). 2. k-NN almost always achieved the worst results The best result for subtask B with a macro- in comparison with the two classifiers. The results averaged F1 score of 68.5% was again achieved of bigrams as features are worse than the results by a support vector machine as classifier with uni- of unigrams. Lowercasing words has different ef- grams and grammatical features. In subtask B, the fects, depending on the classifier: The results of gap between the results of the k-NN classifier and unigrams improve for SVMs but decline for ran- the results of the two classifiers is much larger than dom forests and k-NN. The addition of the struc- in subtask A.

150 Predicted additional features to further increase our classifi- MP C P cation results. We will identify specific emotions MP 63 4 43 110 in the argumentation among citizens. We will try P C 9 48 12 69 to find humor as a predictor for enjoyment and so- P 27 20 172 219 ciability. Actual 99 72 227 398 So far, we have only worked on a sentence level. We would like to automatically detect tokens that Table 3: ConfusionP matrix for our best result of form a group, based on the content. For this, we identifying argument components with a support could use the token-based BIO scheme used in vector machine and “unigram + grammatical” as Goudas et al. (2014) and Habernal and Gurevych features (2016), which divides tokens into beginning (B), inner (I), and other (O) tokens of an argument In order to better understand our results, we re- component. This would also allow us to find more port the confusion matrix for the best classifier in than one argument component in a sentence. Table 3. The confusion matrix shows that the clas- Furthermore, we will work on the distinction of sification of premises works well and that major claims into pro and contra claims. Additionally, positions are often misclassified as premises. In we aim to identify more freely available corpora our future work, we will try to find better seman- for online participation to which we can apply our tic features to differentiate major positions from model for a comparative study. premises. 6 Observations We initially tried to solve subtask B as a four class problem but our features do not allow for Background knowledge Some proposals and a good distinction between pro and contra claims comments require background knowledge in order with our small training size for claims yet. In our to fully comprehend them. For an automated ap- future work, we will treat their distinction as a fur- proach, this is much more difficult, especially if ther classification task and will integrate more po- existing buildings on the field or city districts are larity features. referred to by name.

5 Conclusion and Future Work Edge annotation We chose not to annotate out- going edges in our corpus. In a single label ap- In this paper we have presented a new corpus for proach, ambiguity might occur because a premise German argumentation mining and a modified ar- might support one claim and attack another one. gumentation scheme for online participation. We We tried an approach with multiple outgoing described the background of our data set, our an- edges but it became very difficult to evaluate ev- notation process, and our automated classification ery possible edge in discussions with more than approach for the two classification tasks of identi- 30 comments and multiple major positions. In or- fying argumentative sentences and identifying ar- der to avoid an incomplete edge annotation, we gument components. We evaluated different fea- completely omitted the annotation of edges for the ture combinations and multiple classifiers. Our time being. initial results for argument mining in the field of German online participation are promising. The Contextual differentiation During the annota- best results of 69.77% in subtask A and 68.5% in tion, we noticed some situations where it became subtask B were both achieved by a support vector difficult to decide which argument component is machine with unigrams and grammatical features. the best fit. For instance, “Vertical vegetable gar- While working with our dataset, we realized dens are an enrichment for our perception.” con- that citizens argue not only with rational reasons tains a slight positioning, but in the context of the and that they are not always objective. They often comment, the sentence was used as a reason and, express their positive and negative emotions and therefore, annotated as a premise. use humor to convince other participants or just to Acknowledgments avoid conflicts. This makes an automatic approach more difficult. This work was funded by the PhD program On- In our future work, we want to experiment with line Participation, supported by the North Rhine-

151 Westphalian funding scheme Fortschrittskollegs. Ivan Habernal and Iryna Gurevych. 2015. Exploit- The authors want to thank the anonymous review- ing Debate Portals for Semi-supervised Argumen- ers for their suggestions and comments. tation Mining in User-Generated Web Discourse. In Proceedings of the 2015 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 2127–2137. Association for Com- References putational Linguistics. Stefanie Albert, Jan Anderssen, Regine Bader, Ivan Habernal and Iryna Gurevych. 2016. Argumen- Stephanie Becker, Tobias Bracht, Sabine Brants, tation Mining in User-Generated Web Discourse. Thorsten Brants, Vera Demberg, Stefanie Dipper, Computational Linguistics, (in press). Peter Eisenberg, Silvia Hansen, Hagen Hirschmann, Juliane Janitzek, Carolin Kirstein, Robert Langner, Ivan Habernal, Judith Eckle-Kohler, and Iryna Lukas Michelbacher, Oliver Plaehn, Cordula Preis, Gurevych. 2014. Argumentation Mining on the Marcus Pussel, Marco Rower, Bettina Schrader, Web from Information Seeking Perspective. In Pro- Anne Schwartz, Smith George, and Hans Uszkoreit. ceedings of the Workshop on Frontiers and Connec- 2003. TIGER Annotationsschema. Technical re- tions between Argumentation Theory and Natural port, Universitat¨ Potsdam, Universitat¨ Saarbrucken,¨ Language Processing, pages 26–39. CEUR-WS. Universitat¨ Stuttgart. Constantin Houy, Tim Niesen, Peter Fettke, and Peter Anders Bjorkelund,¨ Bernd Bohnet, Love Hafdell, and Loos. 2013. Towards Automated Identification and Pierre Nugues. 2010. A High-Performance Syn- Analysis of Argumentation Structures in the Deci- tactic and Semantic Dependency Parser. In Pro- sion Corpus of the German Federal Constitutional ceedings of the 23rd International Conference on Court. In 7th IEEE International Conference on Computational Linguistics: Demonstrations, COL- Digital Ecosystems and Technologies (IEEE DEST). ING ’10, pages 33–36. Association for Computa- IEEE Computer Society. tional Linguistics. Klaus Krippendorff. 2004. Content Analysis: An In- troduction to Its Methodology. Sage Publications, Judith Eckle-Kohler, Roland Kluge, and Iryna second edition. Gurevych. 2015. On the Role of Discourse Mark- ers for Discriminating Claims and Premises in Ar- Matthias Liebeck and Stefan Conrad. 2015. IWNLP: gumentative Discourse. In Proceedings of the 2015 Inverse Wiktionary for Natural Language Process- Conference on Empirical Methods in Natural Lan- ing. In Proceedings of the 53rd Annual Meeting of guage Processing, EMNLP 2015, pages 2236–2242. the Association for Computational Linguistics and the 7th International Joint Conference on Natural Tobias Escher, Dennis Friess, Katharina Esau, Jost Language Processing (Volume 2: Short Papers), Sieweke, Ulf Tranow, Simon Dischner, Philipp pages 414–418. Association for Computational Lin- Hagemeister, and Martin Mauve. 2016. Online De- guistics. liberation in Academia: Evaluating the Quality and Legitimacy of Co-Operatively Developed University Christian Meyer, Margot Mieskes, Christian Stab, and Regulations. Policy & Internet, (in press). Iryna Gurevych. 2014. DKPro Agreement: An Open-Source Java Library for Measuring Inter-Rater Joseph L. Fleiss. 1971. Measuring Nominal Scale Agreement. In Proceedings of COLING 2014, the Agreement Among Many Raters. Psychological 25th International Conference on Computational Bulletin, 76(5):378 – 382. Linguistics: System Demonstrations, pages 105– 109. Eirini Florou, Stasinos Konstantopoulos, Antonis Raquel Palau and Marie-Francine Moens. 2009. Ar- Koukourikos, and Pythagoras Karampiperis. 2013. gumentation Mining: The Detection, Classification Argument extraction for supporting public policy and Structure of Arguments in Text. In Proceedings formulation. In Proceedings of the 7th Workshop on of the 12th International Conference on Artificial In- Language Technology for Cultural Heritage, Social telligence and Law, ICAIL ’09, pages 98–107. ACM. Sciences, and Humanities, pages 49–54. Association for Computational Linguistics. Joonsuk Park and Claire Cardie. 2014. Identifying Ap- propriate Support for Propositions in Online User James Freeman. 1991. Dialectics and the Macrostruc- Comments. In Proceedings of the First Workshop ture of Arguments: A Theory of Argument Structure, on Argumentation Mining, pages 29–38. Association volume 10 of Pragmatics and Discourse Analysis for Computational Linguistics. Series. de Gruyter. Joonsuk Park, Cheryl Blake, and Claire Cardie. 2015a. Theodosios Goudas, Christos Louizos, Georgios Peta- Toward Machine-assisted Participation in eRule- sis, and Vangelis Karkaletsis. 2014. Argument Ex- making: An Argumentation Model of Evaluability. traction from News, Blogs, and Social Media. In In Proceedings of the 15th International Conference Artificial Intelligence: Methods and Applications, on Artificial Intelligence and Law, ICAIL ’15, pages pages 287–299. 206–210. ACM.

152 Joonsuk Park, Arzoo Katiyar, and Bishan Yang. 2015b. Conditional Random Fields for Identifying Appro- priate Types of Support for Propositions in Online User Comments. In Proceedings of the 2nd Work- shop on Argumentation Mining, pages 39–44. Asso- ciation for Computational Linguistics. Fabian Pedregosa, Gael¨ Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexan- dre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard´ Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830. Andreas Peldszus and Manfred Stede. 2013. Ranking the annotators: An agreement study on argumenta- tion structure. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Dis- course, pages 196–204. Association for Computa- tional Linguistics. Edward Schiappa and John Nordin. 2013. Argumen- tation: Keeping Faith with Reason. Pearson Educa- tion. Anne Schiller, Simone Teufel, Christine Stockert,¨ and Christine Thielen. 1999. Guidelines fur¨ das Tagging deutscher Textcorpora mit STTS (kleines und großes Tagset). Technical report, Universitat¨ Stuttgart, Uni- versitat¨ Tubingen.¨ Christian Stab and Iryna Gurevych. 2014a. Annotating Argument Components and Relations in Persuasive Essays. In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), pages 1501–1510. Christian Stab and Iryna Gurevych. 2014b. Identifying Argumentative Discourse Structures in Persuasive Essays. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pages 46–56. Pontus Stenetorp, Sampo Pyysalo, Goran Topic,´ Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu- jii. 2012. BRAT: a Web-based Tool for NLP- Assisted Text Annotation. In Proceedings of the Demonstrations at the 13th Conference of the Euro- pean Chapter of the Association for Computational Linguistics, EACL ’12, pages 102–107. Association for Computational Linguistics. Stephen Toulmin. 1958. The Uses of Argument. Cam- bridge University Press.

Stephen Toulmin. 2003. The Uses of Argument, Up- dated Edition. Cambridge University Press.

153 Unshared task: (Dis)agreement in online debates

Maria Skeppstedt1,2 Magnus Sahlgren1 Carita Paradis3 Andreas Kerren2 1Gavagai AB, Stockholm, Sweden maria, mange @gavagai.se 2Computer Science{ Department, Linnaeus} University, Vaxj¨ o,¨ Sweden [email protected] 3Centre for Languages and Literature, Lund University, Lund, Sweden [email protected]

Abstract 2 Previous research

Topic-independent expressions for con- Dis/agreement has been the focus of conversa- veying agreement and disagreement were tional analysis (Mori, 1999), and is linked to annotated in a corpus of web forum de- Speech Act Theory (Searle, 1976). The categories bates, in order to evaluate a classifier have been annotated and detected in transcribed trained to detect these two categories. speech, e.g., in meeting discussions (Hillard et Among the 175 expressions annotated in al., 2003; Galley et al., 2004; Hahn et al., 2006), the evaluation set, 163 were unique, which congressional floor-debates (Thomas et al., 2006), shows that there is large variation in ex- and broadcast conversations (Germesin and Wil- pressions used. This variation might be son, 2009). one of the reasons why the task of auto- Online discussions in form of Wikipedia Talk matically detecting the categories was dif- have been annotated for dis/agreement (Andreas ficult. F-scores of 0.44 and 0.37 were et al., 2012), for positive/negative attitude to- achieved by a classifier trained on 2,000 wards other contributors (Ferschke, 2014), and debate sentences for detecting sentence- for subclasses of positive/negative alignment, e.g. level agreement and disagreement. explicit agreement/disagreement, praise/thanking, and critic/insult (Bender et al., 2011). 1 Introduction For online debate fora, there is a corpus of Argumentation mining involves the task of auto- posts with a scalar judgment for their level of matically extracting an author’s argumentation for dis/agreement with a previous post (Walker et al., taking a specific stance. This includes, e.g., to 2012). Misra et al. (2013) used frequently oc- extract premises and conclusion, or the relation- curring uni/bi/trigrams from the non-neutral posts ship between arguments, such as argument and in this corpus for creating a lexicon of topic- counter-argument (Green et al., 2014; Habernal independent expressions for dis/agreement. This and Gurevych, 2015). In a corpus containing dia- lexicon was then used for selecting features for log, e.g., different types of web fora or discussion training a topic-independent classifier. The ap- pages, the argumentation often involves a reaction proach resulted in an accuracy of 0.66 (an im- to arguments given by previous authors in the dis- provement of 0.6 points compared to standard fea- cussion thread. The author might, for instance, ture selection) for distinguishing the classes agree- give a counter-argument to an argument appear- ment/disagreement, when evaluating the classifier ing earlier in the thread, or an argument supporting on debate topics not included in the training data. the stance of a previous author. A sub-task of de- Despite this usefulness of the lexicon for cre- tecting the argument structure of a dialogic corpus ating a topic-independent dis/agreement classifier, is, therefore, to detect when the author conveys there are, to the best of our knowledge, no debate agreement or disagreement with other authors. forum corpora annotated with the focus of topic- The aim of this study was to investigate this sub- independent expressions of dis/agreement. Here, task, i.e., to automatically detect posts in a dialogic the first step towards creating such a resource was, corpus that contain agreement or disagreement. therefore, taken.

154 Proceedings of the 3rd Workshop on Argument Mining, pages 154–159, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics

3 Method The study was conducted on discussions from a debate forum. The data originates from createde- Figure 1: Two of the chunks in the unshared task bate.com, which is a debate forum that hosts de- data that were annotated as dis/agreement. bates on a variety of topics. The data used as eval- uation set was provided for task Variant A in the 3rd Workshop on Argument Mining, and consists sible chunk that was still a topic-independent ex- of 27 manually collected discussion threads.1 The pression conveying dis/agreement. For instance, debates start with a question, e.g., “Should the age in Figure 1, “fighting a war” is specific to the topic for drinking be lowered?”, which users then de- of the debate, whereas the annotated chunk, “is a bate, either by posting an independent post, or by good thing?”, is topic-independent and could be supporting/disputing/clarifying a previous post. used for expressing disagreement in other cases. The same division into topic-specific/topic- The annotation was performed by one annota- independent means for conveying dis/agreement tor, with Brat as the annotation tool (Stenetorp et as previously used by Misra et al. (2013) was al., 2012). adopted. Instead of using it for creating a lex- ical resource, it was, however, used as a guide- 3.2 Annotation/classification of training set line for annotation. A preliminary analysis of Identifying and annotating relevant chunks in run- posts tagged as support/dispute in 8 discussion ning text is a time-consuming task, which also re- threads showed that typical topic-specific strate- quires a large amount of attention from the anno- gies for conveying dis/agreement were reformula- tator. Classifications of individual sentences is, tions/expansions/elaborations of what was stated however, an easier task, and to classify a limited in a previous post. A new argument for or against corpus of 2,000 sentences is feasible in a relatively the initial debate question could, however, also short amount of time. For creating a larger (but be given, without references to the content of the still relatively limited) training set of discussion previous post. Topic-independent means for con- sentences conveying dis/agreement, the chunk an- veying dis/agreement were typically either explicit notation task was reformulated as a text classifi- statements such as “I (dis)agree”, “NO way!”, or cation task, and individual sentences were manu- critical follow-up questions, “A: Alcohol should ally classified according to the categories agree- be forbidden. B: Should it then also be ille- ment, disagreement or neutral. As for the previ- gal with cell phones?”. All means of conveying ous annotation set-up, sentences containing topic- dis/agreement independent of debate topic were, independent expressions for conveying the two however, included in the task, e.g., as exempli- categories of interest were classified as containing fied by Bender et al. (2011), topic-independent agreement or disagreement. explicit dis/agreement, (sarcastic) praise/thanking, The 2,300 most popular threads, i.e., those con- positive reference, doubt, criticism/insult, dismiss- taining the largest number of posts, were down- ing. loaded from the createdebate.com website (ex- The preliminary analysis also showed that the cluding threads present in the evaluation data). support/dispute tagging provided in the unshared The posts are provided with author tagging that task data would not suffice for distinguishing states what posts are disputing or clarifying pre- agreement from disagreement, as there were posts vious posts. Among posts for which no such tag tagged as support that consisted mainly of expres- was attached (the other posts), and among posts sions of disagreement. tagged as disputing a previous post, 2,000 first- 3.1 Annotation of task data (evaluation set) sentences were randomly selected for manual clas- sification. Only first-sentences of posts were in- All instances in which agreement or disagreement cluded to make it possible to classify each individ- were conveyed using topic-independent expres- ual sentence without context, since it is likely that sions were annotated in the unshared task data set. their agreement/disagreement classification is less The annotation was performed by marking a rele- dependent on the context of the post. For sentence vant scope of text, in the form of the longest pos- segmentation, the standard functionality in NLTK 1https://github.com/UKPLab/argmin2016-unshared-task. (Bird, 2002) was used.

155 3.3 Training a classifier Disputed Other Total As the final step, linear support vector machines # agreement 36 73 109 were trained to perform the binary text classi- # disagreement 420 92 512 fication tasks of detecting sentences containing # sentences in total 1,000 1,000 2,000 agreement and disagreement. The LinearSVC class included in Scikit learn (Pedregosa et al., Table 2: The training data statistics shows the 2011) was trained with uni/bigrams/trig-rams as number of sentences annotated as agreement and features, with the requirement of a uni/bigram/tri- disagreement, extracted from posts tagged as dis- gram to having occurred at least twice in the train- puting a previous post or as other. # sentences ing data to be included. The n best features were in total is the total number of annotated sen- selected by the built-in χ2-based feature selec- tences. The corpus also included 57 sentences, tion, and suitable values of n and the support for which it could not be determined without con- vector machine penalty parameter C were deter- text whether disagreement or agreement was ex- mined by 10-fold cross-validation. The text was pressed. These were classified as neutral. The not transformed into lower-case, as the use of 25 sentences that contained both agreement and case is one possible way of expressing or empha- disagreement were classified as belonging to the sising dis/agreement, e.g., ’NO way!’. The set- agreement category. tings that achieved the best results were used for training a model on the entire training data set, were annotated. This shows that the approach used which was then evaluated on the data provided by Misra et al. (2013), i.e., to classify frequently for the unshared task. The annotations in the un- occurring n-grams, is not sufficient for creating a shared task data were transformed into an evalu- high-coverage lexicon of expressions, and it also ation set by transforming the text chunk annota- indicates that automatic detection of these expres- tions into sentence-level classifications of whether sions might be a difficult task. a sentence contained agreement or disagreement. The most important features used by the classi- Two versions of the classifiers were trained, one fiers (Figure 2) are topic-independent, which indi- in which neutral sentences were included and one cates that the aim to create topic-independent clas- with the same set-up as used by Misra et al. (2013), sifiers was reached. Among less important fea- i.e., to train a classifier to distinguish agreement tures, there were, however, also topic-specific ex- from disagreement and thereby not including neu- pressions, which shows that the trained classifiers tral sentences. were not entirely topic-independent. The classifier results are shown in Table 3. For 4 Results and discussion the training set, an F-score of around 0.47 was ob- tained for agreement and around 0.55 for disagree- # of chunks annotated in total: 175 (163 unique) ment. Results were, however, substantially lower # agreement: 43 # disagreement: 132 for disagreement on the evaluation set. This de- crease in results could be explained by overfitting Table 1: Statistics of unshared task annotated data. to the training data, and by uncertainty of the re- sults due to the small evaluation set. There might, however, also be a difference between what is con- Statistics of the annotated data (Tables 1, 2) sidered as an expression of disagreement when it shows that expressions for disagreement are more occurs in the first sentence of a post (which was frequently occurring than expressions for agree- the case for the training data) and when it occurs ment. This is most likely explained by the typical somewhere else in the text (which was the case for style used in debate fora, in which debating of- many sentences in the evaluation data). ten is conducted by disputing other debaters, but it To distinguish agreement from disagreement could also be due to a more frequent use of topic- was an easier task, resulting in F-scores of 0.60 independent expressions for this category. for agreement and 0.92 for disagreement on the A large variation in the expressions used was training set and F-scores of 0.55 and 0.81, respec- observed during annotation. This observation is tively on the evaluation set. The recall for agree- supported by the data, as 163 unique expressions ment was, however, low also for this task, proba-

156 ? admit agree-that agree-with are-right as-well be-it but-in but-it But-no but-there but-with clarified correct decent don-agree doubt easier figured good-points guess-you Hear hear however idea-as is-correct it-is-the lol love misunderstood my-argument myself nice of-an ok okay on-here people-can point points puts right round said supported they-would this-idea to-keep True true-that upvote ur Well what-you-said win yeah yes Yes your-point Yup

?2 Actually agree all-and anything argument arguments-you bad because-if bother bullshit But choice claim disagree disputing don-believe-in Dude evidence flawed foolish fuck generalization half how ignorant Ignoring in-hell Is-it is-so Is-that it-does lead like-to-see lying many No NO no-but Nope nothing obviously of-evidence on-it once peacefully permission-to point pointless proof should-be So sorry stop stupid think-so think-that-you understand Well-thats What what what-is which-should Why yes You you-have-the you-know you-saying yourself

Figure 2: The most important features for detecting agreement (green) and disagreement (red). Font size corresponds to the importance of the feature, and negative features (in black) are underlined.

Including neutral sentences Agreement vs. disagreement (no neutral sent.) Precision Recall Precision Recall Training- agreement 0.46 0.47 0.64 0.56 set (10-fold) disagreement 0.54 0.56 0.91 0.93 Evaluation- agreement 0.45 0.15 0.44 0.15 0.70 0.17 0.46 0.15 ± ± ± ± set disagreement 0.29 0.06 0.50 0.09 0.84 0.06 0.93 0.04 ± ± ± ±

Table 3: Machine learning results obtained on the corpus annotated in this study. bly due to the few occurrence of this class in the specific to the topic of the debate, relations be- training data. tween the content of the debate posts need to Previous machine learning approaches were be modelled, to be able to analyse reformula- generally more successful. In Wikipedia Talk, tions/expansions/elaborations of previous posts. F-scores of 0.69 and 0.53 were achieved for de- tecting positive and negative attitudes (Ferschke, 5 Conclusion 2014), and F-scores of 0.61 and 0.84 for detecting explicit agreement/disagreement (Opitz and Zirn, To be able to train a topic-independent classi- 2013). In other types of online debates, F-scores fier for detecting dis/agreement in online debate of 0.65 and 0.77 have been achieved for detect- fora, a corpus annotated for topic-independent ex- ing dis/agreement (Yin et al., 2012), and an F- pressions of dis/agreement is a useful resource. score of 0.75 for detecting disagreement (Allen et Here, the first step towards creating such a re- al., 2014). Including a neutral category, however, source was taken. A debate forum corpus has resulted in agreement/disagreement F-scores consisting of 27 discussion threads was anno- of 0.23/0.46 for Wikipedia Talk and 0.26/0.57 for tated for topic-independent expressions convey- debate forums (Rosenthal and McKeown, 2015). ing dis/agreement. Among the 175 annotated ex- Not all of these previous studies are, however, di- pressions (43 for agreement and 132 for disagree- rectly comparable, e.g., since more narrowly or ment), 163 were unique, which shows that there is broadly defined categories were used and/or larger a large variation in expressions used. training data sets or external lexical resources. This variation might be one of the reasons why The next step includes an expansion of the train- the task of detecting dis/agreement was difficult. ing and evaluation sets, as well as to involve a sec- 10-fold cross-validation on an additional set of ond annotator to measure inter-annotator agree- 2,000 randomly selected sentences annotated for ment and to create a gold standard. Without this sentence-level dis/agreement resulted in a preci- measure of reliability, the annotated corpus can- sion of 0.46 and a recall of 0.47 for agreement not be considered complete. However, as a snap- and a precision of 0.54 and a recall 0.56 for dis- shot of its current status, the annotations have been agreement. Results for disagreement, however, made publicly available.2 Future work also in- decreased when the model was applied on held- cludes studying to what extent a topic-independent out data (precision 0.29, recall 0.50). Better results classifier detects dis/agreement in general. If were achieved for the task of distinguishing agree- dis/agreement is frequently conveyed by means ment from disagreement, i.e., not including neu- tral sentences, but recall for the more infrequently 2http://bit.ly/1Ux8o7q occurring category agreement was still low.

157 Acknowledgements Nancy Green, Kevin Ashley, Diane Litman, Chris Reed, and Vern Walker, editors. 2014. Proceedings This work was funded by the StaViCTA project, of the First Workshop on Argumentation Mining. framework grant “the Digitized Society – Past, Association for Computational Linguistics, Strouds- Present, and Future” with No. 2012-5659 from the burg, PA, USA. Swedish Research Council (Vetenskapsradet).˚ Ivan Habernal and Iryna Gurevych. 2015. Exploit- ing debate portals for semi-supervised argumenta- tion mining in user-generated web discourse. In References Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages Kelsey Allen, Giuseppe Carenini, and Raymond T. Ng. 2127–2137, Stroudsburg, PA, September. Associa- 2014. Detecting disagreement in conversations us- tion for Computational Linguistics. ing pseudo-monologic rhetorical structure. In Pro- ceedings of the 2014 Conference on Empirical Meth- Sangyun Hahn, Richard Ladner, and Mari Ostendorf. ods in Natural Language Processing, EMNLP 2014, 2006. Agreement/disagreement classification: ex- October 25-29, 2014, Doha, Qatar, A meeting of ploiting unlabeled data using contrast classifiers. In SIGDAT, a Special Interest Group of the ACL, pages Proceedings of the Human Language Technology 1169–1180, Stroudsburg, PA, USA. Association for Conference of the North American Chapter of the Computational Linguistics. ACL, pages 53–56, Stroudsburg, PA, USA. Associa- tion for Computational Linguistics. Jacob Andreas, Sara Rosenthal, and Kathleen McKe- own. 2012. Annotating agreement and disagree- Dusting Hillard, Mari Ostendorf, and Elisabeth ment in threaded discussion. In Proceedings of Shriberg. 2003. Detection of agreement vs. dis- the Eighth International Conference on Language agreement in meetings: Training with unlabeled Resources and Evaluation (LREC-2012), Istanbul, data. In Proceedings of the Human Language Tech- Turkey, May 23-25, 2012, pages 818–822. nology Conference of the North American Chapter of the ACL, Stroudsburg, PA, USA. Association for Emily M. Bender, Jonathan T. Morgan, Meghan Ox- Computational Linguistics. ley, Mark Zachry, Brian Hutchinson, Alex Marin, Bin Zhang, and Mari Ostendorf. 2011. Anno- Amita Misra and Marilyn Walker. 2013. Topic in- tating social acts: Authority claims and alignment dependent identification of agreement and disagree- moves in wikipedia talk pages. In Proceedings of ment in social media dialogue. In Proceedings of the the Workshop on Languages in Social Media, LSM SIGDIAL 2013 Conference, pages 41–50, Strouds- ’11, pages 48–57, Stroudsburg, PA, USA. Associa- burg, PA, USA, August. Association for Computa- tion for Computational Linguistics. tional Linguistics.

Steven Bird. 2002. Nltk: The natural language toolkit. Junko Mori. 1999. Negotiating agreement and dis- In Proceedings of the ACL Workshop on Effective agreement in Japanese : connective expressions and Tools and Methodologies for Teaching Natural Lan- turn construction. J. Benjamins Pub. Co, Amster- guage Processing and Computational Linguistics, dam. Stroudsburg, PA, USA. Association for Computa- tional Linguistics. Bernd Opitz and Cacilia¨ Zirn. 2013. Bootstrap- ping an unsupervised approach for classifying agree- Oliver Ferschke. 2014. The Quality of Content in ment and disagreement. In Proceedings of the 19th Open Online Collaboration Platforms: Approaches Nordic Conference of Computational Linguistics to NLP-supported Information Quality Management (NODALIDA). Linkoping¨ Univ. Electronic Press, in Wikipedia. Dissertation, Technische Universitat¨ Linkoping.¨ Darmstadt, July. Fabian Pedregosa, Gael¨ Varoquaux, Alexandre Gram- Michel Galley, Kathleen McKeown, Julia Hirschberg, fort, Vincent Michel, Bertrand Thirion, Olivier and Elizabeth Shriberg. 2004. Identifying agree- Grisel, Mathieu Blondel, Peter Prettenhofer, Ron ment and disagreement in conversational speech: Weiss, Vincent Dubourg, Jake Vanderplas, Alexan- Use of bayesian networks to model pragmatic de- dre Passos, David Cournapeau, Matthieu Brucher, pendencies. In Proceedings of the 42Nd Annual Matthieu Perrot, and Edouard Duchesnay. 2011. Meeting on Association for Computational Linguis- Scikit-learn: Machine learning in Python. Journal tics, ACL ’04, Stroudsburg, PA, USA. Association of Machine Learning Research, 12:2825–2830. for Computational Linguistics. Sara Rosenthal and Kathleen McKeown. 2015. I Sebastian Germesin and Theresa Wilson. 2009. couldn’t agree more: The role of conversational Agreement detection in multiparty conversation. In structure in agreement and disagreement detection in Proceedings of the 2009 International Conference online discussions. In SIGDIAL 2015 Conference, on Multimodal Interfaces, ICMI-MLMI ’09, pages pages 168–177, Stroudsburg, PA, USA. Association 7–14, New York, NY, USA. ACM. for Computational Linguistics.

158 John R. Searle. 1976. A Classification of Illocutionary Acts. Language in Society, 5(1):1–23. Pontus Stenetorp, Sampo Pyysalo, Goran Topic, Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu- jii. 2012. brat: a web-based tool for nlp-assisted text annotation. In EACL 2012, 13th Conference of the European Chapter of the Association for Computa- tional Linguistics, pages 102–107, Stroudsburg, PA, USA. Association for Computational Linguistics. Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In Proceed- ings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 327–335, Stroudsburg, PA, USA. Association for Computa- tional Linguistics. Marilyn A. Walker, Pranav An, Jean E. Fox Tree, Rob Abbott, and Joseph King. 2012. A corpus for re- search on deliberation and debate. In In Proceed- ings of the Eighth International Conference on Lan- guage Resources and Evaluation, LREC 2012, pages 23–25.

Jie Yin, Paul Thomas, Nalin Narang, and Cecile Paris. 2012. Unifying local and global agreement and disagreement classification in online debates. In Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis, WASSA ’12, pages 61–69, Stroudsburg, PA, USA. Association for Computational Linguistics.

159 Unshared Task at the 3rd Workshop on Argument Mining: Perspective Based Local Agreement and Disagreement in Online Debate

Chantal van Son, Tommaso Caselli, Antske Fokkens, Isa Maks, Roser Morante, Lora Aroyo and Piek Vossen Vrije Universiteit Amsterdam Boelelaan 1105, 1081 HV Amsterdam, The Netherlands c.m.van.son, t.caselli, antske.fokkens, isa.maks, {r.morantevallejo, lora.aroyo, piek.vossen @vu.nl }

Abstract et al., 2012), segments (Wang and Cardie, 2014) or sentences (Andreas et al., 2012). Yin et al. (2012) This paper proposes a new task in argu- propose a framework that unifies local and global ment mining in online debates. The task (dis)agreement classification. includes three annotations steps that result This paper describes an argument mining task in fine-grained annotations of agreement for the Unshared Task of the 2016 ACL Workshop and disagreement at a propositional level. on Argument Mining,1 where participants pro- We report on the results of a pilot annota- pose a task with a corresponding annotation model tion task on identifying sentences that are (scheme) and conduct an annotation experiment directly addressed in the comment. given a corpus of various argumentative raw texts. Our task focuses on local (dis)agreement. In con- 1 Introduction trast to previous approaches, we propose micro- Online debate (in its broadest sense) takes an in- propositions as annotation targets, which are de- creasingly prominent place in current society. It is fined as the smallest meaningful statements em- at the same time a reflection and a shaping factor bedded in larger expressions. As such, the anno- of the different beliefs, opinions and perspectives tations are not only more informative on exactly that exist in a certain community. Online debate what is (dis)agreed upon, but they also account characterizes itself by the dynamic interaction be- for the fact that two texts (or even two sentences) tween its participants: they attack or support each can contain both agreement and disagreement on other’s stances by confirming or disputing their different statements. The micro-propositions that statements and arguments, questioning their rele- we use as a basis have the advantage that they vance to the debate or introducing new arguments are simple statements that can easily be compared that are believed to overrule them. In fact, as Peld- across texts, whereas overall propositions can be szus and Stede (2013, p. 4) point out, all argumen- very complex. On the other hand, creating a gold- tative text is of dialectic nature: “an argument al- standard annotations of micro-propositions is time ways refers to an explicitly mentioned or at least consuming for long texts. We therefore propose an supposed opponent, as for instance in the rebutting (optional) additional annotation step which iden- of possible objections.” Therefore, these (implicit) tifies relevant portions of text. This results in interactions between participants should be given a three-step annotation procedure: 1) identify- a central role when performing argument mining. ing relevant text, 2) identifying micro-propositions In recent years, several studies have addressed and 3) detecting disagreement. We report on a pi- the annotation and automatic classification of lot study for the first subtask. agreement and disagreement in online debates. We selected a combination of two data sets The main difference between them is the annota- provided by the organizers: i) Editorial articles tion unit they have targeted, i.e. the textual units extracted from Room for Debate from the N.Y. that are in (dis)agreement. Some studies focused Times website (Variant C), each of which has a de- on global (dis)agreement, i.e. the overall stance to- bate title (e.g. Birth Control on Demand), debate wards the main debate topic (Somasundaran and description (e.g. Should it be provided by the gov- Wiebe, 2010). Other studies focused on local 1http://argmining2016.arg.tech/index. (dis)agreement, comparing pairs of posts (Walker php/home/call-for-papers/unshared-task

160 Proceedings of the 3rd Workshop on Argument Mining, pages 160–165, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics ernment to reduce teen pregnancies?) and article the editorial article and in the comment. As repre- title describing the author’s stance (e.g. Publicly sented in Figure 1, the article’s author commits to Funded Birth Control Is Crucial); and ii) Discus- the factual status of the proposition, whereas the sions (i.e. collections of comments from different commenter denies it. In this example, the dis- users) about these editorial articles (Variant D). agreement concerns the whole proposition (“no The remainder of this paper is structured as fol- moving took place at all”). However, we assume lows. Section 2 introduces the theoretical frame- that (dis)agreement can also target specific ar- work the task is based on. The annotation task guments within a proposition (i.e. hypothetically, is described in Section 3. Section 4 discusses the someone could argue that it is not the bullies that results of an annotation experiment, and we con- moved from the playground to the screen, but clude and present future work in Section 5. someone else). We call these smallest meaningful propositional units in a text micro-propositions. 2 Perspective Framework author DISAGREEMENT commenter We consider any (argumentative) text to be a AFFIRMATIVE NEGATIVE collection of propositions (statements) associated CERTAIN CERTAIN NON-FUTURE NON-FUTURE with some perspective values. In our frame- work (van Son et al., 2016), a perspective is de- scribed as a relation between the source of a state- the bullies ment (i.e. the author or, in the case of quotations, another entity introduced in the text) and a target in that statement (i.e. an entity, event or proposi- moved tion) that is characterized by means of multiple perspective values expressing the attitude of the from the to the screen source towards the target. For instance, the com- playground mitment of a source towards the factual status of a targeted event or proposition is represented by a Figure 1: Representation of disagreement in the combination of three perspective values express- perspective framework. ing polarity (AFFIRMATIVE or NEGATIVE), cer- tainty (CERTAIN, PROBABLE, POSSIBLE) and time 3 Task Definition (FUTURE, NON-FUTURE). Other perspective di- mensions, such as sentiment, are modeled in the Based on the perspective framework, we propose same way with different sets of values. a task that aims at determining whether authors Our assumption is that participants in an on- agree or disagree on the perspective values asso- line debate interact with each other by attacking ciated with the propositions contained in debate or supporting the perspective values of any of the texts. Rather than trying to model the full de- propositions in a previous text. In this framework, bate comprehensively, we propose to start from the we define agreement as a correspondence between smallest statements made (i.e. micro-propositions) one or more perspective values of a proposition and derive more overall positions from the per- attributed to one source and those attributed to an- spectives on these statements. This requires a de- other source; disagreement, on the other hand, is tailed analysis of the texts. We optimize the an- defined as a divergence between them. For exam- notation process by dividing the task into three ple, consider the following pair of segments, one subtasks described below: (1) Related Sentence from an editorial article and the other from a com- Identification, (2) Proposition Identification, and ment in the context of Teens Hooked on Screens: (3) Agreement Classification. ARTICLE: The bullies have moved from the play- ground to the mobile screen, and there is no escaping 3.1 Task 1: Related Sentence Identification harassment that essentially lives in your pocket. In an online debate, people often do not respond to COMMENT: Ms. Tynes: The bullies haven’t moved from the playground to the screen. each and every statement made in previous texts, but instead tend to support or attack only one or a This is a clear example of disagreement on the few of them. The aim of the first task is to iden- perspective values of a proposition present both in tify those sentences in the editorial article that are

161 COMMENTED UPON in the comment. A sentence In Task 2 we annotate the (micro-)propositions is defined to be COMMENTED UPON if: in the sentences that have been annotated as be- ing COMMENTED UPON. We first identify pred- the comment repeats or rephrases (part of) • icates that form the core of the proposition (e.g. a statement made in the sentence; moved, escaping and lives). Next, we relate them the comment attacks or supports (part of) a to their arguments and adjuncts. For the first pred- • statement made in the sentence. icate moved, for example, we obtain the following micro-propositions: The main purpose of this task is to eliminate the moving parts of the editorial article that are irrelevant for • the bullies moved (dis)agreement annotation. In the data set we use • moved from the playground in this paper, the average number of sentences in • moved to the screen the editorial articles is 19 (in the comments, the • number of sentences ranges from 1 to 16). Without In this task, we annotate linguistic units. this first task, all propositions of the article includ- Though we will experiment with obtaining crowd ing the irrelevant ones would have to be identified annotations for this task, we may need expert an- and annotated for (dis)agreement, which is neither notators for creating the gold standard. We expect efficient nor beneficial for the attention span of to be able to identify micro-propositions automat- the annotators. With other data consisting of short ically with high accuracy. texts, however, this subtask may be skipped. Deciding whether a statement is COM- 3.3 Task 3: Agreement Classification MENTED UPON may require some reasoning, The final goal of the task is to identify the spe- which makes the task inherently subjective. cific micro-propositions in the editorial article that Instead of developing overdetailed annotation are commented upon in a certain comment, and to guidelines simply to improve inter-annotator determine whether the commenter agrees or dis- agreement, we adopt the view of Aroyo and Welty agrees with the author of the article on the per- (2014) that annotator disagreement can be an spective values of these micro-propositions. Thus, indicator for language ambiguity and semantic the final step concerns classifying the relation be- similarity of target annotations. We considered tween the comment and the micro-propositions in using crowdsourcing for this task, which is terms of agreement and disagreement. For exam- particularly useful when harnessing disagreement ple, there is disagreement between the author and to gain insight into the data and task. However, the commenter about the factual status of moving. platforms like CrowdFlower and MTurk are We aim to obtain this information by asking the not suitable for annotation of long texts and crowd to compare micro-propositions in the origi- eliminating context was not an option in our view. nal text to those in the comment. Therefore, the task is currently designed to be Even though most irrelevant micro-propositions performed by a team of expert annotators, and have been eliminated in Task 1, we need an IR- we will experiment with different thresholds to RELEVANT tag to mark any remaining micro- decide which annotations should be preserved for propositions for which (dis)agreement cannot be Tasks 2 and 3. In the future, we might experiment determined (e.g. all those obtained for escaping with alternative crowdsourcing platforms. and lives in our example).

3.2 Task 2: Proposition Identification 3.4 Interaction between Subtasks A sentence can contain many propositions. For in- The first two subtasks are primarily used to pro- stance, the article sentence discussed earlier in this vide the necessary input for the third subtask. The paper (repeated below) contains three propositions relation between the second and third task is clear. centered around the predicates marked in bold:2 In the second task, we create the units of compar- ison and in the third task we annotate the actual ARTICLE: The bullies have moved from the play- (dis)agreement. Similarly, the first task directly ground to the mobile screen, and there is no escaping harassment that essentially lives in your pocket. provides the input for the second task. The rela- 2We do not consider is to express a meaningful proposi- tion between the first and third task is more com- tion in this sentence. plex. In order to establish whether a comment

162 comments upon a specific sentence, we need to ing questioned most likely means that there is an determine if there is any (dis)agreement with the agreement about it, so we do need to annotate it: sentence in question. A natural question may be ARTICLE: Despite a beautiful campus, dedicated fac- how this can be done if this information is only ulty, loyal alumnae and a significant endowment, Sweet made explicit in subtask 3 or why we need to carry Briar College is closing after 114 years. out subtasks 2 and 3 if we already established COMMENT: Anyway, there’s something ineffably sad to me about Sweet Briar’s closing. (dis)agreement in subtask 1. The main difference lies in the level of specificity of the two tasks. In Figure 2 shows the distribution of the anno- subtask 1, annotators are asked if a comment ad- tations in both rounds. Only the sentences that dresses a given sentence in any way. Subtask 3 were annotated by at least one annotator (29% in dives deeper into the interpretation by asking for Round 2) are included in the graph. We explained each micro-proposition in the sentence whether in Section 3.1 that identifying whether a sentence the commenter agrees or disagrees with it. is commented upon or not is an inherently subjec- There may be cases where one of the sub- tive task. We analyze the distribution of annota- tasks assumes that there is (dis)agreement and tions, because distributions are more insightful for the other that there is no relation. We use the tasks where disagreement is expected than mea- following strategies to deal with this. When no surements for inter-annotator agreement. (dis)agreement is found on a detailed level, sub- task 3 provides an option to indicate that there is 169 no relation between a micro-proposition and the Round 1 Round 2 150 139 comment (the IRRELEVANT tag). This captures cases that were wrongly annotated in subtask 1. 100 91 If subtask 1 misses a case of (dis)agreement, this 83 cannot be corrected in subtask 3. We can, how- ever, maximize recall in the first subtask by us- sentences (n) 50 4247 42 43 ing multiple annotators and a low threshold for se- 32 lecting sentences (e.g. requiring only one annota- 14 tor to indicate whether the sentence is commented 0 upon). We will elaborate on this in Section 4. 1 2 3 4 5 annotators (n) 4 Task 1: Pilot annotation Figure 2: Distribution of annotations.

This section reports on a pilot annotation experi- A deeper analysis of the annotated data and the ment targeted at the first subtask. Five expert an- annotation distribution shows the different degrees notators were asked to identify those sentences in of connectivity between the annotated sentences the editorial article that were COMMENTED UPON and the comments. The sentences that were an- in the comment. A set of eight editorial articles notated by 4 or 5 annotators clearly were strongly (152 unique sentences, including titles) and a total and unambiguously related to the comment. For of 62 comments were provided. In total, this came example, the following sentence was annotated by down to 1,186 sentences to be annotated. We used all of the annotators: the Content Annotation Tool (CAT) (Lenzi et al., ARTICLE: But allowing children and teens to regulate 2012) for the annotations. their behavior like adults gives them room to naturally The experiment was performed in two rounds. modify their own habits. First, simple instructions were given to the anno- COMMENT: I empathize with your argument that al- lows children to regulate their behavior like adults. tators to explore the data and task. For the second round, the instructions were refined by adding two In addition, the above sentence is one exam- simple rules: exclude titles (they are part of the ple of those that were annotated as being COM- meta-data), and include cases where a proposition MENTED UPON in multiple comments. What is simply ‘mentioned’ rather than functioning as these sentences seem to have in common is that part of the argumentation. For example, the fact they express an important argument or a conclud- that the closing of Sweet Briar College is repeated ing statement in the editorial article. In the above in the comment below without its factual status be- case, the author of the article uses this argument in

163 an online debate about Teens Hooked on Screens to be too few and new sentences would be added argue why you should not limit your teen’s screen by a sixth annotator. If we want to find out how time. Comparing this example to one where only many relevant micro-proposition we miss, we can a minority of the annotators agreed (i.e. 2 out of address this through a study where we apply the 5), a difference can be noticed in the amount of last two subtasks on complete texts and verify how inference that is required to understand a relation many (dis)agreement pairs are missed in subtask 1. between the sentence and the comment (i.e. the ar- ticle sentence specifies how access to birth control 5 Conclusion is a win-win for young women): We described a new task for argument mining ARTICLE: Giving poor young women easy access to based on our perspective framework and provided birth control is about exactly that - control. the results of a pilot annotation experiment aimed COMMENT: This is a rational argument for how access to birth control is a win-win for young women, their part- at identifying the sentences of an editorial arti- ners, and the taxpaying public who might otherwise foot cle that are COMMENTED UPON in a comment. the welfare bill. Although a functional classification of statements was not part of our original goal, looking at argu- The choice to annotate the sentence as being mentative texts from an interactive point of view COMMENTED UPON or not depends on the ques- did prove to shed new light on this more traditional tion: how strong or obvious is the inference? The argument mining task. Statements that are re- answer is ambiguous by nature and seems to partly peated, rephrased, attacked or supported by other depend on the annotator, given the number of to- debate participants seem to be the ones that are tal annotated sentences ranging from 123 to 212 (at least perceived as) the main arguments of the (indicating that some annotators are more likely to text, especially when commented upon by multi- annotate inference relations than others). Partly, ple users. In contrast, statements that are not com- however, it depends on the specific instance, in- mented upon are likely to provide background in- dicated by the fact that all annotators had an- formation to support or introduce these arguments. notated multiple relations between sentences and We argued that annotator disagreement is not so comments that none of the others did. much undesirable as it is insightful in tasks like The sentences that were not annotated at all (by this and reported on the distribution of the anno- none of the annotators and for none of the com- tations. In our case, annotator disagreement ap- ments) typically included (personal) anecdotes or peared to be an indicator for the amount of infer- other background information to support or intro- ence that is needed to understand the relation be- duce the main arguments in the article. For exam- tween the sentence and the comment. ple, the following four subsequent sentences intro- In the future, we plan to further experiment duce and illustrate the statements about the freeing with the other two defined subtasks using a com- powers of single-sex education that follow: bination of expert annotation, semi-automatic ap- ARTICLE: Years ago, during a classroom visit, I ob- proaches (textual similarity and entailment, gener- served a small group of black and Latino high school ation of propositional relations) and crowdsourc- boys sitting at their desks looking into handheld mirrors. They were tasked with answering the question, “What do ing. Furthermore, we will include comment– you see?” One boy said, “I see an ugly face.” Another comment relations (where one comment is a re- said, “I see a big nose.” sponse to another) next to article–comment rela- A major advantage of asking multiple annota- tions. The annotations and code for the experiment tors is that we can use different thresholds for se- described in this paper are publicly available. 3 lecting data. If we want to create a high quality set of clearly related sentences and comments, we can use only those sentences annotated by all. As References suggested in Section 3, we can also select all sen- Jacob Andreas, Sara Rosenthal, and Kathleen McKe- tences annotated by one or more person to aim for own. 2012. Annotating agreement and disagree- high recall. Nevertheless, this will not guarantee ment in threaded discussion. In Proceedings of that no sentences are missed. Our results show the 8th International Conference on Language Re- that each additional annotator led to more candi- 3github.com/ChantalvanSon/ date sentences, indicating that five annotators may UnsharedTask-ArgumentMining-2016

164 sources and Evaluation (LREC 2012), pages 818– 822, Istanbul, Turkey. L. Aroyo and C. Welty. 2014. The three sides of CrowdTruth. Human Computation, 1:31–34.

Valentina Bartalesi Lenzi, Giovanni Moretti, and Rachele Sprugnoli. 2012. CAT: the CELCT Anno- tation Tool. In Proceedings of the 8th International Conference on Language Resources and Evalua- tion (LREC 2012), pages 333–338, Istanbul, Turkey, May. Andreas Peldszus and Manfred Stede. 2013. From ar- gument diagrams to argumentation mining in texts: A survey. International Journal of Cognitive Infor- matics and Natural Intelligence (IJCINI), 7(1):1–31. Swapna Somasundaran and Janyce Wiebe. 2010. Rec- ognizing stances in ideological on-line debates. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Gener- ation of Emotion in Text, pages 116–124, Los Ange- les, California, June. Association for Computational Linguistics. Chantal van Son, Tommaso Caselli, Antske Fokkens, Isa Maks, Roser Morante, Lora Aroyo, and Piek Vossen. 2016. GRaSP: A multilayered annotation scheme for perspectives. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Portoroz,ˇ Slovenia, May. Marilyn A Walker, Jean E Fox Tree, Pranav Anand, Rob Abbott, and Joseph King. 2012. A corpus for research on deliberation and debate. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pages 812– 817, Istanbul, Turkey.

Lu Wang and Claire Cardie. 2014. Improving agree- ment and disagreement identification in online dis- cussions with a socially-tuned sentiment lexicon. In Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (ACL 2014), Baltimore, Maryland, USA, June. Association for Computational Linguis- tics. Jie Yin, Paul Thomas, Nalin Narang, and Cecile Paris. 2012. Unifying local and global agreement and disagreement classification in online debates. In Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis, pages 61–69. Association for Computational Lin- guistics.

165 A Preliminary Study of Disputation Behavior in Online Debating Forum

Zhongyu Wei1,2, Yandi Xia1, Chen Li1, Yang Liu1 Zachary Stallbohm1, Yi Li1 and Yang Jin1 1Computer Science Department, The University of Texas at Dallas Richardson, Texas 75080, USA 2School of Data Science, Fudan University, Shanghai, P.R.China zywei,yandixia,chenli,yangl,stallbohm,yili,yangjin @hlt.utdallas.edu { }

Abstract

In this paper, we propose a task for qual- ity evaluation of disputing argument. In order to understand the disputation behav- ior, we propose three sub-tasks, detecting disagreement hierarchy, refutation method and argumentation strategy respectively. We first manually labeled a real dataset collected from an online debating forum. The dataset includes 45 disputing argu- ment pairs. The annotation scheme is Figure 1: A disputation example from createdde- developed by three NLP researchers via bate.com (The debating topic is “Should the Go- annotating all the argument pairs in the rilla have died?”) dataset. Two under-graduate students are then trained to annotate the same dataset. We report annotation results from both the effectiveness of different features for the pre- groups. Then, another larger dataset was diction of highly voted comments in terms of delta annotated and we show analysis of the cor- score and karma score respectively. Although they relation between disputing quality and dif- considered some sorts of argumentation related ferent disputation behaviors. features, such features are merely based on lexi- cal similarity, without modeling persuasion behav- 1 Introduction iors. With the popularity of the online debating forum In this paper, we focus on a particular action such as idebate1, convinceme2 and createdebate3, in the online debating forum, i.e., disputation. researchers have been paying increasing attention Within debate, disputation happens when a user to analyze debating content, including identifica- disagrees with a specific comment. Figure 1 gives tion of arguing expressions in online debate (Tra- a disputation example from the online debating belsi and Zaıane, 2014), recognition of stance forum createdebate. It presents an original ar- in ideological online debates (Somasundaran and gument and an argument disputing it. Our study Wiebe, 2010; Hasan and Ng, 2014; Ranade et al., aims to evaluate the quality of a disputing com- 2013b), and debate summarization (Ranade et al., ment given its original argument and the discussed 2013a). However, there is still little research about topic. In order to have a deep understanding of quality evaluation of debating content. disputation, we analyze disputation behavior via Tan et al. (2016) and Wei and Liu (2016) stud- three sub-tasks, including disagreement hierarchy ied the persuasiveness of comments in sub-reddit identification, refutation method identification and change my view of Reddit.com. They evaluated argumentation strategy identification.

1http://idebate.org/ We first manually labeled a small amount of 2http://convinceme.net data collected from createdebate.com. It includes 3http://www.createdebate.com/ 8 debate threads related to different topics. We

166 Proceedings of the 3rd Workshop on Argument Mining, pages 166–171, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics extracted all the 45 disputing pairs from these Table 1: Examples for disagreement hierarchy threads. Each pair contains two arguments and the original argument second one disputes the first one. Three NLP re- I strongly feel age for smoking and drinking should not be lowered down as it can disturb the hormonal balance of the body! searchers (the first three authors of the paper) first disputing argument developed a rough version of annotation scheme DH-LV1: Irrelevance Wat???? You are an idiot! I would definitely give you a down vote! and they annotated all the argument pairs. Based DH-LV2: Contradiction on the annotation feedback and discussioins, they I do not think this correct, it is impossible to be accepted. DH-LV3: Target Losing Argument modified the scheme. Two native English speak- So this age 21 thing is really stupid cause like i said minors still get hold to alcoholic beverages. (Age limit is non-sense because teen-age ers are then trained to annotate the same dataset. can always have alcohol.) Further, we asked one annotator with better perfor- DH-LV4: Refutation Getting involved in a war will also hurt your body as drinking and smok- mance in previous step to annotate a larger set of ing, but the age limit is 18 instead of 21. data. We then analyze the correlation between dis- puting quality and different disputation behaviors. c) DH-LV3: Target Losing Argument. The We will introduce annotation schema in Section 2 disagreement is contradiction plus reasoning and then report the annotation result in Section 3. and/or evidence. However, it aims at some- We conclude the paper in Section 4. thing slightly different from the original argu- ment. 2 Annotation Schema d) DH-LV4: Refutation. Refutation is a counter- Our annotation is performed on a pair of argu- argument quoting content from the original ar- ments from opposite sides of a specific topic. In gument. The quoting can be either explicit or each pair, the second argument disputes the first implicit. one. Any of them can hold the “supportive” stance to the discussed topic. We define four annota- tion tasks: disagreement hierarchy (DH), refuta- 2.2 Refutation Method tion method (RM), argumentation strategy (AS) When a disputing comment is labeled as refuta- and disputing quality (DQ). The first three are pro- tion, we will further identify its refutation method. posed to understand the disputation behavior. In This sub-task is proposed to indicate what aspect the disputing comment, DH indicates how the dis- of the original argument is attacked by the disput- agreement is expressed, RM describes which part ing one. Three categories are given for this sub- of the original argument is attacked, and AS shows task according to the theory of refutation methods how the argument is formed. proposed by Freeley and Steinberg (2013). Exam- ples for disputing comments using different refu- 2.1 Disagreement Hierarchy tation methods are shown in Table 2. In order to identify how users express their dis- a) . Refutation is performed agreement to the opposite argument, we borrowed RM-F: refute fallacy by attacking the fallacy of the original argu- the disagreement hierarchy from Paul Graham4. ment. This usually happens when the target of We modified the original version of the theory by the attack is the correctness of the claim itself combining some similar categories and proposed in the original argument. a four-level hierarchy. The definition of different types of DH is shown below. Examples of disput- b) RM-R: refute reasoning. Refutation is per- ing comments with different disagreement hierar- formed by attacking the reasoning process chies are shown in Table 1. demonstrated in the original argument. c) RM-E: refute evidence. Refutation is per- a) DH-LV1: Irrelevance. The disagreement formed by attacking the correctness of the evi- barely considers the content of the original ar- dence given in the original argument. gument. b) DH-LV2: Contradiction. The disagreement 2.3 Argumentation Strategy simply states the opposing case, with little or no supporting evidence. To dispute the original argument, the users will form their own argument. Argumentation strate- 4http://paulgraham.com/disagree.html gies have been studies in both The Toulmin Model

167 Table 2: Examples for refutation methods (OA: Table 3: Examples for argumentation strategy original argument; DA: disputing argument) Generalization Look at alan turing; government data collection on him and his homo- RM-F: refute fallacy sexual tendencies led to his suicide. OA: Humans are not animal’s and dont say that we evolved from mon- Analogy keys because we did not What has worked for drug decriminalization in the Netherlands should DA: dont say that we evolved from monkeys because we did not work in the United States. http://en.wikipedia.org/wiki/Human evolution And there’s a long list of Sign references and further reading down there. Where there’s fire, there’s smoke. RM-R: refute reasoning Cause OA: There is supposed to be equal protection under the law. If we give Beer causes drunkenness, or that drunkenness can be caused by beer. some couples benefits for being together we need to give it to the rest. Authority DA: Talking about the 14th Amendment’s Equal Protection Clause? As stated by Wikipedia: human is evolved from animal. That was talking about slavery. Principle RM-E: refute evidence OA: Dont say that we evolved from monkeys because we did not As it says, ”there is a will, there is a way”. http://en.wikipedia.org/wiki/Human evolution And there’s a long list of references and further reading down there. DA: Evolution is fake God made you retard learn it. Also Wikipedia is sooooooooo wrong random people put stuff in there and the creator 2.4 Debating quality evaluation does not even care Yah Fools We are also interested in the general quality of the disputing comment. We use three categories: bad 5 of Argumentation and the work of Walton et debate, reasonable debate and good debate. The al. (2008). In our research, we employ the classi- label should be assigned based on the content of fication version from Toulmin because it is much the disputing argument instead of annotators’ per- simpler. Six categories are used to indicate the ar- sonal preference to the topic. gumentation strategy used in the disputing argu- ment. Note that this label should be given based a) Bad debate. The disagreement is irrelevant or on user’s intention instead of the quality of the ar- simply states its attitude without any support; gument. For example, users might choose inap- the support or reasoning or fallacy is not rea- propriate evidence to support the disputing claim. sonable. We will still treat it as generalization. Examples of b) Reasonable debate. The disagreement is com- arguments with different argumentation strategies plete including contradiction and related sup- are shown in Table 3. portive evidence or reasoning. However, the argument might be attacked easily. a) Generalization. Argument by generalization assumes that a number of examples can be ap- c) Good debate. The disagreement contains con- plied more generally. tradiction and related supportive evidence or reasoning. Besides, this argument is good and b) Analogy. Argument by analogy examines al- ternative examples in order to prove that what persuasive to some extent. is true in one case is true in the other. 3 Annotation Result c) Sign. Argument by sign asserts that two or more things are so closely related that the pres- The annotation is performed on the variantA 6 ence or absence of one indicates the presence dataset provided by the 3rd workshop on ar- or absence of the other. gumentation mining collected from createde- bate.com. In such forum, each debating thread is d) Cause. Argument by cause attempts to estab- lish a cause and effect relationship between two about a particular topic and users can initialize a events. comment with a specific stance. Besides starting a comment, users can also reply to a comment with e) Authority. Argument by authority relies on the an intention of supporting, disputing or clarifying. testimony and reasoning of a credible source. We first work on one subset of the data, namely f) Principle. Argument by principle locates a dev to develop our annotation scheme and analyze principle that is widely regarded as valid and the annotation performance of two laymen annota- shows that a situation exists in which this prin- tors. The statistics of the original dev set are given ciple applies. in Table 4. As we can see, more than half of the g) Other. When no above-mentioned argumenta- comments are disputing ones. We extract all dis- tion strategy is identified, we label it as other. puting comments together with their original com- ment to form argument pairs as the first batch of 5http://www-rohan.sdsu.edu/˜digger/ 305/toulmin_model.htm 6Please contact authors for the annotated dataset.

168 Table 4: Statistics of the dev dataset in VariantA background of the discussion related to this topic. from createdebate.com We first look at the label distribution on all the Thread # 8 four annotation tasks based on experts’ opinion avg comment # 10.25 on batch-1. The annotation scheme changes dur- avg initial comment # 3.00 ing the annotation process via discussion, we thus avg disputation comment # 5.63 avg support comment # 1.38 are not able to provide agreement between ex- avg clarify comment # 0.25 perts. For the three disputation behavior annota- unique user # 6.7 tion tasks, experts finalize the label after discus- avg length of initial comments 87.16 avg length of disputation comments 67.02 sion. For the disputing quality evaluation, experts avg length of comments 69.24 agree on the label for bad debate but had different opinions about good and reasonable ones, since these are subjective. Therefore, for general qual- ity annotation we take the majority. Table 5 shows the detail of the annotation results. For disagree- ment hierarchy, 36 out 45 (80%) disputation are refutation. For refutation method, 20 (44%) dis- puting comments refute fallacy directly, while 7 (18%) and 9 (20%) refute evidence and reason- ing respectively. For argumentation strategy, 20 (44%) disputing comments do not use any speci- fied methods. Generalization is the most popular one while no sign and principle are found. For the Figure 2: Workflow of the annotation task disputing quality, more than half of the comments are labeled as reasonable. Only 10 (22%) are la- beled as good. our experiment dataset batch-1, 8 threads and 45 We then analyze the annotation result for pairs of arguments in total. two laymen annotators using experts’ opinion as We also analyze the relationship between dif- ground truth on batch-1. Generally speaking, the ferent disputation behaviors and the quality of the disputation behavior annotation is difficult for lay- disputing argument. To make this correlation anal- men. With only half an hour training, the perfor- ysis more convincible and also to motivate follow- mance of both annotators is not very good for la- up research for disputation analysis in the online beling the four tasks. For disagreement hierarchy, forum, we collected another batch of annotation annotators seem to have problems to distinguish on the larger dataset batch-2 from another two target losing argument and refutation. Annotator- sub-sets of variantA (i.e., test and crowdsourc- 1 mis-labels too many instances as ttarget losing ing). This batch contains 20 new topics including argument while annotator-2 gives only 1 such an- 93 pairs of disputing arguments. The correlation notation. The lowest accuracy comes from refu- analysis is then performed on the combination of tation method identification. This is because the batch-1 and batch-2. task requires deep understanding and analysis of 3.1 Annotation Result of Expert and Layman argument. For disputing quality evaluation, it is on Batch-1 easier for annotators to identify the bad argument. Distinguishing good and reasonable disputing is Three NLP researchers work together to define the much more difficult. This is because the differ- annotation scheme via annotating all the argument ence between them is very subjective. pairs in batch-1. Two undergraduate students are then hired to annotate the same set of data given 3.2 Correlation of Disputation Behavior and two days to finish all the annotation task. A half Disputing Quality an hour training session is used for introducing the annotation scheme and demonstrating the annota- With the same strategy, we further construct and tion process via two samples. The work flow of annotate the second batch of experiment dataset the annotation is shown in Figure 2. Annotators batch-2. Annotator-1 worked for this. Before are given the entire thread of the debating to have a the annotation, we review the error annotation

169 Table 5: Annotation results

Batch-1 Batch-2 Annotation Type Expert Annotator-1 Annotator-2 Annotator-1 # # precision recall F-1 # precision recall F-1 # DH-LV1 2 (1%) 3 0.667 1.000 0.800 1 1.000 0.500 0.667 4 (4%) DH-LV2 1 (2%) 2 0.500 1.000 0.667 2 0.500 1.000 0.667 5 (5%) DH DH-LV3 6 (13%) 12 0.417 0.833 0.556 1 0.000 0.000 0.000 12 (13%) DH-LV4 36 (80%) 27 1.000 0.750 0.857 41 0.829 0.944 0.883 72 (77%) RM-E 7 (18%) 2 1.000 0.286 0.444 6 0.333 0.286 0.308 4 (4%) RM RM-R 9 (20%) 19 0.263 0.556 0.357 11 0.364 0.444 0.400 27 (29%) RM-F 20 (44%) 6 1.000 0.300 0.462 24 0.500 0.600 0.545 41 (44%) generalization 9 (20%) 6 0.667 0.667 0.667 6 0.167 0.167 0.167 5 (5%) analogy 5 (11%) 4 0.500 0.400 0.444 7 0.714 1.000 0.833 8 (9%) sign 0 (0%) 4 0.000 0.000 0.000 0 0.000 0.000 0.000 0 (0%) AS cause 6 (13%) 12 0.500 0.667 0.571 10 0.600 0.667 0.632 33 (35%) authority 5 (11%) 3 0.667 0.400 0.500 7 0.714 1.000 0.833 7 (8%) principle 0 (0%) 0 0.000 0.000 0.000 0 0.000 0.000 0.000 0 (0%) other 20 (44%) 16 0.813 0.650 0.722 15 0.800 0.600 0.686 40 (43%) bad 12 (27%) 21 0.523 0.917 0.667 11 0.818 0.750 0.783 19 (20%) DQ reasonable 23 (51%) 21 0.667 0.609 0.636 9 0.667 0.261 0.375 58 (62%) good 10 (22%) 3 0.000 0.000 0.000 25 0.280 0.700 0.400 16 (17%)

good indicators for reasonable arguments.

3.3 Discussion We identified two major reasons for annotation er- rors after result analysis on batch-1. First, some categories within sub-tasks are difficult to distin- guish in nature (e.g. target losing argument and refutation). Second, some disputing comments contain multiple claims and premises. This makes Figure 3: The correlation between disputation be- it difficult to identify the essential claim of the dis- haviors and disputing quality (binary setting) on putation. We believe we can improve the annota- batch-1+batch-2. tion performance in future work by: a) extend the time for training session and pick some representa- tive samples for demonstration; b) modify the an- with the annotator to enhance his understanding notation scheme to avoid the ambiguity between about the annotation task. The annotation result categories; c) preprocess the disputing comment of batch-2 can be seen in Table 5. We then report to identify the essential argument for better anno- the correlation result between disputation behav- tation. iors and disputing quality of the arguments on the 4 Conclusion combination of batch-1 and batch-2. For the correlation analysis, we report the label In this paper, we analyzed the disputation action distribution in terms of disputing quality for argu- in the online debate. Four sub-tasks were pro- ments with different disputation labels. Consid- posed including disagreement hierarchy identifi- ering the difference between a “good disputing” cation, refutation method identification, argumen- and a “reasonable disputing” is hard to decide, we tation strategy identification and disputing quality treat both reasonable and good as reasonable to evaluation. We labeled a set of disputing argument form a binary setting. Figure 3 shows the correla- pairs extracted from a real dataset collected in cre- tion between disputation behaviors and disputing ateddebate.com and showed annotation results. quality. As we can see, all the arguments labeled Acknowledgments as DH-irrelevance and DH-contradiction are bad ones, and 91.7% of DH-refutation arguments are The work is partially supported by DARPA Con- reasonable. For argumentation strategy, analogy tract No. FA8750-13-2-0041 and AFOSR award (100%), cause (89.7%) and authority (91.7%) are No. FA9550-15-1-0346.

170 References Austin Freeley and David Steinberg. 2013. Argumen- tation and debate. Cengage Learning. Kazi Saidul Hasan and Vincent Ng. 2014. Why are you taking this stance? identifying and classifying reasons in ideological debates. In EMNLP, pages 751–762.

Sarvesh Ranade, Jayant Gupta, Vasudeva Varma, and Radhika Mamidi. 2013a. Online debate summa- rization using topic directed sentiment analysis. In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Min- ing. ACM.

Sarvesh Ranade, Rajeev Sangal, and Radhika Mamidi. 2013b. Stance classification in online debates by recognizing users? Intentions. In SigDial, pages 61–69.

Swapna Somasundaran and Janyce Wiebe. 2010. Rec- ognizing stances in ideological on-line debates. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Genera- tion of Emotion in Text, pages 116–124. Association for Computational Linguistics.

Chenhao Tan, Vlad Niculae, Cristian Danescu- Niculescu-Mizil, and Lillian Lee. 2016. Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions. arXiv preprint arXiv:1602.01103.

Amine Trabelsi and Osmar R Zaıane. 2014. Finding arguing expressions of divergent viewpoints in on- line debates. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)@ EACL, pages 35–43. Douglas Walton, Christopher Reed, and Fabrizio Macagno. 2008. Argumentation schemes. Cam- bridge University Press. Zhongyu Wei and Yang Liu. 2016. Is this post per- suasive? ranking argumentative comments in online forum. In ACL.

171

Author Index

Addawood, Aseel,1 Musi, Elena, 82 Aroyo, Lora, 160 Ashley, Kevin D., 50 Niwa, Yoshiki, 76

Bar-Haim, Roy, 119 Palmer, Alexis, 21 Barker, Emma, 12 Paradis, Carita, 154 Bashir, Masooda,1 Parsons, Simon, 31 Becker, Maria, 21 Peldszus, Andreas, 103 Beigman Klebanov, Beata, 70 Petasis, Georgios, 94 Bollegala, Danushka, 31 Rajendran, Pavithra, 31 Boltuzic, Filip, 124 Reed, Chris, 40 Budzynska, Katarzyna, 40 Burstein, Jill, 70 Sahlgren, Magnus, 154 Sato, Misa, 76 Caselli, Tommaso, 160 Savelka, Jaromir, 50 Conrad, Stefan, 144 Siddharthan, Advaith, 60, 134 Duthie, Rory, 40 Skeppstedt, Maria, 154 Slonim, Noam, 119 Egan, Charlie, 134 Šnajder, Jan, 124 Esau, Katharina, 144 Song, Yi, 70 Stab, Christian, 70, 113 Fokkens, Antske, 160 Stallbohm, Zachary, 166 Frank, Anette, 21 Stede, Manfred, 103

Gaizauskas, Robert, 12 Toledo-Ronen, Orith, 119 Ghosh, Debanjan, 82 Gurevych, Iryna, 70, 113 van Son, Chantal, 160 Gyawali, Binod, 70 Vossen, Piek, 160

Jin, Yang, 166 Wei, Zhongyu, 166 Wyner, Adam, 60, 134 Karkaletsis, Vangelis, 94 Kerren, Andreas, 154 Xia, Yandi, 166 Koreeda, Yuta, 76 Yanai, Kohsuke, 76 Lawrence, John, 40 Yanase, Toshihiko, 76 Li, Chen, 166 Li, Yi, 166 Liebeck, Matthias, 144 Liu, Yang, 166

Maks, Isa, 160 Mandya, Angrosh, 60 Morante, Roser, 160 Muresan, Smaranda, 82

173