Automatic Argument Quality Assessment--New Datasets And

Automatic Argument Quality Assessment - New Datasets and Methods

Assaf Toledo∗, Shai Gretz∗, Edo Cohen-Karlik∗, Roni Friedman∗, Elad Venezian, Dan Lahav, Michal Jacovi, Ranit Aharonov and Noam Slonim IBM Research

Abstract More recently, IBM also introduced Speech by Crowd, a service which supports the collection of We explore the task of automatic assessment free-text arguments from large audiences on de- of argument quality. To that end, we actively batable topics to generate meaningful narratives. collected 6.3k arguments, more than a factor A real-world use-case of Speech by Crowd is in of five compared to previously examined data. Each argument was explicitly and carefully an- the field of civic engagement, where the aim is to notated for its quality. In addition, 14k pairs exploit the wisdom of the crowd to enhance deci- of arguments were annotated independently, sion making on various topics. There are already identifying the higher quality argument in each several public organizations and commercial com- pair. In spite of the inherent subjective nature panies in this domain, e.g., Decide Madrid2 and of the task, both annotation schemes led to sur- Zencity.3 As part of the development of Speech by prisingly consistent results. We release the la- Crowd, 6.3k arguments were collected from con- beled datasets to the community. Furthermore, we suggest neural methods based on a recently tributors of various levels, and are released as part released language model, for argument rank- of this work. ing as well as for argument-pair classification. An important sub-task of such a service is the In the former task, our results are comparable automatic assessment of argument quality, which to state-of-the-art; in the latter task our results has already shown its importance for prospec- significantly outperform earlier methods. tive applications such as automated decision making (Bench-Capon et al., 2009), argument search 1 Introduction (Wachsmuth et al., 2017b), and writing support Computational argumentation has been receiving (Stab and Gurevych, 2014). Identifying argument growing interest in the NLP community in recent quality in the context of Speech by Crowd allows years (Reed, 2016). With this field rapidly expand- for the top-quality arguments to surface out of ing, various methods have been developed for sub- many contributions. tasks such as argument detection (Lippi and Tor- Assessing argument quality has driven practi- roni, 2016; Levy et al., 2014; Rinott et al., 2015), tioners in a plethora of fields for centuries from stance detection (Bar-Haim et al., 2017) and argu- philosophers (Aristotle et al., 1991), through aca-

arXiv:1909.01007v1 [cs.CL] 3 Sep 2019 ment clustering (Reimers et al., 2019). demic debaters, to argumentation scholars (Walton Recently, IBM introduced Project Debater, the et al., 2008). An inherent difficulty in this domain first AI system able to debate humans on complex is the presumably subjective nature of the task. topics. The system participated in a live debate Wachsmuth et al.(2017a) proposed a taxonomy of against a world champion debater, and was able to quantifiable dimensions of argument quality, com- mine arguments, use them for composing a speech prised of high-level dimensions such as cogency supporting its side of the debate, and also rebut its and effectiveness, and sub-dimensions such as rel- human competitor.1 The underlying technology is evance and clarity, that together enable the assign- intended to enhance decision-making. ment of a holistic quality score to an argument. Habernal and Gurevych(2016b) and Simpson ∗These authors equally contributed to this work. 1For more details: https://www.research. ibm.com/artificial-intelligence/ 2https://decide.madrid.es project-debater/live/ 3https://zencity.io and Gurevych(2018) take a relative approach • IBM-ArgQ-14kPairs - 14k argument pairs an- and treat the problem as relation classification. notated with a relative quality label, indicat- They focus on convincingness – a primary dimen- ing which argument is of higher quality. sion of quality – and determine it by compar- ing pairs of arguments with similar stance. In • IBM-ArgQ-5.3kArgs - the subset of 5.3k this view, the convincingness of an individual ar- arguments from IBM-ArgQ-6.3kArgs that gument is a derivative of its relative convincing- passed our cleansing process. This set is used ness: arguments that are judged as more convinc- in the argument ranking experiments in Sec- ing when compared to others are attributed higher tion 9.2, henceforth: IBMRank. scores. These works explore the labeling and au- • IBM-ArgQ-9.1kPairs - the subset of 9.1k ar- tomatic assessment of argument convincingness gument pairs from IBM-ArgQ-14kPairs that using two datasets introduced by Habernal and passed our cleansing process, used in the Gurevych(2016b): UKPConvArgRank (hence- argument-pair classification experiments in forth, UKPRank) and UKPConvArgAll, which Section 9.1. Henceforth: IBMPairs. contain 1k and 16k arguments and argument- pairs, respectively. The dataset IBMRank differs from UKPRank Gleize et al.(2019) also take a relative approach in a number of ways. Firstly, IBMRank includes to argument quality, focusing on ranking convinc- 5.3k arguments, which make it more than 5 times evidence ingness of . Their solution is based on larger than UKPRank. Secondly, the arguments in a Siamese neural network, which outperforms the IBMRank were collected actively from contribu- results achieved in Simpson and Gurevych(2018) tors. Thirdly, IBMRank includes explicit quality- UKP on the datasets, as well as several baselines labeling of all individual arguments, which is ab- IBM-ConvEnv 4 on their own dataset, . sent from earlier data, enabling us to explore the Here, we extend earlier work in several ways: potential of training quality-prediction methods on (1) introducing a large dataset of actively col- top of such labels, presumably easier to collect. lected arguments, carefully annotated for quality; Finally, with the abundance of technologies (2) suggesting a method for argument-pair classi- such as automated personal assistants, we envi- fication, which outperforms state-of-the-art accu- sion automated argument quality assessment ex- racy on available datasets; (3) suggesting a method panding to applications that include oral commu- for individual argument ranking, which achieves nication. Such use-cases pose new challenges, results comparable to the state of the art. overlooked by prior work, that mainly focused on Our data was collected actively, via a dedicated written arguments. As an initial attempt to ad- user interface. This is in contrast to previous dress these issues, in the newly contributed data datasets, which were sampled from online debate we guided annotators to assess the quality of an portals. We believe that our approach to data col- argument within the context of using the argument lection is more controlled and reduces noise in the as-is to generate a persuasive speech on the topic. data, thus making it easier to utilize it in the con- Correspondingly, we expect these data to reflect text of learning algorithms (see Section7). additional quality dimensions – e.g., a quality pre- Moreover, we applied various cleansing meth- mium on efficiently phrased arguments, and low ods to ensure the high quality of the contributed tolerance to blunt mistakes such as typos that may data and the annotations, as detailed in Section3. lead to poorly stated arguments. We packaged our data in the following datasets, 5 which are released to the research community : 2 Argument Collection • IBM-ArgQ-6.3kArgs - the full dataset, com- As part of the development of Speech by Crowd, prised of all 6.3k arguments that were col- online and on-site experiments have been con- lected and annotated with an individual qual- ducted, enabling to test the ability of the service ity score in the range [0, 1]. to generate a narrative based on collected argu- 4As this work is relatively recent and was published after ments. Arguments were collected from two main our submission, we were not able to compare to it. sources: (1) debate club members, including all 5https://www.research.ibm.com/ haifa/dept/vst/debating_data.shtml# levels, from novices to experts; and (2) a broad ArgumentQuality audience of people attending the experiments. For the purpose of collecting arguments, we Motion #Args Flu vaccination should be mandatory 204 first selected 11 well known controversial con- Flu vaccination should not be mandatory 174 cepts, common in the debate world, such as So- Gambling should be banned 342 cial Media, Doping in Sports and Flu vaccina- Gambling should not be banned 382 Online shopping brings more harm than good 198 tion. Using debate jargon, each concept is used to Online shopping brings more good than harm 215 phrase two “motions”, by proposing two specific Social media brings more harm than good 879 and opposing policies or views towards that con- Social media brings more good than harm 686 cept. For example, for the concept Autonomous We should adopt cryptocurrency 172 We should abandon cryptocurrency 160 Cars, we suggested the motions We should pro- We should adopt vegetarianism 221 mote Autonomous Cars and We should limit Au- We should abandon vegetarianism 179 tonomous Cars.6 The full list of motions appears We should ban the sale of vvg to minors 275 We should allow the sale of vvg to minors 240 in Table1 with the number of arguments collected We should ban fossil fuels 146 for each.7 We should not ban fossil fuels 116 Guidelines Contributors were invited to a We should legalize doping in sport 212 We should ban doping in sport 215 dedicated user interface in which they were We should limit autonomous cars 313 guided to contribute arguments per concept, using We should promote autonomous cars 480 the following concise instructions: We should support information privacy laws 355 We should discourage information privacy laws 93 You can submit as many arguments as you like, both pro and con, using original language Table 1: Motion list and statistics on data collection. and no personal information (i.e. information about an identifiable person). arguments in isolation is presumably more chal- In addition, to exemplify the type of argu- lenging; it requires to evaluate the quality of an ments that we expect to receive, contributors were argument without a clear reference point (except shown an example of one argument related to the for the motion text). This is where the relative motion, provided by a professional debater. The approach has its strength, as it frames the label- arguments collected had to have 8 – 36 words, ing task in a specific context of two competing aimed at obtaining efficiently phrased arguments arguments, and is expected to yield higher inter- (longer/shorter arguments were rejected by the annotator agreement. Indeed, a comparative ap- UI). In total, we collected 6, 257 arguments. proach is widely used in many NLP applications, e.g. in Chen et al.(2013) for assessing reading dif- 3 Argument Quality Labeling ficulty of documents and in Aranberri et al.(2017) We explored two approaches to labeling argument for machine translation. In light of these consider- quality: (a) labeling individual arguments (abso- ations, here we decided to investigate and compare lute approach): each individual argument is di- both approaches. We used the Figure Eight plat- 8 rectly labeled for its quality; and (b) labeling ar- form , with a relatively large number of 15 − 17 gument pairs (relative approach): each argument annotators per instance, to improve the reliability pair is labeled for which of the two arguments is of the collected annotations. of higher quality. In this section we describe the 3.1 Labeling Individual Arguments pros and cons of each approach as well as the associated labeling process. The goal of this task is to assign a quality score for each individual argument. Annotators were Approaches to Argument Quality Labeling presented with the following binary question per The effort in labeling individual arguments scales argument: linearly with the number of arguments, compared to the quadratic scaling of labeling pairs (within Disregarding your own opinion on the topic, the same motion); thus, it is clearly more feasible would you recommend a friend preparing a when considering a large number of arguments. speech supporting/contesting the topic to use However, the task of determining the quality of this argument as is in the speech? (yes/no)

6Habernal and Gurevych(2016b) uses the term topic for All arguments that were collected as described in what we refer to as motion. 7In Table1, vvg stands for violent video games. 8http://figure-eight.com/ Section2 were labeled in this task. We model the ing to focus the task on dimensions beyond quality of each individual argument as a real value argument length. in the range of [0, 1], by calculating the fraction of ‘yes’ answers. To ensure the annotators will 4 Quality Control carefully read each argument, the labeling of each argument started with a test question about the To monitor and ensure the quality of collected stance of the argument towards the concept (pro annotations, we employed the following analyses: or con). The annotators’ performance on these test questions was used in the quality control process Kappa Analysis – described in Section4, and also in determining which pairs of arguments to label. 1. Pairwise Cohen’s kappa (κ)(Cohen, 1960) is calculated for each pair of annota- 3.2 Labeling Argument Pairs tors that share at least 50 common argu- In this task, annotators were presented with a pair ment/argument pairs judgments, and based of arguments, having the same stance towards only on those common judgments. the concept (to reduce bias due to the annotator’s opinion), and were asked the following: 2. Annotator-κ is obtained by averaging all pairwise κ for this annotator as calculated in Step Which of the two arguments would have been 1, and if and only if this annotator had ≥ 5 preferred by most people to support/contest the pairwise κ values estimated. This is used to topic? ignore annotators as described later. Table2 presents an example of such an ar- 3. Averaging all Annotator-κ, calculated in Step gument pair in which the annotators unanimously 2, results in Task-Average-κ.9 preferred the ﬁrst argument.

Argument 1 Argument 2 Test Questions Analysis – Hidden embedded Children emulate the These are less fun and test questions, based on ground truth, are often media they consume more harmful games and so will be more but specifically violent valuable for monitoring crowd work. In our setup, violent if you don’t games are played in at least one fifth of the judgments provided by each ban them from violent groups and exclude annotator are on test questions. When annotators video games softer souls fail a test question, they are alerted. Thus, beyond monitoring annotator’s quality, test questions also Table 2: An example of an argument pair for the motion We should ban the sale of violent video games to provide annotators feedback on task expectations. minors. The first argument was unanimously preferred In addition, an annotator that fails more than a pre- by all annotators. specified fraction (e.g., 20%) of the test questions is removed from the task, and his judgments are As mentioned, annotating all pairs in a large ignored. collection of arguments is often not feasible. Thus, High Prior Analysis – An annotator that al- we focused our attention on pairs that are presum- ways answers ’yes’ to a particular question should ably most valuable to train a learning algorithm. obviously be ignored; more generally, we dis- Specifically, we annotated 14k randomly selected carded the judgments contributed by annotators pairs, that satisfy the following criteria: with a relatively high prior to answer positively on the presented questions. 1. At least 80% of the annotators agreed on the Note, if an annotator is discarded due to fail- stance of each argument, aiming to focus on ure in any of the above analyses, he is further clearly stated arguments. discarded from the estimation of Annotator-κ and 2. Individual quality scores in each pair differ Task-Average-κ. by at least 0.2, aiming for pairs with a rela- 9 tively high chance of a clear winner. It is noted that some annotators remain without valid Annotator-κ and cannot be filtered out based on their κ. Sim- ilarly, those annotators do not contribute to the Task-Average- 3. The length of both arguments, as measured κ. However, in both annotation tasks, those annotators con- by number of tokens, differs by ≤ 20%, aim- tributed only 0.01 − 0.03 of the judgments collected. 5 Data Cleansing failed ≥ 30% of the test-questions; and/or (2) obtained Annotator-κ ≤ 0.15 in this task. Here, the 5.1 Cleansing of Individual Arguments test questions were directly addressing the (rela- Judgments tive) quality judgment of pairs, and not the stance To enhance the quality of the collected data, we of the arguments. In initial annotation rounds the discard judgments by annotators who (1) failed test questions were created based on the previ- ≥ 20% of the test-questions10; and/or (2) obtained ously collected individual arguments labels - con- Annotator-κ ≤ 0.35 in the stance judgment task; sidering pairs in which the difference in individual and/or (3) answered ‘yes’ for ≥ 80% of the qual- quality scores was ≥ 0.6.11 In following annota- ity judgment questions. Finally, we discarded ar- tion rounds, the test questions were defined based guments that were left with less than 7 valid judg- on pairs for which ≥ 90% of the annotators agreed ments. This process left us with 5.3k arguments, on the winning pair. Following this process we each with 11.4 valid annotations on average. The were left with an average of 15.9 valid annotations Task-Average-κ was 0.69 on the stance question for each pair, and with Task-Average-κ of 0.42 on and 0.1 on the quality question. We refer to the the quality judgments – a relatively high value for full, unfiltered, set as IBM-ArgQ-6.3kArgs, and to such a subjective task. As an additional cleans- the filtered set as IBM-ArgQ-5.3kArgs (IBMRank). ing, for training the learning algorithms, we con- For completeness, we also attempted to utilize sidered only pairs for which ≥ 70% of the anno- an alternative data cleansing tool, MACE (Hovy tators agreed on the winner, leaving us with a total et al., 2013). We ran MACE with a threshold k, of 9.1k pairs for training and evaluation. We refer keeping the top k percent of arguments accord- to the full, unfiltered, set as IBM-ArgQ-14kPairs, ing to their entropy. We then re-calculated Task- and to the filtered set as IBM-ArgQ-9.1kPairs. Average-κ on the resulting dataset. We ran MACE with k=0.95, as used in Habernal and Gurevych 6 Data Consistency (2016b), and with k=0.85, as this results in a 6.1 Consistency of Labeling Tasks dataset similar in size to IBMRank. The resulting Task-Average-κ is 0.08 and 0.09, respectively, Provided with both individual and pairwise qual- lower than our reported 0.1. We thus maintain our ity labeling, we estimated the consistency of these approach to data cleansing as described above. two approaches. For each pair of arguments, we define the expected winning argument as the The low average κ of 0.1 for quality judgments one with the higher individual argument score, is expected due to the subjective nature of the and compare that to the actual winning argument, task, but nonetheless requires further attention. namely the argument preferred by most annota- Based on the following observations, we argue tors when considering the pair directly. Overall, that the labels inferred from these annotations are in 75% of the pairs the actual winner was the ex- still meaningful and valuable: (1) the high Task- pected one. Moreover, when focusing on pairs Average-κ on the stance task conveys the annota- in which the individual argument scores differ by tors carefully read the arguments before provid- > 0.5, this agreement reaches 84.3% of pairs. ing their judgments; (2) we report high agreement of the individual quality labels with the argument- 6.2 Reproducibility Evaluation pair annotations, to which much better κ values were obtained (see Section 6.1); (3) we demon- An important property of a valid annotation is strate that the collected labels can be successfully its reproducibility. For this purpose, a random used by a neural network to predict argument rank- sample of 500 argument pairs from the IBMPairs ing (see section 9.2), suggesting these labels carry dataset was relabeled by the crowd. This relabel- a real signal related to arguments’ properties. ing took place a few months after the main annotation tasks, with the exact task and data cleans- 5.2 Cleansing of Argument Pair Labeling ing methods that were employed originally. For measuring correlation, the following A score was To enhance the quality of the collected pairwise defined: the fraction of valid annotations selecting data, we discard judgments by annotators who (1) 11Note, that annotators have the option of contesting prob- 10Since quality judgments are relatively subjective, we fo- lematic test questions, and thus unfitting ones were disabled cused the test questions on the stance question. during the task by our team. “argument A” in an argument pair (A, B) as hav- user interface with clear instructions and en- ing higher quality, out of the total number of valid forced length limitations. Correspondingly, we annotations. Pearson’s correlation coefficient be- end up with cleaner texts, that are also more ho- tween A score in initial and secondary annotation mogeneous in terms of length, compared to the of the defined sample was 0.81. UKPRank that relies on arguments collected from A similar process was followed with the indi- debate portals. vidual arguments quality labeling. Instead of re- labeling, we split existing annotations to two even Text Cleanliness groups. We chose only individual arguments in We counted tokens representing a malformed span which at least 14 valid annotations remained after of text in IBMRank and UKPRank. These are data cleansing (1, 154 such arguments). This re- HTML markup tags, links, excessive punctua- sulted in two sets of labels for the same data, each tion12, and tokens not found in GloVE vocabulary based on at least 7 annotations. Pearson’s correla- (Pennington et al., 2014). Our findings show that tion coefficient between quality scores of the two 94.78% of IBMRank arguments contain no mal- sets was 0.53. We then divided the quality score, formed text, 4.38% include one such token, and which ranges between 0 to 1, to 10 equal bins. The 0.71% include two such tokens. In the case of bin frequency counts between the two sets are dis- UKPRank, only 62.36% of the arguments are free played in the heatmap in Figure1. of malformed text, 17.59% include one such token, and 20.05% include two or more tokens of malformed text.

Text Length As depicted in Figure2, the arguments in IBM- Rank are substantially more homogeneous in their length compared to UKPRank. A potential draw- back of the length limitation is that it possibly prevents any learning system from being able to model long arguments correctly. However, by im- posing this restriction we expect our quality labeling to be less biased due to argument length, hold- ing greater potential to reveal other properties that contribute to argument quality. We conﬁrmed this Figure 1: Counts of quality score bins between two intuition with respect to the argument pair labeling equally sized sets of annotators. as described in Section 9.1.

Data Size and Individual Argument Labeling 6.3 Transitivity Evaluation Finally, IBMRank covers 5, 298 arguments, com- Following Habernal and Gurevych(2016b), we pared to 1, 052 in UKPRank. In addition, in further examined to what extent our labeled pairs UKPRank no individual labeling is provided, and satisfy transitivity. Speciﬁcally, a triplet of argu- individual quality scores are inferred from pairs la- ments (A, B, C) in which A is preferred over B, beling. In contrast, for IBMRank each argument is and B is preferred over C, is considered transi- individually labeled for quality, and we explicitly tive if and only if A is also preferred over C. We demonstrate the consistency of these individual la- examined all 892 argument triplets for which all beling with the provided pairwise labeling. pair-wise combinations were labeled, and found that transitivity holds in 96.2% of the triplets, further strengthening the validity of our data. 8 Methods

7 Comparison of IBMRank and In this section we describe neural methods for UKPRank predicting the individual score and the pair-wise

A distinctive feature of our IBMRank dataset is 12Sequences of three or more punctuation characters, e.g. that it was collected actively, via a dedicated “?!?!?!” Argument Length in IBMRank The ﬁne-tuning process is initialized with weights 350 from the general purpose pre-trained model and 768×2 300 a task speciﬁc weight matrix Wout ∈ R is 250 added to the 12-layer base network. Following

200 standard practice with BERT, given a pair of argu- A B 150 ments and , we feed the network with the following sequence ‘[CLS]A[SEP]B’. The [SEP] 100 token indicates to the network that the input is to 50 be treated as a pair and [CLS] is a token which 0 0 10 20 30 40 is used to obtain contextual embedding for the en- tire sequence. The network is trained for 3 epochs Argument Length in UKPRank −5 30 with a learning rate of 2 . We refer to this model as Arg-Classiﬁer. 25

20 8.2 Argument Ranking

15 For training a model to output a score between [0, 1] we obtain contextual embeddings from the 10 Arg-Classiﬁer ﬁne-tuned model. We concatenate

5 the last 4 layers of the model output to obtain an embedding vector of size 4 × 768 = 3072. The 0 0 10 20 30 40 50 60 70 80 90 100 110 embedding vectors are used as input to a neural network with a single output and one hidden layer Figure 2: Histograms of argument length in IBMRank with 300 neurons. In order for the network to out- and UKPRank. X-axis: length (token count). Y-axis: put values in [0, 1], we use a sigmoid activation, the number of arguments at that length. 1 σsigmoid(x) = 1+e−x . Denote the weight matrices 3072×300 300×1 W1 ∈ R and W2 ∈ R , the regres- classification of arguments. We devise two meth- sor model, fR, is a 2-layered neural network with ods corresponding to the two newly introduced σrelu(x) = max{0, x} activation. fR can be writ- 14 datasets. Our methods are based upon a powerful ten as: language representational model named Bidirec- T T fR(x) = σsigmoid W σrelu(W x) tional Encoder Representations from Transform- 2 1 3072 ers (BERT) (Devlin et al., 2018) which achieves where x ∈ R is the embedding vector repre- state-of-the-art results on a wide range of tasks in senting an argument. We refer to this regression NLP (Wang et al.(2018), Rajpurkar et al.(2016, model as Arg-Ranker. 2018)). BERT has been extensively trained over large corpora to perform two tasks: (1) Masked 9 Experiments Language Model - randomly replace words with a 9.1 Argument-Pair Classification predefined token, [MASK], and predict the miss- ing word. (2) Next Sentence Prediction - given a In this section we evaluate the methods described pair of sentences A and B, predict whether sen- in Section8. First, we evaluate the accuracy of tence B follows sentence A. Due to its bidirectional Arg-Classifier on our IBMPairs dataset and on nature, BERT achieves remarkable results when UKPConvArgStrict (henceforth, UKPStrict), the fine-tuned to different tasks without the need for filtered argument pairs dataset of Habernal and 15 specific modifications per task. For further details Gurevych(2016b), in k-fold cross-validation. refer to Devlin et al.(2018). We calculate accuracy and ROC area under curve (AUC) for each fold, and report the weighted av- 8.1 Argument-Pair Classification erages over all folds. We also evaluate Simp- son and Gurevych(2018)’s GPPL median heuris- We fine-tune BERT’s Base Uncased English pre- tic method with GloVe + ling features in cross- trained model for a binary classification task.13 iments detailed in Section9 we used the Base model. 13Initial experiments with BERT’s Large model showed 14We omit bias terms for readability. only minor improvements, so for the purpose of the exper- 1522 and 32 folds respectively. validation on our IBMPairs dataset. For complete- base. In both Arg-Ranker and Arg-Ranker-base ness, we quote Simpson and Gurevych(2018)’s evaluations we report the mean of 3 runs.18 figures of GPPL opt. and GPC on UKPStrict.16 IBMRank UKPRank We add a simple baseline classifying arguments Arg-Ranker-base Arg-Ranker Arg-Ranker-base Arg-Ranker GPPL based on their token count (Arg-Length). r .41 .42 .44 .49 .45 ρ .38 .41 .57 .59 .65

IBMPairs Arg-Length Arg-Classifier GPPL Table 4: Pearson’s (r) and Spearman’s (ρ) correla- Acc. .55 .80 .71 AUC .59 .86 .78 tion of Arg-Ranker-base, Arg-Ranker and GPPL on the IBMRank and UKPRank datasets. UKPStrict Arg-Length Arg-Classifier GPPL GPPL opt. GPC Acc. .76 .83 .79 .80 .81 As can be seen in Table4, on the UKPRank AUC .78 .89 .87 .87 .89 dataset, Arg-Ranker is slightly better than GPPL for Pearson’s correlation, but slightly worse for Table 3: Accuracy and AUC on IBMPairs and UKPStrict. Spearman’s correlation. Additionally, using direct BERT embeddings provides worse correlation19 As can be seen in Tables3, Arg-Classifier im- than using the Arg-Classifier embeddings for both proves on the GPPL method on both datasets datasets, justifying its use. Finally, similarly to the (p .01 using two-tailed Wilcoxon signed-rank findings in the argument-pair classification task, 17 test). We note that Arg-Classifier’s accuracy the IBMRank dataset is harder to predict.20 on the UKPStrict set is higher than all methods tested on this dataset in Habernal and Gurevych 10 Error Analysis (2016b); Simpson and Gurevych(2018). Interest- We present a qualitative analysis of examples that ingly, all methods reach higher accuracy on UKP- the Arg-Classifier and Arg-Ranker models did not Strict compared to IBMPairs, presumably indicat- predict correctly. For each of the argument-pair ing that the data in IBMPairs is more challenging and ranking tasks, we analyzed 50 − 100 argu- to classify. With regards to Arg-Length, we can ments from three motions on which the perfor- see that it is inaccurate on IBMPairs but achieves a mance of the respective model was poor. For each respectable result on UKPStrict. This is in agree- motion we selected the arguments in which the ment with Habernal and Gurevych(2016a) who model was most confident in the wrong direction. analyzed the reasons that annotators provided for A prominent insight from this analysis, com- their labeling. In most cases the reason indicated mon to both models, is that the model tends to preference for arguments with more information – fail when the argument persuasiveness outweighs which is what longer arguments tend to be better its delivery quality (such as bad phrasing or ty- at. This further strengthens the value of creating pos). An example of this is shown in row 1 of IBMPairs and IBMRank as much more homoge- Table5. In this case, Argument2 is labeled as hav- neous datasets in terms of argument length. ing a higher quality, even though it contains multi- 9.2 Argument Ranking ple typos, and thus is typical to arguments that the model was trained to avoid selecting. We proceed to evaluate the Arg-Ranker on the Another phenomenon that both our models fail IBMRank and UKPRank datasets in k-fold cross- to address is arguments that are off-topic, too validation, and report weighted correlation mea- provocative or not grounded. An example of this, sures. We also evaluate the Arg-Ranker by feeding from the argument-pair task, is shown in row 2 - it vanilla BERT embeddings, instead of the fine- Argument2 is presumably considered harsh by an- tuned embeddings generated by the Arg-Classifier notators, even though it is fine in terms of gram- model. We refer to this version as Arg-Ranker- 18The GPPL regressor of Simpson and Gurevych(2018) 16We were unable to reproduce the results reported in relies on pair-wise (relative) labeling of arguments and as a Simpson and Gurevych(2018) by running the GPPL opt. result it cannot be used for predicting the individual (abso- and GPC algorithms on the UKPStrict dataset. We have ap- lute) labeling of arguments, as in IBMRank. proached the authors and reported the issue, which was not 19Significantly for the IBMRank data on both measures, solved by the time this paper was published, and hence we and for the UKPRank on Pearson’s correlation, p .05. only quote the figures as reported there. 20For the experiments on IBMRank, we included by mis- 17The results per fold in both tasks are included in the sup- take a small fraction of arguments which actually should have plementary material. been filtered. The effect on the results is minimal. Motion Type Argument1 Argument2 We should ban Impact over the only way to provide any space for fossil fuels are bad for the environ- fossil fuels delivery energy alternatives to enter the market ment, they have so2 in them that is is by artificially decreasing the power of the thing that maks acid rain and it is fossil fuels through a ban. today harming the environment and will only be wors. Flu vaccination Provocative the only responsible persons for kids the body has an automatic vaccination should not be or not are their parents. if they dont think due to evolution, those who got sick and mandatory grounded that their kids should get the vaccine died are the weakest link and we are its their own decision. better off without them We should Consistent it’s harder to get all the things you animals deserve less rights than hu- abandon vege- annotator need for a balanced diet while being mans, and it is legitimate for humans to tarianism preference vegetarian. prioritize their enjoyment over the suf- fering of animals.

Table 5: Examples of argument pairs for which there is a high difference between the argument selected by the annotators, marked in bold, and the argument predicted to be of higher quality by the model, marked in italics. matical structure and impact on the topic. These vidual) annotation of argument quality. Our anal- types of arguments are becoming more important ysis suggests that these two schemes provide rela- to recognize, especially in the “fake-news” era. tively consistent results. In addition, these annota- We leave dealing with them for future work. tion efforts may complement each other. As pairs Finally, we also notice certain arguments were of arguments with a high difference in individual consistently preferred by annotators, regardless of quality scores appear to agree with argument-pair the quality of the opposing argument. This is a annotations, one may deduce the latter from the pattern relevant only to the Arg-Classifier model, former. Thus, it may be beneficial to dedicate shown in row 3. the more expensive pair-wise annotation efforts to pairs in which the difference in individual qual- 11 Conclusions and Future Work ity scores is small, reminiscent of active learning (Settles, 2009). In future work we intend to fur- A significant barrier in developing automatic ther investigate this approach, as well as explore in methods for estimating argument quality is the more detail the low fraction of cases where these lack of suitable data. An important contribution two schemes led to clearly different results. of this work is a newly introduced data composed The second contribution of this work is suggest- of 6.3k carefully annotated arguments, compared ing neural methods, based on Devlin et al.(2018), to 1k arguments in previously considered data. for argument ranking as well as for argument- Another barrier is the inherent subjectivity of the pair classification. In the former task, our re- manual task for determining argument quality. To sults are comparable to state-of-the-art; in the lat- overcome this issue, we employed a relatively ter task they significantly outperform earlier meth- large set of crowd annotators to consider each in- ods (Habernal and Gurevych, 2016b). stance, associated with various measures to ensure Finally, to the best of our knowledge, current the quality of the annotations associated with the approaches do not deal with argument pairs of rel- released data. In addition, while previous work atively similar quality. A natural extension is to focused on arguments collected from web debate develop a ternary-class classification model that portals, here we collected arguments via a dedi- will be trained and evaluated on such pairs, as we cated interface, enforcing length limitations, and intend to explore in future work. providing contributors with clear guidance. More- over, previous work relied solely on annotating Acknowledgements pairs of arguments, and used these annotations to infer the individual ranking of arguments; in con- We thank Tel Aviv University Debating Society, trast, here, we annotated all individual arguments Ben Gurion University Debating Society, Yale De- for their quality, and further annotated 14k pairs. bate Association, HWS Debate Team, Seawolf This two–fold approach allowed us, for the first Debate Program of the University of Alaska, and time, to explicitly examine the relation between many other individual debaters. relative (pairwise) annotation and explicit (indi- References Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. 2013. Learning whom to trust Nora Aranberri, Gorka Labaka, Arantza D´ıaz de Ilar- with MACE. In Proceedings of the 2013 Conference raza, and Kepa Sarasola. 2017. Ebaluatoia: crowd of the North American Chapter of the Association evaluation for english–basque machine translation. for Computational Linguistics: Human Language Language Resources and Evaluation, 51(4):1053– Technologies, pages 1120–1130, Atlanta, Georgia. 1084. Association for Computational Linguistics.

Aristotle, G.A. Kennedy, and G.A. Kennedy. 1991. On Ran Levy, Yonatan Bilu, Daniel Hershcovich, Ehud Rhetoric: A Theory of Civic Discourse. Oxford Uni- Aharoni, and Noam Slonim. 2014. Context depen- versity Press. dent claim detection. In Proceedings of COLING 2014, the 25th International Conference on Compu- Roy Bar-Haim, Indrajit Bhattacharya, Francesco Din- tational Linguistics: Technical Papers, pages 1489– uzzo, Amrita Saha, and Noam Slonim. 2017. Stance 1500, Dublin, Ireland. Dublin City University and classiﬁcation of context-dependent claims. In Pro- Association for Computational Linguistics. ceedings of the 15th Conference of the European Chapter of the Association for Computational Lin- Marco Lippi and Paolo Torroni. 2016. Argumentation guistics: Volume 1, Long Papers, pages 251–261, mining: State of the art and emerging trends. ACM Valencia, Spain. Association for Computational Lin- Trans. Internet Technol., 16(2):10:1–10:25. guistics.

T Bench-Capon, K Atkinson, and Peter McBurney. Jeffrey Pennington, Richard Socher, and Christo- 2009. Altruism and agents: an argumentation based pher D. Manning. 2014. Glove: Global vectors for In EMNLP approach to designing agent decision mechanisms, word representation. In . pages 1073 – 1080. Unknown Publisher. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Xi Chen, Paul N. Bennett, Kevyn Collins-Thompson, Know what you don’t know: Unanswerable ques- and Eric Horvitz. 2013. Pairwise ranking aggrega- tions for squad. CoRR, abs/1806.03822. tion in a crowdsourced setting. In Proceedings of the Sixth ACM International Conference on Web Search Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and and Data Mining, WSDM ’13, pages 193–202, New Percy Liang. 2016. Squad: 100,000+ questions for York, NY, USA. ACM. machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Nat- Cohen. 1960. A coefﬁcient of agreement for nominal ural Language Processing, pages 2383–2392. Asso- scales. Educ Psychol Meas, pages 37–46. ciation for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Chris Reed. 2016. Proceedings of the third work- Kristina Toutanova. 2018. BERT: pre-training of shop on argument mining (argmining2016). In deep bidirectional transformers for language under- Proceedings of the Third Workshop on Argument standing. CoRR, abs/1810.04805. Mining (ArgMining2016). Association for Compu- tational Linguistics. Martin Gleize, Eyal Shnarch, Leshem Choshen, Lena Dankin, Guy Moshkowich, Ranit Aharonov, and Nils Reimers, Benjamin Schiller, Tilman Beck, Jo- Noam Slonim. 2019. Are you convinced? choos- hannes Daxenberger, Christian Stab, and Iryna ing the more convincing evidence with a Siamese Gurevych. 2019. Classiﬁcation and clustering of network. In Proceedings of the 57th Annual Meet- arguments with contextualized word embeddings. ing of the Association for Computational Linguis- CoRR, abs/1906.09821. tics, pages 967–976, Florence, Italy. Association for Computational Linguistics. Ruty Rinott, Lena Dankin, Carlos Alzate Perez, Mitesh M. Khapra, Ehud Aharoni, and Noam Ivan Habernal and Iryna Gurevych. 2016a. What Slonim. 2015. Show me your evidence - an auto- makes a convincing argument? empirical analysis matic method for context dependent evidence de- and detecting attributes of convincingness in web tection. In Proceedings of the 2015 Conference on argumentation. In Proceedings of the 2016 Con- Empirical Methods in Natural Language Process- ference on Empirical Methods in Natural Language ing, pages 440–450, Lisbon, Portugal. Association Processing, pages 1214–1223, Austin, Texas. Asso- for Computational Linguistics. ciation for Computational Linguistics. Burr Settles. 2009. Active learning literature survey. Ivan Habernal and Iryna Gurevych. 2016b. Which ar- Computer Sciences Technical Report 1648, Univer- gument is more convincing? analyzing and predict- sity of Wisconsin–Madison. ing convincingness of web arguments using bidirectional lstm. In Proceedings of the 54th Annual Meet- Edwin D Simpson and Iryna Gurevych. 2018. Finding ing of the Association for Computational Linguistics convincing arguments using scalable bayesian pref- (Volume 1: Long Papers), pages 1589–1599. Asso- erence learning. Transactions of the Association for ciation for Computational Linguistics. Computational Linguistics, 6:357–371. Christian Stab and Iryna Gurevych. 2014. Annotating argument components and relations in persuasive es- says. In COLING, pages 1501–1510. ACL. Henning Wachsmuth, Nona Naderi, Yufang Hou, Yonatan Bilu, Vinodkumar Prabhakaran, Tim Al- berdingk Thijm, Graeme Hirst, and Benno Stein. 2017a. Computational argumentation quality assessment in natural language. In Proceedings of the 15th Conference of the European Chapter of the As- sociation for Computational Linguistics: Volume 1, Long Papers, pages 176–187. Association for Com- putational Linguistics. Henning Wachsmuth, Martin Potthast, Khalid Al Khatib, Yamen Ajjour, Jana Puschmann, Jiani Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, and Benno Stein. 2017b. Building an argument search engine for the web. In ArgMining@EMNLP, pages 49–59. Association for Computational Linguistics.

Douglas Walton, Chris Reed, and Fabrizio Macagno. 2008. Argumentation Schemes. Cambridge Univer- sity Press. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A multi-task benchmark and analysis plat- form for natural language understanding. CoRR, abs/1804.07461.