Evidence Type Classification in Randomized Controlled Trials Tobias Mayer, Elena Cabrio, Serena Villata

Evidence Type Classification in Randomized Controlled Trials Tobias Mayer, Elena Cabrio, Serena Villata To cite this version: Tobias Mayer, Elena Cabrio, Serena Villata. Evidence Type Classification in Randomized Controlled Trials. 5th ArgMining@EMNLP 2018, Oct 2018, Brussels, Belgium. hal-01912157 HAL Id: hal-01912157 https://hal.archives-ouvertes.fr/hal-01912157 Submitted on 5 Nov 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Evidence Type Classification in Randomized Controlled Trials Tobias Mayer and Elena Cabrio and Serena Villata Universite´ Coteˆ d’Azur, CNRS, I3S, Inria, France ftmayer,cabrio,[email protected] Abstract To answer this question, we propose to resort Randomized Controlled Trials (RCT) are a on Argument Mining (AM) (Peldszus and Stede, common type of experimental studies in the 2013; Lippi and Torroni, 2016a), defined as “the medical domain for evidence-based decision general task of analyzing discourse on the prag- making. The ability to automatically extract matics level and applying a certain argumentation the arguments proposed therein can be of valu- theory to model and automatically analyze the data able support for clinicians and practitioners in at hand” (Habernal and Gurevych, 2017). Two their daily evidence-based decision making ac- stages are crucial: (1) the detection of argument tivities. Given the peculiarity of the medical domain and the required level of detail, stan- components (e.g., claim, premises) and the iden- dard approaches to argument component de- tification of their textual boundaries, and (2) the tection in argument(ation) mining are not fine- prediction of the relations holding between the ar- grained enough to support such activities. In guments. In the AM framework, we propose a this paper, we introduce a new sub-task of the new task called evidence type classification, as a argument component identification task: evi- sub-task of the argument component identification dence type classification . To address it, we task. The distinction among different kinds of ev- propose a supervised approach and we test it on a set of RCT abstracts on different medical idence is crucial in evidence-based decision mak- topics. ing as different kinds of evidence are associated to different weights in the reasoning process. Such 1 Introduction information need to be extracted from raw text. Evidence-based decision making in medicine has To the best of our knowledge, this is the first the aim to support clinicians and practitioners to approach in AM targeting evidence type classifi- reason upon the arguments in support or against a cation in the medical domain. The main contribu- certain treatment, its effects, and the comparison tions of this paper are: (i) we propose four classes with other related treatments for the same disease. of evidence for RCT (i.e., comparative, signifi- These approaches (e.g., (Hunter and Williams, cance, side-effect, and other), and we annotate a 2012; Craven et al., 2012; Longo and Hederman, new dataset of 169 RCT abstracts with such labels, 2013; Qassas et al., 2015)) consider different kinds and (ii) we experiment with supervised classifiers of data, e.g., Randomized Controlled Trials or over such dataset obtaining satisfactory results. other observational studies, and they usually re- quire transforming the unstructured textual information into structured information as input of the 2 Evidence type classification reasoning framework. This paper proposes a pre- liminary step towards the issue of providing this In (Mayer et al., 2018), as a first step towards the transformation, starting from RCT, i.e., documents extraction of argumentative information from clin- reporting experimental studies in the medical do- ical data, we extended an existing corpus (Trenta main. More precisely, the research question we et al., 2015) on RCT abstracts, with the annota- answer in this paper is: how to distinguish differ- tions of the different argument components (evi- ent kinds of evidence in RCT, so that fine-grained dence, claim, major claim). The structure of RCTs evidence-based decision making activities are sup- should follow the CONSORT policies to ensure ported? a minimum consensus, which makes the studies comparable and ideal for building a corpus1. RCT risk of bias, when it comes to determining the abstracts were retrieved directly from PubMed2 by quality of the evidence. As stated by (Bellomo searching for the disease name and specifying that and Bagshaw, 2006) there are also other aspects it has to be a RCT. This version of the corpus with of the trial quality, which impinge upon the truth- coarse labels contains 927 argument components fulness of the findings. As a step forward, in this (679 evidence and 248 claims) from 159 abstracts work we extend the corpus annotation, specifying comprising 4 different diseases (glaucoma, hyper- four classes of evidence, which are most promi- tension, hepatitis b, diabetes). nent in our data and assist in assessing these com- In particular, an evidence in a RCT is an obser- plex quality dimensions, like reproducibility, gen- vation or measurement in the study (ground truth), eralizability or the estimate of effect: which supports or attacks another argument com- comparative: when there is some kind of component, usually a claim. Those observations comparison between the control and intervention arms prise side effects and the measured outcome of the (Table1, example 2). Supporting the search for intervention and control arm. They are observed similarities in outcomes of different studies, which facts, and therefore credible without further justi- is an important measure for the reproducibility. fications, since this is the ground truth the argumentation is based on. In Example 1, evidence significance: for any sentence stating that the re- are in italic, underlined and surrounded by square sults are statistically significant (Table1, example brackets with subscripts, while claims are in bold. 3). Many comparative sentences also contain statistical information. However, this class can be Example 1: To compare the intraocular pressure- seen more as a measure for the strength of ben- lowering effect of latanoprost with that eficial or potentially harmful outcomes. of dorzolamide when added to timolol. [. ] [The diurnal intraocular pressure side-effect: captures all evidence reporting any reduction was significant in both groups side-effect or adverse drug effect to see if poten- (P < 0:001)]1.[The mean intraocular pres- tial harms outweigh the benefits of an intervention sure reduction from baseline was 32% for the (Table1, example 4). latanoprost plus timolol group and 20% for other: all the evidence that do not fall under the the dorzolamide plus timolol group]2.[The least square estimate of the mean diurnal in- other categories, like non-comparative observa- traocular pressure reduction after 3 months tions, risk factors or limitations of the study (too was -7.06 mm Hg in the latanoprost plus rare occurrences to form new classes). Especially timolol group and -4.44 mm Hg in the dor- the latter can be relevant for the generalizabil- ity of the outcome of a study (Table1, example 5). zolamide plus timolol group (P < 0:001)]3. Drugs administered in both treatment groups were well tolerated. This study clearly Table2 shows the statistics of the obtained dataset. showed that [the additive diurnal intraocu- Three annotators have annotated the data after a lar pressure-lowering effect of latanoprost training phase. Inter Annotator Agreement has is superior to that of dorzolamide in pa- been calculated on 10 abstracts comprising 47 evidence, resulting in a Fleiss’ kappa of 0.88. tients treated with timolol]1. Example 1 shows different reports of the experi- 3 Proposed methods mental outcomes as evidence. Those can be re- In (Mayer et al., 2018), we addressed the argument sults without concrete measurement values (see component detection as a supervised text classifi- evidence 1), or exact measured values (see evi- cation problem: given a collection of sentences, dence 2 and 3). Different measures are annotated each labeled with the presence/absence of an argu- as multiple evidence. The reporting of side effects ment component, the goal is to train a classifier to and negative observations are also considered as detect the argumentative sentences. We retrained evidence. Traditionally evidence-based medicine an existing system, i.e. MARGOT (Lippi and Tor- (EBM) focuses mainly on the study design and roni, 2016b), to detect evidence and claims from 1http://www.consort-statement.org/ clinical data. The methods we used are SubSet 2https://www.ncbi.nlm.nih.gov/pubmed/ Tree Kernels (SSTK) (Collins and Duffy, 2002), 1. Brimonidine provides a sustained long-term ocular As a step forward - after the distinction be- hypotensive effect, is well tolerated, and has a low rate of allergic response. tween argumentative (claims and evidence) and 2. The overall success rates were 87% for the 350-mm2 non-argumentative sentences - we address the task group and 70% for the 500-mm2 group (P = 0:05). of distinguishing the different types of evidence 3. All regimens produced clinically relevant and statistically significant (P < :05) intraocular pressure (see Section2). We cast it as a multi-class classi- reductions from baseline. fication problem. For that we use Support Vector 4. Allergy was seen in 9 % of subjects treated with Machines (SVMs)3 with a linear kernel and dif- brimonidine. 5. Risk of all three outcomes was higher for participants ferent strategies to transform the multi-class into with chronic kidney disease or frailty.

Evidence Type Classification in Randomized Controlled Trials Tobias Mayer, Elena Cabrio, Serena Villata

No-Decision Classification: an Alternative to Testing for Statistical

Comparison of Machine Learning Techniques When Estimating Probability of Impairment

Binary Classification Models with “Uncertain” Predictions

Social Dating: Matching and Clustering

How Will AI Shape the Future of Law?

Gaussian Process Classification Using Privileged Noise

Binary Classification Approach to Ordinal Regression

Analysis of Confirmed Cases of COVID-19

Estimating the Operating Characteristics of Ensemble Methods

Binary Classification and Logistic Regression Models Application to Crash Severity

Benchmarking Machine Learning Models for the Analysis of Genetic Data Using FRESA.CAD Binary Classiﬁcation Benchmarking

Measuring the Effects of Confounders in Medical Supervised Classification