Argument Discovery and Extraction with the Argument Workbench
Total Page:16
File Type:pdf, Size:1020Kb
Argument Discovery and Extraction with the Argument Workbench Adam Wyner Wim Peters David Price Computing Science Computer Science DebateGraph University of Aberdeen University of Sheffield United Kingdom Aberdeen, United Kingdom Sheffield, United Kingdom [email protected] [email protected] [email protected] Abstract lexical forms for related semantic meaning. It is dif- ficult for humans to reconstruct argument from text, The paper discusses the architecture and de- let alone for a computer. This is especially the case velopment of an Argument Workbench, which where arguments are dispersed across unstructured is a interactive, integrated, modular tool set to textual corpora. In our view, the most productive extract, reconstruct, and visualise arguments. We consider a corpora with dispersed infor- scenario is one in which a human argument engineer mation across texts, making it essential to con- is maximally assisted in her work by computational ceptually search for argument elements, top- means in the form of automated text filtering and an- ics, and terminology. The Argument Work- notation. This enables the engineer to focus on text bench is a processing cascade, developed in that matters and further explore the argumentation collaboration with DebateGraph. The tool structure on the basis of the added metadata. The supports an argument engineer to reconstruct Argument WorkBench (AWB) captures this process arguments from textual sources, using infor- of incremental refinement and extension of the argu- mation processed at one stage as input to a subsequent stage of analysis, and then build- ment structure, which the engineer then produces as ing an argument graph. We harvest and pre- a structured object with a visual representation. process comments; highlight argument indi- Given the abundance of textual source data avail- cators, speech act and epistemic terminology; able for argumentation analysis there is a real need model topics; and identify domain terminol- for automated filtering and interpretation. Cur- ogy. We use conceptual semantic search over rent social media platforms provide an unprece- the corpus to extract sentences relative to argu- ment and domain terminology. The argument dented source of user-contributed content on most engineer uses the extracts for the construction any topic. Reader-contributed comments to a com- of arguments in DebateGraph. ment forum, e.g. for a news article, are a source of arguments for and against issues raised in the article, where an argument is a claim with justifications and 1 Introduction exceptions. It is difficult to coherently understand Argumentative text is rich, multidimensional, and the overall, integrated meaning of the comments. fine-grained, consisting of (among others): a range To reconstruct the arguments sensibly and of (explicit and implicit) discourse relations between reusably, we build on a prototype Argument Work- statements in the corpus, including indicators for bench (AWB) (Wyner et al.(2012); Wyner(2015)), conclusions and premises; speech acts and proposi- which is a semi-automated, interactive, integrated, tional attitudes; contrasting sentiment terminology; modular tool set to extract, reconstruct, and visualise and domain terminology. Moreover, linguistic ex- arguments. The workbench is a processing cascade, pression is various, given alternative syntactic or developed in collaboration with an industrial partner 78 Proceedings of the 2nd Workshop on Argumentation Mining, pages 78–83, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics DebateGraph and used by an Argumentation Engi- Figure 1 shows the overall workflow. Document neer, where information processed at one stage gives collection is not taken into account. In the first greater structure for the subsequent stage. In partic- stage, text analysis such as topic, term and named ular, we: harvest and pre-process comments; high- entity extraction provides a first thematic grouping light argument indicators, speech act terminology, and semantic classification of relevant domain ele- epistemic terminology; model topics; and identify ments. This combination of topics, named entities domain terminology and relationships. We use con- and terms automatically provides the first version ceptual semantic search over the corpus to extract of a domain model, which assists the engineer in sentences relative to argument and domain terminol- the conceptual interpretation and subsequent explo- ogy. The argument engineer analyses the output and ration. The texts filtered in this thematic way can then inputs extracts into the DebateGraph visualisa- then be filtered further with respect to argument in- tion tool. The novelty of the work presented in this dicators (discourse terminology, speech acts, epis- paper is the addition of terminology (domain top- temic terminology) as well as sentiment (positive ics and key words, speech act, and epistemic) along and negative terminology). At each stage, the Argu- with the workflow analysis provided by our indus- mentation Engineer is able to query the corpus with trial partner. For this paper, we worked with a corpus respect to the metadata (which we also refer to as of texts bearing on the Scottish Independence vote in the conceptual annotations). This complex filtering 2014; however, the tool is neutral with respect to do- of information from across a corpus helps the Argu- main, since the domain terminology is derived using mentation Engineer consolidate her understanding automatic tools. of the argumentative role of information. In this short paper, we briefly outline the AWB workflow, sketch tool components, provide sample 3 AWB Components query results, discuss related work in the area, and 3.1 Text Analysis close with a brief discussion. To identify and extract the textual elements from the source material, we use the GATE framework (Cun- 2 The Argument WorkBench Workflow ningham et al.(2002)) for the production of semantic The main user of the Argument WorkBench (AWB) metadata in the form of annotations. is Argumentation Engineer, an expert in argumen- GATE is a framework for language engineer- tation modeling who uses the Workbench to select ing applications, which supports efficient and ro- and interpret the text material. Although the AWB bust text processing including functionality for automates some of the subtasks involved, the ulti- both manual and automatic annotation (Cunningham mate modeler is the argumentation engineer. The et al.(2002)); it is highly scalable and has been ap- AWB distinguishes between the selection and mod- plied in many large text processing projects; it is an eling tasks, where selection is computer-assisted and open source desktop application written in Java that semi-automatic, whereas the modeling is performed provides a user interface for professional linguists manually in DebateGraph (see Figure 1). and text engineers to bring together a wide variety of natural language processing tools and apply them to The AWB encompasses a flexible methodology a set of documents. The tools are concatenated into that provides a workflow and an associated set of a pipeline of natural language processing modules. modules that together form a flexible and extendable The main modules we are using in our bottom-up methodology for the detection of argument in text. and incremental tool development (Wyner and Pe- Automated techniques provide textually grounded ters(2011)) perform the following functionalities: information about conceptual nature of the domain and the argument structure by means of the detec- linguistic pre-processing. Texts are segmented • tion of argument indicators. This information, in the into tokens and sentences; words are assigned form of textual metadata, enable the argumentation Part-of-Speech (POS). engineer to filter out potentially interesting text for eventual manual analysis, validation and evaluation. gazetteer lookup. A gazetteer is a list of words • 79 Figure 1: Overview of the Argument WorkBench Workflow associated with a central concept. In the lookup to TermRaider, we have used a tool to model top- phase, text in the corpus is matched with terms ics, identifying clusters of terminology that are taken on the lists, then assigned an annotation. to statistically “cohere” around a topic; for this, we have used a tool based on Latent Dirichlet Alloca- annotation assignment through rule-based • tion (Blei et al.(2008)). Each word in a topic is used grammars, where rules take annotations and to annotate every sentence in the corpus that con- regular expressions as input and produce tains that word. Thus, with term and topic annota- annotations as output. tion, the Argumentation Engineer is able to query Once a GATE pipeline has been applied, the ar- the corpus for relevant, candidate passages. gument engineer views the annotations in situ or 3.3 DebateGraph using GATE’s ANNIC (ANNotations In Context) corpus indexing and querying tool (see section 4), DebateGraph is a free, cloud-based platform that en- which enables semantic search for annotation pat- ables communities of any size to build and share dy- terns across a distributed corpus. namic interactive visualizations of all the ideas, ar- guments, evidence, options and actions that anyone 3.2 Term and Topic Extraction in the community believes relevant to the issues un- In the current version of the AWB, we used two der consideration, and to ensure that all perspectives automatic approches to developing terminology, al- are represented transparently,