EACL 2017

15th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the Software Demonstrations

April 3-7, 2017 Valencia, Spain Edited by Anselmo Peñas Universidad Nacional de Educacion a Distancia Spain

Andre Martins Unbabel and Instituto de Telecomunicacoes Portugal

2017 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

ISBN 978-1-945626-36-4

ii Introduction

This volume contains the contributions accepted for Software Demonstrations at the EACL 2017 conference which takes place from 4th to 7th of April 2017 in Valencia, Spain.

The Demonstrations program primary aim is to present the software systems and solutions related to all areas of computational linguistics and natural language processing. This program encourages the early exhibition of research prototypes, but also includes interesting mature systems.

The number of submissions for software demonstrations exceeded, again, the number of available slots. This confronted us with the situation where we were forced to introduce a strict selection . The selection process was completed following the predefined criteria of relevance for EACL 2017 conference: novelty, overall usability, and applicability to more than one linguistic level and to more than one computational linguistic field/task. We would like to thank the general chairs and the local organizers, without whom it would have been impossible to put together such a strong demonstrations program.

We hope you will find this proceedings useful and that the software systems presented in this proceedings will inspire new ideas and approaches in your future work.

With kind regards,

Demo chairs Anselmo Peñas and André Martins

iii

General Chair (GC): Mirella Lapata (University of Edinburgh)

Co-Chairs (PCs): Phil Blunsom (University of Oxford) Alexander Koller (University of Potsdam)

Local organizing team: Paolo Rosso (Chair), PRHLT, Universitat Politècnica de València Francisco Casacuberta (Co-chair), PRHLT, Universitat Politècnica de València Jon Ander Gómez (Co-chair), PRHLT, Universitat Politècnica de València

Software Demo chairs Anselmo Peñas (UNED, Spain) Andre Martins (Unbabel and Instituto de Telecomunicacoes, Portugal)

Publication Chairs: Maria Liakata Chris Biemann

Workshop Chairs: Laura Rimell Richard Johansson

Student research workshop faculty advisor: Barbara Plank

Tutorials co-chairs: Alex Klementiev Lucia Specia

Publicity Chair: David Weir

Conference Handbook Chair: Andreas Vlachos

v

Table of Contents

COVER: Covering the Semantically Tractable Questions Michael Minock ...... 1

Common Round: Application of Language Technologies to Large-Scale Web Debates Hans Uszkoreit, Aleksandra Gabryszak, Leonhard Hennig, Jörg Steffen, Renlong Ai, Stephan Busemann, Jon Dehdari, Josef van Genabith, Georg Heigold, Nils Rethmeier, Raphael Rubino, Sven Schmeier, Philippe Thomas, He Wang and Feiyu Xu ...... 5

A Web-Based Interactive Tool for Creating, Inspecting, Editing, and Publishing Etymological Datasets Johann-Mattis List ...... 9

WAT-SL: A Customizable Web Annotation Tool for Segment Labeling Johannes Kiesel, Henning Wachsmuth, Khalid Al Khatib and Benno Stein...... 13

TextImager as a Generic Interface to R Tolga Uslu, Wahed Hemati, Alexander Mehler and Daniel Baumartz ...... 17

GraWiTas: a Grammar-based Wikipedia Talk Page Parser Benjamin Cabrera, Laura Steinert and Björn Ross ...... 21

TWINE: A real-time system for TWeet analysis via INformation Extraction Debora Nozza, Fausto Ristagno, Matteo Palmonari, Elisabetta Fersini, Pikakshi Manchanda and Enza Messina ...... 25

Alto: Rapid Prototyping for Parsing and Translation Johannes Gontrum, Jonas Groschwitz, Alexander Koller and Christoph Teichmann ...... 29

CASSANDRA: A multipurpose configurable voice-enabled human-computer-interface Tiberiu Boro¸s,Stefan Daniel Dumitrescu and Sonia Pipa ...... 33

An Extensible Framework for Verification of Numerical Claims James Thorne and Andreas Vlachos ...... 37

ADoCS: Automatic Designer of Conference Schedules Diego Fernando Vallejo Huanga, Paulina Adriana Morillo Alcívar and Cesar Ferri Ramírez . . . . 41

A Web Interface for Diachronic Semantic Search in Spanish Pablo Gamallo, Iván Rodríguez-Torres and Marcos Garcia ...... 45

Multilingual CALL Framework for Automatic Language Exercise Generation from Free Text Naiara Perez and Montse Cuadros ...... 49

Audience Segmentation in Social Media Verena Henrich and Alexander Lang ...... 53

The arText prototype: An automatic system for writing specialized texts Iria da Cunha, M. Amor Montané and Luis Hysa ...... 57

QCRI Live Speech Translation System Fahim Dalvi, Yifan Zhang, Sameer Khurana, Nadir Durrani, Hassan Sajjad, Ahmed Abdelali, Hamdy Mubarak, Ahmed Ali and Stephan Vogel ...... 61

vii Nematus: a Toolkit for Neural Machine Translation Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry and Maria Nadejde...... 65

A tool for extracting sense-disambiguated example sentences through user feedback Beto Boullosa, Richard Eckart de Castilho, Alexander Geyken, Lothar Lemnitzer and Iryna Gurevych 69

Lingmotif: Sentiment Analysis for the Digital Humanities Antonio Moreno-Ortiz ...... 73

RAMBLE ON: Tracing Movements of Popular Historical Figures Stefano Menini, Rachele Sprugnoli, Giovanni Moretti, Enrico Bignotti, Sara Tonelli and Bruno Lepri...... 77

Autobank: a semi-automatic annotation tool for developing deep Minimalist Grammar treebanks JohnTorr...... 81

Chatbot with a Discourse Structure-Driven Dialogue Management Boris Galitsky and Dmitry Ilvovsky ...... 87

Marine Variable Linker: Exploring Relations between Changing Variables in Marine Science Literature Erwin Marsi, Pinar Pinar Øzturk and Murat V. Ardelan ...... 91

Neoveille, a for Neologism Tracking Emmanuel Cartier ...... 95

Building Web-Interfaces for Vector Semantic Models with the WebVectors Toolkit Andrey Kutuzov and Elizaveta Kuzmenko...... 99

InToEventS: An Interactive Toolkit for Discovering and Building Event Schemas Germán Ferrero, Audi Primadhanty and Ariadna Quattoni ...... 104

ICE: Idiom and Collocation Extractor for Research and Education vasanthi vuppuluri, Shahryar Baki, An Nguyen and Rakesh Verma ...... 108

Bib2vec: Embedding-based Search System for Bibliographic Information Takuma Yoneda, Koki Mori, Makoto Miwa and Yutaka Sasaki ...... 112

The SUMMA Platform Prototype Renars Liepins, Ulrich Germann, Guntis Barzdins, Alexandra Birch, Steve Renals, Susanne We- ber, Peggy van der Kreeft, Herve Bourlard, João Prieto, Ondrej Klejch, Peter Bell, Alexandros Lazaridis, Alfonso Mendes, Sebastian Riedel, Mariana S. C. Almeida, Pedro Balage, Shay B. Cohen, Tomasz Dwo- jak, Philip N. Garner, Andreas Giefer, Marcin Junczys-Dowmunt, Hina Imran, David Nogueira, Ahmed Ali, Sebastião Miranda, Andrei Popescu-Belis, Lesly Miculicich Werlen, Nikos Papasarantopoulos, Abi- ola Obamuyide, Clive Jones, Fahim Dalvi, Andreas Vlachos, Yang Wang, Sibo Tong, Rico Sennrich, Nikolaos Pappas, Shashi Narayan, Marco Damonte, Nadir Durrani, Sameer Khurana, Ahmed Abdelali, Hassan Sajjad, Stephan Vogel, David Sheppey, Chris Hernon and Jeff Mitchell ...... 116

viii EACL 2017 Software Demonstrations Program

Wednesday 5th April 2017

17.30–19.30 Session 1

COVER: Covering the Semantically Tractable Questions Michael Minock

Common Round: Application of Language Technologies to Large-Scale Web De- bates Hans Uszkoreit, Aleksandra Gabryszak, Leonhard Hennig, Jörg Steffen, Renlong Ai, Stephan Busemann, Jon Dehdari, Josef van Genabith, Georg Heigold, Nils Reth- meier, Raphael Rubino, Sven Schmeier, Philippe Thomas, He Wang and Feiyu Xu

A Web-Based Interactive Tool for Creating, Inspecting, Editing, and Publishing Et- ymological Datasets Johann-Mattis List

WAT-SL: A Customizable Web Annotation Tool for Segment Labeling Johannes Kiesel, Henning Wachsmuth, Khalid Al Khatib and Benno Stein

TextImager as a Generic Interface to R Tolga Uslu, Wahed Hemati, Alexander Mehler and Daniel Baumartz

GraWiTas: a Grammar-based Wikipedia Talk Page Parser Benjamin Cabrera, Laura Steinert and Björn Ross

TWINE: A real-time system for TWeet analysis via INformation Extraction Debora Nozza, Fausto Ristagno, Matteo Palmonari, Elisabetta Fersini, Pikakshi Manchanda and Enza Messina

Alto: Rapid Prototyping for Parsing and Translation Johannes Gontrum, Jonas Groschwitz, Alexander Koller and Christoph Teichmann

CASSANDRA: A multipurpose configurable voice-enabled human-computer- interface Tiberiu Boro¸s,Stefan Daniel Dumitrescu and Sonia Pipa

An Extensible Framework for Verification of Numerical Claims James Thorne and Andreas Vlachos

ADoCS: Automatic Designer of Conference Schedules Diego Fernando Vallejo Huanga, Paulina Adriana Morillo Alcívar and Cesar Ferri Ramírez

ix Wednesday 5th April 2017 (continued)

A Web Interface for Diachronic Semantic Search in Spanish Pablo Gamallo, Iván Rodríguez-Torres and Marcos Garcia

Multilingual CALL Framework for Automatic Language Exercise Generation from Free Text Naiara Perez and Montse Cuadros

Audience Segmentation in Social Media Verena Henrich and Alexander Lang

Thursday 6th April 2017

17.30–19.30 Session 2

The arText prototype: An automatic system for writing specialized texts Iria da Cunha, M. Amor Montané and Luis Hysa

QCRI Live Speech Translation System Fahim Dalvi, Yifan Zhang, Sameer Khurana, Nadir Durrani, Hassan Sajjad, Ahmed Abdelali, Hamdy Mubarak, Ahmed Ali and Stephan Vogel

Nematus: a Toolkit for Neural Machine Translation Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Ju- lian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry and Maria Nadejde

A tool for extracting sense-disambiguated example sentences through user feedback Beto Boullosa, Richard Eckart de Castilho, Alexander Geyken, Lothar Lemnitzer and Iryna Gurevych

Lingmotif: Sentiment Analysis for the Digital Humanities Antonio Moreno-Ortiz

RAMBLE ON: Tracing Movements of Popular Historical Figures Stefano Menini, Rachele Sprugnoli, Giovanni Moretti, Enrico Bignotti, Sara Tonelli and Bruno Lepri

Autobank: a semi-automatic annotation tool for developing deep Minimalist Gram- mar treebanks John Torr

Chatbot with a Discourse Structure-Driven Dialogue Management Boris Galitsky and Dmitry Ilvovsky

x Thursday 6th April 2017 (continued)

Marine Variable Linker: Exploring Relations between Changing Variables in Ma- rine Science Literature Erwin Marsi, Pinar Pinar Øzturk and Murat V. Ardelan

Neoveille, a Web Platform for Neologism Tracking Emmanuel Cartier

Building Web-Interfaces for Vector Semantic Models with the WebVectors Toolkit Andrey Kutuzov and Elizaveta Kuzmenko

InToEventS: An Interactive Toolkit for Discovering and Building Event Schemas Germán Ferrero, Audi Primadhanty and Ariadna Quattoni

ICE: Idiom and Collocation Extractor for Research and Education vasanthi vuppuluri, Shahryar Baki, An Nguyen and Rakesh Verma

Bib2vec: Embedding-based Search System for Bibliographic Information Takuma Yoneda, Koki Mori, Makoto Miwa and Yutaka Sasaki

The SUMMA Platform Prototype Renars Liepins, Ulrich Germann, Guntis Barzdins, Alexandra Birch, Steve Renals, Susanne Weber, Peggy van der Kreeft, Herve Bourlard, João Prieto, Ondrej Kle- jch, Peter Bell, Alexandros Lazaridis, Alfonso Mendes, Sebastian Riedel, Mariana S. C. Almeida, Pedro Balage, Shay B. Cohen, Tomasz Dwojak, Philip N. Garner, Andreas Giefer, Marcin Junczys-Dowmunt, Hina Imran, David Nogueira, Ahmed Ali, Sebastião Miranda, Andrei Popescu-Belis, Lesly Miculicich Werlen, Nikos Papasarantopoulos, Abiola Obamuyide, Clive Jones, Fahim Dalvi, Andreas Vla- chos, Yang Wang, Sibo Tong, Rico Sennrich, Nikolaos Pappas, Shashi Narayan, Marco Damonte, Nadir Durrani, Sameer Khurana, Ahmed Abdelali, Hassan Sajjad, Stephan Vogel, David Sheppey, Chris Hernon and Jeff Mitchell

xi

COVER: Covering the Semantically Tractable Questions

Michael Minock KTH Royal Institute of Technology, Stockholm, Sweden Umea˚ University, Umea˚ Sweden. [email protected], [email protected]

Abstract we explicitly describe the handling of self-joining queries, aggregate operators, sub-queries, etc.; In semantic parsing, natural language b) we employ theorem proving for consolidating questions map to meaning representation equivalent queries and including only queries that language (MRL) expressions over some are information bearing and non-redundant; c) we fixed vocabulary of predicates. To do this more cleanly factor the model to better isolate the reliably, one must guarantee that for a role of off-the-shelf syntactic parsers. In prac- wide class of natural language questions tice, our extended model leads to improved per- (the so called semantically tractable ques- formance (12% absolute coverage improvement) tions), correct interpretations are always in on the full GEOQUERY corpus. the mapped set of possibilities. Here we demonstrate the system COVER which sig- 2 COVER nificantly clarifies, revises and extends the Figure 1 shows the components of COVER. The notion of semantic tractability. COVER is only configuration is a simple domain specific written in Python and uses NLTK. lexicon which itself may be automatically de- 1 Introduction rived from the instance and lexical re- sources. There are three core processing phases, The twentieth century attempts to build and eval- which generate a set of MRL expressions from the uate natural language interfaces to are user’s question and two auxiliary processes which covered, more or less, in (Androutsopoulos et al., use off-the-shelf syntactic analysis and theorem 1995; Copestake and Jones, 1990). While re- proving to prune MRL expressions from the con- cent work has focused on learning approaches, structed set. We cover the three core phases here there are less costly alternatives based on only and the two auxiliary processes in section 3. For lightly naming database elements (e.g. relations, more detail see (Minock, 2017). attributes, values) and reducing question interpre- tation to graph match (Zhang et al., 1999; Popescu 2.1 Phase 1: Generating All Mappings et al., 2003). In (Popescu et al., 2003), the no- A mapping is defined as an assignment of all word tion of semantically tractable questions was intro- positions in a user’s question to database elements duced and further developed in (Popescu et al., (relations, attributes, values). Thus a mapping 2004). The semantically tractable questions are for the six-word question “what are the cities in those for which, under strong assumptions, one Ohio?” would consist of six assignments. Since can guarantee generating a correct interpretation eventually mappings might generate MRL expres- (among others). A focus of the PRECISE work sions, each assignment is marked by a type to indi- was reducing the problem of mapping tokenized cate its use as either a focus, a condition or a stop user questions to database elements to a max-flow assignment (to be ignored). problem. These ideas were implemented and eval- In COVER’s first phase candidate mappings for uated in the PRECISE system, which scored 77% a question are generated. For example in the ques- coverage over the full GEOQUERY corpus. tion “what are the cities in Ohio?”, under the cor- Here we demonstrate the system COVER, which rect mapping: ‘what’ is assigned to City and is extends and refines the basic model as follows: a) marked as a focus type; ‘are’ and ‘the’ are marked

1 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 1–4 c 2017 Association for Computational Linguistics

Figure 1: Overall processing paths of COVER. as stop assignments and are assigned to the empty the relation (e.g. ‘New York’ for the relation element; ‘cities’ is assigned to the relation City City) is within the question. and is marked as condition type; ‘in’ is assigned the foreign key attribute state from the rela- 3. All attribute focus markers are on men- tion City to the relation State and is marked tioned, rooted attributes. An attribute focus as a condition type; ‘Ohio’ is assigned to the value marker (e.g. ‘what’) must not only explic- ‘Ohio’ of attribute name of the relation State itly match an attribute (e.g. ‘population’), but and is marked as a condition type. The knowledge such an attribute must also be rooted. An at- to determine such mappings comes from a very tribute is rooted if the relation of the attribute simple domain lexicon which assigns phrases to is mentioned. database elements. (For example ’Ohio’ in the lex- 4. Non-focus attributes satisfy correspon- icon is assigned to the database element represent- dences. Unless they are the focus, attributes ing the State.name=’Ohio’). Note matching must pair with a value (e.g. in “cities with is based on word stems. Thus, for example, ‘cities’ name Springfield”), or, in the case that the matches ‘city’. attribute is a foreign key (e.g. “cities in the state...”, the attribute must pair with the rela- 2.2 Phase 2: Determining Valid Mappings tion or primary value of the foreign key (e.g. Valid mappings must observe certain constraints. “cities in Ohio”). For example the question “what is the capital of New York?” is valid when ‘New York’ maps to the 5. Value elements satisfy correspondences. state, but is not valid when ‘New York’ maps to the Values are either primary (e.g. “New York”) city. “What is the population of the Ohio River?” or must be paired with either an attribute (e.g. has no valid mappings because under the given “... city with the name of New York” ...”), or database there is no way to fit the population via ellipsis paired with a relation (e.g. “... the attribute with the River table. Specifically the city New York”). following are six properties that valid mappings 6. All mentioned elements are connected. must observe. The elements assigned by the mapping must a connected graph over the underlying 1. There exists a unique focus element. There database schema. always exists a unique focus element that is the attribute or relation upon which the ques- This leads to the core definition of this work: tion is seeking information. For efficiency (Semantically Tractable Question) this is actually enforced in phase 1. Definition 1 For a given question q, lexicon , q is semantically L 2. All relation focus markers are on men- tractable if there exists at least one valid mapping tioned relations. If a relation is a focus, then over q. it must be mentioned. A relation is mentioned 2.3 Phase 3: Generating Logical Formulas if either it is explicitly named in the ques- tion (e.g. ‘cities’) or if a primary value1 of Given a set of valid mappings, COVER’s third phase is to generate one or more MRL expres- 1 Primary values(Li and Jagadish, 2014), stand, or for the sions for each valid mapping. To achieve this most part stand, for a tuple of a relation. For example ‘Spring- field’ is a primary value of City even though it is not a key we unfold the connected graph of valid mappings value. (see property 6 in section 2.2) into meaningful full

2 graphs. This is complicated in the self-joining Case Coverage Avg/Med # in- terpretations case where graphs will have multiple vertices for full 780/880 (89%) 7.59/2 a given relation. For example the valid mapping no-equiv reduction 780/880 (89%) 19.65/2 for “what states border states that border New York?” maps to two database relations: State Table 1: Evaluation over GEOQUERY and Borders. But the corresponding unfolded graph will be over three instantiations of State and two of Borders. Our algorithm gives a sys- Theorem proving is also used to enforce prag- tematic and exhaustive process to compute all rea- matic constraints. For example we remove queries sonable unfolded graphs. And then for each un- that do not bear information, or have redundancies folded graph we generate the set of possible at- within them that violate Gricean principles. This tachments of conditions and selections. In this end is principally achieved by determining whether a this gives a set of query objects which maybe di- set-fetching query necessarily generates a single rectly expressed in SQL for application over the or no tuple where the answer is already in the ques- database or to first order logic expressions for sat- tion. For example a query retrieving “the names of isfiability testing via an automatic theorem prover. states where the name of the state is New York and the state borders a state” does not bear informa- 3 Auxiliary Processes tion. Finally it should be noted, one can add arbi- trary domain rules (e.g. states have one and only Figure 1 shows two auxiliary processes that fur- one capital) to constrain deduction. This would ther constrain what are valid mappings as well as allow more query equivalencies and cases of prag- which MRL expressions are returned. matics violations to be recognized.

3.1 Integrating Syntax 4 Demonstration COVER uses an off-the-shelf parser to generate a syntactic attachment relation between word posi- Our demonstration, like PRECISE, is over GEO- tions. This attachment relation is then used to QUERY, a database on US geography along with sharpen the conditions in the properties 2, 3, 4 880 natural language questions paired with corre- and 5 of valid mappings. In short correspondences sponding logical formulas. The evaluation method and focus-value matches must be over word posi- is exactly as in PRECISE(in conversation with Ana- tions that are attached in the underlying syntactic Maria Popescu). First we prepare a lexicon over parse. This has the effect of reducing the number the GEOQUERY domain, then, given the set of 880 of valid mappings. For example let us consider natural language/meaning representation pairs, the “what is the largest city in the smallest state?”. queries are run through the system and if the ques- If our syntactic analysis component correctly de- tion is semantically tractable and generates one or termines that ‘largest’ does not attach to ‘state’ more formal query expressions, then an expression and ‘smallest’ does not attach to ‘city’, then an er- equivalent to the target MRL expression must be roneous second valid mapping will be excluded. within the generated set. Our experiment shows Attachment information is also used to constrain this to be the case. which MRL expressions can be generated from Table 1 presents some results. By ‘coverage’ we valid mappings. , like in the PRECISE evaluation, that the an- swer is in the computed result set. Table 1 shows 3.2 Integrating Semantics and Pragmatics results for two cases: full has all features turned Many of the MRL expressions generated in phase on; no-equiv reduction shows results when the three are semantically equivalent, but syntactically auxiliary process described in section 3.2 is dis- distinct. The second auxiliary process uses a the- engaged. Clearly we are benefiting from the use orem prover to reduce the number of MRL ex- of a theorem prover which is used to reduce the pressions (the shortest one from each equivalence size of returned query sets. class) that need to be paraphrased for interactive At EACL we will run our evaluation on a laptop, disambiguation. This is achieved by testing pair- showing the complete configuration over GEO- wise query containment over all the MRL expres- QUERY and welcoming audience members to pose sions in the MRL set produced in the third phase. interactive questions.

3 5 Discussion which notions of ’attachment’ perform best. A final question is, is the semantically tractable Cover To apply , consider our strong, though class fundamental? COVER has generalized the achievable assumptions: a) the user’s mental class from the earlier PRECISE work and nothing model matches the database; b) no word or phrase seems to block its further extension to questions of a question maps only to incorrect element, type requiring negation, circular self-joins, etc. Still, pairs; c) all words in the question are syntactically we expended considerable effort trying to extend attached. Under such assumptions, every question the class to include queries with maximum cardi- that we answer is semantically tractable, and thus nality conditions (e.g. “What are the states with a correct interpretation is always included within the most cities?”). This effort foundered on defin- any non-empty answer. Let us discuss the feasi- ing a decidable semantics. But we also witnessed bility of these assumptions. many cases where spurious valid mappings were With respect to assumption a, it is difficult let in while not finding a correct valid mapping to determine if a user’s mental model matches (‘most’ seems to be a particularly sensitive word). a database, but, in general, natural language in- Further study is required to determine the natu- terfaces do better when the schema is based on ral limit of the semantic tractability question class. a conceptual (e.g. Entity-Relationship) or do- How far can we go? Is there a limit? Our intuition main model, as is the case for GEOQUERY. Still says ’yes’. A related question is is there a hier- COVER is not yet fully generalized for EER-based archy of semantically tractable classes where the databases (e.g. multi-attribute keys, isa hierar- number of interpretations blows up as we extend chies, union types, etc.). We shall study more cor- the number of constructs we are able to handle? pora (e.g. QALD (Walter et al., 2012)) to see what Again, our intuition says ’yes’. COVER will be the type of conceptual models are required. basis of the future study of these questions. With respect to assumption b, an over- expanded lexicon can lead to spurious interpre- tations, but that is not so serious as long as the References right interpretation still gets generated. Any num- Ion Androutsopoulos, Graeme D. Ritchie, and Peter ber of additional noisy entries could be added Thanisch. 1995. Natural language interfaces to and our strong assumption b) would remain true. databases. NLE, 1(1):29–81. While we are investigating how to automatically Ann A. Copestake and Karen Sparck Jones. 1990. Nat- generate ‘adequate’ lexicons (e.g. by adapt- ural language interfaces to databases. Knowledge ing domain-independent ontologies, expanding Eng. Review, 5(4):225–249. domain-specific lists, or using techniques from au- Fei Li and H. V. Jagadish. 2014. Constructing an tomatic paraphrase generation), the question of interactive natural language interface for relational how the lexicon is acquired and accessed (e.g. databases. PVLDB, 8(1):73–84. Querying over an API would require probing for Michael Minock. 2017. Technical description of named entities rather than simple hash look-up, COVER 1.0. CoRR. etc.) is orthogonal to the contribution of this work. Ana-Maria Popescu, Oren Etzioni, and Henry A. We make assumption c because we want to Kautz. 2003. Towards a theory of natural language guarantee that COVER lives up to the promise of interfaces to databases. In IUI-03, January 12-15, always having within its non-empty result sets 2003, Miami, FL, USA, pages 149 – 157. the correct interpretation. Still the control points Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, for syntactic analysis are very clearly laid out in David Ko, and Alexander Yates. 2004. Modern nat- ural language interfaces to databases. In COLING, COVER. Given that interfaces should be usable by pages 141 – 147. real users, keeping the interpretation set manage- able while trying to keep the correct one in the Sebastian Walter, Christina Unger, Philipp Cimiano, ¨ set is useful, especially if a database is particu- and Daniel Bar. 2012. Evaluation of a layered ap- proach to question answering over linked data. In larly ambiguous. Exploring methods to integrate ISWC, Boston, MA, USA, pages 362–374. off-the-shelf syntactic parsers to narrow the num- Guogen Zhang, Wesley W. Chu, Frank Meng, and ber of interpretations while not excluding correct Gladys Kong. 1999. Query formulation from high- interpretations will be future work. Specifically level concepts for relational databases. In UIDIS, we will evaluate which off-the-shelf parsers and pages 64–75.

4 Common Round: Application of Language Technologies to Large-Scale Web Debates

Hans Uszkoreit, Aleksandra Gabryszak, Leonhard Hennig, Jorg¨ Steffen, Renlong Ai, Stephan Busemann, Jon Dehdari, Josef van Genabith, Georg Heigold, Nils Rethmeier, Raphael Rubino, Sven Schmeier, Philippe Thomas, He Wang, Feiyu Xu Deutsches Forschungszentrum fur¨ Kunstliche¨ Intelligenz, Germany [email protected]

Abstract platforms offer the following participation model: users can formulate a discussion topic, share their Web debates play an important role in en- opinions on that topic and respond to others in abling broad participation of constituen- the form of unstructured posts. However, they do cies in social, political and economic not offer effective functionalities for 1) easy ac- decision-taking. However, it is challeng- cess to the argumentative structure of debate con- ing to organize, structure, and navigate a tent, and 2) quick overviews of the various se- vast number of diverse argumentations and mantic facets, the polarity and the relevance of comments collected from many partici- the arguments. Some platforms1 allow users to pants over a long time period. In this paper label posts as pro or con arguments, to cite ex- we demonstrate Common Round, a ternal sources, to assess debate content or to cre- generation platform for large-scale web ate structured debates across the web, but do not debates, which provides functions for elic- offer any deeper automatic language technology- iting the semantic content and structures based analyses. Argumentation mining research, from the contributions of participants. In which could help in automatically structuring de- particular, Common Round applies lan- bates, has only recently been applied to web de- guage technologies for the extraction of bate corpora (Boltuzic and Snajder, 2015; Petasis semantic essence from textual input, ag- and Karkaletsis, 2016; Egan et al., 2016). gregation of the formulated opinions and Our goal is to address these issues by develop- arguments. The platform also provides a ing a debate platform that: cross-lingual access to debates using ma- Supports debate participants in making sub- • chine translation. stantial and clear contributions Facilitates an overview of debate contents • 1 Introduction Associates relevant information available on • Debating is a very useful approach to individual the web with debate topics and collective decision making; it helps the for- Connects regional discussions to global de- • mation of ideas and policies in democracies. Web- liberations based debates allow for reaching a wider audi- Supports advanced participation in deliber- • ence, therefore bringing more arguments and di- ations, without sacrificing transparency and verse perspectives on a debate topic compared to usability. face-to-face discussions. The asynchronous mode In this paper, we present the Common Round of web debates also enables participants to ex- platform, which implements various functions to- plore debate content more thoroughly (Salter et wards these goals, with following contributions: al., 2016). However, these advantages often get (a) we describe the architecture of the platform lost in a large-scale debate since its content can (Section 2), including a model for argumentative become unmanageable. Moreover, missing back- dialogues (Section 3) and (b) we present our im- ground knowledge on debate topics or language plementation of several language technologies for barriers for international topics can prohibit users supporting collective deliberation (Sections 4–6). from a proper understanding of debate content. 1Examples: ProCon.org, OpenPetition.de, Arv- The majority of the existing web discussion ina and ArguBlogging (Bex et al., 2013)

5 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 5–8 c 2017 Association for Computational Linguistics

Figure 1: Architecture of Common Round.

2 Common Round platform - Overview Figure 1 depicts the platform architecture. Common Round provides a top-level dialogue model that supports users to make decisions and enter labeled contributions at appropriate places in the model. This keeps debates organized and transparent, and at the same time allows users to participate in a free and unrestricted discussion. Furthermore a predefined coarse-grained debate structure facilitates the development of automatic text analysis algorithms for deliberation support, since it enables the system to integrate domain and context knowledge for better interpretation of user intentions. Within our platform, textual debate content is automatically further structured and en- hanced employing argument mining, text analyt- Figure 2: Dialogue model of Common Round. ics, retrieval of external material and machine translation. The results of these analyses are fed back to the front-end to give users a good overview for the debate. The subsequent level is reserved for of discussion content, and to enable a structured, semantic search for additional evidences and back- question type example Yes-No Should Cannabis be legalized? ground material. We will exemplify the features Multi-proposal How to address online fake news? of our platform with four selected topics Should Proposal 1 Impose sanctions on fake news authors. Cannabis be legalized?, Are German cars bet- Proposal 2 Impose sanctions on providers. ter than those from Japan?, Should we use wind Table 1: Yes-no and multi-proposal questions. power?, How to address online fake news? arguments. An argument can either be a pro- or a 3 Dialogue model and content con-argument. For the yes-no questions, an argu- assessments ment directly refers to a yes answer to the ques- The Common Round platform provides a struc- tion, while for multi-proposal questions an argu- ture that aims to cover the most essential aspects ment refers to a single proposal. At the next level, of argumentative dialogues (Figure 2). a user can add evidence in favor of or against an The top level defines the semantic categories of argument. Such evidence could be a textual contri- the debate questions. Given the categories, the bution or a link to external sources. Finally, a user questions themselves can be posted as the next can freely answer to either an evidence or a com- level. There are two types of debate questions: ment or just refer to a previously posted answer. (a) yes-no questions, and (b) multi-proposal ques- The length of the contributions is restricted in or- tions (Table 1). A question reflects the major claim der to enforce separated, focused posts instead of

6 lengthy and noisy information. bate participants. Additionally, our platform in- Besides the dialogue structure, assessing - corporates a web search to enable finding support- ings is an important aspect in the Common Round ing evidence in external knowledge sources. Table forum. The assessment reflects the community 2 shows an example of the text analytics results view of a main debate claim, an argument or evi- and the association with external material. dence. The degrees of consent a user can express are based on the common 5-level Likert-type scale Cannabis is a good way to reduce pain (Likert, 1932). Based on the ratings a first time substance disorder viewer of a question can get a quick overview of NER and MayTreatDisorder the state of the discussion. RE Sentiment positive External www.scientificamerican. 4 Argument Mining Material com/article/could-medical- cannabis-break-the- The dialogue model enables users to distinguish painkiller-epidemic between argumentative and non-argumentative parts of a debate. Users can explore the com- Table 2: Example of the text analytics results. ponents of argumentations, i.e. major claims, ar- guments and evidences. The categorization in Information extraction pipeline currently sup- pro/con arguments and pro/con evidences easily ports English and German texts. We utilize allows finding posts supporting or refuting the ma- Stanford CoreNLP tools for segmentation, POS- jor claim. Furthermore, similar arguments are au- tagging and dependency parsing of English texts. tomatically clustered. To aggregate similar ar- POS-tagging and dependency parsing of German guments we represent posts with Paragraph2Vec texts are realized with Mate Tools. The topic (Le and Mikolov, 2014) and then compute clus- detection is realized by the supervised document ters from the cosine similarity between post vec- classification module PCL (Schmeier, 2013). PCL tors using agglomerative clustering. In order to is proven to be very robust and accurate especially keep the clustering unsupervised we automatically for short texts and unbalanced training data. For discover the number of clusters using the Silhou- named entity recognition, we employ two com- ette Coefficient (Rousseeuw, 1987). To help users plementary approaches. We apply Stanford NER identify the most relevant content within a clus- models to recognize standard entity types, such ter, the arguments are sorted by their assessed va- as persons, organizations, locations and date/time lidity. Reading only a few arguments from each expressions. For non-standard concept types, we cluster enables users to gain a quick overview of use SProUT (Drozdzynski et al., 2004), which im- different argument facets. For example, in a de- plements a regular expression-like rule formalism bate on cannabis legalization, arguments from a and gazetteers for detecting domain-specific con- medical point of view can be grouped separately cepts in text. Relation extraction is performed from arguments that focus on economic reasons. by matching dependency parse trees of sentences This functionality is an important contribution to to a set of automatically learned dependency pat- structuring large-scale debates. terns (Krause et al., 2012). Relation patterns are learned from corpora manually annotated with 5 Text Analytics and Association with event type, argument types, and roles, or using dis- External Material tant supervision (Mintz et al., 2009). For senti- ment analysis, we apply a lexicon-based approach The Common Round platform enriches the con- that additionally makes use of syntactic informa- tents of debates and posts by extracting informa- tion in order to handle negation. tion about topics, sentiment, entities and relations. Topic detection helps to find semantically related 6 Cross-lingual Access debates. Sentiment analysis allows determining the emotion level in discussions. By identifying, In order to support cross-lingual access to debate for example, instances of the relation MayTreat- posts we developed a new character-based neural Disorder in a discussion about the legalization of machine translation (NMT) engine and tuned on cannabis, we can aggregate references to the same Common Round domains. Translation is avail- or similar medical facts introduced by different de- able as a real-time service and translation outputs

7 are marked. Our NMT is based on an encoder– References decoder with attention design (Bahdanau et al., D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural 2014; Johnson et al., 2016), using bidirectional machine translation by jointly learning to align and LSTM layers for encoding and unidirectional lay- translate. CoRR, abs/1409.0473. ers for decoding. We employ a character-based F. Bex, J. Lawrence, M. Snaith, and C.Reed. 2013. approach to better cope with rich morphology and Implementing the argument web. Commun. ACM, OOVs in Common Round user posts. We train 56:66–73. 2 on English-German data from WMT-16 and use O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, transfer learning on 30K crawled noisy in-domain B. Haddow, M. Huck, A. Jimeno Yepes, P. Koehn, segments for Common Round domain-adaptation. V. Logacheva, C. Monz, M. Negri, A. Neveol, As our NMT system is new, we compare with s-o- M. Neves, M. Popel, M. Post, R. Rubino, C. Scar- ton, L. Specia, M. Turchi, K. Verspoor, and Ma. t-a PBMT: while both perform in the top-ranks in Zampieri. 2016. Findings of the 2016 Conference WMT-16 (Bojar et al., 2016), NMT is better on on Machine Translation. In Proc. of MT. Common Round data (NMT from 29.1 BLEU on F. Boltuzic and J. Snajder. 2015. Identifying promi- WMT-16 data to 22.9 car and 23.8 wind power, nent arguments in online debates using semantic tex- against 30.0 to 13.6 and 17.7 for PBMT) and bet- tual similarity. In ArgMining@HLT-NAACL. ter adapts using the supplementary crawled data 3 W. Drozdzynski, H. Krieger, J. Piskorski, U. Schafer,¨ (25.8 car and 28.5 wind against 14.1 and 19.3). and F. Xu. 2004. Shallow processing with unifica- tion and typed feature structures — foundations and 7 Conclusion & Future Work applications. Kunstliche¨ Intelligenz, 1:17–23. We presented Common Round, a new type of C. Egan, A. Siddharthan, and A. Z. Wyner. 2016. Sum- web-based debate platform supporting large-scale marising the points made in online political debates. In ArgMining@ACL. decision making. The dialogue model of the plat- form supports users in making clear and substan- M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viegas,´ M. Wat- tial debate contributions by labeling their posts tenberg, G. Corrado, M. Hughes, and J. Dean. as pro/con arguments, pro/con evidences, com- 2016. Google’s multilingual neural machine trans- ments and answers. The user input is automati- lation system: Enabling zero-shot translation. In cally analyzed and enhanced using language tech- https://arxiv.org/abs/1611.04558. nology such as argument mining, text analytics, S. Krause, H. Li, H. Uszkoreit, and F. Xu. 2012. retrieval of external material and machine trans- Large-Scale Learning of Relation-Extraction Rules lation. The analysis results are fed back to the with Distant Supervision from the Web. In ISWC. user interface as an additional information layer. Q. V. Le and T. Mikolov. 2014. Distributed represen- For future work we suggest a more fine-grained tations of sentences and documents. In ICML. automatic text analysis, for example by detecting R. Likert. 1932. A technique for the measurement of claims and premises of arguments as well as types attitudes. Archives of Psychology, 22(140):1–55. and validity of evidences. Moreover we will con- M. Mintz, S. Bills, R. Snow, and D. Jurafsky. 2009. duct a thorough evaluation of the system. Distant Supervision for Relation Extraction Without Labeled Data. In IJCNLP. Acknowledgments G. Petasis and V. Karkaletsis. 2016. Identifying ar- This research was supported by the German Fed- gument components through TextRank. In ArgMin- eral Ministry of Education and Research (BMBF) ing@ACL. through the projects ALL SIDES (01IW14002) P. J Rousseeuw. 1987. Silhouettes: a graphical aid to and BBDC (01IS14013E), by the German Fed- the interpretation and validation of cluster analysis. eral Ministry of Economics and Energy (BMWi) J. Comput. Appl. Math., 20:53–65. through the projects SDW (01MD15010A) and S. Salter, T. Douglas, and D. Kember. 2016. Compar- SD4M (01MD15007B), and by the European ing face-to-face and asynchronous online communi- Union’s Horizon 2020 research and innovation cation as mechanisms for critical reflective dialogue. JEAR, 0(0):1–16. programme through the project QT21 (645452). S. Schmeier. 2013. Exploratory Search on Mobile De- 2http://www.statmt.org/wmt16/ vices. DFKI-LT - Dissertation Series, Saarbruecken, translation-task.html Germany. 3All EN DE, similar for reverse. →

8 A Web-Based Interactive Tool for Creating, Inspecting, Editing, and Publishing Etymological Datasets

Johann-Mattis List Max Planck Institute for the Science of Human History Kahlaische Straße 10 07745 Jena [email protected]

Abstract This is surprising, since automatic approaches still lag behind expert analyses (List et al., 2017). The paper presents the Etymological DIC- Tools for data preparation and evaluation would tionary ediTOR (EDICTOR), a free, in- allow experts to directly interact with computa- teractive, web-based tool designed to aid tional approaches by manually checking and cor- historical linguists in creating, editing, recting their automatically produced results. Fur- analysing, and publishing etymological thermore, since the majority of phylogenetic ap- datasets. The EDICTOR offers interac- proaches makes use of manually submitted expert tive solutions for important tasks in histor- judgments (Gray and Atkinson, 2003), it seems in- ical linguistics, including facilitated input dispensable to have tools which ease these tasks. and segmentation of phonetic transcrip- tions, quantitative and qualitative analyses of phonetic and morphological data, en- 2 The EDICTOR Tool hanced interfaces for cognate class assign- The Etymological DICtionary ediTOR (EDIC- ment and multiple word alignment, and TOR) is a free, interactive, web-based tool that automated evaluation of regular sound cor- was specifically designed to serve as an inter- respondences. As a web-based tool writ- face between quantitative and qualitative tasks in ten in JavaScript, the EDICTOR can be historical linguistics. Inspired by powerful fea- used in standard web browsers across all tures of STARLING (Starostin, 2000) and RefLex major platforms. (Segerer and Flavier, 2015), expanded by inno- 1 Introduction vative new features, and based on a very simple data model that allows for a direct integration with The amount of large digitally available datasets for quantitative software packages like LingPy (List various language families is constantly increasing. and Forkel, 2016), the EDICTOR is a lightweight In order to analyse these data, linguists turn more but powerful toolkit for computer-assisted appli- and more to automatic approaches. Phylogenetic cations in historical linguistics. methods from biology are now regularly used to create evolutionary trees of language families 2.1 File Formats and Data Structure (Gray and Atkinson, 2003). Methods for the com- parison of biological sequences have been adapted The EDICTOR was designed as a lightweight file- and allow to automatically search for cognate based tool that takes a text file as input, allowing words in multilingual word lists (List, 2014) and to modify and save it. The input format is a plain to automatically align them (List, 2014). Complex tab-separated value (TSV) file, with a header indi- workflows are used to search for deep genealogi- cating the value of the columns. This format is es- cal signals between established language families sentially identical with the format used in LingPy. (Jager,¨ 2015). Although the EDICTOR accepts all regular TSV In contrast to the large arsenal of software for files as input, its primary target are multi-lingual automatic analyses, the number of tools help- word lists, that is, datasets in which a given num- ing to manually prepare, edit, and correct lexical ber of concepts has been translated into a certain datasets in historical linguistics is extremely rare. range of target languages.

9 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 9–12 c 2017 Association for Computational Linguistics

Align- PANEL INTERACTION ED Phono- Activation T EDICTOR ments Editing th o x t ə r Filtering Cognate d o: - t a - logy DATA ANALYSIS Sets Frequency Analysis ID: 1 Morpho- Structural Analysis Word logy DATA EDITING {one} List /əu/ Transcription Partial /əu/ -th-a- {one} ID: 1 Corres- -th-a- Phonetic Segmentation Cognates {one} Morphological Segmentation Cognate Assignment ID: 1 pondences ID: 1 th o x t ə r d o: - t a - Phonetic Alignment

Figure 1: Basic panel structure of the EDICTOR.

ID DOCULECT CONCEPT TRANSCRIPTION ... For direct online usage, the tool can be accessed 1 German Woldemort valdəmar ... via its project website. 2 English Woldemort wɔldəmɔrt ... 3 Chinese Woldemort fu⁵¹ti⁵¹mɔ³⁵ ... 4 Russian Woldemort vladimir ... 3 Data Editing in the EDICTOR ...... 10 German Harry haralt ... 3.1 Editing Word List Data 11 English Harry hæri ... 12 Russian Harry gali ... Editing data in the Word List panel of the EDIC- ...... TOR is straightforward by inserting values in text- fields which appear when clicking on a given field Figure 2: Basic file format in the EDICTOR or when browsing the data using the arrow keys of the keyboard. Additional keyboard shortcuts allow for quick browsing. For specific data types, auto- 2.2 User Interface matic operations are available which facilitate the The EDICTOR is divided into different panels input or test what the user inserts. Transcription, which allow to edit or analyse the data in diffe- for example supports SAMPA-input. The segmen- rent ways. The core module is the Word List panel tation of phonetic entries into meaningful sound which displays the data in its original form and units is also carried out automatically. Sound seg- can be edited and analysed as one knows it from ments are highlighted with specific background spreadsheet applications. For more complex tasks colors based on their underlying sound class and of data editing and analysis, such as cognate as- sounds which are not recognized as valid IPA sym- signment or phonological analysis, additional pan- bols are highlighted in warning colors (see the il- els are provided. Specific modes of interaction be- lustration in Figure 3). The users can decide them- tween the different panels allow for a flexible in- selves in which fields they wish to receive auto- teraction between different tasks. Using drag-and- matic support, and even Chinese input using an ¯ ¯ drop, users can arrange the panels individually or automatic Pinyin converter is provided. hide them completely. Figure 1 illustrates how the Conversion and Segmentation major panels of the EDICTOR interact with each wOld@mO:RtwOld@mO:Rt other. wɔldəmɔːʁt ID DOCULECT CONCEPT SEGMENTS w ɔ l d ə m ɔː ʁ t 22 Chinese Woldemort fu⁵ ¹ d i⁵ ¹ mɔ ³⁵ 2.3 Technical Aspects 4 English Woldemort wOld@mO:Rt

3 German Woldemortvaltər Highlighting of Unrecognized The EDICTOR application is written in plain Phonetic Symbols JavaScript and was tested in Google Chrome, Fire- 21 Russian Woldemort В л а д и м и р fox, and Safari across different operating systems (Windows, MacOS, ). For the purpose of Figure 3: Editing word lists in the EDICTOR offline usage, users can download the source code.

10 3.2 Cognate Assessment 4 Data Analysis in the EDICTOR Defining which words in multilingual word lists 4.1 Analysing Phonetic Data are cognate is still a notoriously difficult task for Errors are inevitable in large datasets, and this machines (List, 2014). Given that the majority holds also and especially for phonetic transcrip- of datasets are based on manually edited cognate tions. Many errors, however, can be easily spot- judgments, it is important to have tools which fa- ted by applying simple sanity checks to the data. cilitate this task while at the same time control- A straightforward way to check the consistency ling for typical errors. The EDICTOR offers two of the phonetic transcriptions in a given dataset is ways to edit cognate information, the first assum- provided in the Phonology panel of the EDICTOR. ing complete cognacy of the words in their en- Here all sound segments which occur in the seg- tirety, and the second allowing to assign only spe- mented transcriptions of one language are counted cific parts of words to the same cognate set. In and automatically compared with an internal set order to carry out partial cognate assignment, the of IPA segments. Counting the frequency of seg- data needs to be morphologically segmented in a ments is very helpful to spot simple typing er- first stage, for example with help of the Morpho- rors, since segments which occur only one time logy panel of the EDICTOR (see Section 4.2). For in the whole data are very likely to be errors. The both tasks, simple and intuitive interfaces are of- internal segment inventory adds a structural per- fered which allow to browse through the data and spective: If segments are found in the internal in- to assign words to the same cognate set. ventory, additional phonetic information (manner, place, etc.) is shown, if segments are missing, this German is highlighted. The results can be viewed in tabu- English Russian lar form and in form of a classical IPA chart. Chinese IGNORE 4.2 Analysing Morphological Data The majority of words in all languages consist Figure 4: Aligning words in the EDICTOR of more than one morpheme. If historically re- lated words differ regarding their morpheme struc- ture, this poses great problems for automatic ap- proaches to sequence comparison, since the al- 3.3 Phonetic Alignment gorithms usually compare words in their entirety. Since historical-comparative linguistics is essen- German Großvater ‘grandfather’, for example, tially based on sequence comparison (List, 2014), is composed of two different morphemes, groß alignment analyses, in which words are arranged ‘large’ and Vater ‘father’. In order to analyse in a matrix in such a way that corresponding multi-morphemic words historically, it is impor- sounds are placed in the same column, are un- tant to carry out a morphological annotation anal- derlying all cognate sets. Unfortunately they are ysis. In order to ease this task, the Morphology rarely made explicit in classical etymological dic- panel of the EDICTOR offers a variety of straight- tionaries. In order to increase explicitness, the forward operations by which morpheme structure EDICTOR offers an alignment panel. The align- can be annotated and analysed at the same time. ment panel is essentially realized as a pop-up win- The core idea behind all operations is a search dow showing the sounds of all sequences which for similar words or morphemes in the same lan- belong to the same cognate set. Users can edit the guage. These colexifications are then listed and alignments by moving sound segments with the displayed in form of a bipartite word family net- mouse. Columns of the alignment which contain work in which words are linked to morphemes, as unalignable parts (like suffixes or prefixes) can be illustrated in Figure 5. The morphology analysis explicitly marked as such. In addition to manual in the EDICTOR is no miracle cure for morpheme alignments, the EDICTOR offers a simple align- detection, and morpheme boundaries need still to ment algorithm which can be used to pre-analyse be annotated by the user. However, the dynam- the alignments. Figure 4 shows an example for ically produced word family networks as well as the alignment of four fictive cognates in the EDIC- the explicit listing of words sharing the same sub- TOR. sequence of sounds greatly facilitates this task.

11 NON-BIOLOGICAL datasets in a visually appealing format as can faːtər «the father» MOTHER be seen from this exemplary URL: http:

ʃtiːf+faːtər «the stepfather» //edictor.digling.org?file=Tujia. ɡroːs+mʊtər «the grandmother» tsv&publish=true&preview=500. FATHER 6 Conclusion and Outlook ʃviːɡər+faːtər «the father-in-law»

LARGE This paper presented a web-based tool for creat- ing, inspecting, editing, and publishing etymologi- ɡroːs+faːtər «the grandfather» FATHER-OF-SPOUSE cal datasets. Although many aspects of the tool are still experimental, and many problems still need to Figure 5: Word family network in the EDICTOR: be solved, I am confident that – even in its current The morphemes (in red) link the words around form – the tool will be helpful for those working German Großvater ‘grandfather’ (in blue). with etymological datasets. In the future, I will de- velop the tool further, both by adding more useful features and by increasing its consistency. 4.3 Analysing Sound Correspondences Once cognate sets are identified and aligned, Acknowledgements searching for regular sound correspondences in the This research was supported by the DFG research data is a straightforward task. The Correspon- fellowship grant 261553824 Vertical and lateral dences panel of the EDICTOR allows to analyse aspects of Chinese dialect history. I thank Nathan sound correspondence patterns across pairs of lan- W. Hill, Guillaume Jacques, and Laurent Sagart guages. In addition to a simple frequency count, for testing the prototype. however, conditioning context can be included in the analysis. Context is modeled as a separate string that provides abstract context symbols for References each sound segment of a given word. This means Russell D. Gray and Quentin D. Atkinson. 2003. essentially that context is handled as an additional Language-tree divergence times support the Ana- tier of a sequence. This multi-tiered represen- tolian theory of Indo-European origin. Nature, 426(6965):435–439. tation is very flexible and also allows to model suprasegmental context, like tone or stress. If Gerhard Jager.¨ 2015. Support for linguistic users do not provide their own tiers, the EDIC- macrofamilies from weighted alignment. PNAS, 112(41):1275212757. TOR employs a default context model which dis- tinguishes consonants in syllable onsets from con- Johann-Mattis List and Robert Forkel. 2016. LingPy. sonants in syllable offsets. Max Planck Institute for the Science of Human His- tory, Jena. URL: http://lingpy.org. 5 Customising the EDICTOR Johann-Mattis List, Simon J. Greenhill, and Russell D. Gray. 2017. The potential of automatic word The EDICTOR can be configured in multiple comparison for historical linguistics. PLOS ONE, ways, be it while editing a dataset or before load- 12(1):1–18, 01. ing the data. The latter is handled via URL pa- Johann-Mattis List. 2014. Sequence comparison in rameters passed to the URL that loads the applica- historical linguistics.Dusseldorf¨ University Press, tion. In order to facilitate the customization proce- Dusseldorf.¨ dure, a specific panel for customisation allows the Guillaume Segerer and S. Flavier. 2015. RefLex. users to define their default settings and creates a Laboratoire DDL, Paris and Lyon. URL: http: URL which users can bookmark to have quick ac- //reflex.cnrs.fr. cess to their preferred settings. Sergej Anatol’evicˇ Starostin. 2000. STARLING. The EDICTOR can be loaded in read-only RGGU, Moscow. URL: http://starling. mode by specifying a “publish” parameter. rinet.ru. Additionally, -side files can be directly Supplementary Material loaded when loading the application. This Supplements for this paper contain a demo video (https://youtu.be/ makes it very simple and straightforward to IyZuf6SmQM4), the application website (http://edictor.digling. org), the source (v. 0.1, https://zenodo.org/record/48834), and use the EDICTOR to publish raw etymological the development version (http://github.com/digling/edictor).

12 WAT-SL: A Customizable Web Annotation Tool for Segment Labeling

Johannes Kiesel and Henning Wachsmuth and Khalid Al-Khatib and Benno Stein Faculty of Media Bauhaus-Universität Weimar 99423 Weimar, Germany .@uni-weimar.de

Abstract notator training and the annotation process. While crowdsourcing platforms such as mturk.com or A frequent type of annotations in text cor- crowdflower.com follow a similar approach, their pora are labeled text segments. General- interfaces are either hardly customizable at all or purpose annotation tools tend to be overly need to be implemented from scratch. comprehensive, often making the annota- This paper presents WAT-SL (Web Annotation tion process slower and more error-prone. Tool for Segment Labeling), an open-source web- We present WAT-SL, a new web-based tool based annotation tool dedicated to segment label- that is dedicated to segment labeling and ing.1 WAT-SL provides all functionalities to effi- highly customizable to the labeling task at ciently run and manage segment labeling projects. hand. We outline its main features and ex- Its self-descriptive annotation interface requires emplify how we used it for a crowdsourced only a web browser, making it particularly conve- corpus with labeled argument units. nient for remote annotation processes. The inter- face can be easily tailored to the requirements of 1 Introduction the project using standard web technologies in or- Human-annotated corpora are essential for the de- der to focus on the specific segment labels at hand velopment and evaluation of natural language pro- and to match the layout expectations of the anno- cessing methods. However, creating such corpora tators. At the same time, it ensures that the texts is expensive and time-consuming. While remote to be labeled remain readable during the whole annotation processes on the web such as crowd- annotation process. This process is server-based sourcing provide a way to obtain numerous an- and preemptable at any point. The annotator’s notations in short time, the bottleneck often lies progress can be constantly monitored, as all rel- in the used annotation tool and the required train- evant interactions of the annotators are logged in a ing of annotators. In particular, most popular an- simple key-value based plain text format. notation tools aim to be general purpose, such as In Section 2, we detail the main functionalities the built-in editors of GATE (Cunningham, 2002) of WAT-SL, and we explain its general usage. Sec- or web-based tools like BRAT (Stenetorp et al., tion 3 then outlines how we customized and used 2012) and WebAnno (Yimam et al., 2014). Their WAT-SL ourselves in previous work to label over comprehensiveness comes at the cost of higher in- 35,000 argumentative segments in a corpus with terface complexity, which often also decreases the 300 news editorials (Al-Khatib et al., 2016). readability of the texts to be annotated. Most of the typical annotation work is very fo- 2 Segment Labeling with WAT-SL cused, though, requiring only few annotation func- tionalities. An example is segment labeling, i.e., WAT-SL is a ready-to-use and easily customizable the assignment of one predefined label to each seg- web-based annotation tool that is dedicated to seg- ment of a given text. Such labels may, e.g., capture ment labeling and that puts the focus on easy us- clause-level argument units, sentence-level sen- age for all involved parties: annotators, annotation timent, or paragraph-level information extraction curators, and annotation project organizers. events. We argue that, for segment labeling, a ded- 1WAT-SL is available under a MIT license at: icated tool is favorable in order to speed up the an- https://github.com/webis-de/wat

13 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 13–16 c 2017 Association for Computational Linguistics

moving the mouse cursor over a label, the label de- scription is displayed to prevent a faulty selection. Once a segment is labeled, its background color changes, and the button displays an abbreviation of the respective label. To assist the annotators in forming a mental model of the annotation inter- face, the background colors of labeled segments match the label colors in the menu. All labels are saved automatically, avoiding any data loss in case of power outages, connection issues, or similar. In some cases, texts might be over-segmented, for example due to an automatic segmentation. If this is the case, WAT-SL allows annotators to mark a segment as being continued in the next segment. The interface will then visually connect these seg- Figure 1: Screenshot of an exemplary task selec- ments (cf. the buttons showing “->” in Figure 2). tion page in WAT-SL, listing three assigned tasks. Finally, the annotation interface includes a text box for leaving comments to the project organiz- In WAT-SL, the annotation process is split into ers. To simplify the formulation of comments, tasks, usually corresponding to single texts. To- each segment is numbered, with the number being gether, these tasks form a project. shown when the mouse cursor is moved over it.

2.1 Annotating Segments 2.2 Curating Annotations The annotation interface is designed with a focus After an annotation process is completed, a cura- on easy and efficient usage. It can be accessed tion phase usually follows where the annotations with any modern web browser. In order to start, of different annotators are consolidated into one. annotators require only a login and a password. The WAT-SL curation interface enables an effi- When logging in, an annotator sees the task se- cient curation by mimicking the annotation inter- lection page that lists all assigned tasks, includ- face with three adjustments (Figure 3): First, seg- ing the annotator’s current progress in terms of ments for which the majority of annotators agreed the number of labeled segments (Figure 1). Com- on a label are pre-labeled accordingly. Second, the pleted tasks are marked green, and the web page menu shows for each label how many annotators automatically scrolls down to the first uncom- chose it. And third, the label description shows pleted task. This allows annotators to seamless (anonymized) which annotator chose the label, so interrupt and continue the annotation process. that curators can interpret each label in its context. After selecting a task to work on, the annotator The curation may be accessed under the same sees the main annotation interface (Figure 2). The URL as the annotation in order to allow annotators design of the interface seeks for clarity and self- of some tasks being curators of other tasks. descriptiveness, following the templates of today’s most popular framework for responsive web sites, 2.3 Running an Annotation Project Bootstrap. As a result, we expect that many anno- WAT-SL is a platform-independent and easily de- tators will feel familiar with the style. ployable standalone Java application, with few The central panel of the annotation interface configurations stored in a simple “key = value” shows the text to be labeled and its title. One de- file. Among others, annotators are managed in this sign objective was to obtain a non-intrusive anno- file by assigning a login, a password, and a set of tation interface that remains close to just display- tasks to each of them. For each task, the organizer ing the text in order to maximize readability. As of an annotation project creates a directory (see shown in Figure 2, we decided to indicate seg- below). WAT-SL uses the directory name as the ments only by a shaded background and a small task name in all occasions. Once the Java archive button at the end. To ensure a natural text flow, file we provide is then executed, it reads all config- line breaks are possible within segments. The but- urations and starts a server. The server is immedi- ton reveals a menu for selecting the label. When ately ready to accept requests from the annotators.

14 Figure 2: Screenshot of the annotation interface of WAT-SL, capturing the annotation of one news edito- rial within the argumentative segment labeling project described in Section 3. In particular, the screenshot illustrates how the annotator selects a label (Testimony) for one segment of the news editorial.

Annotating Segments Only few steps are needed to set up the segment labeling tasks: (1) List the la- bels and their descriptions in the configuration file, and place the button images in the corresponding directory. (2) Set the title displayed above each text (see Figure 2) in a separate configuration file in the respective task directory, or project-wide in the same file as the labels. (3) Finally, put the texts in the task directories. For easy usage, the required Figure 3: Screenshot of the curation interface, il- text format is as simple as possible: one segment lustrating how a label (Anecdote) is selected based per line and empty lines for paragraph breaks. Op- on counts of all labels the annotators selected for tionally, organizers can add Cascading Style Sheet the respective segment (Anecdote (2), No unit (1)). and JavaScript files to customize the interface.

WAT-SL logs all relevant annotator interactions. Curating Annotations To curate a task, an or- Whenever an annotator changes a segment label, ganizer duplicates the annotation task and then the new label is immediately sent to the server. copies the annotation logs into the new task direc- This prevents data loss and allows to monitor the tory. The organizer then specifies curators for the progress. In addition to the new label, WAT-SL curation task analog to assigning annotators. logs the current date, and the time offset and IP ad- dress of the annotator. With these logs, researchers Result In addition to the logs, the web interface can analyze the annotators’ behavior, e.g., to iden- also allows the organizer to see the final labels tify hard cases where annotators needed much without history in a simple “key = value” format, time or changed the label multiple times. which is useful when distributing the annotations.

15 3 Case Study: Labeling Argument Units 4 Conclusion and Future Work WAT-SL was already used successfully in the past, This paper has presented WAT-SL, an open-source namely, for creating the Webis-Editorials-16 cor- web annotation tool dedicated to segment label- pus with 300 news editorials split into a total of ing. WAT-SL is designed for easily configuration 35,665 segments, each labeled by three annotators and deployment, while allowing project organiz- and finally curated by the authors (Al-Khatib et ers to tailor its annotation interface to the project al., 2016). In particular, each segment was as- needs using standard web technologies. WAT-SL signed one of eight labels, where the labels in- aims for simplicity by featuring a self-explanatory clude six types of argument units (e.g., assump- interface that does not distract from the text to be tion and anecdote), a label for non-argumentative annotated, as well as by always showing the an- segments, and a label indicating that a unit is con- notation progress at a glance and saving it auto- tinued in the next segment (see the “->” label in matically. The curation of the annotations can be Section 2). On average, each editorial contains done using exactly the same interface. In case of 957 tokens in 118 segments. oversegmented texts, annotators can merge seg- The annotation project of the Webis-Editorials- ments. Moreover, all relevant annotator interac- 16 corpus included one task per editorial. The edi- tions are logged, allowing projects organizers to torial’s text had been pre-segmented with a heuris- monitor and analyze the annotation process. tic unit segmentation algorithm before. This algo- Recently, WAT-SL was used successfully on rithm was tuned towards oversegmenting a text in a corpus with 300 news editorials, where four case of doubt, i.e., to avoid false negatives (seg- remote annotators labeled 35,000 argumentative ments that should have been split further) at the segments. In the future, we plan to create more cost of more false positives (segments that need to dedicated annotation tools in the spirit of WAT-SL be merged). Note that WAT-SL allows fixing such for other annotation types (e.g., relation identifica- false positives using “->”. tion). In particular, we plan to improve the anno- For annotation, we hired four workers from tator management functionalities of WAT-SL and the crowdsourcing platform upwork.com. Given reuse them for these other tools. the segmented editorials, each worker iteratively chose one assigned editorial (see Figure 1), read References it completely, and then selected the appropriate la- bel for each segment in the editorial (see Figure 2). Khalid Al-Khatib, Henning Wachsmuth, Johannes Kiesel, Matthias Hagen, and Benno Stein. 2016. This annotation process was repeated for all edito- A News Editorial Corpus for Mining Argumenta- rials, with some annotators interrupting their work tion Strategies. In Proceedings of the 26th Inter- on an editorial and returning to it later on. All ed- national Conference on Computational Linguistics, itorials were labeled by three workers, resulting in pages 3433–3443. 106,995 annotations in total. The average time per Hamish Cunningham. 2002. GATE, a General Ar- editorial taken by a worker was ~20 minutes. chitecture for Text Engineering. Computers and the To create the final version of the corpus, we Humanities, 36(2):223–254. curated each editorial using WAT-SL. In particu- Pontus Stenetorp, Sampo Pyysalo, Goran Topic,´ lar, we automatically kept all labels with majority Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu- agreement, and let one expert decide on the oth- jii. 2012. BRAT: A Web-based Tool for NLP- ers. Also, difficult cases where annotators tend to assisted Text Annotation. In Proceedings of the Demonstrations at the 13th Conference of the Euro- disagree were identified in the curation phase. pean Chapter of the Association for Computational From the perspective of WAT-SL, the annota- Linguistics, pages 102–107. tion of the Webis-Editorials-16 corpus served as a case study, which provided evidence that the Seid Muhie Yimam, Chris Biemann, Richard Eckart de Castilho, and Iryna Gurevych. 2014. Automatic tool is easy to learn and master. From the begin- Annotation Suggestions and Custom Annotation ning, the workers used WAT-SL without notewor- Layers in WebAnno. In Proceedings of 52nd An- thy problems, as far as we could see from monitor- nual Meeting of the Association for Computational ing the interaction logs. Also, the comment area Linguistics: System Demonstrations, pages 91–96. turned out to be useful, i.e., the workers left sev- eral valuable suggestions and questions there.

16 TextImager as a Generic Interface to R

Tolga Uslu1, Wahed Hemati1, Alexander Mehler1, and Daniel Baumartz2

1,2Goethe University of Frankfurt 1,2TextTechnology Lab 1 uslu,hemati,mehler @em.uni-frankfurt.de { [email protected]}

Abstract eral Wikipedia articles to present all features of R integrated into TextImager. This includes arti- R is a very powerful framework for sta- cles about four politicians (Angela Merkel, Barack tistical modeling. Thus, it is of high im- Obama, Recep (Tayyip) Erdogan˘ , Donald Trump) portance to integrate R with state-of-the- and five sportsman, that is, three basketball players art tools in NLP. In this paper, we present (Michael Jordan, Kobe Bryant and Lebron James) the functionality and architecture of such and two soccer players (Thomas Muller¨ and Bas- an integration by means of TextImager. tian Schweinsteiger). We use the OpenCPU API to integrate R based on our own R-Server. This allows 2 Related Work for communicating with R-packages and R is used and developed by a large community combining them with TextImager’s NLP- covering a wide range of statistical packages. The components. CRAN1 package repository is the main repository 1 Introduction of the R project. It currently consists of about 10 000 packages including packages for NLP.2 We introduced TextImager in (Hemati et al., 2016) The R framework requires basic to versatile skills where we focused on its architecture and the func- in programming and scripting. It provides limited tions of its backend. In this paper, we present the visualization and interaction functionalities. At- functionality and architecture of R interfaced by tempts have been undertaken to provide web inter- means of TextImager. For this purpose, we created faces for R as, for example, Shiny3 and rApache4. a separate panel in TextImager for R-applications. Though they provide a variety of functions and vi- In this panel, we combine state-of-the-art NLP sualizations5, these tools are not optimized for sta- tools embedded into TextImager with the the pow- tistical NLP: their NLP-related functionalities are erful statistics of R (R Development Core Team, rather limited. In order to fill this gap, we intro- 2008). We use the OpenCPU API (Ooms, 2014) to duce TextImager’s R package, that is, a web based integrate R into TextImager by means of our own tool for NLP utilizing R. R-server. This allows for easily communicating with the built-in R-packages and combining the 3 Architecture advantages of both worlds. In the case of topic detection (LDA), for example, the complete text is TextImager is a UIMA-based (Ferrucci and Lally, used as an input string to R. Thanks to TextIma- 2004) framework that offers a wide range of NLP ger’s preprocessor, more information is provided and visualization tools by means of a user-friendly about syntactic words, parts of speech, lemmas, GUI without requiring programming skills. It con- grammatical categories etc., which can improve sists of two parts: front-end and back-end. The topic detection. Further, the output of R routines back-end is a modular, expandable, scalable and is displayed by means of TextImager’s visualiza- 1https://cran.r-project.org/ tions, all of which are linked to the correspond- 2https://cran.r-project.org/web/views/ NaturalLanguageProcessing.html ing input text(s). This allows for unprecedented 3http://shiny.rstudio.com/ interaction between text and the statistical results 4http://rapache.net/ computed by R. For this paper we sampled sev- 5http://shiny.rstudio.com/gallery/

17 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 17–20 c 2017 Association for Computational Linguistics

flexible architecture with parallel and distributed 3.3 OpenCPU Output Integration processing capabilities (Hemati et al., 2016). The Visualizing the results of a calculation or sensitiv- front-end is a that makes NLP ity analysis is an important task. That is why we processes available in a user-friendly way with re- provide interactive visualizations to make the in- sponsive and interactive visualizations (Hemati et formation available to the user more comprehensi- al., 2016). TextImager already integrated many ble. This allows the user to interact with the text third party tools. One of these is R. This section and the output of R, for example, by highlighting describes the technical integration and utilization any focal sentence in a document by hovering over of R into TextImager. a word or sentence graph generated by means of R.

3.4 R-packages 3.1 R / OpenCPU This section gives an overview of R-packages em- R is a software environment for statistical comput- bedded into the pipeline of TextImager. ing. It compiles and runs on a variety of and Windows platforms. One of our goals is to provide tm The tm-package (Feinerer and Hornik, 2015) an easy to use interface for R with a focus on NLP. is a text mining R-package containing pre- To this end, we use OpenCPU to integrate R into process methods for data importing, corpus TextImager. OpenCPU provides an HTTP API, handling, stopword filtering and more. which allocates the functionalities of R-packages (Ooms, 2014). The OpenCPU software can be lda The lda-package (Chang, 2015) provides an used directly in R; alternatively, it can be installed implementation of Latent Dirichlet Alloca- on a server. We decided for the latter variant. In tion algorithms. It is used to automatically addition, we used the opencpu.js JavaScript library identify topics in documents, to get top doc- which simplifies API use in JavaScript and allows uments for these topics and to predict doc- for calling R-functions directly from TextImager. ument labels. In addition to the tabular out- To minimize the communication effort between put we developed an interactive visualization, client and server and to encapsulate R scripts from which assigns the decisive words to the ap- TextImager’s functionality, we created a so called propriate topic (see Figure 1). This visualiza- TextImager-R-package that takes TextImager data, tion makes it easy to differentiate between the performs all R-based calculations and returns the two topics and see which words classify these results. This package serves for converting any topics. In our example, we have recognized TextImager data to meet the input requirements of words such as players, season and game as any R-package. In this way, we can easily add one topic (sportsman) and party, state and new packages to TextImager without changing the politically as a different topic (politics). The HTTP request code. Because some data and mod- parameters of every package can be set on els have a long build time we used OpenCPU’s runtime. The utilization and combination of session feature to keep this data on the server and TextImager and R makes it possible, to cal- access it in future sessions so that we do not have culate topics not only based on wordforms, to recreate it. This allows the user for quickly ex- but also takes other features into account like ecuting several methods even in parallel without lemma, pos-tags, morphological features and recalculating them each time. more. stylo The stylo-package (Eder et al., 2016) pro- 3.2 Data Structure vides functionality for stylometric analyses. All parameters of the package can be set The data structure of TextImager differs from R’s through the of Tex- data structure. Therefore, we developed a generic tImager. The package provides multiple un- mapping interface that translates the data structure supervised analyses, mostly based on a most- from TextImager to an R-readable format. De- frequent-word list and contrastive text analy- pending on the R-package, we send the required sis. In Figure 2 we have calculated a cluster data via OpenCPU. This allows for combining analysis based on our example corpus. We each NLP tool with any R-package. can see that the politicians, basketball players

18 also build an interactive network visualiza- tion (see figure 3) using sigma.js to make it interoperable with TextImager.

Figure 1: Words assigned to their detected topics Figure 3: Network graph based on word embed- dings and soccer players have been clustered into their own groups. tidytext The tidytext package (Silge and Robin- son, 2016) provides functionality to create datasets following the tidy data principles (Wickham, 2014). We used it with our tm based corpus to calculate TF-IDF informa- tion of documents. In Figure 4 we see the output tabular with informations like tf, idf and tf-idf

Figure 2: Cluster analysis of the documents

LSAfun The LSAfun (Gunther¨ et al., 2015) and lsa (Wild, 2015) packages provide function- ality for Latent Semantic Analysis. In Tex- tImager it is used to generate summaries of Figure 4: Statistical information of the documents documents and similarities of sentences. igraph The igraph-package (Csardi and Nepusz, stringdist The stringdist-package (van der Loo, 2006) provides multiple network analysis 2014) implements distance calculation meth- tools. We created a document network by ods, like Cosine, Jaccard, OSA and linking each word with its five most simi- other. We implemented functionality for cal- lar words based on word embeddings. The culating sentence similarities and provide an igraph-package also allows to layout the interactive visual representation. Each node graph and calculate different centrality mea- represents an sentence of the selected docu- sures. In the end, we can export the network ment and the links between them represent in many formats (GraphML, GEXF, JSON, the similarity of those sentences. The thicker etc.) and edit it with graph editors. We the links, the more similar they are. By in-

19 teracting with the visualization, correspond- number of built-in R-packages in TextImager. ing sections of the document panel are get- ting highlighted, to see the similar sentences 5 Scope of the Software Demonstration (see Figure 5). The bidirectional interaction The beta version of TextImager is online. functionality enables easy comparability. To test the functionalities of R as integrated into TextImager use the following demo: http://textimager.hucompute.org/ index.html?viewport=demo&file= R-Demo..

References Jonathan Chang, 2015. lda: Collapsed Gibbs Sam- pling Methods for Topic Models. R package version 1.4.2. Gabor Csardi and Tamas Nepusz. 2006. The igraph Figure 5: Depiction of sentence similarities. software package for complex network research. In- terJournal, Complex Systems:1695. Maciej Eder, Jan Rybicki, and Mike Kestemont. 2016. stats We used functions from the R-package stats Stylometry with R: a package for computational text (R Development Core Team, 2008) to calcu- analysis. R Journal, 8(1):107–121. late a hierarchical cluster analysis based on Ingo Feinerer and Kurt Hornik, 2015. tm: Text Mining the sentence similarities. This allows us to Package. R package version 0.6-2. cluster similar sentences and visualize them with an interactive dendrogram. In Figure David Ferrucci and Adam Lally. 2004. UIMA: an architectural approach to unstructured information 6 we selected on one of these clusters and processing in the corporate research environment. the document panel immediately adapts and Natural Language Engineering, 10(3-4):327–348. highlights all the sentences in this cluster. Fritz Gunther,¨ Carolin Dudschig, and Barbara Kaup. 2015. Lsafun: An R package for computations based on latent semantic analysis. Behavior Re- search Methods, 47(4):930–944. Wahed Hemati, Tolga Uslu, and Alexander Mehler. 2016. Textimager: a distributed UIMA-based sys- tem for NLP. In Proceedings of the COLING 2016 System Demonstrations. Jeroen Ooms. 2014. The OpenCPU system: Towards a universal interface for scientific computing through separation of concerns. arXiv:1406.4806 [stat.CO]. R Development Core Team, 2008. R: A Language and Environment for Statistical Computing. R Foun- Figure 6: Similarity-clustered sentences. dation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.

An interesting side effect of integrating these Julia Silge and David Robinson. 2016. tidytext: Text mining and analysis using tidy data principles in R. tools into TextImager’s pipeline is that their out- JOSS, 1(3). put can be concerted in a way to arrive at higher- level text annotations and analyses. In this way, we M.P.J. van der Loo. 2014. The stringdist package for approximate string matching. The R Journal, 6:111– provide to an integration of two heavily expanding 122. areas, that is, NLP and statistical modeling. Hadley Wickham. 2014. Tidy data. Journal of Statis- 4 Future work tical Software, 59(1):1–23. In already ongoing work, we focus on big data as Fridolin Wild, 2015. lsa: Latent Semantic Analysis.R package version 0.73.1. exemplified by the Wikipedia. We also extend the

20 GraWiTas: a Grammar-based Wikipedia Talk Page Parser

Benjamin Cabrera Laura Steinert Bjorn¨ Ross University of Duisburg-Essen University of Duisburg-Essen University of Duisburg-Essen Lotharstr. 65 Lotharstr. 65 Forsthausweg 2 47057 Duisburg, Germany 47057 Duisburg, Germany 47057 Duisburg, Germany [email protected]

Abstract changes ever made to an article, but it does not contain explicit information about the collabora- Wikipedia offers researchers unique in- tion between editors. Most discussions between sights into the collaboration and commu- editors take place on the article’s talk page, a dedi- nication patterns of a large self-regulating cated file where they can leave comments and dis- community of editors. The main medium cuss potential improvements and revisions. of direct communication between editors The talk page is therefore useful for observ- of an article is the article’s talk page. ing explicit coordination between editors. How- However, a talk page file is unstructured ever, most studies on Wikipedia article quality and therefore difficult to analyse automat- have focused on easily accessible data (Liu and ically. A few parsers exist that enable its Ram, 2011), whereas talk pages are not easy to use transformation into a structured data for- with automated methods. Essentially, talk pages mat. However, they are rarely open source, are very loosely structured text files for which the support only a limited subset of the talk community has defined certain formatting rules. page syntax – resulting in the loss of con- Editors do not always follow these rules in detail tent – and usually support only one export and thus an automated analysis of talk page data format. Together with this article we offer requires a parsing of the file into a structured for- a very fast, lightweight, open source parser mat by breaking it down into the individual com- with support for various output formats. ments and the links among them. In a preliminary evaluation it achieved a high accuracy. The parser uses a gram- In this article, we introduce an open source mar-based approach – offering a transpar- parser for article talk pages called GraWiTas that ent implementation and easy extensibility. focuses on a good comment detection rate, good usability and a plethora of different output for- 1 Introduction mats. While a few of such talk page parsers exist, our core parser is based on the .Spirit C++- Wikipedia is becoming an increasingly important library1 which utilises Parsing Expression Gram- knowledge platform. As the content is created by a mars (PEG) for specifying the language to parse. self-regulating community of users, the analysis of This leads to a very fast, easily extensible program interactions among users with methods from natu- for different use cases. ral language processing and social network analy- sis can yield important insights into collaboration The next section of this paper describes the patterns inherent to such platforms. For example, structure of Wikipedia talk pages. The third sec- Viegas et al. (2007) manually classified talk page tion describes the parser we developed. After- posts with regard to the communication type to wards, a preliminary evaluation is described. The analyse the coordination among editors. Such in- fifth section gives an overview on related work. Fi- sights can be important for research in the area of nally, conclusions from our research are given at Computer-Supported Collaborative Learning. the end of the paper. The collaboration patterns among Wikipedia editors are visible on an article’s revision history 1http://boost-spirit.com/, as seen on Feb. 14, and its talk page. The revision history lists all 2017

21 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 21–24 c 2017 Association for Computational Linguistics

2 Wikipedia Talk Pages While this structure is easily understood by hu- mans, parsing talk page discussions automatically The talk page of a Wikipedia article is the place is far from trivial for a number of reasons. The dis- where editors can discuss issues with the article cussion is not stored in a structured database for- content and plan future contributions. There is a mat, and many editors use slight variations of the talk page for every article on Wikipedia. As any agreed-upon syntax when writing their comments. other page in Wikipedia, it can be edited by any- For example, there are multiple types of signatures body by manipulating the underlying text file – that mark the end of a comment. Some editors do which uses Wiki markup, a dedicated markup syn- not indent correctly or use other formatting com- tax. When a user visits the talk page, this syntax mands in the Wikipedia arsenal. file is interpreted by the Wikipedia template en- In addition, some Wikipedia talk pages contain gine and turned into the displayed HTML. transcluded content. This means that the content Although Wiki markup includes many different of a different file is pasted into the file at the speci- formatting (meta-)commands, the commands used fied position. Also, pages that become “too large” on talk pages are fairly basic. Some guidelines as are archived3, meaning that the original page is to how to comment on talk pages are defined in moved to a different name and a new, empty page the Wikipedia Talk Page Guidelines2. These rules is created in its place. specify, for example, that new topics should be added to the bottom of the page, that a user may 3 GraWiTas not alter or delete another user’s post, that a user Our parser – GraWiTas – consists of three com- should sign a post with their username and a times- ponents, covering the whole process of retriev- tamp, and that to indicate to which older post they ing and parsing Wikipedia talk pages. The first are referring, users should indent their own post. two components gather and preprocess the needed The following snippet gives an example of the data. They differ in the used data source: The talk page syntax: crawler component extracts the talk page content == Meaning of second paragraph == of given Wikipedia URLs while the dump compo- I don’t understand the second paragraph nent processes full Wikipedia XML dumps. The ... [[User:U1|U1]] [[User Talk:U1| talk]] 07:24, 2 Dec 2016 (UTC) actual parsing is done in the core parser compo- :I also find it confusing...[[User:U2|U2 nent, where all comments from the Wiki markup ]] 17:03, 3 Dec 2016 (UTC) of a talk page are extracted and exported into one ::Me too... [[User:U3|U3]] [[User Talk: U3|talk]] 19:58, 3 Dec 2016 (UTC) of several output formats. :LOL, the unit makes no sense [[User:U4| U4]] [[User Talk:U1|talk]] 00:27, 6 3.1 Core Parser Component Dec 2016 (UTC) Is the reference to source 2 correct? [[ The main logic of the core parser lies in a system User:U3|U3]] [[User Talk:U3|talk]] of rules that essentially defines what to consider as 11:41, 4 Dec 2016 (UTC) a comment on a talk page. Simply spoken, a com- ment is a piece of text – possibly with an indenta- The first line marks the beginning of a new dis- tion – followed by a signature and some line end- cussion topic on the page. The following lines ing. As already discussed, signatures and inden- contain comments on this topic. All authors tations are relatively fuzzy concepts, which means signed their comments with a signature consisting that the rules need to define a lot of cases and ex- of some variation of a link to their user profile page ceptions. There are also some template elements – in Wiki markup, hyperlinks are wrapped in ’[[’ in the Wikipedia syntax that can change the inter- and ’]]’ – and a timestamp. User U2 replies to the pretation of a comment, e.g. outdents4. first post and thus indents their text by one tab. In Grammars are a theoretical concept that Wiki markup this is done with a colon ’:’. The matches such a rule system very well. All dif- third comment, which is a reply to the second one, ferent cases can be specified in detail and it is is indented by two colons, leading to more inden- easy to build more complex rules out of multiple tation on the page when it is displayed. 3https://en.wikipedia.org/wiki/Help: 2https://en.wikipedia.org/wiki/ Archiving_a_talk_page, as seen on Feb. 14, 2017 Wikipedia:Talk_page_guidelines, as seen 4https://en.wikipedia.org/wiki/ on Feb. 14, 2017 Template:Outdent, as seen on Feb. 14, 2017

22 I don’t understand the second paragraph. ID Reply To User Comment Time U1 (talk) 07:24, 2 Dec 2016 (UTC) 1 - U1 I don’t understand the second paragraph. 07:24, 2 Dec 2016 I also find it confusing... 2 1 U2 I also find it confusing... 17:03, 3 Dec 2016 U2 (talk) 17:03, 3 Dec 2016 (UTC) 3 2 U3 Me too... 19:58, 3 Dec 2016 4 2 U LOL, the unit makes no sense. 00:27, 6 Dec 2016 Me too... 4 5 - U3 Is the reference to source 2 correct? 11:41, 4 Dec 2016 U3 (talk) 19:58, 3 Dec 2016 (UTC) LOL, the unit makes no sense. U4 (talk) 00:27, 6 Dec 2016 (UTC) Is the reference to source 2 correct? U3 (talk) 11:41, 4 Dec 2016 (UTC)

LOL, the unit makes no sense. U4 LOL, the unit makes no sense.

U I don’t understand the second paragraph. 3 1 I don’t understand the second paragraph. U1

U2 I also find it confusing... 1 I also find it confusing... U2

U1 Me too... Me too... 1 U Is the reference to source 2 correct? 4 U3 Is the reference to source 2 correct?

Figure 1: Schematic representation of the core parser component: A table (top right) is extracted from a talk page (top left) using a grammar-based rule set. It can then be transformed into various output formats, e.g. a one-mode (user- / comment-) graph (bottom left and right) or a two-mode graph (bottom center) smaller ones. In particular we use Parsing Expres- of URLs. sion Grammars (Ford, 2004) which are very effi- Our crawler is also able to fetch archived ciently implemented in the Boost.Spirit library for pages and includes their content in the down- C++. The library allows developers to essentially loaded markup file. Finally, it also fetches tran- write the grammar in C++ syntax, which is then scluded content by searching for the respective compiled – note the possibility for compile-time Wiki-markup commands. If the transcluded con- optimisation – to a highly optimised parser. Be- tent belongs to the namespace Talk – i.e. the name side its efficient implementation, another advan- of the transcluded content starts with Talk: – it is tage of using grammars is extensibility. Whenever part of a talk page and is therefore included. All we encountered mismatched or incorrectly parsed other transcluded content is not included as it is comments we added another case to our grammar unnecessary for the talk page analysis. and no real application logic had to be written. The crawler component should be used for After extracting the comments (with username, small to medium-scale studies that rely on up- date, ...) from the raw talk page the core parser has to-date talk page data. Setting up the crawler is various output formats including lists of the com- very intuitive, but the need to contact the server ments (formats: CSV, JSON, human-readable) may make it unfeasible to work with very large and users as well as comment networks (formats: Wikipedia data. GML, GraphML, Graphviz) (cf. Figure 1). 3.3 Wiki Markup Dump Component 3.2 Wiki Markup Crawler Component To be able to work with all data at once – without connecting to their server too often – Wikipedia The crawler component is responsible for obtain- offers a download of full database dumps that in- ing the Wiki markup of a talk page using the clude e.g. all articles, all talk pages and all user Wikipedia website as a source. The program pages.5 For GraWiTas we also provide a program expects a text file containing the URLs to the that is able to read such large XML dumps ef- Wikipedia articles as input. It then connects to ficiently and feed talk pages to our core parser. the server to retrieve the HTML source code, from Users can select if they want to parse all talk pages which we extract the relevant Wiki markup and 5https://en.wikipedia.org/wiki/ which we finally feed to the core parser to get the Wikipedia:Database_download, as seen on Feb. 14, structured talk page data for each article in the list 2017

23 in the dump or only some particular ones charac- does not handle transcluded talk page content, nor terised by a list of article titles. the outdent template. Their parser outputs a CSV The dump component can be used for large- file with custom fields. scale studies that look at all talk pages of Wikipedia at once. Compared to the crawler, ob- 6 Conclusion taining the Wiki markdown is faster. However, this The possibility of parsing Wikipedia talk pages comes at the price that users have to download and into networks provides many new avenues for maintain the large XML dump files. NLP researchers interested in collaboration pat- terns. A usable parser is a prerequisite for this type 4 Evaluation of research. Unlike previous implementations, our We ran a very small evaluation study to assess parser correctly handles many quirks of Wikipedia the speed and accuracy of our core parser. For talk page syntax, from archived and transcluded analysing the speed, we generated large artificial talk page contents to outdented comments. It can talk pages from smaller existing ones and mea- produce output in a number of standardised for- sured the time our parser took. Using an Intel(R) mats including GML as well as a custom text file Core(TM) i7 M 620 @ 2.67GHz processor, we format. Finally, it requires very little initial setup. obtained parsing times under 50ms for file sizes The GraWiTas source code is publicly available7, smaller than 1MB ( 2000 Comments). For files together with a web app demonstrating the core ∼ up to 10MB ( 20000 Comments) it took around parser component. ∼ 300ms. This leads to an overall parsing speed of around 30MB/s for average-sized comments. Acknowledgments However, real world talk pages very rarely even This work was supported by the Deutsche pass the 1MB mark. Forschungsgemeinschaft (DFG) under grant No. To evaluate accuracy, we picked a random ar- GRK 2167, Research Training Group “User- ticle with a talk page and verified the extracted Centred Social Media” comments manually. Hereby, an accuracy of 0.97 was achieved. Although such a small evaluation is of course far from comprehensive, it shows that References our parser is on the one hand fast enough to parse Oliver Ferschke, Iryna Gurevych, and Yevgen Chebo- large numbers of talk pages and on the other hand tar. 2012. Behind the article: Recognizing dialog yields a high enough accuracy for real-world ap- acts in wikipedia talk pages. In Proc. of EACL’12, pages 777–786, Stroudsburg, PA, USA. ACL. plications. Bryan Ford. 2004. Parsing expression grammars: A 5 Related Work recognition-based syntactic foundation. SIGPLAN Not., 39(1):111–122. There have been previous attempts at parsing David Laniado, Riccardo Tasso, Yana Volkovich, and Wikipedia talk pages. Massa (2011) focused on Andreas Kaltenbrunner. 2011. When the wikipedi- user talk pages, where the conventions are differ- ans talk: Network and tree structure of wikipedia ent from article talk pages. Ferschke et al. (2012) discussion pages. In Proc. of ICWSM, Barcelona, rely on indentation for extracting relationships be- Spain, July 17-21, 2011. tween comments, but use the revision history to Jun Liu and Sudha Ram. 2011. Who does what: Col- identify the authors and comment creation dates. laboration patterns in the wikipedia and their impact However, they do not offer any information on on article quality. ACM Trans. Manag. Inform. Syst., whether archived, transcluded and outdented con- 2(2). tent is handled correctly, and their implementation Paolo Massa. 2011. Social networks of wikipedia. In has not been made public. Laniado et al. (2011) HT’11, Proc. of ACM Hypertext 2011, Eindhoven, infer the structure of discussions from indentation The Netherlands, June 6-9, 2011, pages 221–230. similar to our approach. They use user signa- Fernanda B. Viegas, Martin Wattenberg, Jesse Kriss, tures to infer comment metadata (author and date). and Frank Van Ham. 2007. Talk before you type: Their parser is available on Github6. However, it Coordination in wikipedia. In Proc. of HICSS’07. IEEE. 6https://github.com/sdivad/WikiTalkParser, as seen on Feb. 14, 2017 7https://github.com/ace7k3/grawitas

24 TWINE: A real-time system for TWeet analysis via INformation Extraction

Debora Nozza, Fausto Ristagno, Matteo Palmonari, Elisabetta Fersini, Pikakshi Manchanda, Enza Messina University of Milano-Bicocca / Milano debora.nozza, palmonari, fersini, pikakshi.manchanda,{ messina @disco.unimib.it [email protected]}

Abstract Twitter has several advantages compared to tra- ditional information channels, i.e. tweets are cre- In the recent years, the amount of user ated in real-time, have a broad coverage over a generated contents shared on the Web has wide variety of topics and include several useful significantly increased, especially in social embedded information, e.g. time, user profile and media environment, e.g. Twitter, Face- geo-coordinates if present. book, Google+. This large quantity of Mining and extracting relevant information data has generated the need of reactive and from this huge amount of microblog posts is an sophisticated systems for capturing and active research topic, generally called Information understanding the underlying information Extraction (IE). One of the key subtask of IE is enclosed in them. In this paper we present Named Entity Recognition and Linking (NEEL), TWINE, a real-time system for the big aimed to first identify and classify named entities data analysis and exploration of informa- such as people, locations, organisations and prod- tion extracted from Twitter streams. The ucts, then to link the recognized entity mentions to proposed system based on a Named Entity a Knowledge Base (KB) (Derczynski et al., 2015). Recognition and Linking pipeline and a Although several Information Extraction mod- multi-dimensional spatial geo-localization els have been proposed for dealing with microblog is managed by a scalable and flexi- contents (Bontcheva et al., 2013; Derczynski et ble architecture for an interactive visu- al., 2015), only few of them focused on the com- alization of micropost streams insights. bination of these techniques with big data archi- The demo is available at http://twine- tecture and user interface in order to perform and mind.cloudapp.net/streaming1,2. explore real-time analysis of social media content streams. Moreover, the majority of these research 1 Introduction studies are event-centric, in particular focusing on the tasks of situational awareness and event detec- The emergence of social media has provided new tion (Kumar et al., 2011; Leban et al., 2014; Sheth sources of information and an immediate commu- et al., 2014; Zhang et al., 2016). nication medium for people from all walks of life In this paper we propose TWINE, a system that (Kumar et al., 2014). In particular, Twitter is a visualizes and efficiently performs real-time big popular microblogging service that is particularly data analytics on user-driven tweets via Informa- focused on the speed and ease of publication. Ev- tion Extraction methods. eryday, nearly 300 million active users share over TWINE allows the user to: 500 million of posts3, so-called tweets, principally perform real-time monitoring of tweets re- using mobile devices. • lated to their topics of interest, with unre- 1At the moment, the application is deployed on Azure stricted keywords; client service with traffic and storage limits given by the provider. explore the information extracted by 2 • The TWINE system requires Twitter authentication, if semantic-based analysis of large amount of you do not want to use your twitter account you can try the demo at http://twine-mind.cloudapp.net/streaming-demo. tweets, i.e. (i) recognition of named entities 3http://www.internetlivestats.com/ and the information of the correspondent

25 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 25–28 c 2017 Association for Computational Linguistics

Figure 1: TWINE system overview.

KB resources, (ii) multi-dimensional spatial Figure 2: TWINE system architecture. geo-tagging for each tweet, including the geo-localization of the named entities identi- fied as locations and (iii) two semantic-driven 2.1 System Architecture interactive visualization interfaces. The proposed system is implemented using a cen- The following section will present the details of tralized system architecture, as shown in Figure the architecture for supporting real-time tweets 2. The main requirement was to develop a sys- analysis and the description of the conceived tem able to process in real-time large incoming of graphical user interface. data streams. In TWINE, all the aforementioned processes are 2 TWINE system triggered by the user from the client and elabo- rated on the server-side, i.e. the streaming fetch- The proposed system TWINE, acronym for TWeet ing phase, the NEEL processing, the KB resources analysis via INformation Extraction, is a real-time retrieval, the geo-codification of the locations and system for the analysis and exploration of infor- the database storing. mation extracted from Twitter data. Figure 1 out- With this design implementation all the compu- lines its macro steps coupled with corresponding tations are performed on the server. This improves examples. the independence on the client technical specifica- Given a set of keywords provided by the user tions, preventing different problems such as slow (e.g. “Italy”) as input query, the system fetches loading, high processor usage and even freezing. the stream of all the tweets (text and tweet’s author The system architecture, presented in Figure 2, information) matching the keywords using Twit- is composed of several independent modules: ter APIs. Next, each tweet text is processed by the NEEL pipeline. This step provides as out- External Services. The system makes use of put a set of recognized named entities linked to Twitter APIs for fetching the streaming of tweets the correspondent KB resource (e.g. Tuscany - given an input query, a SPARQL endpoint over the http://dbpedia.org/page/Tuscany). After this elab- DBpedia data set for the retrieval of the KB re- oration, the system retrieves all the additional rele- source information and a georeferencing system, vant information needed for the exploration: from OpenStreetMap4, to obtain the geographic coor- the KB we extract the resource data, i.e. image, dinates from the tweet author account’s profile text description, type and coordinates if the entity location. is a location, and from Twitter we extract the tweet author account’s profile location, that is resolved NEEL pipeline. This module uses the NEEL wih a georeferincing system. pipeline proposed by Caliano et al. (2016) on the This information are subsequently stored in a tweets. database that incrementally enriches information Message Broker system. This module is neces- generated by the precedent phases. Then, the sary to build pipelines for processing streaming TWINE web interface fetches the stored data from data in real time, in such a way that components the DB for populating two different interactive vi- sualisation interfaces. 4https://www.openstreetmap.org/

26 Figure 3: TWINE Map View snapshot.

Figure 4: TWINE List View snapshot. can exchange data reliably. The 2.2 User Interface platform5 permits us to store and process the data in a fault-tolerant way and to ignore the latency TWINE provides two different visualisations of due to the Information Extraction processing. the extracted information: the Map View, which shows the different geo-tags associated with Database. All the source and processed data are tweets in addition to the NEEL output, and the List stored in a NoSQL database. In particular, we View, that better emphasizes the relation between choose a MongoDB6 database because of its flex- the text and its named entities. ibility, horizontal scalability and its representation The Map View (Figure 3) provides in the top format that is particularly suitable for storing Twit- panel a textual search bar where users can insert ter contents. keywords related to their topic of interest (e.g. italy, milan, rome, venice). The user can also, from Frontend host and API . The pres- left to right, start and stop the stream fetching pro- ence of these two server-side modules is motivated cess, clear the current results, change View and ap- by the need of make the TWINE user-interface in- ply semantic filters related to the geo-localization dependent on its functionalities. In this way, we and KB resource characteristics, i.e. type and clas- improve the modularity and flexibility of the en- sification confidence score. tire system. Then, in the left-hand panel the user can read 5https://kafka.apache.org/ the content of each fetched tweet (text, user in- 6http://www.mongodb.org/ formation and recognized named entities) and

27 directly open it in the Twitter platform. clude more insights such as the user network infor- The center panel can be further divided into mation, a heatmap visualization and a time control two sub-panels: the top one shows the information filter. about the Knowledge Base resources related to the linked named entities present in the tweets (image, textual description, type as symbol and the clas- References sification confidence score), and the bottom one Kalina Bontcheva, Leon Derczynski, Adam Funk, provides the list of the recognized named entities Mark A Greenwood, Diana Maynard, and Niraj Aswani. 2013. Twitie: An open-source informa- for which it does not exist a correspondence in the tion extraction pipeline for microblog text. In Pro- KB, i.e. NIL entities. ceedings of International Conference on Recent Ad- These two panels, the one that reports the tweets vances in Natural Language Processing, pages 83– and the one with the recognized and linked KB re- 90. sources, are responsive. For example, by clicking Davide Caliano, Elisabetta Fersini, Pikakshi Man- on the entity Italy in the middle panel, only tweets chanda, Matteo Palmonari, and Enza Messina. containing the mention of the entity Italy will be 2016. Unimib: Entity linking in tweets using jaro- shown in the left panel. Respectively, by clicking winkler distance, popularity and coherence. In Pro- ceedings of the 6th International Workshop on Mak- on a tweet, the center panel will show only the re- ing Sense of Microposts (# Microposts). lated entities. In the right-hand panel, the user can visualize Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve Gorrell, Raphael¨ the geo-tag extracted from the tweets, (i) the orig- Troncy, Johann Petrak, and Kalina Bontcheva. inal geo-location where the post is emitted (green 2015. Analysis of named entity recognition and marker), (ii) the user-defined location for the user linking for tweets. Information Processing & Man- account’s profile (blue marker) and (iii) the geo- agement, 51(2):32–49. location of the named entities extracted from the Shamanth Kumar, Geoffrey Barbier, Mohammad Ali tweets, if the corresponding KB resource has the Abbasi, and Huan Liu. 2011. Tweettracker: An latitude-longitude coordinates (red marker). analysis tool for humanitarian and disaster relief. In Proceedings of the 5th International AAAI Confer- Finally, a text field is present at the top of the ence on Weblogs and Social Media. first two panels to filter the tweets and KB re- sources that match specific keywords. Shamanth Kumar, Fred Morstatter, and Huan Liu. The List View is reported in Figure 4. Differ- 2014. Twitter data analytics. Springer. ently from the Map View, the focus is on the link Gregor Leban, Blaz Fortuna, Janez Brank, and Marko between the words, i.e. recognized named enti- Grobelnik. 2014. Event registry: learning about ties, and the corresponding KB resources. In the world events from news. In Proceedings of the 23rd International Conference on , pages reported example, this visualisation is more intu- 107–110. itive for catching the meaning of Dolomites and Gnocchi thanks to a direct connection between the Amit Sheth, Ashutosh Jadhav, Pavan Kapanipathi, named entities and the snippet and the image of Chen Lu, Hemant Purohit, Gary Alan Smith, and Wenbo Wang. 2014. Twitris: A system for collec- associated KB resources. tive social intelligence. In Encyclopedia of Social Network Analysis and Mining, pages 2240–2253. 3 Conclusion Springer. We introduced TWINE, a system that provides Xiubo Zhang, Stephen Kelly, and Khurshid Ahmad. 2016. The slandail monitor: Real-time processing an efficient real-time data analytics platform on and visualisation of social media data for emergency streaming of social media contents. The system is management. In Proceedings of the 11th Interna- supported by a scalable and modular architecture tional Conference on Availability, Reliability and Se- and by an intuitive and interactive user interface. curity, pages 786–791. As future work, we intend to implement a dis- tributed solution in order to faster and easier man- age huge quantity of data. Additionally, current integrated modules will be improved: the NEEL pipeline will be replaced by a multi-lingual and more accurate method, the web interface will in-

28 Alto: Rapid Prototyping for Parsing and Translation

Johannes Gontrum Jonas Groschwitz∗ Alexander Koller∗ Christoph Teichmann∗ University of Potsdam, Potsdam, Germany / ∗ Saarland University, Saarbrücken, Germany [email protected] / ∗ {groschwitz|koller|teichmann}@coli.uni-saarland.de

Abstract formalisms (see Fig. 1 for some examples). By selecting an appropriate algebra in which the val- We present Alto, a rapid prototyping tool ues of the language are constructed, IRTGs can for new grammar formalisms. Alto im- describe languages of objects that are not strings, plements generic but efficient algorithms including string-to-tree and tree-to-tree mappings, for parsing, translation, and training for which have been used in machine translation, and a range of monolingual and synchronous synchronous hyperedge replacement grammars, grammar formalisms. It can easily be ex- which are being used in semantic parsing. tended to new formalisms, which makes One advantage of IRTGs is that a variety of all of these algorithms immediately avail- algorithms, including the ones listed above, can able for the new formalism. be expressed generically in terms of operations on regular tree grammars. These algorithms ap- 1 Introduction ply identically to all IRTG grammars and Alto of- Whenever a new grammar formalism for natu- fers optimized implementations. Only an algebra- ral language is developed, there is a prototyp- specific decomposition operation is needed for ing phase in which a number of standard algo- each new algebra. Thus prototyping for a new rithms for the new formalism must be worked grammar formalism amounts to implementing an out and implemented. For monolingual grammar appropriate algebra that captures new interpreta- formalisms, such as (probabilistic) context-free tion operations. All algorithms in Alto then be- grammar or tree-adjoining grammar, these include come directly available for the new formalism, algorithms for chart parsing and parameter estima- yielding an efficient prototype at a much reduced tion. For synchronous grammar formalisms, we implementation effort. also want to decode inputs into outputs, and bina- Alto is open source and regularly adds new fea- rizing grammars becomes nontrivial. Implement- tures. It is available via its website: ing these algorithms requires considerable thought https://bitbucket.org/tclup/alto. and effort for each new grammar formalism, and can lead to faulty or inefficient prototypes. At the 2 An example grammar same time, there is a clear sense that these algo- rithms work basically the same across many dif- Let us look at an example to illustrate the Alto ferent grammar formalisms, and change only in workflow. We will work with a synchronous specific details. string-to-graph grammar, which Alto’s GUI dis- In this demo, we address this situation by intro- plays as in Fig. 2. The first and second column ducing Alto, the Algebraic Language Toolkit. Alto describe a weighted regular tree grammar (wRTG, is based on Interpreted Regular Tree Grammars (Comon et al., 2007)), which specifies how to (IRTGs; (Koller and Kuhlmann, 2011)), which rewrite nonterminal symbols such as S and NP re- separate the derivation process (described by prob- cursively in order to produce derivation trees. For abilistic regular tree grammars) from the interpre- instance, the tree shown in the leftmost panel of tation of a derivation tree into a value of the lan- Fig. 3 can be derived using this grammar, starting guage. In this way, IRTGs can capture a wide with the start symbol S, and is assigned a weight variety of monolingual and synchronous grammar (= probability) of 0.24. These derivation trees

29 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 29–32 c 2017 Association for Computational Linguistics

Formalism Reference Algebra(s) Context-Free Grammars (CFGs) (Hopcroft and Ullman, 1979) String Hyperedge Replacement Grammars (HRGs) (Chiang et al., 2013) Graph Tree Substitution Grammars (Sima’an et al., 1994) String / Tree Tree-Adjoining Grammars (TAGs) (Joshi et al., 1975) TAG string / TAG tree Synchronous CFGs (Chiang, 2007) String / String Synchronous HRGs (Chiang et al., 2013) String / Graph String to Tree Transducer (Galley et al., 2004) String / Tree Tree to Tree Transducer (Graehl et al., 2008) Tree / Tree

Figure 1: Some grammar formalisms that Alto can work with. serve as abstract syntactic representations, along a decomposition grammar for w, which describes the lines of derivation trees in TAG. all terms over the string algebra that evaluate to w; Next, notice the column “english” in Fig. 2. this step is algebra-specific. From this, we calcu- This column describes how to interpret the deriva- late a regular tree grammar for all derivation trees tion tree in an interpretation called “english”. It that the homomorphism maps into such a term, first specifies a tree homomorphism, which maps and then intersect it with the wRTG. These opera- derivation trees into terms over some algebra by tions can be phrased in terms of generic operations applying certain rewrite rules bottom-up. For in- on RTGs; implementing these efficiently is a chal- stance, the “boy” node in the example derivation lenge which we have tackled in Alto. We can com- tree is mapped to the term t = (the, boy). The pute the best derivation tree from the chart, and 1 ∗ entire subtree then maps to a term of the form map it into an output graph. Similarly, we can also (?1, (wants, (to, ?2))), as specified by the row decode an input graph into an output string. ∗ ∗ ∗ for “wants2” in the grammar, where ?1 is replaced by t1 and ?2 is replaced by the analogous term for 3 Algorithms in Alto “go”. The result is shown at the bottom of the mid- dle panel in Fig. 3. Finally, this term is evaluated Alto can read IRTG grammars and corpora from in the underlying algebra; in this case, a simple files, and implements a number of core algorithms, string algebra, which interprets the symbol * as including: automatic binarization of monolingual string concatenation. Thus the term in the middle and synchronous grammars (Büchse et al., 2013); panel evaluates to the string “the boy wants to go”, computation of parse charts for given input ob- shown in the “value” field. jects; computing the best derivation tree; comput- The example IRTG also contains a column ing the k-best derivation trees, along the lines of called “semantics”. This column describes a sec- (Huang and Chiang, 2005); and decoding the best ond interpretation of the derivation tree, this time derivation tree(s) into output interpretations. Alto into an algebra of graphs. Because the graph al- supports PCFG-style probability models with both gebra is more complex than the string algebra, the maximum likelihood and expectation maximiza- function symbols look more complicated. How- tion estimation. Log-linear probability models are ever, the general approach is exactly the same as also available, and can be trained with maximum before: the grammar specifies how to map the likelihood estimation. All of these functions are derivation tree into a term (bottom of rightmost available through command-line tools, a Java API, panel in Fig. 3), and then this term is evaluated and a GUI, seen in Fig. 2 and 3. in the respective algebra (here, the graph shown at We have invested considerable effort into mak- the top of the rightmost panel). ing these algorithms efficient enough for practical Thus the example grammar is a synchronous use. In particular, many algorithms for wRTGs in grammar which describes a relation between Alto are implemented in a lazy fashion, i.e. the strings and graphs. When we parse an input string rules of the wRTG are only calculated by need; w, we compute a parse chart that describes all see e.g. (Groschwitz et al., 2015; Groschwitz et grammatically correct derivation trees that inter- al., 2016). Obviously, Alto cannot be as effi- pret to this input string. We do this by computing cient for well-established tasks like PCFG parsing

30 Figure 2: An example IRTG with an English and a semantic interpretation (Alto screenshot).

Figure 3: A derivation tree with interpreted values (Alto screenshot). as a parser that was implemented and optimized between Context-Free and Tree-Adjoining Gram- for this specific grammar formalism. Nonethe- mars in Alto is that CFGs use the simple string al- less, Alto is fast enough for practical use with gebra outlined in Section 2, whereas for TAG we treebank-scale gramars, and for less mainstream use a special “TAG string algebra” which defines grammar formalisms can be faster than special- string wrapping operations (Koller and Kuhlmann, ized implementations for these formalisms. For 2012). All algorithms mentioned in Section 3 are instance, Alto is the fastest published parser for generic and do not make any assumptions about Hyperedge Replacement Grammars (Groschwitz what algebras are being used. As explained above, et al., 2015). Alto contains multiple algorithms for the only algebra-specific step is to compute de- computing the intersection and inverse homomor- composition grammars for input objects. phism of RTGs, and a user can choose the combi- In order to implement a new algebra, a user of nation that works best for their particular grammar Alto simply derives a class from the abstract base formalism (Groschwitz et al., 2016). class Algebra, which amounts to specifying the The most recent version adds further perfor- possible values of the algebra (as a Java class) and mance improvements through the use of a num- implementing the operations of the algebra as Java ber of pruning techniques, including coarse-to-fine methods. If Alto is also to parse objects from this parsing (Charniak et al., 2006). With these, Sec- algebra, the class needs to implement a method for tion 23 of the WSJ corpus can be parsed in a cou- computing decomposition grammars for the alge- ple of minutes. bra’s values. Alto comes with a number of alge- bras built in, including string algebras for Context- 4 Extending Alto Free and Tree-Adjoining grammars as well as tree, As explained above, Alto can capture any gram- set, and graph algebras. All of these can be used mar formalism whose derivation trees can be de- in parsing. By parsing sets in a set-to-string gram- scribed with a wRTG, by interpreting these into mar for example, Alto can generate referring ex- different algebras. For instance, the difference pressions (Engonopoulos and Koller, 2014).

31 Finally, Alto has a flexible system of input and hyperedge replacement grammars. In Proceedings output codecs, which can map grammars and alge- of the 51st Annual Meeting of the Association for bra values to string representations and vice versa. Computational Linguistics. A key use of these codecs is reading grammars D. Chiang. 2007. Hierarchical phrase-based transla- in native input format and converting them into tion. Computational Linguistics, 33(2):201–228. IRTGs. Users can provide their own codecs to H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, maximize interoperability with existing code. D. Lugiez, S. Tison, M. Tommasi, and C. Löd- ing. 2007. Tree Automata Techniques and Ap- 5 Conclusion and Future Work plications. published online - http://tata. gforge.inria.fr/. Alto is a flexible tool for the rapid implementation N. Engonopoulos and A. Koller. 2014. Generating ef- of new formalisms. This flexibility is based on a fective referring expressions using charts. In Pro- division of concerns between the generation and ceedings of the INLG and SIGDIAL 2014. the interpretation of grammatical derivations. We M. Galley, M. Hopkins, K. Knight, and D. Marcu. hope that the research community will use Alto on 2004. What’s in a translation rule? In HLT-NAACL newly developed formalisms and on novel combi- 2004: Main Proceedings. nations for existing algebras. Further, the toolkit’s J. Graehl, K. Knight, and J. May. 2008. Training tree focus on balancing generality with efficiency sup- transducers. Computational Linguistics, 34(3):391– ports research using larger datasets and grammars. 427. In the future, we will implement algorithms for J. Groschwitz, A. Koller, and C. Teichmann. 2015. the automatic induction of IRTG grammars from Graph parsing with s-graph grammars. In Proceed- corpora, e.g. string-to-graph corpora, such as the ings of the 53rd Annual Meeting of the Association AMRBank, for semantic parsing (Banarescu et al., for Computational Linguistics and the 7th Interna- tional Joint Conference on Natural Language Pro- 2013). This will simplify the prototyping process cessing. for new formalisms even further, by making large- J. Groschwitz, A. Koller, and M. Johnson. 2016. Effi- scale grammars for them available more quickly. cient techniques for parsing with tree automata. In Furthermore, we will explore ways for incorpo- Proceedings of the 54th Annual Meeting of the As- rating neural methods into Alto, e.g. in terms of sociation for Computational Linguistics. supertagging (Lewis et al., 2016). J. E. Hopcroft and J. Ullman. 1979. Introduction to Automata Theory, Languages, and Computation. Acknowledgements Addison-Wesley, Reading, Massachusetts. This work was supported by DFG grant KO L. Huang and D. Chiang. 2005. Better k-best parsing. 2916/2-1. In Proceedings of the Ninth International Workshop on Parsing Technology. A. K. Joshi, L. S. Levy, and M. Takahashi. 1975. Tree References adjunct grammars. Journal of Computer and System Sciences, 10(1):136–163. L. Banarescu, C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, A. Koller and M. Kuhlmann. 2011. A generalized M. Palmer, and N. Schneider. 2013. Abstract mean- view on parsing and translation. In Proceedings of ing representation for sembanking. In Proceedings the 12th International Conference on Parsing Tech- of the 7th Linguistic Annotation Workshop. nologies.

M. Büchse, A. Koller, and H. Vogler. 2013. Generic A. Koller and M. Kuhlmann. 2012. Decompos- binarization for parsing and translation. In Proceed- ing tag algorithms using simple algebraizations. In ings of the 51st Annual Meeting of the Association Proceedings of the 11th International Workshop on for Computational Linguistics. Tree Adjoining Grammars and Related Formalisms (TAG+11). E. Charniak, M. Johnson, M. Elsner, J. Austerweil, D. Ellis, I. Haxton, C. Hill, R. Shrivaths, J. Moore, M. Lewis, K. Lee, and L. Zettlemoyer. 2016. LSTM M. Pozar, and T. Vu. 2006. Multilevel coarse-to- CCG parsing. In Proceedings of NAACL-HLT. fine pcfg parsing. In Proceedings of the Human Lan- K. Sima’an, R. Bod, S. Krauwer, and R. Scha. 1994. guage Technology Conference of the North Ameri- Efficient disambiguation by means of stochastic tree can Chapter of the ACL. substitution grammars. In Proceedings of Interna- tional Conference on New Methods in Language D. Chiang, J. Andreas, D. Bauer, K. M. Hermann, Processing. B. Jones, and K. Knight. 2013. Parsing graphs with

32 CASSANDRA: A multipurpose configurable voice-enabled human-computer-interface

Tiberiu Boros, Stefan Daniel Dumitrescu and Sonia Pipa Research Center for Artificial Intelligence Romanian Academy Bucharest, Romania [email protected], [email protected], [email protected]

Abstract cording to the documents stored in his/her own cloud-hosted storage (Google is able to parse and Voice enabled human computer interfaces obtain information from plane tickets and book- (HCI) that integrate automatic speech ings and automatically provides calendar entries recognition, text-to-speech synthesis and as well as general tips such as your plane leaves natural language understanding have be- tomorrow at 11 AM and you should be at the come a commodity, introduced by the im- airport before 10 AM due to traffic). We intro- mersion of smart phones and other gad- duce our natural-voice assistive system for pro- gets in our daily lives. Smart assistants cess automation that enables control of external are able to respond to simple queries (sim- devices/gadgets as well as allowing the user to ilar to text-based question-answering sys- interact with the system in the form of informa- tems), perform simple tasks (call a num- tion queries, much like a personal assistant exist- ber, reject a call etc.) and help organiz- ing now on mobile devices. The user interacts with ing appointments. With this paper we in- the system using his/her voice, the system under- troduce a newly created process automa- stands and acts accordingly by controlling exter- tion platform that enables the user to con- nal devices or by responding to the user with the trol applications and home appliances and requested information. All the work presented in to query the system for information us- this paper was done during the implementation of ing a natural voice interface. We offer an the Assistive Natural-language, Voice-controlled overview of the technologies that enabled System for Intelligent Buildings (ANVSIB) na- us to construct our system and we present tional project. We cover aspects related to sys- different usage scenarios in home and of- tem architecture, challenges involved by individ- fice environments. ual tasks, performance figures of the sub-modules and technical decisions related to the development 1 Introduction of a working prototype. The tools and technolo- Major mobile developers are currently including gies are divided in 3 main topics: (a) automatic some form of personal assistants in their operating speech recognition (ASR); (b) text-to-speech syn- systems, which enable users to control their de- thesis (TTS) and (c) integration with home au- vice using natural voice queries (see Google Now tomation services. in Android, Siri in Apple’s iOS, Microsoft’s Cor- tana, etc.). While these assistants offer powerful 2 System architecture integration with the device, they are restricted to that device only - the user (a) is able to control its From a logical point of view, the system is di- device, (b) has access to information from simple vided in three components (each presented in this queries (QA systems included by major competi- paper): ASR, TTS, and integration with exter- tors in the mobile OS world are able to understand nal services using natural language understanding and automatically summarize answers to queries (NLU). From an procedural point of view, the sys- such as: how is the weather today?, who is the tem has two distinct entities: one entity acts as an president of the United States or who was Michael endpoint(client) and is implemented as an android Jackson) and (c) can organize his/her agenda ac- application that is responsible for (a) acquisition of

33 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 33–36 c 2017 Association for Computational Linguistics speech data from the user, processing and recog- them here: the part-of-speech tagger is a neural nition (all using external services), as well as the inspired approach (Boros et al., 2013b) achiev- (b) presentation of data to the user. The second ing 98.21% accuracy on the ”1984” novel by G. entity acts as a server and is responsible for re- Orwell using morphosyntactic descriptors (MSDs) ceiving text-input from the endpoint, identifying a (Erjavec, 2004) in the tag-set; all the lexical pro- scenario from a limited set, extracting parameters, cessing modules were described in (Boros, 2013) performing the necessary operations and returning For syllabification we used the onset-nucleus- information to the endpoint in the form of text, im- coda (ONC) tagging strategy proposed in (Bartlett ages and sound. et al., 2009) and chunking is performed using a Because we wanted to keep our system as scal- POS-based grammar described in (Ion, 2007). able as possible, Automatic Speech Recognition Our TTS implements both unit-selection (Boros and Text-to-speech synthesis operations are car- et al., 2013a) and statistical parametric speech ried out on external servers. The endpoint can be synthesis. For Cassandra we chose to use para- configured to access both Cassandra’s own ASR metric synthesis with our implementation of the and TTS services (described here) as well as the STRAIGHT filter (Kawahara et al., 1999). ASR and TTS servers provided by Google Speech Our TTS system primarily supports English, API. German, French and Romanian and can be ac- cessed for demonstration purposes from our web 2.1 Cassandra’s TTS synthesis system portal(link will be provided after evaluation). For other languages Cassandra uses Google TTS, Text-to-speech synthesis refers to the conversion which provides statistical parametric speech syn- of any arbitrary (unrestricted) text into audio sig- thesis for a large number of languages. nal. The unrestricted requirement makes this task difficult and, while state-of-the-art systems pro- 2.2 Cassandra’s speech recognition duce remarkable results in terms of intelligibil- ity and naturalness, the recipe for producing syn- Cassandra’s Automatic Speech Recognition thetic voices which are indistinguishable from nat- (ASR) service was created by our partners, ural ones has not yet been found. This limita- the Speech and Dialogue Research Laboratory tion is caused by the fact that natural language un- (http://speed.pub.ro) and it is a scalable and ex- derstanding still poses serious challenges for ma- tensible on-line Speech-to-Text (S2T) solution. A chines and the fact that the surface form of the text demo version of the Speech-to-Text (S2T) system does not provide sufficient cues for the prosodic resides in the cloud or in SpeeDs IT infrastructure realization of a spontaneous and expressive voice and can be accessed on-line using the Web API (Taylor, 2009). or a proprietary protocol. Client applications can TTS synthesis involves two major steps: (a) ex- be developed using any technology that is able to traction of features from text and (b) conversion of communicate through TCP-IP sockets with the symbolic representations into actual speech. The server application. The server application can text processing step (step a) is usually composed communicate with several clients, serving them of low-level text-processing tasks such as part-of- either simultaneously or sequentially. speech tagging, lemmatization, chunking, letter- The speech-to-text system can be configured to to-sound conversion, syllabification etc. The sig- transcribe different types of speech, from a num- nal processing task (step b) consists of selecting ber of domains and languages. It can be config- an optimal set of speech parameters (given the fea- ured to instantiate multiple speech recognition en- tures provided by step a) and generating an acous- gines (S2T Transcribers), each of these engines tic signal that best fits these parameters. Text pro- being responsible for transcribing speech from a cessing is performed by our in-house developed specific domain (for example, TV news in Roma- natural language processing pipeline called Mod- nian, medical-related speech in Romanian, coun- ular Language Processing for Lightweight Appli- try names in English, etc.). The speech recogni- cations (MLPLA) (Zafiu et al., 2015). Most of tion engines are based on the open-source CMU the individual modules have been thoroughly de- Sphinx speech recognition toolkit. The ASR sys- scribed in our previous work (Boros, 2013; Boros tems with small vocabulary and grammar lan- and Dumitrescu, 2015), so we only briefly list guage models were evaluated in depth in (Cucu

34 et al., 2015), while the ASR system with large vo- Short Term Memory neural networks and word cabulary and statistical language model was evalu- embeddings; (b) parameter start/stop markers are ated in depth in (Cucu et al., 2014). The first ones then added using a deep neural network classifier have word error rates between 0 and 5%, while trained on a window of 4 words; (c) next, param- the large vocabulary ASR has a word error rate of eter types are added using a window of six words, about 16%. in which the parameters are ignored. Theoret- ically, parameter identification (c) and boundary 2.3 Natural language processing detection (b) could be carried out simultaneously. However, we found that splitting this task in two The core of Cassandra is driven by a natural lan- steps works significantly better, mainly because it guage processing system that is responsible for re- mitigates the data sparsity issue. An interesting ceiving a text and, after analysis and parameter observation is that using word embeddings in the extraction, acting based on a predefined scenario. scenario identification step, allowed the system to A scenario is the equivalent of a frame in frame- identify the query ”It’s too hot” as an air condi- based dialogue systems. At any time, the end-user tioning activity, though the training data only con- is able to alter Cassandra’s configuration by creat- tained examples such as: ”Set AC to $value”, ”Set ing scenarios or editing existing ones. A scenario the temperature to $value degrees” etc. is defined by its name and 3 components: Because we wanted to keep the system as lan- Component 1 contains a list of example sen- guage independent as possible, the only external tences with parameters preceded by a special char- resources required for language adaptation is a acter ’$’ which is used to identify them. (e.g. ”Set word embeddings file extracted using word2vec the temperature on $value degrees.”) (Mikolov et al., 2014). The process is simple and Component 2 tells the system what to actions given a large enough corpus (e.g. a Wikipedia to take, depending on the scenario. Actions are dump) one can obtain good embeddings by run- described in a JSON with the following structure: ning the word2vec tool on the tokenized text. (a) gadget - a unique identifier telling the system what module must be used for processing (e.g. a 3 Standard gadgets and usage scenarios light, an A/C unit, the security system, etc.); (b) gadget parameters - a free-form JSON structure At submission time we have already implemented that tells the system what parameters to pass on to 4 demo scenarios: the gadget (e.g. turn the lights on=1 or off=0, set Scenario 1 is a standard query system, in which A/C to X degrees, etc.). The values of these pa- the user can ask Cassandra various questions (for rameters can be either constants, predefined sys- example ”how is the weather”) to which the sys- tem variables or actual values extracted from the tem will respond using a Knowledge Base (KB). text. A gadget can return a text response. The KB was built using Wikipedia for Romanian Component 3 is used for feedback. After pro- and English. cessing, this JSON is returned to the end-point Scenario 2 is a home automation system that which initiated the session and it is used to re- allows the user to control home appliances using lay information back to the user. This is also a a natural voice interface. There are several pre- JSON with 2 attributes: (a) friendly response - a defined tasks such as multimedia, lighting, cli- text response that if present will be synthesized mate and security system control. Communica- and played back to the user; the friendly response tion with these devices is performed through KNX can include system variables, parameters extracted (EN 50090, ISO/IEC 14543), which is a purpose- from the text or the response returned by the gad- built standardized network communication proto- get; (b) launch intent - a structure that informs the col that is simple and scalable. end-point that it must launch an external Android Scenario 3 is a set of hard-coded short ques- Intent with a given set of parameters; Cassandra tions/answers that increase Cassandra’s appeal. comes with a predefined set of Intents for image For example, this Q/A set contains a game called preview, audio playback or video playback. ”shrink”. It enables the system to perform word The methodology for scenario identification and associations in a similar manner in which a psy- parameter extraction is performed in three sequen- chologist would ask a patient to do. This particular tial steps: (a) the scenario is identified using Long- game proved to be very appealing to the users in

35 our test group primarily because it made the sys- ceedings of Human Language Technologies: North tem seem more ”life-like”. American Chapter of the Association for Computa- is a business oriented demo. We in- tional Linguistics, pages 308–316. Association for Scenario 4 Computational Linguistics. tegrated Cassandra with a Document Management Tiberiu Boros and Stefan Daniel Dumitrescu. 2015. System (DMS) and created an application that Robust deep-learning models for text-to-speech syn- filters documents based on associated meta-data. thesis support on embedded devices. In Proceedings The interaction is described using a large JSON in of the 7th International Conference on Management of computational and collective intElligence in Dig- which all parameters are optional, missing param- ital EcoSystems, pages 98–102. ACM. eters being ignored by the system. By doing so we are able to respond to queries such as: ”Give Tiberiu Boros, Radu Ion, and tefan Daniel Dumitrescu. 2013a. The racai text-to-speech synthesis system. me all invoices”, ”Give me all invoices issued to Acme Computers”, ”Give me all invoices issued Tiberiu Boros, Radu Ion, and Dan Tufis. 2013b. Large to Acme computers that are due in two weeks” etc. tagset labeling using feed forward neural networks. We find this scenario particularly interesting from case study on romanian language. In ACL (1), pages 692–700. a commercial point-of-view, especially because it enables users access to the DMS without being Tiberiu Boros. 2013. A unified lexical processing forced to master any specific search skills. In a framework based on the margin infused relaxed al- similar way, an application could be built to give gorithm. a case study on the romanian language. In RANLP, pages 91–97. users access to databases and enable them to con- struct complex queries and reports with no SQL Horia Cucu, Andi Buzo, Lucian Petrica,˘ Dragos¸ knowledge whatsoever: ”Give me a list of cus- Burileanu, and Corneliu Burileanu. 2014. Recent improvements of the speed romanian lvcsr system. tomers that bought tablets over the last 6 months In Communications (COMM), 2014 10th Interna- and group them by age”. tional Conference on, pages 1–4. IEEE. 4 Conclusions and future development Horia Cucu, Andi Buzo, and Corneliu Burileanu. 2015. The speed grammar-based asr system for Cassandra is an open-source, freely available per- the romanian language. ROMANIAN JOURNAL OF sonal assistant and, from our knowledge, the only INFORMATION SCIENCE AND TECHNOLOGY, 18(1):33–53. system that is extensible to several languages, not only English, using minimal effort. We intend Tomaz Erjavec. 2004. Multext-east version 3: Multi- to further develop this system and extend the ba- lingual morphosyntactic specifications, lexicons and sic set of applications to suit most common usage corpora. In LREC. scenarios, as well as to offer more complex NLP- Radu Ion. 2007. Word sense disambiguation methods powered business scenarios that integrate with var- applied to english and romanian. PhD thesis. Roma- ious existing software and hardware implementa- nian Academy, Bucharest. tions. A necessary next step in the near future is to Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain create an open source repository that will enable De Cheveigne. 1999. Restructuring speech rep- the creation of a community of developers around resentations using a pitch-adaptive time–frequency it. smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in Acknowledgements: This work was supported sounds. Speech communication, 27(3):187–207. by UEFISCDI, under grant PN-II-PT-PCCA- 2013-4-0789, project Assistive Natural-language, Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Voice-controlled System for Intelligent Buildings Dean. 2014. word2vec. (2013-2017). Paul Taylor. 2009. Text-to-speech synthesis. Cam- bridge university press.

References Adrian Zafiu, Tiberiu Boros, and Stefan Daniel Du- mitrescu. 2015. Modular language processing for Susan Bartlett, Grzegorz Kondrak, and Colin Cherry. lightweight applications. In Proceedings of Lan- 2009. On the syllabification of phonemes. In Pro- guage & Technology Conference.

36 An Extensible Framework for Verification of Numerical Claims

James Thorne Andreas Vlachos Department of Computer Science Department of Computer Science University of Sheffield, UK University of Sheffield, UK [email protected] [email protected]

Abstract Input: Around 80 million people were inhabitants of Germany in 2015. In this paper we present our automated Data Source: data/worldbank wdi.csv Property: population(Germany, 2015) fact checking system demonstration which Value: 81413145 we developed in order to participate in the Absolute Percentage Error: 1.7% Fast and Furious Fact Check challenge. Verdict: TRUE We focused on simple numerical claims such as “population of Germany in 2015 Figure 1: Fact checking a claim by matching it to was 80 million” which comprised a quar- an entry in the knowldge base. ter of the test instances in the challenge, achieving 68% accuracy. Our system ex- tends previous work on semantic parsing a verdict. For example, in the claim of Figure 1 a and claim identification to handle tempo- system needs to recognize the named entity (Ger- ral expressions and knowledge bases con- many), the statistical property (population) and the sisting of multiple tables, while relying year, link them to appropriate elements in a KB, solely on automatically generated training and deduce the truthfulness of the claim using the data. We demonstrate the extensible na- absolute percentage error. ture of our system by evaluating it on rela- We contrast this task against rumour detection tions used in previous work. We make our (Qazvinian et al., 2011) – a similar prediction task system publicly available so that it can be based on language subjectivity and growth of read- used and extended by the community.1 ership through a social network. While these are important factors to consider, a sentence can be 1 Introduction true or false regardless of whether it is a rumour Fact checking is the task of assessing the truthful- (Lukasik et al., 2016). ness in spoken or written language. We are mo- Existing fact checking systems are capable of tivated by calls to provide tools to support jour- detecting fact-check-worthy claims in text (Has- nalists with resources to verify content at source san et al., 2015b), returning semantically similar (Cohen et al., 2011) or upon distribution. Manual textual claims (Walenz et al., 2014); and scoring verification can be too slow to verify information the truth of triples on a knowledge graph through given the speed at which claims travel on social semantic distance (Ciampaglia et al., 2015). How- networks (Hassan et al., 2015a). ever, neither of these are suitable for fact check- In the context of natural language processing re- ing a claim made in natural language against a search, the task of automated fact checking was database. Previous works appropriate for this task discussed by Vlachos and Riedel (2014). Given operate on a limited domain and are not able a claim, a system for this task must determine to incorporate temporal information when check- what information is needed to support or refute the ing time-dependent claims (Vlachos and Riedel, claim, retrieve the information from a knowledge 2015). base (KB) and then compute a deduction to assign In this paper we introduce our fact checking 1https://github.com/sheffieldnlp/ tool, describe its architecture and design decisions, numerical-fact-checking-eacl2017 evaluate its accuracy and discuss future work. We

37 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 37–40 c 2017 Association for Computational Linguistics highlight the ease of incorporating new informa- tion sources to fact check, which may be unavail- able during training. To validate the extensibility of the system, we complete an additional evalua- tion of the system using claims taken from Vla- chos and Riedel (2015). We make the source code publicly available to the community.

2 Design Considerations We developed our fact-checking approach in the context of the HeroX challenge2 – a competition organised by the fact checking organization Full- Fact3. The types of claims the system presented can fact check was restricted to those which re- quire looking up a value in a KB, similar to the one in Figure 1. To learn a model to perform the KB look up (essentially a semantic parsing task), we extend the work of Vlachos and Riedel (2015) who used distant supervision (Mintz et al., 2009) to generate training data, obviating the need for man- ual labeling. In particular, we extend it to handle simple temporal expressions in order to fact check time-dependent claims appropriately, i. e. popula- tion in 2015. While the recently proposed seman- tic parser of Pasupat and Liang (2015) is also able Figure 2: Relation matching step and filtering to handle temporal expressions, it makes the as- sumption that the table against which the claim ing (Section 3.1) and then how the relation match- needs to be interpreted is known, which is unre- ing module is trained and the features used for this alistic in the context of fact checking. purpose (Section 3.2). Furthermore, the system we propose can pre- dict relations from the KB on which the semantic 3.1 Fact Checking parser has not been trained, a paradigm referred to In our implementation, fact checking is a three as zero-shot learning (Larochelle et al., 2008). We step process, illustrated in Figure 2. Firstly, we achieve this by learning a binary classifier that as- link named entities in the claim to entities in our sesses how well the claim “matches” each relation KB and retrieve a set of tuples involving the enti- in the KB. Finally, another consideration in our ties found. Secondly, these entries are filtered in a design is algorithmic accountability (Diakopoulos, relation matching step. Using the text in the claim 2016) so that the predictions and the decision pro- and the predicate as features, we classify whether cess used by the system are interpretable by a hu- this tuple is relevant. And finally, the values in the man. matched triples from the KB are compared to the 3 System Overview value in the statement to deduce the verdict as fol- lows: if there is at least one value with absolute Given an unverified statement, the objective of percentage error lower than a threshold defined by this system is to identify a KB entry to support the user, the claim is labeled true, otherwise false. or refute the claim. Our KB consists of a set We model the relation matching as a binary of un-normalised tables that have been translated classification task using logistic regression imple- into simple Entity-Predicate-Value triples through mented in scikit-learn, predicting whether a a simple set of rules. In what follows we first de- predicate is a match to the given input claim. The scribe the fact checking process used during test- aim of this step is to retain tuples for which the 2http://herox.com/factcheck predicate in the Entity-Predicate-Value tuple can 3https://fullfact.org be described by the surface forms present in the

38 claim (positive-class) and discard the remainder syntactic features, we consider the words in the (negative-class). We chose not to model this as a span between the entity and number as well as the multi-class classification task (one class per pred- dependency path between the entity and the num- icate) to improve the extensibility of the system. ber. To generalise to unseen relations, we include A multi-class classifier requires training instances the ability to add custom indicator functions; for for every class, thus would not be applicable to our demonstration we include a simple boolean predicates not seen during training. Instead, our feature indicating whether the intersection of the aim is to predict the compatibility of a predicate words in the predicate name and the sentence is w. r. t. the input claim. not empty. For each of the candidate tuples r R, a fea- i ∈ ture vector is generated using lexical and syntactic 4 Evaluation features present in the claim and relation: φ(ri, c). This feature vector is inputted to a logistic re- The system was field tested in the HeroX fact gression binary classifier and we retain all tuples checking challenge - 40 general-domain claims T chosen by journalists. Given the design of our sys- where θ φ(ri, c) 0. ≥ tem, we were restricted to claims that can only be 3.2 Training Data Generation answered by a KB look up (11 claims in total), re- turning the correct truth assessment for 7.5 of them The training data for relation matching is gener- according to the human judges. The half-correct ated using distant supervision and the Bing search one was due to not providing a fully correct ex- engine. We first read the table and apply a set of planation. To fact check a statement, the user only simple rules to extract subject, predicate, object needs to enter the claim into a text box. The only tuples, and for each named entity and numeric requirement is to provide the system with an ap- value we generate a query containing the entity propriate KB. Thus we ensured that the system can name and the predicate. For example, the entry readily incorporate new tables taken from ency- (Germany,Population:2015,81413145) clopedic sources such as Wikipedia and the World is converted to the query “Germany” Population Bank. In our system, this step is achieved by sim- 2015. The queries are then executed on the Bing ply importing a CSV file and running a script to search engine and the top 50 web-page results are generate the new instances to train the relation retained. We extract the text from the webpages matching classifier. using a script built around the BeautifulSoup Analysis of our entry to this competition package.4 This text is parsed and annotated with showed that two errors were caused by incorrect co-reference chains using the Stanford CoreNLP initial source data and one partial error caused by pipeline (Manning et al., 2014). recalling a correct property but making an incor- Each sentence containing a mention of an entity rect deduction. Of numerical claims that we did and a number is used to generate a training ex- not attempt, we observed that many required look- ample. The examples are labeled as follows. If ing up multiple entries and performing a more the absolute percentage error between the value complex deduction step which was beyond the in the KB and the number extracted from text is scope of this project. below a threshold (an adjustable hyperparameter), We further validate the system by evaluating the training instance is marked as a positive in- the ability of this fact checking system to make stance. Sentences which contain a number outside veracity assessments on simple numerical claims of this threshold are marked as negative instances. from the data set collected by (Vlachos and Riedel, We make an exception for numbers tagged as dates 2015). Of the 4,255 claims about numerical prop- where an exact match is required. erties about countries and geographical areas in For each claim, the feature generation function, this data set, our KB contained information to fact φ(r, c) , outputs lexical and syntactic features from check 3,418. The system presented recalled KB the claim and a set of custom indicator variables. entries for 3,045 claims (89.1%). We observed Additionally, we include a bias feature for every that the system was consistently unable to fact unique predicate in the KB. For our lexical and check two properties (undernourishment and re- 4https://www.crummy.com/software/ newable freshwater per capita). Analysis of these BeautifulSoup/ failure cases revealed too great a lexical difference

39 between the test claims and the training data our checking from knowledge networks. PloS one, system generated; the claims in the test cases were 10(6):e0128193. comparative in nature (e. g. country X has higher Sarah Cohen, Chengkai Li, Jun Yang, and Cong Yu. rate of undernourishment than country Y) whereas 2011. Computational journalism: A call to arms to the training data generated using the method de- database researchers. In CIDR, volume 2011, pages 148–151. scribed in Section 3.2 are absolute claims. A high number of false positive matches were Nicholas Diakopoulos. 2016. Accountability in al- generated, e. g. for a claim about population, other gorithmic decision making. Communications of the ACM, 59(2):56–62. irrelevant properties were also recalled in addition. For the 3,045 matched claims, 17,770 properties Naeemul Hassan, Bill Adair, James T Hamilton, were matched from the KB that had a score greater Chengkai Li, Mark Tremayne, Jun Yang, and Cong Yu. 2015a. The quest to automate fact-checking. than or equal to the logistic score of the correct world. property. This means that for every claim, there Naeemul Hassan, Chengkai Li, and Mark Tremayne. were, on average, 5.85 incorrect properties also 2015b. Detecting check-worthy factual claims in extracted from the KB. In our case, this did not presidential debates. In Proceedings of the 24th yield false positive assignment of truth labels to ACM International on Conference on Information the claims. This was because the absolute percent- and Knowledge Management, pages 1835–1838. age error between the incorrectly retrieved prop- ACM. erties and the claimed value was outside of the Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. threshold we defined; thus these incorrectly re- 2008. Zero-data learning of new tasks. In AAAI, volume 1, page 3. trieved properties never resulted in a true verdict, allowing the correct one to determine the verdict. Michal Lukasik, Kalina Bontcheva, Trevor Cohn, Arkaitz Zubiaga, Maria Liakata, and Rob Proc- 5 Conclusions and Future Work ter. 2016. Using gaussian processes for rumour stance classification in social media. arXiv preprint The core capability of the system demonstration arXiv:1609.01962. we presented is to fact check natural language Christopher D. Manning, Mihai Surdeanu, John Bauer, claims against relations stored in a KB. Although Jenny Finkel, Steven J. Bethard, and David Mc- the range of claims is limited, the system is a field- Closky. 2014. The Stanford CoreNLP natural lan- tested prototype and has been evaluated on a pub- guage processing toolkit. In Proceedings of ACL System Demonstrations, pages 55–60. lished data set (Vlachos and Riedel, 2015) and on real-world claims presented as part of the HeroX Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- sky. 2009. Distant supervision for relation extrac- fact checking challenge. In future work, we will tion without labeled data. In Proceedings of ACL, extend the semantic parsing technique used and pages 1003–1011. apply our system to more complex claim types. Panupong Pasupat and Percy Liang. 2015. Compo- Additionally, further work is required to reduce sitional semantic parsing on semi-structured tables. the number of candidate relations recalled from arXiv preprint arXiv:1508.00305. the KB. While this was not an issue in our case, we Vahed Qazvinian, Emily Rosengren, Dragomir R believe that ameliorating this issue will enhance Radev, and Qiaozhu Mei. 2011. Rumor has it: Iden- the ability of the system to assign a correct truth tifying misinformation in microblogs. In Proceed- label where there exist properties with similar nu- ings of EMNLP, pages 1589–1599. merical values. Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. Acknowledgements ACL 2014, page 18. This research is supported by a Microsoft Azure for Re- Andreas Vlachos and Sebastian Riedel. 2015. Identifi- search grant. Andreas Vlachos is supported by the EU H2020 cation and verification of simple claims about statis- SUMMA project (grant agreement number 688139). tical properties. In Proceedings of EMNLP. Brett Walenz, You Will Wu, Seokhyun Alex Song, Emre Sonmez, Eric Wu, Kevin Wu, Pankaj K Agar- References wal, Jun Yang, Naeemul Hassan, Afroza Sultana, et al. 2014. Finding, monitoring, and checking Giovanni Luca Ciampaglia, Prashant Shiralkar, Luis M claims computationally based on structured data. In Rocha, Johan Bollen, Filippo Menczer, and Alessan- Computation+ Journalism Symposium. dro Flammini. 2015. Computational fact

40 ADoCS: Automatic Designer of Conference Schedules

Diego Vallejo1, Paulina Morillo2,Cesar` Ferri3 1Universidad San Francisco de Quito, Department of Mathematics, Quito, Ecuador [email protected] 2Universidad Politecnica´ Salesiana, Research Group IDEIAGEOCA, Quito, Ecuador [email protected] 3Universitat Politecnica` de Valencia,` DSIC, Valencia,` Spain [email protected]

Abstract are usually formed of various sessions of fixed size where the authors present their selected pa- Distributing papers into sessions in sci- pers. These sessions are usually thematic and are entific conferences is a task consisting in arranged by the conference chair in a manual and grouping papers with common topics and tedious work, specially when the number of papers considering the size restrictions imposed is high. Organising the sessions of a conference by the conference schedule. This prob- can be seen as a problem of document clustering lem can be seen as a semi-supervised clus- with size constraints. tering of scientific papers based on their In this work we present a web application called features. This paper presents a web tool ADoCS for the automatic configuration of ses- called ADoCS that solves the problem of sions in scientific conferences. The system ap- configuring conference schedules by an plies a new semi-supervised clustering algorithm automatic clustering of articles by similar- for grouping documents with size constraints. Re- ity using a new algorithm considering size cently, a similar approach has been described by constraints. (Skvorcˇ et al., 2016). In this case the authors also use information from reviews to build the groups, 1 Introduction however, this information is not always available. Cluster analysis has the objective of dividing data objects into groups, so that objects within the 2 Methodology same group are very similar to each other and dif- ferent from objects in other groups (Tan et al., In this section we summarise the semi-supervised 2005). Semi-supervised clustering methods try to clustering algorithm of ADoCS system. increase the performance of unsupervised cluster- The information about the papers in the confer- ing algorithms by using limited amounts of super- ence is uploaded to the system by means of a sim- vision in the form of labelled data or constraints ple csv file. This csv represents a paper per row, (Basu et al., 2004). These constrains are usu- and each row must contain, at least, three columns: ally restrictions of size or relations of belonging Title, Keywords and Abstract. For the text pre- of objects to the clusters. These membership re- processing, NLP techniques and information re- strictions have been incorporated into the cluster- trieval techniques are applied to obtain a dissim- ing process by different works (Zhu et al., 2010; ilarity matrix. We used a classical scheme for data Zhang et al., 2014; Ganganath et al., 2014; Grossi pre-processing in documents: tokenization, stop- et al., 2015) showing that in this context, semi- words removal and stemming (Jha, 2015). supervised clustering methods obtain groupings To structure the dissimilarity matrix of titles that satisfy the initial restrictions. and keywords, Jaccard coefficient is applied, since On the other hand, in generic form, document these two elements usually have a small number clustering should be conceived as the partitioning of tokens. In the case of abstracts a vector model of a documents collection into several groups ac- with a cosine similarity index on TF-IDF weight- cording to their content (Hu et al., 2008). A scien- ing matrix is used. ADoCS web tool has these de- tific article is a research paper published in spe- fault settings, although these parameters can be di- cialised journals and conferences. Conferences rectly adjusted by the user.

41 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 41–44 c 2017 Association for Computational Linguistics

The bag-of-words, or vector model representa- ditionally, we find three control bars (Title, Key- tion derived from the texts, configure a Euclidean words and Abstract) with values between 0 and space where several distances can be applied in 1 that establish the weights of each of these fac- order to estimate similarities between elements. tors for the computation of distances between pa- Nevertheless, in some cases, we need to average pers. By default, the three bars are set to 0.33 to several criteria and unify them to obtain a single indicate that the three factors will have the same metric that quantifies dissimilarities, and with this, weight in computing the distances. The values we can derive a distance matrix. This is the case of the weights are normalised in such a way that we address here, since for each paper we have they always sum 1. The user can also configure three different features: title, abstract and key- whether the TF-IDF transformation is applied or words. In these situations we cannot directly ap- not, as well as the metric that is employed to com- ply clustering methods based on centroids, such pute the distance between elements. These con- as K-Means, since there is not a Euclidean space trols are responsive, i.e., when the user modifies defined for the elements. One way to solve this one of the values, the distance matrix is recom- type of problems is to apply algorithms that, for puted, and also all the components that depend on the clustering process, use only the dissimilarity this matrix. or distance matrix as input. ADoCS works with Once the file is correctly uploaded in the sys- a new algorithm called CSCLP (Clustering algo- tem, the application enables the function tabs that rithm with Size Constraints and Linear Program- give access to the functionality of the web system. ming) (Vallejo, 2016) that only uses as inputs: the There are four application tabs: size constrains of the sessions and the dissimilar- ity/distance matrix. Papers: This tab contains information about • Clustering algorithms obtain better results if a the dataset. We include here the list of pa- proper selection of initial points method is com- pers. For each paper, we show the number, puted (Fayyad et al., 1998). For this reason, Title, Keywords and Abstract. In order to im- the initial points in our clustering algorithm is prove the visualisation, a check box can be chosen using a popular method: Buckshot algo- employed to show additional information of rithm (Cutting et al., 1992). In CSCLP, the initial the papers. points are used as pairwise constraints (as cannot- link constraints in semi-supervised clustering ter- Dendrogram: In this part, a dendrogram • minology) for the formation of the clusters, and generated from the distance matrix is shown. with binary integer linear programming (BILP) The distance between papers is computed the membership and assignment of the instances to considering the weights selected by the user the clusters is determined, satisfying the size con- and the methodology detailed in Section 2. strains of the sessions. In this way, the original clustering problem with size constraints becomes MDS: In this tab, a Multidimensional Scaling • an optimisation problem. Details of the algorithm algorithm is employed over the distance ma- can be found in (Vallejo, 2016). trix to generate a 2D plot about the similarity of papers. Once the clusters are arranged, the 3 ADoCS Tool membership of the papers to each cluster is denoted by the colour. In this section we include a description of the ADoCS tool. You can find a web version Wordmap: This application tab includes a • of the tool in the url: https://ceferra. word map representation for showing the shinyapps.io/ADoCS. most popular terms extracted from the ab- On the left part of the web interface we find a stracts of the papers in the dataset. panel where we can upload a csv file containing information of the papers to be clustered. In the Schedule: In this part, the user can config- • panel, there are several controls where we can con- ure the number and size of the sessions and figure some features of this csv file. Concretely, execute the CSCLP algorithm, described in the separator of fields (comma by default) and how Section 2, to build the groups according the literals are parsed (single quote by default). Ad- similarity between papers.

42 The typical use of this tool starts by loading a csv file with the information of the papers. Af- ter the data is in the system, we can use the first tab Papers to know the number of papers in the conference. We can also observe specific informa- tion of all the papers (title, keywords, abstract or other information included in the file). The tabs Dendrogram, MDS and Wordmap can be used for exploring the similarities between papers and for knowing the most common terms in the papers of the conference. Finally, we can execute the semi- supervised algorithm to configure the sessions in the Schedule tab. In this tab, we can introduce the number of the desired sessions as well as the dis- tribution of the size of the sessions. For instance, if we have 40 papers and we want 5 sessions of 5 papers, and 5 sessions of 3 papers, we intro- duce “10” in the Number of Sessions text box and “5,5,5,5,5,3,3,3,3,3” in Size of Sessions text box. When the user push button compute, if the sum of Figure 1: Example of the WordMap of the ab- the number of papers in the distribution is correct stracts in the papers of Twenty-Eighth Conference (i.e. is equal to the total number of papers in the on Artificial Intelligence (AAAI-2014) conference), the algorithm of section 2 is executed in order to find clusters that satisfy the size restric- tions expressed by the session distribution. As a proxy: The proxy package (Meyer and • result of the algorithm, a list of the papers with the Buchta, 2016) allows to compute distances assigned cluster is shown in the tab. Finally, we between instances. The package contains im- can download the assignments as a csv file if we plementations of popular distances, such as: push the Save csv button. Euclidean, Manhattan or Minkowski, etc.

4 Technology Rglpk: The Rglpk package (Theussl and • Hornik, 2016) represents an R interface to The ADoCS tool has been implemented totally in the GNU Linear Programming Kit. GLPK R (R Core Team, 2015). This system is a free soft- is a popular open source kit for solving lin- ware language for statistical computing. There is a ear programming (LP), mixed integer linear plethora of different libraries that makes R a pow- programming (MILP) and other related prob- erful environment for developing software related lems with optimisation. to data analysis in general, and computational lin- guistics in particular. Specifically, we have em- The ADoCS source code and some datasets ployed the following R libraries: about conferences to test the tool can be found https://github.com/dievalhu/ tm: The tm package (Meyer et al., 2008) con- in • tains a set of functions for text mining in R. In ADoCS. this project we have use the functionalities re- The graphical user interface has been devel- lated to text transformations (stemming, stop- oped by means of the Shiny package (Chang et words, TF-IDF, ...) with the English corpora. al., 2016). This package constitutes a framework for building graphical applications in R. Shiny is wordcloud: The wordcloud package (Fel- a powerful package for converting basic R scripts • lows, 2014) includes functions to show the into interactive web applications without requir- most common terms in a vocabulary in form ing programming web knowledge. As a result we of the popular wordcloud plot. An example obtain an interactive web application that can be of this plot for the AAAI-2014 conference is executed locally using a web browser, or can be included in Figure 1. uploaded to a shiny where the

43 tool is available for a general use. In this case, Ian Fellows, 2014. wordcloud: Word Clouds. R pack- we have uploaded the application to the free server age version 2.5. http://www.shinyapps.io/. Nuwan Ganganath, Chi-Tsun Cheng, and Chi. K Tse. 2014. Data clustering with cluster size 5 Conclusions and Future Work constraints using a modified k-means algorithm. Cyber-Enabled Distributed Computing and Knowl- Arranging papers to create an appropriate con- edge Discovery (CyberC), International Conference ference schedule with sessions containing papers IEEE, pages 158–161. with common topics is a tedious task, specially when the number of papers is high. Machine Valerio Grossi, Anna Monreale, Mirco Nanni, Dino Pe- dreschi, and Franco Turini. 2015. Clustering formu- learning offers techniques that can automatise this lation using constraint optimization. Software Engi- task with the help of NLP methods for extracting neering and Formal Methods, pages 93–107. features from the papers. In this context, organis- Guobiao Hu, Shuigeng Zhou, Jihong Guan, and Xiao- ing a conference schedule can be seen as a semi- hua Hu. 2008. Towards effective document clus- supervised clustering. In this paper we have pre- tering: a constrained k-means based approach. In- sented the ADoCS system, a web application that formation Processing & Management, 44(4):1397– is able to create a set of clusters according to the 1409. similarity of the documents analysed. The groups Monica Jha. 2015. Document clustering using are formed following the size distribution config- k-medoids. International Journal on Advanced ured by the user. Although initially the application Computer Theory and Engineering (IJACTE), is focused on grouping conference papers, other 4(1):2319–2526. related tasks in clustering documents with restric- David Meyer and Christian Buchta, 2016. proxy: Dis- tions could be addressed thanks to the versatility tance and Similarity Measures. R package version of the interface (different metrics, TF-IDF trans- 0.4-16. formation). David Meyer, Kurt Hornik, and Ingo Feinerer. 2008. As future work, we are interested in develop- Text mining infrastructure in R. Journal of statisti- ing conceptual clustering methods to extract topics cal software, 25(5):1–54. from the created clusters. R Core Team, 2015. R: A Language and Environment for Statistical Computing. R Foundation for Statis- Acknowledgments tical Computing, Vienna, Austria.

This work has been partially supported by the EU Tadej Skvorc,ˇ Nada Lavrac, and Marko Robnik- (FEDER) and Spanish MINECO grant TIN2015- Sikonja.ˇ 2016. Co-Bidding Graphs for Constrained 69175-C4-1-R, LOBASS, by MINECO in Spain Paper Clustering. In 5th Symposium on Languages, (PCIN-2013-037) and by Generalitat Valenciana Applications and Technologies (SLATE’16), vol- PROMETEOII/2015/013. ume 51, pages 1–13. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2005. Introduction to data mining. Addison-Wesley References Longman Publishing Co., Inc., Boston, MA, USA. Sugato Basu, Mikhail Bilenko, and Raymond J Stefan Theussl and Kurt Hornik, 2016. Rglpk: R/GNU Mooney. 2004. A probabilistic framework for semi- Linear Programming Kit Interface. R package ver- supervised clustering. In Proceedings of the tenth sion 0.6-2. ACM SIGKDD conference, pages 59–68. ACM. Winston Chang, Joe Cheng, JJ Allaire, Yihui Xie, and Diego F. Vallejo. 2016. Clustering de documentos Jonathan McPherson, 2016. shiny: Web Application con restricciones de tamano.˜ Master’s thesis, Higher Framework for R. R package version 0.14. Technical School of Computer Engineering, Poly- technic University of Valencia - UPV. Douglass R. Cutting, David R. Karger, Jan O. Peder- sen, and John W. Tukey. 1992. Scatter/gather: a Shaohong Zhang, Hau-San Wong, and Dongqing Xie. cluster-based approach to browsing large document 2014. Semi-supervised clustering with pairwise and collections. Proceedings of the 15th annual interna- size constraints. International Joint Conference on tional ACM SIGIR conference, pages 318–392. Neural Networks (IJCNN), IEEE, pages 2450–2457. Usama Fayyad, Cory Reina, and Paul S. Bradley. 1998. Shunzhi Zhu, Dingding Wang, and Tao. Li. 2010. Data Initialization of iterative refinement clustering algo- clustering with size constraints. Knowledge-Based rithms. Proceedings of ACMSIGKDD, pages 194– Systems, 23(8):883–889. 198.

44 A Web Interface for Diachronic Semantic Search in Spanish

Pablo Gamallo, Ivan´ Rodr´ıguez-Torres Marcos Garcia Universidade de Santiago de Compostela Universidade da Coruna˜ Centro Singular de Investigacion´ LyS Group en Tecnolox´ıas da Informacion´ (CiTIUS) Faculty of Philology 15782 Santiago de Compostela, Galiza 15701 A Coruna,˜ Galiza [email protected], [email protected] [email protected]

Abstract a non structured database. Then, a web interface was designed to provide researchers with a tool, This article describes a semantic system called Diachronic Explorer, to search for seman- which is based on distributional models tic changes. Diachronic Explorer offers different obtained from a chronologically structured advanced searching techniques and different visu- language resource, namely Google Books alizations of the results. Syntactic Ngrams. The models were cre- Even if there exist similar distributional ap- ated using dependency-based contexts and proaches for languages such as English, German, a strategy for reducing the vector space, or French, to the best of our knowledge our pro- which consists in selecting the more infor- posal is the first work using this technique for mative and relevant word contexts. The Spanish. In addition, we provide the Hispanic system allows linguists to analize mean- community with a useful tool to do linguistic re- ing change of Spanish words in the written search. The system is open-source1 and includes a language across time. web interface2 which will be used in the Software Demonstration. 1 Introduction Semantic changes of words are as common as 2 The Diachronic Explorer other linguistic changes such as morphological or The Diachronic Explorer relies on a set of distri- phonological ones. For instance, the word ‘arti- butional semantic models for Spanish language for ficial’ originally meant ‘built by the man’ (hav- each year from 1900 to 2009. This semantic re- ing a positive polarity), but this word has recently source is stored in a NoSQL database which feeds changed its meaning if used in contrast with ‘nat- a web server used for searching and visualizing ural’, aquiring in this context a negative polarity. lexical changes on dozens of thousands of Span- Search methods based on frequency (such as ish words along the time. Google Trends or Google Books Ngram Viewer search engines) are useful for finding words whose 2.1 Architecture use significantly increases —or decreases— in Figure 1 shows the architecture of Diachronic a specific period of time, but these systems do Explorer. It takes as input a large corpus of not not allow language experts to detect semantic dependency-based ngrams to build 110 distribu- changes such as the above mentioned one. tional models, one per year since 1900 to 2009. Taking the above into account, this demonstra- Then, Cosine similarity is computed for all words tion paper describes a distributional-based system in each model in order to generate CSV files con- aimed at visualizing semantic changes of words sisting of pairs of similar words. Each word is as- across historical Spanish texts. sociated with its N (where N = 20) most similar The system relies on yearly corpora distribu- words according to the Cosine coefficient. These tional language models, learned from the Span- files are stored in MongoDB to be accessed by web ish Google Books Ngrams Corpus, so as to ob- 1 tain word vectors for each year from 1900 to 2009. https://github.com/citiususc/ explorador-diacronico We compared the pairwise similarity of word vec- 2https://tec.citius.usc.es/ tors for each year and inserted this information in explorador-diacronico/index.

45 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 45–48 c 2017 Association for Computational Linguistics

Figure 1: Architecture of Diachronic Explorer services which generate different types of word 2.3 Data Storage searching. We use two ways for storing data. First, we 2.2 Yearly Models from Syntactic-Ngrams store the distributional models and the similar- ity pairs in temporal CSV files. These temporal To build the distributional semantic models along files are generated from offline processes which the time, we made use of the dataset based on can be executed periodically as the linguistic in- the Google Books Ngrams Corpus, which is de- put (ngrams) is updated or enlarged with new lan- scribed in detail in Michel et al. (2011). The guage sources. Second, the similarity files are im- whole project contains over 500 billion words ported in a database to make the data access more (45B in Spanish), being the majority of the con- efficient and easier. tent published after 1900. More precisely, we used The database system that we chose was Mon- the Spanish syntactic n-grams from the Version goDB, which is a NoSQL DB. This type of 2012/07/01.3 Lin et al. (2012) and Goldberg and database system fits well with the task because of Orwant (2013) describe the method to extract syn- two main reasons: tactic ngrams. Each syntactic ngram is accom- panied with a corpus-level occurrence count, as Our data do not follow a relational model, • well as a time-series of counts over the years. The so we do not need some of the features that temporal dimension allows inspection of how the make the relational databases powerful, for meaning of a word evolves over time by looking example, reference keys or table organiza- at the contexts the word appears in within differ- tion. ent time periods. We transformed the syntactic ngrams into dis- NoSQL databases scale really well. Un- • tributional ‘word-context’ matrices by using a like relational databases they implement au- filtering-based strategy that selects for relevant tomaric sharding, which enables us to share contexts (Gamallo, 2016; Gamallo and Bordag, and distribute data among servers in the most 2011). A matrix was generated for each year, efficient way possible. So, as the data in- where each word is represented as a context vector. crease it will be easy to rise up a new instance The final distributional-based resource consists of without any additional engineering. 110 matrices (one per year from 1900 to 2009). The size of the whole resource is 2,8G, with 25M Among the different types of NoSQL databases, per matrix in average. Then, the Cosine similar- we chose MongoDB because it is document ori- ity between word vectors was calculated and, for ented, which fits very well with our data structure. each word, the 20 most similar ones were selected by year. In total, a data structure with more than 2.4 Web Interface 300 million pairs of similar words was generated, The Diachronic Explorer is provided with an web giving rise to 110 CSV files (i.e., one file per year). interface that offers different ways of making a di- 3http://storage.googleapis.com/books/ achronic search. Here, we describe just two types ngrams/books/datasetsv2.html of search:

46 (a) (b)

Figure 2: Similar words to cancer´ (cancer) using a simple search in 1900 (a) and 2009 (b)

Simple: The user enters a target word and selects dependency-based contexts and a strategy for re- for a specific period of time (e.g. from 1920 ducing the vector space, which consists in select- to 1939), and the system returns the 20 most ing the more informative and relevant word con- similar terms (in average) within the selected texts. period of type. The user can select a word As far as we know, our system is the first at- cloud image to visualize the output. Figure 2 tempt to build diachronic distributional models for shows the similar words to cancer´ (‘cancer’) Spanish language. Besides, it uses NoSQL stor- returned with a simple search in two differ- age technology to scale easily as new data is pro- ent periods: 1900 (a) and 2009 (b). Notice cessed, and provides an interface enabling useful that this word is similar to peste and tuber- types of word searching across time. Other simi- culosis at the begining of the 20th century, lar works for English are reported in different pa- while nowadays it is more similar to techni- pers (Wijaya and Yeniterzi, 2011; Gulordava and cal words, such as tumor or carcinoma. Baroni, 2011; Jatowt and Duh, 2014; Kim et al., 2014; Kulkarni et al., 2015). Track record: As in the simple search, the user An interesting direction of research could in- inserts a word and a time period. However, volve the use of the system infrastructure for hold- this new type of search returns the specific ing other types of language variety, namely di- similarity scores for each year of the period, atopic variation. Distributional models can be instead of the similarity average. In addition, built from text corpora organized, not only with the system allows the user to select any word diachronic information, but also with dialectal fea- from the set of all words (and not just the top tures. For instance, we could adapt the system to 20) semantically related to the target one in search meaning changes across different diatopic the searched time period. varieties: Spanish from Argentina, Mexico, Spain, 3 Conclusions Bolivia, and so on. The structure of our system is generic enough to deal with any type of variety, In this abstract we described a system, Diachronic not only that derived from the historical axis. Explorer, that allows linguists to analize meaning change of Spanish words in the written language Acknowledgments across time. The system is based on distributional models obtained from a chronologically structured This work has received financial support from a language resource, namely Google Books Syn- 2016 BBVA Foundation Grant for Researchers tactic Ngrams. The models were created using and Cultural Creators, TelePares (MINECO,

47 ref:FFI2014-51978-C2-1-R), Juan de la Cierva Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, formacion´ Grant (Ref: FJCI-2014-22853), the Jon Orwant, Steven Pinker, Martin A. Nowak, and Conseller´ıa de Cultura, Educacion´ e Orde- Erez Lieberman Aiden. 2011. Quantitative Anal- ysis of Culture Using Millions of Digitized Books. nacion´ Universitaria (accreditation 2016-2019, Science, 331(6014):176–182. ED431G/08) and the European Regional Develop- ment Fund (ERDF), Derry Tanti Wijaya and Reyyan Yeniterzi. 2011. Un- derstanding Semantic Change of Words over Cen- turies. In Proceedings of the 2011 International Workshop on DETecting and Exploiting Cultural di- References versiTy on the Social Web (DETECT 2011), pages Pablo Gamallo and Stefan Bordag. 2011. Is singu- 35–40, Glasgow, Scotland. ACM. lar value decomposition useful for word simalirity extraction. Language Resources and Evaluation, 45(2):95–119.

Pablo Gamallo. 2016. Comparing Explicit and Pre- dictive Distributional Semantic Models Endowed with Syntactic Contexts. Language Resources and Evaluation, First online: 13 May 2016. doi:10.1007/s10579-016-9357-4.

Yoav Goldberg and Jon Orwant. 2013. A Dataset of Syntactic-Ngrams over Time from a Very Large Cor- pus of English Books. In Proceedings of the Second Joint Conference on Lexical and Computational Se- mantics (SEM 2013), pages 241–247, Atlanta, Geor- gia.

Kristina Gulordava and Marco Baroni. 2011. A Dis- tributional Similarity Approach to the Detection of Semantic Change in the Google Books Ngram Cor- pus. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Seman- tics, pages 67–71, Edinburgh, Scotland. ACL.

Adam Jatowt and Kevin Duh. 2014. A Framework for Analyzing Semantic Change of Words Across Time. In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2014, pages 229–238, Piscataway, NJ. IEEE Press.

Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. 2014. Temporal Analysis of Language through Neural Language Models. In Proceedings of the ACL 2014 Workshop on Lan- guage Technologies and Computational Social Sci- ence, pages 61–65, Baltimore, Maryland. ACL.

Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2015. Statistically Significant De- tection of Linguistic Change. In Proceedings of the 24th International World Wide Web Confer- ence (WWW 2015), pages 625–635, Florence, Italy. ACM.

Yuri Lin, Jean-Baptiste Michel, Erez Lieberman Aiden, Jon Orwant, Will Brockman, and Slav Petrov. 2012. Syntactic Annotations for the Google Books Ngram Corpus. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Linguistics (ACL 2012), pages 169–174. ACL.

Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, Joseph P.

48 Multilingual CALL Framework for Automatic Language Exercise Generation from Free Text

Naiara Perez and Montse Cuadros Vicomtech-IK4 research center Mikeletegi 57, Donostia-San Sebastian,´ Spain nperez,mcuadros @vicomtech.org { }

Abstract Three formats. Users can design and an- • swer exercises of three types: a) Fill-in-the- This paper describes a web-based appli- Gaps (FG): learners must fill in the gaps in a cation to design and answer exercises text with the correct words, based on some for language learning. It is available in clues given or just the context provided by Basque, Spanish, English, and French. the text; b) Multiple Choice (MC): learners Based on open-source Natural Language must choose the correct answers from a set of Processing (NLP) technology such as words given to fill in the gaps in a text; and, c) word embedding models and word sense Shuffled Sentences (SS): learners must order disambiguation, the system enables users a set of given words to formulate grammati- to create automatically and in real time cal sentences. three types of exercises, namely, Fill-in- the-Gaps, Multiple Choice, and Shuffled Highly configurable. Each exercise format • Sentences questionnaires. These are gen- offers a variety of settings that users can erated from texts of the users’ own choice, control. The input texts from which exer- so they can train their language skills with cises are built are always given by the users. content of their particular interest. They also choose the pedagogical target of the exercises based on language-specific part- 1 Introduction of-speech (PoS) tags and morphological fea- This paper describes a web-based computer- tures. Other settings include the type of clues assisted language learning (CALL) framework for in FG mode and the amount of distractors in automatic generation and evaluation of exercises1. MC. Furthermore, users can select the exer- The aim of this application is mainly to allow cise items themselves or let the system do it agents in the language learning sector, both teach- automatically. ers and learners alike, to create questionnaires Exportable. The exercises can be down- from texts of their own interest with little effort. • loaded in ’s CLOZE2 syntax to im- To do so, the application includes state-of-the- port them into Moodle quizzes, an exten- art open source NLP technology, namely part-of- sively exploited platform by teaching institu- speech tagging, word sense disambiguation, and tions all over the world. word embedding models. Its main features are the following: Evaluation. The questionnaires designed • can be answered in the same application, Multilingual. The platform enables to train • which prompts the percentage of correct an- Spanish, Basque, English and French skills. swers upon submission. Correct answers are Interface messages appear in the chosen lan- shown in green and the incorrect ones in red, guage too. so the learner can try to guess again. 1The application is at edunlp.vicomtech.org. The web page asks for credentials to log in; the username is ”vi- Real-time generation, easy to use. comtech”, and the password ”edunlp”. You will also need • your own account, which you can obtain by writing an e-mail 2https://docs.moodle.org/23/en/ to the authors of the paper. Embedded_Answers_(Cloze)_question_type

49 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 49–52 c 2017 Association for Computational Linguistics

2 Related Work depending on the languages’ characteristics and the richness of the parsing models available. For There exist countless tools, both web-based and instance, exercises in Spanish can target the sub- desktop software, that facilitate the creation of junctive conjugation or the definite/indefinite arti- teaching material for general purposes, some of cles. In English, one can target, for example, past the best known being Moodle3, Hot Potatoes4, 5 6 and present participle tenses. Once the initial con- ClassTools.net , and ClassMarker . However, the figuration has been set, the workflow continues as focus of such tools is on enabling users to adapt follows: the pedagogical content they are interested in, whichever it is, to certain exercise formats (i.e., 3.1 Extraction of candidate items quizzes, open clozes, crosswords, drag-and-drops, The text given is segmented, tokenized and tagged and so on). That is, these tools do not offer sup- with IXA pipes (Agerri et al., 2014), using the lat- port for assessing the contents on their pedagogi- est models provided with the tools. The tokens cal suitability nor other exercise-dependent tasks, with a PoS tag or morphological feature selected such as building clues for quizzes or distractors for by the user as pedagogical target are chosen as multiple-choice exercises. candidate items. In the case of the SS format, item In the domain of language learning in particular, candidates are sentences containing tokens with exercise authoring very often implies a lot of word the relevant PoS tags or morphological features. list curation, searching for texts that contain cer- If the text does not contain any candidate item, the tain linguistic patterns or expressions, retrieving application alerts the user that it is not possible to definitions, and similar tasks. The availability of build an exercise with the configuration given. resources for language teachers that simplify these processes, such as on-line dictionaries or teaching- 3.2 Final item selection 7 oriented lexical databases (e.g., English Profile ), The system has two ways of getting the final items depends on the target language. To the best of this from the candidates identified: the user can choose paper’s authors’ knowledge, there does not exist whether to select them or let the application do it at the moment an exercise generation and evalu- randomly. In the latter case, the user can set an up- ation framework specific to learn Basque, Span- per bound to the amount of items generated. The ish, English, and French, that not only automa- system never yields two contiguous items, since it tizes formatting the content given by the user for would increase substantially the difficulty in an- several exercises but also incorporates natural lan- swering them. guage processing (NLP) techniques to ease the au- thoring process. Volodina et al. (2014) describe a 3.3 Exercise generation similar framework named Larka.¨ Larka¨ designs Once the final items have been chosen, the actual multiple-choice exercises in Swedish for linguis- questionnaire must be designed. This depends on tics or Swedish learners. The questions are based the format chosen by the user and the specific set- on controlled corpora, that is, the users cannot tings available to that format: choose the texts they will be working on. Fill-in-the-Gaps (FG). The system substitutes the items chosen with gaps that have to be filled in 3 Workflow description with the appropriate words. Users can choose to All the exercise formats mentioned share a com- show, for all the languages, at least three types of mon building process. Users must choose a lan- clues to help learners do the exercise: the lemmas guage, provide a text –in that language–, and of the correct words, their definition, or a word choose a pedagogical target from the options bank of all the correct words (and how many times given. Pedagogical targets are based on PoS tags they occur). For Spanish and English, the sys- and morphological features. Different possible tem is also capable of prompting the morphologi- targets have been implemented for each language, cal features of the words to be guessed, in addition to the lemma. This feature is interesting to train on 3https://moodle.org/ 4 singulars and plurals, grammatical gender, verbal https://hotpot.uvic.ca/ tenses, and so on. Moreover, depending on the 5https://www.classtools.net/ 6https://www.classmarker.com/ pedagogical target chosen, the system automati- 7http://www.englishprofile.org/ cally disables certain types of clues. It would not

50 make much sense, for instance, to give the lem- 3.4 Evaluation mas of the correct words if the pedagogical target The user can answer the questionnaire it has de- were prepositions, given that prepositions cannot signed and get it assessed by the system. For be lemmatized. The user can also choose to not all the exercise formats, the evaluation consist in give any help. comparing the correct answers with the input re- To generate clues based on lemmas and/or mor- ceived from the user. The answer is right only phologial features, the system turns to the linguis- when it is the same to the correct answer. tic annotation given by the IXA pipes during can- didate extraction. The annotation contains all the 4 The demonstrator information necessary for each token in the text. The application has a clean interface and is easy Retrieving definitions requires additional pro- to follow. An exercise can be designed, answered cessing, since a word’s definition depends on the and evaluated visiting less than four pages: context it appears in. That is, choosing the correct In the home page, the user definition of a word in the text provided by the user The home page. chooses a language –Spanish, Basque, English or translates to disambiguating the word. The appli- French–, and an exercise format –FG, MC, or SS. cation relies on Babelfy (Moro et al., 2014) and It leads to the exercise configuration page of the BabelNet (Navigli and Ponzetto, 2010) APIs in or- exercise chosen. der to do so. It passes the whole text as the con- All the configuration text and asks Babelfy to assign a single sense to The configuration page. pages share a common structure. A text field oc- the words chosen as targets of the exercise. Then, cupies the top of the page. This is where the user it retrieves from BabelNet the definitions associ- introduces the text they want to work with. Below ated to those senses. Babelfy is not always able to the text field are the configuration options, pre- assign a sense to a word; when this happens, the sented as radio-button lists. All the exercise for- application returns the lemma of the word as its mats require that users choose a pedagogical tar- clue. get. Then come the format-specific settings, the Multiple Choice (MC). Again, the exercise only section of the page that varies. consists of gaps in the text to be filled with the For an FG exercise, two properties must be set: correct words. In this case, the learner is given the type of clue and how the clues must be visu- a set of words from which to choose an answer. alized. They can be shown below the gapped text This set of words contains the right answer and with a reference to the gap they belong to, or as some incorrect words called “distractors”. When description boxes of the gaps (i.e., “tooltips”). this exercise format is chosen, the system automat- In a MC exercise, users must choose the amount ically generates as many distractors as specified by of distractors they want the system to create. the user. This is achieved by consulting, for each For the SS mode, users can set an upper limit to correct answer, a word embedding model and re- the sentences that will be selected as candidates. trieving the most similar words. The models are Finally, the user can choose whether to let the word2vec (Mikolov et al., 2013b; Mikolov et al., system select the items of the exercise or choose 2013a) trained on Leipzig University’s corpora8 them themselves among the candidates that the with the library Gensim for Python9. Thus, dis- system generates. In the former case, the system tractors tend to be words that appear often in con- asks the user how many items it should create at texts similar to the right answer, but not seman- most, and takes the user directly to the “Answer tically or grammatically correct. Distractors are and evaluate” page. In the latter case, the user is then transformed to the same case as the correct taken to the “Choose the items” page. word and finally shuffled for their visualization. If the text does not contain any token that meets Shuffled Sentences (SS). This exercise con- the pedagogical target, the system asks the user to sists in ordering a set of words given to formu- provide a different text or change the configura- late a grammatical sentence. In this case, the sys- tion set. That is, the application allows for users tem substitutes the sentences chosen by gaps and to know with a single click whether the texts they shows the sentence shuffled as a lead. choose are suitable for the pedagogical objective 8http://corpora.uni-leipzig.de/ they have set, without them having had previously 9http://radimrehurek.com/gensim/ read the text thoroughly.

51 Choose the items. In this page users choose Another aspect that can be improved is the fact the items it wants to create among the words that that exercise items are created from unigrams. It meet the pedagogical target they chose. The in- would be very interesting that the application were terface shows the whole text with all the available capable of generating multi-word candidates. This candidates selected. They can be toggled simply would be useful to revise collocations or phrasal by clicking on them. There are also buttons to se- verbs, for instance. In this same regard, the appli- lect all and remove all the candidates. Once users cability of the system would increase if it based are satisfied with their selection, they are led to the item candidate generation not only on PoS tags “Answer and evaluate” page. or morphological features, but on other criteria as Answer and evaluate. This is the page where well like the semantics of the text. the final questionnaire is displayed and can be There is also room for improvement in the con- filled. The appearance varies a little depending on figurability of the application. Users should be al- the format chosen. FG exercises consist of the text lowed to control two key features: definition clues given and gaps where the correct words should be in FG exercises and MC distractors. Currently, the written. If the tooltip clues option has been en- application imposes the definitions and distractors abled, clues appear in description boxes when hov- it generates, instead of presenting them as options ering over the gaps; otherwise, they appear listed for the users to choose. below the text. Word bank clues appear boxed Finally, we plan to implement more formats that on to the right of the text. In MC exercises, the add to the available three. choices are radio-button groups listed to the right of the text. As for SS exercises, page-wide gaps re- Acknowledgements place the sentences chosen, and the shuffled words This work has been funded by the Spanish Gov- are given above the gaps. ernment (project TIN2015-65308-C5-1-R). At the bottom of this page there is a link to download the exercise in Moodle CLOZE syntax. The file that is downloaded can be imported to References Moodle in order to generate a quiz. Rodrigo Agerri, Josu Bermudez, and German Rigau. Users can fill in the exercise they designed and 2014. IXA pipeline: Efficient and Ready to Use submit them to get the percentage of correct an- Multilingual NLP tools. In LREC, volume 2014, pages 3823–3828. swers. Furthermore, the answers are colored in green or red, depending on whether they are cor- Tomas Mikolov, Greg Corrado, Kai Chen, and Jeff rect or not, respectively. This way learners can try Dean. 2013a. Efficient Estimation of Word Rep- to answer again. resentations in Vector Space. CoRR, abs/1301.3. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- 5 Conclusions rado, and Jeff Dean. 2013b. Distributed Represen- tations of Words and Phrases and their Composition- We have described a web-based framework for ality. In NIPS, pages 3111–3119. language learning exercise generation. It is avail- Andrea Moro, Francesco Cecconi, and Roberto Nav- able for Spanish, Basque, English, and French. igli. 2014. Multilingual word sense disambiguation Users can design three types of exercises to train and entity linking for everybody. In Proceedings diverse skills in these languages. The framework of the 2014 International Conference on Posters & is highly configurable and lets the user choose Demonstrations Track-Volume 1272, pages 25–28. CEUR-WS. org. whether they want to select the exercise items or have the system do it. Roberto Navigli and Simone Paolo Ponzetto. 2010. As future work, the system should be improved BabelNet: Building a very large multilingual seman- tic network. In ACL, pages 216–225. in various ways. To begin with, the actual system delegates to the user the task of making the exer- Elena Volodina, Ildiko´ Pilan,´ Lars Borin, and cises more or less difficult by choosing the items Therese Lindstrom¨ Tiedemann. 2014. A flexible language learning platform based on language re- themselves. The application will be endowed with sources and web services. In LREC, pages 3973– technology that allows the users to create automat- 3978. ically exercises which vary in difficulty starting from the same text.

52 Audience Segmentation in Social Media

Verena Henrich, Alexander Lang IBM Germany Research and Development GmbH Boblingen,¨ Germany [email protected], [email protected]

Abstract various topics. Audience insights are key to put these opinions into the right context. Understanding the social media audience is becoming increasingly important for so- 2 Audience Segmentation with IBM cial media analysis. This paper presents an Watson Analytics for Social Media approach that detects various audience at- IBM Watson Analytics for Social Media™ tributes, including author location, demo- (WASM) is a cloud-based social media analysis graphics, behavior and interests. It works tool for line-of-business users from public rela- both for a variety of social media sources tions, marketing or product management1. The and for multiple languages. The approach tool is domain-independent: users configure top- has been implemented within IBM Watson ics of interest to analyze, e.g., products, brands, Analytics for Social Media™ and creates services or politics. Based on the user’s top- author profiles for more than 300 different ics, WASM retrieves all relevant content from a analysis domains every day. variety of social media sources (Twitter, Face- book, blogs, forums, reviews, video comments 1 Understanding the Social Media and news) across multiple languages (currently Audience – Why Bother? Arabic, English, French, German, Italian, Por- tuguese and Spanish) and applies various natural The social media audience shows Who is talking language processing steps: about a company’s products and services in social media. This is increasingly important for various • Dynamic topic modeling to spot ambiguous teams within an organization: user topics, and suggest variants. Marketing: Does our latest social media cam- paign resonate more with men or with women? • Aspect-oriented sentiment analysis. What are social media sites where actual users of • Detection of spam and advertisements. our products congregate? What other interests do authors have that talk about our products, so we • Audience segmentation, including author lo- can create co-selling opportunities? cation, demographics, behavior, interests, Sales: Which people are disgruntled with our and account types. products or services and consider churning? Can we find social media authors interested in buying While the system demo will show all steps, this our type of product, so we can engage with them? paper focuses on audience segmentation. Audi- ence segmentation relies on: (i) The author’s name Product management and product research: Which product features are important specifically and nick name. (ii) The author’s biography: a for women, or parents? What aspects do actual short, optional self-description of the author (see users of our product highlight—and how does this Figure 1), which often includes the author’s loca- compare to the competition’s product? tion. (iii) The author’s content: social media posts (Tweets, blog posts, forum entries, reviews) that Besides commercial scenarios, social media is contain topics configured by a WASM user. becoming relevant for the social and political sci- ences to understand opinions and attitudes towards 1https://watson.analytics.ibmcloud.com

53 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 53–56 c 2017 Association for Computational Linguistics

mom; example patterns are my kids or our daugh- ter. WASM tags both authors in Figure 1 as having children. Gender: Possible values are male, female and unknown. The classification of gender relies on a similar keyword and pattern matching as described previously. In addition, it relies on matching the Figure 1: Example author biographies. author name against a built-in database of 150,000 first names that span a variety of cultures. For am- biguous first names such as Andrea (which iden- Our segmentation approach only relies on the tifies females in Germany and males in Italy), we ‘on-topic’ content of an author, not on all of the au- take both the language of the author content and thor’s posts—for two reasons: Firstly, many social the detected author location into account to pick media sources such as forums do not allow retriev- the appropriate gender. WASM classifies the first ing content based on an author name. Secondly, author in Figure 1 as male, the second author as social media content providers charge per docu- female. ment. It would not be economical to download all content for all authors who talk about a cer- 2.3 Author Behavior tain topic. We found that the combination of the WASM identifies three types of author behavior: author’s content with the author’s name(s) and bi- Users own a certain product or use a particular ography significantly enhances both precision and service. They identify themselves through phrases recall of audience segmentation. such as my new PhoneXY, I’ve got a PhoneXY or 2.1 Author Location I use ServiceXY. Prospective users are interested in buying a product or service. Example phrases WASM detects the permanent location of an au- are I’m looking for a PhoneXY or I really need a thor, i.e., the city or region where the author lives, PhoneXY. Churners have either quit using a prod- by matching the information found in their biog- uct or service, or have a high risk of leaving. They raphy against large, curated lists of city, region are identified by phrases such as Bye Bye PhoneXY and country names. We use population informa- or I sold my PhoneXY. tion (preferring the location with higher popula- Author behavior is classified by matching key- tion) to disambiguate city names in the absence of words and patterns in the authors biographies and any region or country information, e.g., an author content—similarly to the demographic analysis with the location Stuttgart is assigned to Stuttgart, described above. It allows to understand: What Germany, not Stuttgart, AR, USA. do actual customers of a certain product or ser- 2.2 Author Demographics vice talk about? What are the key factors why cus- tomers churn? WASM identifies three demographic categories: Figure 2 shows an example analysis where the Marital status: When the author is married, the topics of interest were three retailers: a discounter, value is true, otherwiseunknown. For the classifi- an organic market and a regular supermarket.2 The cation we identify keywords (encoded in dictio- top of Figure 2 summarizes what is relevant for naries) and patterns in the author’s biography and authors that WASM identified as users, i.e., cus- content. Our dictionaries and patterns match, for tomers of a specific retailer. The bottom part example, authors describing themselves as mar- shows two social media posts from userss. ried or husband, or authors that use phrases such as my spouse loves running or at my wife’s birth- 2.4 Author Interests day. Thus, WASM tags the authors in Figure 1 as WASM defines a three-tiered taxonomy of inter- married. ests, which is inspired by the IAB Tech Lab Con- Parental status: When the author has children, tent Taxonomy (Interactive Advertising Bureau, the value is true, otherwise unknown. Similarly 2016). The first tier represents eight top-level to the marital status classification, keywords and interest categories such as art and entertainment patterns are matched in the authors’ biographies and contents. Example keywords are father or 2The retailers’ names are anonymized.

54 2.5 Account Type Classification WASM classifies social media accounts into orga- nizational or personal. This helps customers to understand who is really driving the social media conversations around their brands and products: is it themselves (through official accounts), a set of news sites, or is it an ‘organic’ conversation with lots of personal accounts involved? Organizational accounts include companies, NGOs, newspapers, universities, and fan sites. Personal accounts are non-organizational accounts from individual authors. Account types are distin- Figure 2: What matters to users of supermarkets. guished by cues from the authors’ names and bi- ographies. For example, company-indicating ex- pressions such as Inc or Corp and patterns like or sports. The second tier comprises about 60 Official Twitter account of indi- more specific categories. These include music cate that the account represents an organization, and movies under art and entertainment and ball whereas a phrase like Views are my own points to sports and martial arts under sports. The third an actual person. The biographies in Figure 1 con- level represents fine-grained interests, e.g., tennis tain many hints that both accounts are personal, and soccer below ball sports. e.g., agent nouns (director, developer, lover) and personal and possessive pronouns (I, my, our). The fine-grained interests on the third level are identified in author biographies and content with 3 Implementation the help of dictionaries and patterns. From the bi- ography of the first author in Figure 1, we infer 3.1 Text Analysis Stack that he is interested in music and baseball (music lover and baseball fan). For the second author we The WASM author segmentation components are Annotation Query Language infer an interest in machine knitting (I am an avid created by using the machine knitter). Note that a simple occurrence of (AQL; Chiticariu et al., 2010). AQL is an SQL- a certain interest, e.g. a sport type, is usually not style programming language for extracting struc- enough: it has to be qualified in the author content tured data from text. Its underlying relational logic by matching specific patterns. A match in a biog- supports a clear definition of rules, dictionaries, raphy, however, typically qualifies as a bona fide regular expressions, and text patterns. AQL opti- interest—excluding, e.g., negation structures. mizes the rule execution to achieve higher analysis throughput. Figure 3 visualizes the connections between the The implemented rules harness linguistic pre- three retailers mentioned in the previous chapter processing steps, which consist of tokeniza- and coarse-grained interest categories. This re- tion, part-of-speech tagging and chunking (Ab- veals that authors who write about Organic Mar- ney, 1992; Manning and Schutze,¨ 1999)—the lat- ket CD are uniquely interested in animals and that ter also expressed in AQL. Rules in AQL sup- Discounter AB does not attract authors who are in- port the combination of tokens, parts of speech, terested in sports. These insights allow targeted chunks, string constants, dictionaries, and regular advertisements or co-selling opportunities. expressions. Here is a simplified pattern for iden- tifying whether an author has children: create view Parent as extract pattern {0,1} /child|sons?|kids?/ from FirstPersonPossessive FP, Adjectives A; It combines a dictionary of first person posses- sive pronouns with an optional adjective (identi- Figure 3: Author interests related to Retailers fied by its part of speech) and a regular expres- sion matching child words. The pattern matches

55 phrases such as my son and our lovely kids. thors from all social media sources. WASM clas- sifies 1,695 authors as users—without any addi- 3.2 System Architecture tional effort by the customer. Compare that with We wanted to provide users analysis results as the task to run a survey in the retailer’s stores (and quickly as possible. Hence, we created a data its competition) to get similar insights. We manu- streaming architecture that retrieves documents ally annotated author behavior for 500 documents from social media, and analyzes them on the fly, from distinct authors. The evaluated precision is while the retrieval is still in progress. Moreover, 90.0% at a recall of 58.3%. we wanted to run all analyses for all users on a The evaluation of account types is based on single, multi-tenant system to keep operating costs a random sample of 50,124 Tweets, which com- low. The data stream that our text analysis compo- prises 43,193 distinct authors. 36,682 of the au- nents see contains social media content for differ- thors provide biographies. We use this as the ent users. Hence, we built components that can “upper recall bound” for our biography-based ap- switch between different analysis processing con- proach. Our system assigns an account type for figurations with minimal overhead. 18,657 authors, which corresponds to a recall of The text analysis components run as Scala or 50.9%. It classifies 16,981 as personal and 1,676 Java modules within ™. We exploit as organizational accounts. We manually anno- Spark’s Streaming API for our data streaming re- tated 500 of the classified accounts, which results quirements. The data between the components in a precision of 97.4%. flows through Apache Kafka™. The combination of Spark and Kafka allows for processing scale- 5 Conclusion and Future Work out, and resilience against component failures. This paper presented real-life use cases that re- The multi-tenant text analysis processing is quire understanding audience segments in social separated into user-specific and language-specific media. We described how IBM Watson Analytics analysis steps. We optimized the user-specific for Social Media™ segments social media authors analysis steps (such as detecting products and according to their location, demographics, behav- brands a particular user cares about) to have virtu- ior, interests and account types. Our approach ally no switching overhead. The language-specific shows that the author biography in particular is a steps (e.g., sentiment detection or author segmen- rich source of segment information. Furthermore, tation) are invariant across customers. To achieve we show how author segments can be created at the required processing throughput, we launch one a large scale by combining natural language pro- language-specific rule set per processing node, and cessing and the latest developments in data ana- analyze multiple documents in parallel with this lytics platforms. In future work, we plan to ex- language-specific rule set. tend this approach to additional author segments The analysis pipeline runs on a cluster of 10 vir- as well as additional languages. tual machines, each with 16 cores and 32G RAM. Our customers run 300+ analyses per day, analyz- ing between 50,000 to 5,000,000 documents each. References Steven P. Abney, 1992. Parsing By Chunks, pages 257– 4 Evaluation 278. Springer Netherlands, Dordrecht. Our customers are willing to accept smaller seg- Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, and Shivaku- ment sizes (i.e., lower recall) as long as the preci- mar Vaithyanathan. 2010. SystemT: An Algebraic sion of the segment identification is high. This de- Approach to Declarative Information Extraction. In sign goal of our analysis components is reflected Proceedings of ACL, pages 128–137. in the evaluation results for author behavior and Interactive Advertising Bureau. 2016. IAB Tech account types on English social media documents Lab Content Taxonomy. Online, accessed: 2016- that we present here.3 12-23, https://www.iab.com/guidelines/iab-quality- The retail dataset mentioned in previous chap- assurance-guidelines-qag-taxonomy/. ters consists of 50,354 documents by 41,209 au- Christopher D. Manning and Hinrich Schutze.¨ 1999. Foundations of Statistical Natural Language Pro- 3An evaluation of each audience feature for each sup- cessing. MIT Press, Cambridge, MA, USA. ported language is beyond the scope of this paper.

56 The arText prototype: An automatic system for writing specialized texts

Iria da Cunha M. Amor Montane´ Luis Hysa Universidad Nacional de Universitat Pompeu Fabra Universitat Pompeu Fabra Educacion´ a Distancia (UNED) Roc Boronat 138, 08018 Roc Boronat 122, 08018 Senda del Rey 7, 28040 Barcelona, Spain Barcelona, Spain Madrid, Spain [email protected] [email protected] [email protected]

Abstract several languages. To our knowledge, none of the systems that are currently available have consid- This article describes an automatic system ered the specific characteristics of textual genres for writing specialized texts in Spanish. in specialized domains, such as medicine, tourism The arText prototype is a free online text and the public administration. editor that includes different types of lin- Writing specialized texts is more challenging guistic information. It is designed for a va- than writing general texts (Cabre,´ 1999). Tex- riety of end users and domains, including tual, lexical and discourse features are an essential specialists and university students working component of textual genres, such as medical re- in the fields of medicine and tourism, and search papers, travel blog posts, or claims submit- laypersons writing to the public adminis- ted to the public administration. Against this back- tration. ArText provides guidance on how drop, this article aims to present a prototype for an to structure a text, prompts users to include automatic system that provides assistance in writ- all necessary contents in each section, and ing specialized texts in Spanish. The arText sys- detects lexical and discourse problems in tem includes textual, lexical and discourse-related the text. information, and is useful for different end users. 1 Introduction It provides guidance on how to structure a text, prompts users to include all necessary contents In the field of Natural Language Processing in each section, and detects lexical and discourse (NLP), various types of linguistic information in- problems in the text. cluding phonological, morphological, lexical, syn- Da Cunha et al. (in press) determined the most tactic, semantic and discourse-related features can frequent textual genres that pose the greatest writ- be used to develop applications. To date, tools ing challenges for three groups: specialists and for writing texts have often been designed for gen- university students in medicine and tourism, and eral subject areas and included information on or- laypersons writing to the public administration. thographic, grammatical and/or lexical aspects of ArText was designed to help these users write the the writing process. NLP researchers have tended 15 textual genres included in Table 1. not to study systems for structuring and writing Section 2 describes the characteristics of the specialized texts, although a few researchers have system and its modules. Section 3 explains how bucked this trend: Kinnunen et al. (2012) devel- the system was evaluated, while Section 4 presents oped a system to identify and correct writing prob- conclusions and future lines of research. lems in English in several domains; Aluisio et al. (2001)’s system helps non-native speakers write 2 Description of the System scientific publications in English; the Writing Pal 1 (Dai et al., 2011) and Estilector systems help im- ArText is a free online text editor that anyone prove academic writing in English and Spanish, can use, with no registration required. The sys- 2 respectively; and LanguageTool is an open source tem was developed in a LINUX environment us- proofreading program for non-specialized texts in ing an Apache server and a MySQL database. A 1http://www.estilector.com/index.php. variety of resources were utilized in the back end 2https://www.languagetool.org/. (BASH, , and PHP, with a Laravel Frame-

57 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 57–60 c 2017 Association for Computational Linguistics

Medicine Research article the template for a “claim” to be submitted to the Review article Medical history public administration includes the following sec- Abstract tions: Bachelor’s thesis - Header Tourism Informative article Travel blog post - Addressee Report - Introductory clause Rules and regulations - Supporting details Business plan Public Administration Allegation - Request Cover letter - Closing Letter of complaint Claim A drop-down menu in the left-hand column pro- Application vides sample texts for each section, including sec- tion titles, where appropriate. For example, the Table 1: Specialized fields and textual genres in- “Supporting details” section includes two differ- cluded in arText. ent contents: - Grounds for the claim - Attachments work) and front end (HTML, CSS, JAVASCRIPT, When users click on a specific content, arText dis- with and JQUERY); Google Analytics is in- plays a list of sample phrases that can be incor- tegrated into the site to measure traffic. porated into the final text. For example, “Attach- Documents can be exported in four formats: ments” includes the following phrases: PDF, TXT, HTML and ARTEXT. Previously - Attached please find [document name].4 saved documents can be uploaded using the AR- - The following supporting documents are at- TEXT format, and the website includes a detailed tached: [list of documents]. user manual and a contact section for comments, Users can click on a stock phrase to include it in questions and suggestions. the text. ArText can be accessed at http: //sistema-artext.com/,3 and has been 2.2 Module 2. Format and Spellchecking optimized for the Google Chrome browser. To use arText, click on “Start using arText” and pick one The toolbar at the top of the screen includes an of the 15 textual genres mentioned above. This open source spellchecker (WebSpellChecker Ltd.) brings you to the text editor, where you can start and various formatting options, e.g. to change font writing using the text editor and the three modules or font size; insert bullet points, images, tables and integrated into arText: Structure, Contents and links; cut, copy and paste; print; and search. Since Phraseology; Format and Spellchecking; and online storage is not provided, the user’s manual Lexical and Discourse-based Recommendations. includes instructions for uploading an image to Google Drive and inserting it into a document pro- 2.1 Module 1. Structure, Contents and duced using arText. Phraseology 2.3 Module 3. Lexical and Discourse-based The left-hand column helps users structure and Recommendations draft documents. Its interactive template includes typical sections, contents and phraseology for By clicking on the review button in the right- each textual genre. This information was extracted hand column, users can see a series of lexical and from da Cunha and Montane´ (2016), a corpus- discourse-related recommendations for improving based analysis following van Dijk (1989)’s textual their texts. These recommendations are derived approach. Specifically, users can insert: from da Cunha and Montane´ (2016), which is - Typical document sections based on Cabre´ (1999)’s Communicative Theory - Typical contents found in each section of Terminology and Mann and Thompson (1988)’s - Phraseology related to each of these contents Rhetorical Structure Theory. The module includes The text editor displays the sections which typi- a series of algorithms based on linguistic rules and cally appear in a given textual genre. For example, two NLP tools: the Freeling shallow parser (At-

3A demo is available at https://canal.uned.es/ 4Users are instructed to fill in the information indicated in mmobj/index/id/54433 square brackets (e.g. names, dates and numbers).

58 serias et al., 2006) and the DiSeg discourse seg- Text displays a list of discourse markers repeated menter (da Cunha et al., 2012). in the text. When users click on one of these mark- This module includes 11 main recommenda- ers, all of its occurrences are highlighted, and a list tions, all of which are displayed in the right-hand of alternative discourse markers used to express column, when appropriate. A subset of these rec- the same relationship (e.g. Cause, Restatement, ommendations is assigned to each textual genre, Contrast and Condition, etc.) is displayed in the and all recommendations are adapted to the lin- right-hand column. For instance, for the discourse guistic characteristics of each genre (da Cunha and marker “that is,” used to express Restatement, ar- Montane,´ 2016). Recommendations cover the fol- Text suggests the alternatives “in other words,” lowing 11 topics: “that is to say,” “i.e.” and “to put it another way.” 1. Spelling out acronyms 3 Evaluation 2. Using acronyms systematically 3. Providing definitions Real and ad hoc texts were used to test arText’s al- 4. Using the passive voice gorithms and linguistic rules and improve the sys- 5. Using the 1st person systematically tem. Subsequently, the prototype was launched 6. Using subjectivity indicators and data-driven and user-driven evaluations were 7. Repeating words conducted. 8. Using long sentences The data-driven evaluation was based on a test 9. Segmenting long sentences corpus with 24 texts corresponding to one textual 10. Considering alternative discourse markers genre from each domain; the corpus comprised 11. Varying discourse markers eight medical abstracts, eight tourism-related in- By clicking on a given recommendation, users can formative articles and eight applications to the see a more detailed explanation and suggestions. public administration. The linguistic characteris- In some cases, arText also highlights phrases or tics of these texts were manually annotated, and content in the text editor. For example, one lexical the manual annotation and arText results were recommendation, “Spelling out acronyms,” high- compared. Precision and recall were measured for lights acronyms that are not spelled out when they a series of recommendations; the results are pre- first appear in the text (i.e. arText would high- sented in Table 2.5 light “COPD” if “chronic obstructive pulmonary Recommendation ID Precision Recall disease” did not appear next to this acronym the 1 0.76 0.68 first time the term was used). Some recommenda- 2 0.75 0.94 tions also actively engage users in the revision pro- 3 x x 4 1 0.94 cess. For example, the recommendation “Repeat- 5: sing. verbal forms 0.87 0.97 ing words” shows a list of repeated words; when 5: pl. verbal forms 1 0.99 users click on a word in the right-hand column, 6 - - 9 0.74 0.87 all occurrences of this word in the text are high- 10 1 1 lighted. This recommendation is not displayed for highly specialized textual genres (e.g. research ar- Table 2: Data-driven results. ticles and abstracts), since lexical variation is usu- ally avoided in these types of texts (Cabre,´ 1999). Recommendation 7 did not apply in the medi- Some recommendations focus on the discourse cal subcorpus; in the tourism and public admin- level. For example, “Segmenting long sentences” istration subcorpora, 91.67% and 94.70% of de- highlights one or more long sentences; users can tected words, respectively, were repeated in the click to see suggestions for splitting them into text. For Recommendation 8, 100% of highlighted shorter sentences. In this case, arText proposes sentences in the medicine and tourism subcor- these discourse segments in the right-hand col- pora were long sentences according to the thresh- umn. The number of words used to determine long olds for abstracts and informative articles; no long sentences differs for each textual genre, following sentences appeared in the public administration da Cunha and Montane´ (2016). 5In light of the degree of specialization, Recommendation 4 did not apply for any of the textual genres included in the Another discourse level recommendation refers test corpus. No cases for which Recommendation 6 applies to “Varying discourse markers.” In this case, ar- were found in the test corpus.

59 subcorpus, so this recommendation could not be smith for the proofreading of the text, and the eval- tested for this genre. No cases of Recommenda- uation survey respondents and the research groups tion 11 were found in the medical and administra- ACTUALing and IULATERM for their support. tion subcorpora; in the tourism corpus, 100% of detected discourse markers were repeated in the text, and adequate alternatives were proposed. References The user-driven evaluation aimed to determine Sandra M. Alu´ısio, Iris Barcelos, Jandir Sampaio, and how useful arText is. A survey designed using Osvaldo N. Oliveira Jr. 2001. How to learn the many unwritten “rules of the game” of the academic Google Forms focused on accessibility, the use- discourse: A hybrid approach based on critiques fulness of the three modules and general issues. and cases to support scientific writing. Proceedings Three doctors, three tourism professionals, and 25 of the IEEE International Conference on Advanced laypersons completed the survey; all laypersons Learning Technologies, 2001. were between 30-50 years old and had both higher Jordi Atserias, Bernardino Casas, Elisabet Comelles, education experience and internet skills. In gen- Meritxell Gonzalez,´ Llu´ıs Padro,´ and Muntsa Padro.´ eral, respondents found arText to be user-friendly 2006. Freeling 1.3. syntactic and semantic services in an open-source nlp library. Proceedings of the 5th and useful; 100% of them would recommend the International Conference on Language Resources system to other people. Respondents found the and Evaluation (LREC 2006), pages 48–55. section on structure to be the most useful module, M. Teresa Cabre.´ 1999. La Terminolog´ıa. Repre- while the approach to uploading images was con- sentacion´ y comunicacion´ . Institut for Applied Lin- sidered the system’s greatest weakness. guistics, Barcelona. Iria da Cunha and M. Amor Montane.´ 2016. Un 4 Conclusions and Future Work analisis´ lingu¨´ıstico de los generos´ textuales del This paper describes a prototype of an automatic ambito´ de la administracion´ con repercusion´ en la vida de los ciudadanos. Proceedings of the XV Sym- system to assist users in writing specialized texts. posium of the Ibero-American Terminology Network The online arText editor helps users draft texts (RITerm 2016), pages 61–62. for 15 textual genres in three specialized domains: Iria da Cunha, Eric SanJuan, Juan-Manuel Torres- medicine, tourism and the public administration. Moreno, Marina Lloberes, and Irene Castellon.´ It lays out the structure for each section of the 2012. Diseg 1.0: The first system for spanish dis- document, suggests appropriate contents and stock course segmentation. Expert Systems with Applica- phrases for each section, and detects typical lin- tions, 39(2):1671–1678. guistic errors. This innovative system is the first Iria da Cunha, M. Amor Montane,´ and Alba Coll. in tool that considers lexical, textual and discourse press. Deteccion´ de generos´ textuales que presentan features for specific textual genres. Moreover, the dificultades de redaccion:´ un estudio en los ambitos´ de la administracion,´ la medicina y el turismo. Pro- arText project is based on the idea that academic ceedings of the 34th International Conference of the research can be shared with and used construc- Spanish Association of Applied Linguistics. tively by the general public. Jianmin Dai, Roxanne B. Raine, Rod Roscoe, Zhiqiang In the future, the results of the data-driven eval- Cai, and Danielle S. McNamara. 2011. The uation will be utilized to improve arText’s algo- writing-pal tutoring system: Development and de- rithms. A second user-driven evaluation will in- sign. Journal of Engineering and Computer Innova- clude a broader population (e.g. students). Finally, tions, 2(1):1–11. arText may be adapted to other textual genres, spe- Tomi Kinnunen, Henri Leisma, Monika Machunik, cialized domains and languages. Tuomo Kakkonen, and Jean-Luc Lebrun. 2012. Swan scientific writing assistant a tool for helping scholars to write reader-friendly manuscripts. Pro- Acknowledgments ceedings of the 13th Conference of the European This article is part of the “Automatic system Chapter of the Association for Computational Lin- guistics, pages 20–24. to help in writing specialized texts in domains relevant to Spanish society” research project, William C. Mann and Sandra A. Thompson. 1988. funded by “2015 BBVA Foundation Grants for Rhetorical structure theory: Toward a functional the- ory of text organization. Text, 8(3):243–288. Researchers and Cultural Creators”. It was also supported by a Ramon´ y Cajal contract (RYC- Teun A. van Dijk. 1989. La ciencia del texto. Paidos,´ Barcelona. 2014-16935). We would like to thank Josh Gold-

60 QCRI Live Speech Translation System

Fahim Dalvi, Yifan Zhang, Sameer Khurana, Nadir Durrani, Hassan Sajjad Ahmed Abdelali, Hamdy Mubarak, Ahmed Ali, Stephan Vogel Qatar Computing Research Institute Hamad bin Khalifa University, Doha, Qatar faimaduddin, yzhang @qf.org.qa { }

Abstract

We present QCRI’s Arabic-to-English speech translation system. It features modern web technologies to capture live audio, and broadcasts Arabic transcrip- tions and English translations simultane- ously. Our Kaldi-based ASR system uses the Time Delay Neural Network architec- ture, while our Machine Translation (MT) system uses both phrase-based and neural Figure 1: Speech translation system in action. The frameworks. Although our neural MT sys- Arabic transcriptions and English translations are tem is slower than the phrase-based sys- shown in real-time as they are decoded. tem, it produces significantly better trans- lations and is memory efficient.1 Our system is also robust to common English code-switching, frequent acronyms, as well as di- 1 Introduction alectal speech. Both the Arabic transcriptions and We present our Arabic-to-English SLT system the English translations are presented as results to consisting of three modules, the Web application, the viewers. The system is built upon modern web Kaldi-based Speech Recognition culminated with technologies, allowing it to run on any browser Phrase-based/Neural MT system. It is trained and that has implemented these technologies. Figure optimized for the translation of live talks and lec- 1 presents a screen shot of the interface. tures into English. We used a Time Delayed Neu- ral Network (TDNN) for our speech recognition 2 System Architecture system, which has a word error rate of 23%. For The QCRI live speech translation system is pri- our machine translation system, we deployed both marily composed of three fairly independent mod- the traditional phrase-based Moses and the emerg- ules: the web application, speech recognition, and ing Neural MT system. The trade-off between machine translation. Figure 2 shows the complete efficiency and accuracy (BLEU) barred us from work-flow of the system. It mainly involves the picking only one final system. While the phrase- following steps: 1) Send audio from a broadcast based system was much faster (translating 24 to- instance to the ASR server; 2) Receive transcrip- kens/second versus 9.5 tokens/second), it was also tion from the ASR server; 3) Send transcription roughly 5 BLEU points worse (28.6 versus 33.6) to MT server; 4) Receive translation from the MT compared to our Neural MT system. We there- server; 5) Sync results with backend system and fore leave it up to the user to decide whether they 6) Multiple watch instances sync results from the care more about translation quality or speed. The backend. Steps 1-5 are constantly repeated as new real-time factor for the entire pipeline is 1.18 using audio is received by the system through the broad- Phrase-based MT and 1.26 using Neural MT. cast page. Step 6 is also periodically repeated to 1The demo is available at https://st.qcri.org/ get the latest results on the watch page. Both the demos/livetranslation. speech recognition and machine translation mod-

61 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 61–64 c 2017 Association for Computational Linguistics

2.2 Speech transcription We use the Speech-to-text transcription system that was built as part of QCRI’s submission to the 2016 Arabic Multi-Dialect Broadcast Media Recognition (MGB) Challenge. Key features of the transcription system are given below: Data: The training data consisted of 1200 hours of transcribed broadcast speech data col- lected from Aljazeera news channel. In addition we had 10 hours of development data (Ali et al., 2016). We used data augmentation techniques such as Speed and Volume perturbation which in- creased the size of the training data to three times the original size (Ko et al., 2015). Figure 2: Demo system overview Speech Lexicon: We used a Grapheme based lexicon (Killer and Schultz, 2003) of size 900k. ules have a standard API that can be used to send The lexicon is constructed using the words that oc- and receive information. The Web Application cur more than twice in the training transcripts. connects with the API and runs independently of Speech Features: Features used to train the system used for transcription and translation. all the acoustic models are 40 dimensional hi- resolution Mel Frequency Cepstral Coefficients 2.1 Web application (MFCC hires), extracted for each speech frame, The web application has two major components; concatenated with 100 dimensional i-Vectors per the frontend and the backend. The frontend is cre- speaker to facilitate speaker adaptation (Saon et ated using the React Javascript framework to han- al., 2013). dle the dynamic User Interface (UI) changes such Acoustic Models: We experimented with three as transcription and translation updates. The back- acoustic models; Time Delayed Neural Networks end is built using NodeJS and MongoDB to han- (TDNNs) (Peddinti et al., 2015), Long Short-Term dle sessions, data associated with these sessions Memory Recurrent Neural Networks (LSTM) and and authentication. The frontend presents the user Bi-directional LSTM (Sak et al., 2014). Perfor- with three pages; the landing page, the watch page mance of the BLSTM acoustic model in terms of and the broadcast page. The landing page allows Word Error Rate is better than the TDNN, but the user to either create a new session or work TDNN has a much better real-time factor while with an existing one. The watch page regularly decoding. Hence, for the purpose of the speech syncs with the backend to get the latest partial or translation system, we use the TDNN acoustic final transcriptions and translations. The broadcast model. The TDNN model consists of 5 hidden lay- page is meant for the primary speaker. This page ers, each layer containing 1024 hidden units and is is responsible for recording the audio data and col- trained using Lattice Free Maximum Mutual Infor- lecting the transcriptions and translations from the mation (LF-MMI) modeling framework in Kaldi ASR and MT systems respectively. Both partial (Povey et al., 2016). Word Error Rate comparison transcriptions and translations are also presented of different acoustic models can be seen in Table to the speaker as they are being decoded. To avoid 1. For further details, see Khurana and Ali (2016). very frequent and abrupt changes, the rate of up- Language Model: We built a Kneser Ney date of partial translations was configured based smoothed trigram language model. The vocab size on a MIN NEW WORDS parameter, which defines is restricted to the 100k most frequent words to the minimum number of new words required in improve the decoding speed and the real-time fac- the partial transcription to trigger the translation tor of the system. The choice of using a trigram service. Both the partial and the final results are model instead of an RNN as in our offline systems also synced to the backend as they are made avail- was essential in keeping the decoding speed at a able, so that the viewers on the watch page can reasonable value. experience the live translation. Decoder Parameters: Beam size for the de-

62 Model %WER Dowmunt et al., 2016) decoder to use our neural models on the CPU. Because of computation con- TDNN 23.0 straints, we reduced the beam size from 12 to 5 LSTM 20.9 with a minimal loss of 0.1 BLEU points. BLSTM 19.3 The primary factors in our final decision were 3-fold; overall quality, translation time and com- Table 1: Recognition results for the LF-MMI putational constraints. The translation time has to trained recognition systems. LM used for decod- be small for a live translation system. The perfor- ing is tri-gram. Data augmentation is used before mance of the four systems on the official IWSLT training test-sets is shown in Figure 3. coder was tuned to give the best real-time factor with a reasonable drop in accuracy. The final value was selected to be 9.0.

2.3 Machine translation The MT component is served by an API that con- nects to several translation systems and allows the user to seamlessly switch between them. We had four systems to choose from for our demo, two of which were Phrase-based systems, and the two were Neural MT systems trained using Nematus Figure 3: Performance and Translation speed of (Sennrich et al., 2016). various MT systems PB-Best: This is a competition-grade phrase- We also computed the translation speed of each based system, also used for our participation at of the systems.3 The results shown in Figure 3 the IWSLT’16 campaign (Durrani et al., 2016). It depict the significant time gain we achieved using was trained using all the freely available Arabic- the pruned phrase based system. However, with a English data with state-of-the-art features such as 5 BLEU point difference in translation quality, we a large language model, lexical reordering, OSM decided to compromise and use the slower NMT- (Durrani et al., 2011) and NNJM features (Devlin CPU in our final demo. We also allow the user to et al., 2014). switch to the phrase-based system, if translation PB-Pruned: The PB-best system is not suitable speed is more important. We did not use NMT- for real time translation and has high memory re- GPU since it is very costly to put into production quirements. To increase the efficiency, we dropped with its requirement for a dedicated GPU card. the OSM and NNJM features, heavily pruned the Finally, we added a customized dictionary and language model and used MML-filtering to select translated unknown words by transliterating them a subset of training data. The resulting system was in a post-decoding step (Durrani et al., 2014). trained on 1.2 M sentences, 10 times less the orig- inal data. 2.4 Combining Speech recognition and Machine translation 2 NMT-GPU: This is our best system that we To evaluate our complete pipeline, we prepared submitted to the IWSLT’16 campaign (Durrani et three in-house test sets. The first set was col- al., 2016). The advantage of Neural models is that lected from an in-house promotional video, while their size does not scale linearly with the data, and the other two sets were collected in a quiet office hence we were able to train using all available data environment. without sacrificing translation speed. This model 3 runs on the GPU. PB-Pruned and NMT-CPU were run using a sin- gle CPU on our standard demo machine using a Intel(R) Xeon(R) E5-2660 @ 2.20GHz proces- NMT-CPU: This is the same model as 3, but sor. PB-Best was run on another machine using runs on the CPU. We use the AmuNMT (Junczys- a Intel(R) Xeon(R) E5-2650 @ 2.00GHz proces- sor due to memory constraints. Finally, NMT-GPU was run 2without performing ensembling using an NVidia GeForce GTX TITAN X GPU card.

63 We analyzed the real time performance of the 52nd Annual Meeting of the Association for Compu- entire pipeline, include the lag induced by trans- tational Linguistics (Volume 1: Long Papers). lation after the transcription is complete. With Nadir Durrani, Helmut Schmid, and Alexander Fraser. an average real-time factor of 1.1 for the speech 2011. A joint sequence translation model with in- recognition, our system keeps up with normal tegrated reordering. In Proceedings of the Associ- speech without any significant lag. The distribu- ation for Computational Linguistics: Human Lan- guage Technologies (ACL-HLT’11), Portland, OR, tion of the real time factor speech recognition and USA. translation for the in-house test sets is shown in Figure 4. Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp Koehn. 2014. Integrating an unsupervised translit- eration model into statistical machine translation. In EACL, volume 14, pages 148–153. Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and Stephan Vogel. 2016. QCRI machine translation systems for IWSLT 16. In Proceedings of the 15th International Workshop on Spoken Language Trans- lation, IWSLT ’16, Seattle, WA, USA. Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang. 2016. Is neural machine translation ready for deployment? A case study on 30 translation di- rections. CoRR, abs/1610.01108. Sameer Khurana and Ahmed Ali. 2016. QCRI ad- vanced transcription system (QATS) for the arabic multi-dialect broadcast media recognition: MGB-2 challenge. In Spoken Language Technology Work- Figure 4: Real-time analysis for all audio seg- shop (SLT) 2016 IEEE. ments in our in-house test sets Mirjam Killer and Tanja Schultz. 2003. Grapheme based speech recognition. In INTERSPEECH. 3 Conclusion Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San- jeev Khudanpur. 2015. Audio augmentation for This paper presents QCRI live speech translation speech recognition. In INTERSPEECH. system for real world settings such as lectures and talks. Currently, the system works very well Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khu- danpur. 2015. A time delay neural network archi- for Arabic including frequent dialectal words, and tecture for efficient modeling of long temporal con- also supports code-switching for most common texts. In INTERSPEECH, pages 3214–3218. acronyms and English words. Our future aim is to Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pe- improve the system in several ways; by having a gah Ghahrmani, Vimal Manohar, Xingyu Na, Yim- tighter integration between the speech recognition ing Wang, and Sanjeev Khudanpur. 2016. Purely and translation components, incorporating more sequence-trained neural networks for asr based on dialectal speech recognition and translation, and lattice-free mmi. by improving punctuation recovery of the speech Hasim Sak, Andrew W Senior, and Franc¸oise Beau- recognition system which will help machine trans- fays. 2014. Long short-term memory recurrent lation to produce better translation quality. neural network architectures for large scale acoustic modeling. In INTERSPEECH. George Saon, Hagen Soltau, David Nahamoo, and References Michael Picheny. 2013. Speaker adaptation of neural network acoustic models using i-vectors. In Ahmed Ali, Peter Bell, James Glass, Yacine Messaoui, ASRU, pages 55–59. Hamdy Mubarak, Steve Renals, and Yifan Zhang. 2016. The mgb-2 challenge: Arabic multi-dialect Rico Sennrich, Barry Haddow, and Alexandra Birch. broadcast media recognition. In SLT. 2016. Edinburgh neural machine translation sys- tems for wmt 16. In Proceedings of the First Con- Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas ference on Machine Translation, pages 371–376, Lamar, Richard Schwartz, and John Makhoul. 2014. Berlin, Germany, August. Association for Compu- Fast and robust neural network joint models for sta- tational Linguistics. tistical machine translation. In Proceedings of the

64 Nematus: a Toolkit for Neural Machine Translation

? Rico Sennrich† Orhan Firat Kyunghyun Cho‡ Alexandra Birch† Barry Haddow† Julian Hitschler¶ Marcin Junczys-Dowmunt† Samuel Läubli§ Antonio Valerio Miceli Barone† Jozef Mokry† Maria Nadejde˘ † ? †University of Edinburgh Middle East Technical University ‡New York University ¶Heidelberg University §University of Zurich

Abstract 2 Neural Network Architecture

We present Nematus, a toolkit for Neu- Nematus implements an attentional encoder– ral Machine Translation. The toolkit pri- decoder architecture similar to the one described oritizes high translation accuracy, usabil- by Bahdanau et al. (2015), but with several imple- ity, and extensibility. Nematus has been mentation differences. The main differences are as used to build top-performing submissions follows: to shared translation tasks at WMT and We initialize the decoder hidden state with IWSLT, and has been used to train systems • the mean of the source annotation, rather than for production environments. the annotation at the last position of the en- 1 Introduction coder backward RNN. We implement a novel conditional GRU with Neural Machine Translation (NMT) (Bahdanau et • al., 2015; Sutskever et al., 2014) has recently es- attention. tablished itself as a new state-of-the art in machine In the decoder, we use a feedforward hidden 1 • translation. We present Nematus , a new toolkit layer with tanh non-linearity rather than a for Neural Machine Translation. maxout before the softmax layer. Nematus has its roots in the dl4mt-tutorial.2 We found the codebase of the tutorial to be compact, In both encoder and decoder word embed- • simple and easy to extend, while also produc- ding layers, we do not use additional biases. ing high translation quality. These characteristics Compared to Look, Generate, Update de- make it a good starting point for research in NMT. • coder phases in Bahdanau et al. (2015), we Nematus has been extended to include new func- implement Look, Update, Generate which tionality based on recent research, and has been drastically simplifies the decoder implemen- used to build top-performing systems to last year’s tation (see Table 1). shared translation tasks at WMT (Sennrich et al., 2016) and IWSLT (Junczys-Dowmunt and Birch, Optionally, we perform recurrent Bayesian • 2016). dropout (Gal, 2015). Nematus is implemented in Python, and based Instead of a single word embedding at each on the Theano framework (Theano Develop- • ment Team, 2016). It implements an attentional source position, our input representations al- encoder–decoder architecture similar to Bahdanau lows multiple features (or “factors”) at each et al. (2015). Our neural network architecture dif- time step, with the final embedding being the fers in some aspect from theirs, and we will dis- concatenation of the embeddings of each fea- cuss differences in more detail. We will also de- ture (Sennrich and Haddow, 2016). scribe additional functionality, aimed to enhance We allow tying of embedding matrices (Press usability and performance, which has been imple- • and Wolf, 2017; Inan et al., 2016). mented in Nematus.

1available at https://github.com/rsennrich/nematus We will here describe some differences in more 2https://github.com/nyu-dl/dl4mt-tutorial detail:

65 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 65–68 c 2017 Association for Computational Linguistics

z being the reset and update gate activations. In Table 1: Decoder phase differences j0 RNNSearch (Bahdanau et al., 2015) Nematus (DL4MT) this formulation, W0, U0, Wr0 , Ur0 , Wz0 , Uz0 are Phase Output - Input Phase Output - Input trained model parameters; σ is the logistic sigmoid Look cj sj 1, C Look cj sj 1, yj 1, C ← − ← − − activation function. Generate yj sj 1, yj 1, cj Update sj sj 1, yj 1, cj ← − − ← − − Update sj sj 1, yj, cj Generate yj sj, yj 1, cj The attention mechanism, ATT, inputs the en- ← − ← − tire context set C along with intermediate hidden state s in order to compute the context vector c Given a source sequence (x , . . . , x ) of j0 j 1 Tx as follows: length Tx and a target sequence (y1, . . . , yTy ) of length Ty, let hi be the annotation of the source T symbol at position i, obtained by concatenating x c =ATT C, s0 = α h , the forward and backward encoder RNN hidden j j ij i i states, hi = [−→h i; ←−h i], and sj be the decoder hid-  X exp(eij) den state at position j. αij = T x , k=1 exp(ekj) decoder initialization Bahdanau et al. (2015) | eij =va tanh Uasj0 + Wahi , initialize the decoder hidden state s with the last P backward encoder state. where αij is the normalized alignment weight be- tween source symbol at position i and target sym- j v , U , W s0 = tanh Winit←−h 1 bol at position and a a a are the trained model parameters.  3 with Winit as trained parameters. We use the Finally, the second transition block, GRU2, gen- average annotation instead: erates sj, the hidden state of the cGRUatt, by look- ing at intermediate representation sj0 and context Tx i=1 hi vector cj with the following formulations: s0 = tanh Winit Tx P !  sj = GRU2 s0 , cj = (1 zj ) s + zj s0 , j − j j conditional GRU with attention Nematus im-  s = tanh Wcj + rj (Us0 ) , j j plements a novel conditional GRU with attention,  rj =σ Wrcj + Ursj0 , cGRUatt. A cGRUatt uses its previous hidden state  zj =σ Wzcj + Uzsj0 , sj 1, the whole set of source annotations C = − h ,..., h and the previously decoded symbol similarly, s being the proposal hidden state, { 1 Tx } j yj 1 in order to update its hidden state sj, which rj and zj being the reset and update gate − is further used to decode symbol yj at position j, activations with the trained model parameters W, U, Wr, Ur, Wz, Uz. sj = cGRUatt (sj 1, yj 1, C) Note that the two GRU blocks are not individu- − − ally recurrent, recurrence only occurs at the level Our conditional GRU layer with attention of the whole cGRU layer. This way of combining mechanism, cGRU , consists of three compo- att RNN blocks is similar to what is referred in the nents: two GRU state transition blocks and an literature as deep transition RNNs (Pascanu et al., attention mechanism ATT in between. The first 2014; Zilly et al., 2016) as opposed to the more transition block, GRU , combines the previous de- 1 common stacked RNNs (Schmidhuber, 1992; El coded symbol yj 1 and previous hidden state sj 1 − − Hihi and Bengio, 1995; Graves, 2013). in order to generate an intermediate representation sj0 with the following formulations: deep output Given sj, yj 1, and cj, the out- − put probability p(yj sj, yj 1, cj) is computed by sj0 = GRU1 (yj 1, sj 1) = (1 zj0 ) s0 + zj0 sj 1, − − − − j − |  a softmax activation, using an intermediate repre- s0 = tanh W0E[yj 1] + rj0 (U0sj 1) , j − −  sentation tj. rj0 = σ Wr0 E[yj 1] + Ur0 sj 1 , − −  zj0 = σ Wz0 E[yj 1] + Uz0 sj 1 , − − p(yj sj,yj 1, cj) = softmax (tjWo) | − where E is the target word embedding matrix, sj0 tj = tanh (sjWt1 + E[yj 1]Wt2 + cjWt3) − is the proposal intermediate representation, rj0 and Wt1, Wt2, Wt3, Wo are the trained model pa- 3All the biases are omitted for simplicity. rameters.

66 distribution of the ensemble is the geometric aver- 0 age of the individual models’ probability distribu- tions. The toolkit includes scripts for beam search hello 0.946 HI 0.007 Hey 0.006 0.056 4.920 5.107 decoding, parallel corpus scoring and n-best-list rescoring. world 0.957 World 0.010 world 0.684 0.100 4.632 5.299 Nematus includes utilities to visualise the atten- tion weights for a given sentence pair, and to vi- . 0.030 ! 0.928 ... 0.014 3.609 0.175 4.384 sualise the beam search graph. An example of the latter is shown in Figure 1. Our demonstration will 0.999 0.999 0.994 cover how to train a model using the command- 3.609 0.175 4.390 line interface, and showing various functionalities of Nematus, including decoding and visualisation, with pre-trained models.4 Figure 1: Search graph visualisation for DE EN → translation of "Hallo Welt!" with beam size 3. 5 Conclusion

3 Training Algorithms We have presented Nematus, a toolkit for Neural Machine Translation. We have described imple- By default, the training objective in Nematus is mentation differences to the architecture by Bah- cross-entropy minimization on a parallel training danau et al. (2015); due to the empirically strong corpus. Training is performed via stochastic gra- performance of Nematus, we consider these to be dient descent, or one of its variants with adaptive of wider interest. learning rate (Adadelta (Zeiler, 2012), RmsProp We hope that researchers will find Nematus an (Tieleman and Hinton, 2012), Adam (Kingma and accessible and well documented toolkit to support Ba, 2014)). their research. The toolkit is by no means limited Additionally, Nematus supports minimum risk to research, and has been used to train MT systems training (MRT) (Shen et al., 2016) to optimize to- that are currently in production (WIPO, 2016). wards an arbitrary, sentence-level loss function. Various MT metrics are supported as loss function, Nematus is available under a permissive BSD license. including smoothed sentence-level BLEU (Chen and Cherry, 2014), METEOR (Denkowski and Lavie, 2011), BEER (Stanojevic and Sima’an, Acknowledgments 2014), and any interpolation of implemented met- This project has received funding from the Euro- rics. pean Union’s Horizon 2020 research and innova- To stabilize training, Nematus supports early tion programme under grant agreements 645452 stopping based on cross entropy, or an arbitrary (QT21), 644333 (TraMOOC), 644402 (HimL) and loss function defined by the user. 688139 (SUMMA). 4 Usability Features In addition to the main algorithms to train and References decode with an NMT model, Nematus includes Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- features aimed towards facilitating experimenta- gio. 2015. Neural Machine Translation by Jointly tion with the models, and their visualisation. Var- Learning to Align and Translate. In Proceedings of ious model parameters are configurable via a the International Conference on Learning Represen- command-line interface, and we provide extensive tations (ICLR). documentation of options, and sample set-ups for Boxing Chen and Colin Cherry. 2014. A Systematic training systems. Comparison of Smoothing Techniques for Sentence- Nematus provides support for applying single Level BLEU. In Proceedings of the Ninth Workshop models, as well as using multiple models in an en- on Statistical Machine Translation, pages 362–367, semble – the latter is possible even if the model Baltimore, Maryland, USA. architectures differ, as long as the output vocabu- 4Pre-trained models for 8 translation directions are avail- lary is the same. At each time step, the probability able at http://statmt.org/rsennrich/wmt16_systems/

67 Michael Denkowski and Alon Lavie. 2011. Me- Proceedings of the 54th Annual Meeting of the As- teor 1.3: Automatic Metric for Reliable Optimiza- sociation for Computational Linguistics (Volume 1: tion and Evaluation of Machine Translation Sys- Long Papers), Berlin, Germany. tems. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 85–91, Ed- Milos Stanojevic and Khalil Sima’an. 2014. BEER: inburgh, Scotland. BEtter Evaluation as Ranking. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Salah El Hihi and Yoshua Bengio. 1995. Hierarchical pages 414–419, Baltimore, Maryland, USA. Recurrent Neural Networks for Long-Term Depen- dencies. In Nips, volume 409. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Net- Yarin Gal. 2015. A Theoretically Grounded Appli- works. In Advances in Neural Information Process- cation of Dropout in Recurrent Neural Networks. ing Systems 27: Annual Conference on Neural Infor- ArXiv e-prints. mation Processing Systems 2014, pages 3104–3112, Montreal, Quebec, Canada. Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint Theano Development Team. 2016. Theano: A Python arXiv:1308.0850. framework for fast computation of mathematical ex- pressions. arXiv e-prints, abs/1605.02688. Hakan Inan, Khashayar Khosravi, and Richard Socher. 2016. Tying Word Vectors and Word Classifiers: A Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture Loss Framework for Language Modeling. CoRR, 6.5 - rmsprop. abs/1611.01462. WIPO. 2016. WIPO Develops Cutting-Edge Marcin Junczys-Dowmunt and Alexandra Birch. 2016. Translation Tool For Patent Documents, Oct. http://www.wipo.int/pressroom/en/ The University of Edinburgh’s systems submis- articles/2016/article\_0014.html sion to the MT task at IWSLT. In The Interna- . tional Workshop on Spoken Language Translation Matthew D Zeiler. 2012. ADADELTA: an (IWSLT), Seattle, USA. adaptive learning rate method. arXiv preprint arXiv:1212.5701. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint Julian Georg Zilly, Rupesh Kumar Srivastava, arXiv:1412.6980. Jan Koutník, and Jürgen Schmidhuber. 2016. Recurrent highway networks. arXiv preprint Razvan Pascanu, Çaglar˘ Gülçehre, Kyunghyun Cho, arXiv:1607.03474. and Yoshua Bengio. 2014. How to Construct Deep Recurrent Neural Networks. In International Con- ference on Learning Representations 2014 (Confer- ence Track).

Ofir Press and Lior Wolf. 2017. Using the Output Em- bedding to Improve Language Models. In Proceed- ings of the 15th Conference of the European Chap- ter of the Association for Computational Linguistics (EACL), Valencia, Spain.

Jürgen Schmidhuber. 1992. Learning complex, ex- tended sequences using the principle of history com- pression. Neural Computation, 4(2):234–242.

Rico Sennrich and Barry Haddow. 2016. Linguistic Input Features Improve Neural Machine Translation. In Proceedings of the First Conference on Machine Translation, Volume 1: Research Papers, pages 83– 91, Berlin, Germany.

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Edinburgh Neural Machine Translation Sys- tems for WMT 16. In Proceedings of the First Con- ference on Machine Translation, Volume 2: Shared Task Papers, pages 368–373, Berlin, Germany.

Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum Risk Training for Neural Machine Translation. In

68 A tool for extracting sense-disambiguated example sentences through user feedback

Beto Boullosa† and Richard Eckart de Castilho† and Alexander Geyken‡ and Lothar Lemnitzer‡ and Iryna Gurevych†

†Ubiquitous Knowledge Processing Lab ‡Berlin-Brandenburg Department of Computer Science Academy of Sciences Technische Universitat¨ Darmstadt http://www.ukp.tu-darmstadt.de http://www.bbaw.de

Abstract tem; section 4 is devoted to evaluation; section 5 summarizes the conclusions and future work. This paper describes an application system aimed to help lexicographers in the ex- 2 Extraction of Dictionary Examples traction of example sentences for a given headword based on its different senses. Example sentences can help understanding the nu- The tool uses classification and clustering ances of the usage of a certain term, specially methods and incorporates user feedback to in the presence of polysemy. This has become refine its results. rather important in the last decades, with the shift that has occurred, in the field of dictionary mak- 1 Introduction ing, from a content-centered to a user-centered Language is subject to constant evolution and perspective (Lew, 2015). With the popularization change. Hence, lexicographers are always sev- of online dictionaries, space-saving considerations eral steps behind the current state of language have lost the importance once held, making it eas- in discovering new words, new senses of ex- ier to add example sentences to a given headword. isting words, cataloging them, and illustrating Didakowski et al. (2012) argue that a system them using good example sentences. To facili- for example extraction should ideally act “like a tate this work, lexicographers increasingly rely on lexicographer”, i.e., it should fully understand the automatic approaches that allow sifting efficiently examples themselves, something arguably beyond through the ever growing body of digitally avail- the scope of current NLP technology. Instead, op- able text, something that has brought important erational criteria must be used to define “good” gains, including time saving and better utilizing examples, as seen also in the work of Kilgarriff limited financial and personal resources. et al. (2008), criteria like presence of typical col- Among the tasks that benefit from the increas- locations of the target word, characteristics of the ing automation in lexicography is the automatic sentence itself, and guaranteeing that all senses of extraction of suitable corpus-based example sen- the target word are represented by the extracted tences for the headwords in the dictionary. Our example sentences. paper describes an innovative system that han- Several methods to automate the task have been dles this task by incorporating user feedback to developed, the most popular being GDEX (”Good a computer-driven process of sentence extraction Dictionary EXamples”) (Kilgarriff et al., 2008). based on a combination of unsupervised and su- GDEX is a rule based software tool that suggests pervised machine learning, in contrast to current ”good” corpus examples to the lexicographer ac- approaches that do not include user feedback. Our cording to predefined criteria, including sentence tool allows querying sentences containing a spe- length, word frequency and the presence/absence cific lemma, clustering these sentences by topical of pronouns or named entities. The goal of GDEX similarity to initialize a sense classifier, and inter- is to reduce the number of corpus examples to be actively refining the sense assignments, continu- inspected by extracting only the n-”best” exam- ally updating the classifier in the background. ples, the default being 100 sentences. It has been In the next section, we contextualize the task of used and adapted for languages other than English. example extraction; section 3 describes our sys- Didakowski et al. (2012) presented an extrac-

69 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 69–72 c 2017 Association for Computational Linguistics tor of good examples based on a hand-written, performed over a certain headword (actually de- rule-based grammar, determining the quality of fined by its lemma and POS tag). Tasks include a sentence according to its grammatical structure searching for initial sentences, clustering and clas- combined with some simpler features as used by sification. Users can work on many jobs in parallel GDEX. None of those works, however, focused on and isolated, which allows calculating inter-rater differentiating example sentences according to the agreement on the sense disambiguation task. word senses or the target words. As for the technologies, the tool was developed Cook et al. (2013) used Word Sense Induction using Java Wicket and relying on Solr1, Mallet 2, (WSI) to identify novel senses for words in a dic- DKPro Core (Eckart de Castilho and Gurevych, tionary. They utilized hierarchical LDA (Latent 2014), Tomcat and MySQL. The next subsections Dirichlet Allocation) (Teh et al., 2006), a varia- describe the system in more detail. tion of the original LDA (Blei et al., 2003) topic model, to identify novel word senses, later com- 3.1 Searching bining this approach with GDEX, to allow extract- After starting a job, the user goes to the Search ing good example sentences according to word page to look for sentences containing the desired senses (Cook et al., 2014). However, they obtained lemma. They are shown in a KWIC (Keyword in ”encouraging rather than conclusive” results, spe- Context) table with their ids. When satisfied with cially due to limitations of the LDA approach in the results, the user selects the stopwords to use in linking identified topics with word senses. the next steps and goes to the Clustering phase. Our work explores and develops a similar ap- proach of using topic modeling and WSI to cluster 3.2 Clustering sentences according to the senses of a target word, In the clustering page, the selected sentences are but we take a step further, using the initial clus- automatically divided into clusters corresponding ters as seed for a series of interactive classification to topics that ideally relate to the senses of the steps. The training data for each classification step target word. The user manually configures how are sentences whose confidence scores calculated many clusters the topic modeling generates and by the system exceed a threshold, and sentences control the hyper-parameters to fine-tune the pro- manually labeled by the user. The process leads to cess. We currently use Mallet’s LDA implementa- a user-driven refinement in the labeling process. tion to topic modeling (McCallum, 2002). The clustering page lists the selected sentences 3 System description according to the generated clusters. Each cluster is The computer-assisted, interactive sense disam- shown in its own column, with a word cloud on the biguation process supported by our system in- top, containing the main words related to it, which volves: 1) import sentences into a searchable in- helps the user to assess cluster’s quality and mean- dex; 2) retrieve sentences containing a specific ing. The word sizes correspond to their LDA- word (lemma); 3) cluster selected sentences, pro- calculated weights. The user can change hyper- viding a starting point for interactive classifica- parameters and regenerate clusters as often as de- tion; 4) train a multi-class classifier from the ini- sired, before proceeding to the classification step. tial clusters - that trains on the most representative 3.3 Classification sentences for each cluster and is then used to la- bel the rest of the sentences; a sentence is ”rep- In the classification step, the user interactively re- resentative” if the confidence score calculated by fines the results, giving feedback to the automatic the system exceeds a configurable threshold; 5) re- classification in order to improve labeling of the fine the classifier by interactively correcting sense sentences according to word senses. The initial assignments for selected sentences. automatic labels correspond to the results of the The system supports multiple users working in clustering phase. The user starts analyzing each so called projects, which define, among other sentence and decides if the automatically assigned things, the list of stopwords available for cluster- sense label is appropriate or if a different label ing and classification and the location where the needs to be assigned manually. The initial default sentences should be retrieved from. 1http://lucene.apache.org/solr/ A project contains jobs, corresponding to tasks 2http://mallet.cs.umass.edu

70 Figure 1: Classification page

User Assist Accuracy Time (mm:ss) proach for Word Sense Disambiguation, known Annot. 1 No 0.96 8:05 for its efficiency (Hristea, 2013). We use the Mal- Annot. 1 Yes 0.95 6:05 let implementation of Naive Bayes, with the sen- Annot. 2 No 0.90 10:05 Annot. 2 Yes 0.90 7:55 tence tokens as features.

Table 1: Evaluation results 4 Evaluation To evaluate the tool, we conducted experiments value for a sentence’s manual label is “unseen”, in two different scenarios: 1) using no assis- indicating that the user has still not evaluated that tive features, annotators classified sentences iden- sentence. Besides the available sense labels, two tifying the word senses by their own; 2) using special labels can be assigned to a sentence: “nei- automatically-generated clusters, annotators let ther”, to indicate that none of the available labels the system suggest senses and then manually as- is applicable; “unknown”, meaning that the user signed labels to sentences, helped by the feedback- does not know how to label it. based multi-class classifier. Every annotator ap- The classification page (figure 1 - the numbers plied the two scenarios to a target word, namely 3 below correspond to its elements) has a table list- ”Galicia” , with three senses: a) the Spanish au- ing each sentence with its id 1 ; automatically as- tonomous community, b) the region in Central- signed sense labels in the previous 5 and current Eastern Europe; c) a football club in Brazil. 6 iterations; confidence scores (weights) of the The senses were randomly distributed over 97 sense label in the previous 3 and current 4 iter- sentences in the first scenario (38/46/13 sentences ations; manually assigned sense label 7 ; sentence for the respective senses) and 99 in the second text 8 ; and an indicator to tell if the sense label (43/41/15). Sentences in both scenarios did not has changed between the previous and current it- overlap, and were taken from a larger dataset of eration 2 . The page has also a widget detailing manually annotated sentences. The experiments the different word senses currently available 10 . were performed by two non-lexicographers (com- Besides editing the label of a sense, the user can puter scientists with NLP background). We mea- sured the time taken in every scenario and calcu- also add new manual senses 11 . lated the accuracy of the final results compared to After modifying parameters, adding senses and the manually annotated gold standard. manually labeling sentences, a new classification Results (table 1) indicated that the time was sig- iteration can be started. The classifier uses the nificantly reduced when working with full assis- threshold 9 to identify the training data - the con- tance, compared to working without assistance. fidence scores are calculated by the classifier (al- Although accuracy did vary very little, there was though in the initial classification they come from 3 the topic modeling). Furthermore, manually la- Although proper nouns are not present in conventional dictionaries, but rather in onomastic dictionaries, which usu- beled sentences are also used as training data. We ally do not make use of examples, it serves well, for method- use the Naive Bayes algorithm, a classical ap- ological reasons, to our evaluation purposes.

71 a slight loss of quality in the results of the first search Foundation under grant No. EC 503/1-1 annotator when using assistance. This might indi- and GU 798/21-1 (INCEpTION). cate a negative influence of system suggestions on the annotator or it could be attributable to a more difficult random selection of the samples in cer- References tain cases. These effects call for further investiga- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. tion. However, using the tool outside the evalua- 2003. Latent dirichlet allocation. Journal of Ma- tion setup, we noted subjective speedups from de- chine Learning Research, 3:993–1022, March. veloping smart strategies to optimize the use of the Paul Cook, Jey Han Lau, Michael Rundell, Diana Mc- information provided by the machine (e.g. quickly Carthy, and Timothy Baldwin. 2013. A lexico- annotating sentences containing specific context graphic appraisal of an automatic approach for de- tecting new word-senses. In Proceedings of eLex words or sorting by sense and confidence score). 2013, pages 49–65, Tallinn, Estonia. Thus, we expect the automatic assistance to have a larger impact in actual use than the present eval- Paul Cook, Michael Rundell, Jey Han Lau, and Tim- uation can show. othy Baldwin. 2014. Applying a word-sense in- duction system to the automatic extraction of di- verse dictionary examples. In Proceedings of the 5 Conclusions and future work XVI EURALEX International Congress, pages 319– 328, Bolzano, Italy. We have introduced a novel system for interactive word sense disambiguation in the context of ex- Jorg¨ Didakowski, Lothar Lemnitzer, and Alexander ample sentences extraction. Using a combination Geyken. 2012. Automatic example sentence ex- traction for a contemporary german dictionary. In of unsupervised and supervised methods, corpus Proceedings of the XV EURALEX International processing and user feedback, our approach aims Congress, pages 343–349, Oslo, Norway. at helping lexicographers to properly assess and Richard Eckart de Castilho and Iryna Gurevych. 2014. refine a list of pre-analyzed example sentences ac- A broad-coverage collection of portable NLP com- cording to the senses of a target word, alleviating ponents for building shareable analysis pipelines. In the burden of doing this manually. Using vari- Proceedings of the Workshop on Open Infrastruc- ous state-of-the-art techniques, including cluster- tures and Analysis Frameworks for HLT, pages 1– ing and classification methods for word sense in- 11, Dublin, Ireland, August. duction, and incorporating user feedback into the Florentina T. Hristea. 2013. The na¨ıve bayes model in process of shaping word senses and associating the context of word sense disambiguation. In The sentences to them, the tool can be a valuable ad- Na¨ıve Bayes Model for Unsupervised Word Sense Disambiguation: Aspects Concerning Feature Se- dition to a lexicographer’s toolset. As next steps, lection, pages 9–16. Springer, Heidelberg, Germany. we plan to focus on the extraction of “good” exam- ples, adding support for ranking sentences in dif- Adam Kilgarriff, Milo Husk, Katy McAdam, Michael ferent sense clusters according to operational cri- Rundell, and Pavel Rychly.´ 2008. Gdex: Automat- ically finding good dictionary examples in a corpus. teria like in Didakowski et al. (2012). We also plan In Proceedings of the XIII EURALEX International to extend the evaluation and to observe the strate- Congress, pages 425–432, Barcelona, Spain. gies that users develop, in order to discover if they Robert Lew. 2015. Dictionaries and their users. In can inform further improvements to the system. International Handbook of Modern Lexis and Lexi- cography. Springer, Heidelberg, Germany. Acknowledgments Andrew Kachites McCallum. 2002. MAL- This work has received funding from the Euro- LET: A machine learning for language toolkit. pean Union’s Horizon 2020 research and innova- http://mallet.cs.umass.edu. tion programme (H2020-EINFRA-2014-2) under Yee Whye Teh, Michael I Jordan, Matthew J Beal, and grant agreement No. 654021. It reflects only the David M Blei. 2006. Hierarchical dirichlet pro- author’s views and the EU is not liable for any cesses. Journal of the American Statistical Associa- use that may be made of the information con- tion, 101(476):1566–1581. tained therein. It was further supported by the German Federal Ministry of Education and Re- search (BMBF) under the promotional reference 01UG1416B (CEDIFOR) and by the German Re-

72 Lingmotif: Sentiment Analysis for the Digital Humanities

Antonio Moreno-Ortiz University of Malaga´ Spain [email protected]

Abstract user-provided dictionaries, which can be imported from CSV files, and used along with the applica- Lingmotif is a lexicon-based, tion’s core dictionary. linguistically-motivated, user-friendly, Lingmotif’s SA approach could be loosely char- GUI-enabled, multi-platform, Sentiment acterized as bag-of-words, since sentiment is com- Analysis desktop application. Lingmotif puted solely based on the presence of certain lex- can perform SA on any type of input ical items. However, Lingmotif is not just a clas- texts, regardless of their length and topic. sifier. It also offers a visual representation of the The analysis is based on the identification sentiment profile of texts, allows to compare the of sentiment-laden words and phrases profile of multiple documents side by side, and contained in the application’s rich core can process ordered document series. Such fea- lexicons, and employs context rules to tures are useful in discourse analysis tasks where account for sentiment shifters. It offers sentiment changes are relevant, whether within or easy-to-interpret visual representations of across texts, such as political speeches and narra- quantitative data (text polarity, sentiment tives, or to track the evolution in sentiment towards intensity, sentiment profile), as well as a a given topic (in news, for example). detailed, qualitative analysis of the text Being focused on the end user, Lingmotif uses in terms of its sentiment. Lingmotif can a simple, easy-to-use GUI that allows users to also take user-provided plugin lexicons select input and options, and launch the analy- in order to account for domain-specific sis (see Figure1). Results are generated as an sentiment expression. Lingmotif currently HTML/Javascript document, which is saved to a analyzes English and Spanish texts. predefined location and automatically sent to the user’s default browser for immediate display. In- 1 Introduction ternally, the application generates results as an Lingmotif1 is a lexicon-based Sentiment Anal- XML document containing all the relevant data; ysis (SA) system that employs a set of lexical this XML document is then parsed against one of sources and analyzes context, by means of senti- several available XSL templates, and transformed ment shifters in order to identify sentiment-laden into the final HTML. text segments and produce two scores that qual- Lingmotif is available for the Mac OS, MS Win- ify a text from a SA perspective. In a nutshell, it dows, and Linux platforms. It is free for non- breaks down a text into its constituent sentences, commercial purposes. where sentiment-carrying words and phrases are searched for, identified, and assigned a valence 2 Lexicon-based Sentiment Analysis (i.e., a sentiment index). The overall score for a text is computed as a function of the accumu- Lingmotif is a lexicon-based SA system, since it lated negative, positive and neutral scores. Spe- uses a rich set of lexical sources and analyzes cific domains can be accounted for by applying context in order to identify sentiment laden text segments and produce two scores that qualify a 1This research was supported by Spain’s MINECO through the funding of project Lingmotif2 (FFI2016-78141- text from a SA perspective. In a nutshell, it P). breaks down a text into its constituent sentences,

73 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 73–76 c 2017 Association for Computational Linguistics

many cases. For example, we can classify as neg- ative the word and then include phrases such as kill time with a neutral valence. When this is not possible, the options are to include it with the more statistically probable polarity or simply leave it out when the chances of getting the item with one polarity or another are similar.

2.2 Context rules Context rules are Lingmotifs mechanism to deal with sentiment shifters. They work by specifying words or phrases that can appear in the immedi- ate vicinity of the identified sentiment word. Basi- cally, we use the same approach as (Polanyi and Zaenen, 2006). Previous implemented systems following this approach are (Kennedy and Inkpen, 2006), (Taboada et al., 2011), and (Moreno-Ortiz et al., 2010). We use simple addition or subtrac- tion (of integers on a -5 to 5 scale in our case). When a context rule is matched, the resulting text segment is marked as a single unit and assigned the calculated valence, as specified in the rule. Figure 1: Lingmotifs GUI Lingmotifs context rules were compiled by ex- tensive corpus analysis, studying concordances of common polarity words (adjectives, verbs, nouns, where sentiment-carrying words and phrases are and adverbs), and then testing the rules against searched for, identified, and assigned a valence. texts to further improve and refine them. For each language, Lingmotif uses a core lexicon, a set of context rules, and, optionally, one or more 2.3 Plugin lexicons plugin lexicons. In the following sections, the most salient as- Topic has been consistently shown to determine pects of the application’s sentiment analysis en- the semantic orientation of a text (Aue and Ga- gine are described. A more thorough description mon, 2005), (Pang and Lee, 2008). Being a can be found in (Moreno-Ortiz, 2017). general-purpose SA system, Lingmotif provides a flexible mechanism to adapt to specific do- 2.1 Core sentiment lexicon mains by means of user-provided lexicons. Lexi- cal information contained in plugin lexicons over- A lexical item in a Lingmotif lexicon can be either rides Lingmotifs core lexicon, providing domain- a single word or a multiword expression. Each en- specific sentiment items. They can be created as a try is defined by a specification of its form, part CSV file following a simple format, which is then of speech, and valence. The valence is an in- imported into Lingmotif’s internal database. teger from -5 to -2 for negatives and 5 to 2 for positives. The items form can either be a literal 3 Single and multi-document modes string or a lemma. For the part-of-speech specifi- cation, Lingmotif uses the Penn Treebank tag set. From a classification perspective, it only makes A wildcard (ALL) can be used for cases where all sense to use a large set of texts to be analyzed (i.e., possible parts of speech for that lemma share the classified). However, since Lingmotif is able to same valence. Sentiment disambiguation is cur- specifically identify and mark those text segments rently dealt with using exclusively formal features: that convey sentiment, we can take advantage of part-of speech tags and multi-word-expressions. this feature to measure sentiment not only in the MWEs usually include words that may or may not text as a whole, but in subsections of the text, pro- have the same polarity of the expression. includ- ducing a sentiment map of the text and display the ing such expressions can solve disambiguation for result in several ways.

74 A single input text can be typed or pasted in the titative data tables, which include common text text box area, or text files can be loaded, depend- metrics and a breakdown of the sentiment analysis ing on the selected input type. Loading files al- data for the analyzed text. The final results sec- lows the user to select one of them and analyze it tion shows the input text after processing, where in single-mode, or select the complete set of files. the identified sentiment items are color-coded to represent their polarity. This makes it possible to 3.1 Single-document mode know exactly what the analyzer found in the text. For every text analyzed, either in single or multi- document mode, Lingmotif produces a number of metrics for each individual text. The two metrics that summarize the text’s overall sentiment are the Text Sentiment Score (TSS) and the Text Senti- ment Intensity (TSI). Both are displayed by means of visual, animated (Javascript) gauges at the top of the results page. The numeric indexes (on a 0-100 scale) are categorized in ranges, from ”ex- tremely negative” to ”extremely positive”, to make numeric results more intuitively interpretable by the user. Both gauges are also color and intensity- coded in the red (negative) to green (positive) range (see Figure2). Figure 4: Sentiment scores gauges

An option exists (advanced view) to also show specific data for each lexical item

Figure 5: Detail in Advanced Mode Figure 2: Sentiment scores gauges

For long texts, Lingmotif will also generate a 3.2 Multi-document analysis sentiment profile, which is a visual representation Multiple input texts can be analyzed in one of sev- of the text’s internal structure and organization in eral modes (see below). When in multi-document terms of sentiment expression. This Javascript mode, Lingmotif will analyze documents one by graph is interactive: hovering the data points will one, generating one HTML file for each, although display the lexical items that make up that partic- they will not be displayed on the browser, just ular text segment (see Figure3). saved to the output folder. When the analysis is finished, a single results page will be displayed. This page is a summary of results, and is different from the single-document results page: the gauges for TSS and TSI are now the average for the ana- lyzed set and the detailed analysis section contains a quantitative analysis of each of the files in the set. The first column in this table shows the title of the document (file name without extension) as a hyperlink to the HTML file for that particular file. Available multi-document analysis modes are Figure 3: Document sentiment profile the following:

The next three sections of the results document Classifier (default): a stacked bar graph and • are shown in Figure4 below. First is the quan- data table are offered showing classification

75 results based on their TSS category. The graph offers a visualization of results (see Figure6); both its legend and the graph it- self are interactive. A table summarizing the classification results is also offered.

Figure 7: Multi-document analysis in parallel mode (5 documents)

it a valuable tool for research in the Digital Hu- manities where such tasks are relevant. The pos- sibility to integrate user-provided lexicons enables it to adapt to any subject domain.

References A. Aue and M. Gamon. 2005. Customizing sentiment classifiers to new domains: A case study. Figure 6: Classifier graph A. Kennedy and D. Inkpen. 2006. Sentiment classi- fication of movie reviews using contextual valence shifters. Computational Intelligence 22(2):110– To facilitate analysis of large sets of docu- 125. ments, they can be loaded from a single text file where each line is assumed to be an indi- A. Moreno-Ortiz. 2017. Lingmotif: A user- focused sentiment analysis tool. Procesamiento del vidual document. Lingmotif classifies docu- Lenguaje Natural 58:21–29. ments according to their TSS, which will al- ways include the neutral category. A. Moreno-Ortiz, A.´ Perez´ Pozo, and S. Tor- res Sanchez.´ 2010. Sentitext: sistema de analisis´ Series: the set of loaded files is assumed to be de sentimiento para el espanol.˜ Procesamiento del • Lenguaje Natural 45:297–298. in order, chronological (time series) or other- wise. Each data point in the Sentiment Anal- B. Pang and L. Lee. 2008. Opinion mining and senti- ysis Profile represents one document. The ment analysis. Foundations and Trends in Informa- tion Retrieval 2(12):1–135. data point is the average TSS for that partic- ular document. L. Polanyi and A. Zaenen. 2006. Contextual valence shifters. In Computing Attitude and Affect in Text: Parallel: produces a graph with one line for Theory and Applications, Springer, Dordrecht, The • each file (this mode is limited to 15 docu- Netherlands, volume 20 of The Information Re- trieval Series, pages 1–10. Shanahan, james g., qu, ments). This is useful to compare sentiment yan, wiebe, janyce edition. flow in texts side by side (see Figure7). M. Taboada, J. Brooks, M. Tofiloski, K. Voll, and Merge: this is a convenience option merges M. Stede. 2011. Lexicon-based methods for • sentiment analysis. Computational Linguistics all loaded individual files into one single text. 37(2):267–307. 4 Conclusions Lingmotif goes beyond what SA classifiers have to offer. It offers automatic identification of sentiment-laden words and phrases, as well as text segments. Its many visual representations of the text’s structure from a sentiment perspective make

76 RAMBLE ON: Tracing Movements of Popular Historical Figures

Stefano Menini1-2, Rachele Sprugnoli1-2, Giovanni Moretti1, Enrico Bignotti2, Sara Tonelli1 and Bruno Lepri1

1Fondazione Bruno Kessler / Via Sommarive 18, Trento, Italy 2University of Trento / Via Sommarive 9, Trento, Italy menini,sprugnoli,moretti,satonelli,lepri @fbk.eu { enrico.bignotti @unitn.it } { }

Abstract gator2, where users can upload their own set of movements taken from Wikipedia biographies. We present RAMBLE ON, an application integrating a pipeline for frame-based in- 2 Related Work formation extraction and an interface to The analysis of human mobility is an impor- track and display movement trajectories. tant topic in many research fields such as so- The code of the extraction pipeline and cial sciences and history, where structured data a navigator are freely available; moreover taken from census records, parish registers, mobile we display in a demonstrator the outcome phones etc. are employed to quantify travel flows of a case study carried out on trajectories and find recurring patterns of movements (Pooley of notable persons of the XX Century. and Turnbull, 2005; Gonzalez et al., 2008; Cat- tuto et al., 2010; Jurdak et al., 2015). Other stud- 1 Introduction ies on mobility rely on a great amount of manu- At a time when there were no social media, emails ally extracted information (Murray, 2013) or on and mobile phones, interactions were strongly shallow extraction methods. For example Gergaud shaped by movements across cities and countries. et al. (2016) detect movements in Wikipedia bi- In particular, the movements of eminent figures of ographies assuming that cities linked in biogra- the past were the engine of important changes in phy pages are locations where the subject lived or different domains such as politics, science, and the spent some time. arts. Therefore, tracing these movements means However we believe that, even if NLP contribu- providing important data for the analysis of cul- tion has been quite neglected in cultural analytics ture and society, fostering so-called cultural ana- studies, language technologies can greatly support lytics (Piper, 2016). this kind of research. For this reason, in RAMBLE This paper presents RAMBLE ON, a novel appli- ON we combine state-of-the-art Information Ex- cation that embeds Natural Language Processing traction and semantic processing tools and display (NLP) modules to extract movements from un- the extracted information through an advanced in- structured texts and an interface to interactively teractive interface. With respect to previous work, explore motion trajectories. In our use case, we our application allows to extract a wide variety of focus on biographies of famous historical figures movements going beyond the birth-to-death mi- from the first half of the XX Century extracted gration that is the focus of Schich et al. (2014) from the English Wikipedia. A web-based navi- or the transfers to the concentration camps of de- gator1 related to this use case is meant for schol- portees during Nazism as in Russo et al. (2015). ars without a technical background, supporting them in discovering new cultural migration pat- 3 Information Extraction terns with respect to different time periods, geo- In Figure 1, we show the general NLP workflow graphical areas and domains of occupation. We behind information extraction in RAMBLE ON. also release the script to generate trajectories and The goal is to obtain, starting from an unstructured a stand-alone version of the RAMBLE ON navi- 2Both can be downloaded at http://dh.fbk.eu/ 1Available at http://dhlab.fbk.eu/rambleon/ technologies/rambleon

77 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 77–80 c 2017 Association for Computational Linguistics

FRAME FRAME Arriving Meet with Attending Motion Becoming a member Receiving Being employed Residence Bringing Scrutiny Cause motion Self motion Colonization Sending Come together Speak on topic Conquering State continue Cotheme Temporary stay Departing Transfer Detaining Travel Education teaching Used up Figure 1: Information extraction workflow. Fleeing Working on Inhibit movement text, a set of destinations together with their coor- 4 dinates and a date, each representative of the place Table 1: List of frames selected for RAMBLE ON where a person moved or lived at the given time- point. ing the information from the CoreNLP modules in PIKES with the remaining 29 frames listed in Ta- Input Data In our approach information ex- ble 1, our application extracts a list of candidate traction is performed on Wikipedia biographical sentences, containing a date and a movement of pages. In the first step, these pages are cleaned up the subject together with a destination. These rep- by removing infoboxes, tables and tags, keeping resent the geographical position of a person at a only the main body as raw text. certain time.

Pre-processing Raw text is processed using Georeferencing To georeference all the des- PIKES (Corcoglioniti et al., 2015), a suite of tools tinations mentioned in the candidate sentences for extracting frame oriented knowledge from En- RAMBLE ON uses Nominatim5. Due to errors by glish texts. PIKES integrates Semafor (Das et al., the NERC module (e.g., Artaman League anno- 2014), a system for Semantic Role Labeling based tated as geographical entity), some destinations on FrameNet (Baker et al., 1998), whose output is can lack coordinates and thus are discarded. More- used to identify predicates related to movements over, for each biography, the places and dates of and their arguments because its high-level organi- birth and death of the subject are added as taken zation in semantic frames is an useful way to gen- from DBpedia. eralize over predicates. PIKES also includes Stan- ford CoreNLP (Manning et al., 2014). Its mod- Output Data Details about the movements ex- ules for Named Entity Recognition and Classifica- tracted from each Wikipedia biography, e.g. date, tion (NERC), coreference resolution and recogni- coordinates and the original snippet of text, are tion of time expressions are used to detect for each saved in a JSON file as shown in the example be- text: (i) mentions related to the person who is the low. This output format accepts additional fields subject of the biography; (ii) locations and orga- with information about the subject of the biogra- nizations that can be movement destinations; (iii) phy, e.g. gender, occupation domain, that could be dates. extracted from other resources. "name": "Maria_Montessori", Frame Selection Starting from the frames re- "movements": [{ "date": 19151100, lated to the Motion frames in FrameNet and a man- "place": "Italy", ual analysis of a set of biographies, we identified "latitude": 41.87194, 45 candidate frames related to direct (e.g. De- "longitude": 12.5673, "predicate_id": "t3147", parting) or indirect (e.g. Residence) movements "predicate": "returned", of people. After a manual evaluation of these "resource": "FrameNet", 45 frames on a set of biographies annotated with "resource_frame": "Arriving", "place_frame": "@Goal", PIKES, we removed 16 of them from the list of "snippet": "Montessori’s father died in candidate frames because of the high number of November 1915, and she returned to Italy." false positives. These include for example Es- }] caping, Getting underway and Touring. Combin- 5https://nominatim.openstreetmap.org/

78 # of # of 4 Ramble On Navigator DOMAIN individuals movements Movements as extracted with the procedure de- Arts 647 (348) 788 Science & Technology 591 (318) 631 scribed in Section 3 are graphically presented on Humanities 502 (276) 709 an interactive map that visualizes trajectories be- Institutions 483 (255) 633 tween places. The interface, called RAMBLE ON Public Figure 69 (30) 55 Sports 59 (30) 54 Navigator, is built using technology based on Business & Law 38 (13) 22 (HTML5, CSS3, Javascript) and Exploration 18 (13) 37 open source libraries for data visualization and ge- TOTAL 2,407 (1,283) 2,929 ographical representation, i.e. d3.js and Leaflet. Table 2: Domain distribution of individuals in our Through this interface, see Figure 2, it is possible use case. In brackets the number of individuals to filter the movements on the basis of the time with movements. span or to search for a specific individual. More- over, if information about nationality and domain of occupation is provided in the JSON files, the an actual lack of sentences concerning movements Navigator allows to further filter the search. Hov- or errors in the automatic processing, e.g., missed ering the mouse on a trajectory, the snippet of text identification of places or dates. Table 2 shows from which it was automatically extracted appears the distribution per domain of the individuals with on the bottom left. Information about all the move- associated movements. Moreover, we corrected ments related to a place is displayed when hover- the coordinates of places wrongly georeferenced ing on a spot on the map. The trajectories have (6.7%). The extracted movements are evoked by an arrow indicating the route destination and are predicates associated to 66 different lemmas (lex- dashed if the movement described by the snippet ical units in FrameNet). The most frequent lem- is started before the selected time span. mas (> 100 occurrences) are: return (567), move The online version of the Navigator shows the (556), visit (253), travel (188), attend (182), go output of the case study presented in Section 5, (153), live (111), arrive (107). while the stand-alone application also allows to upload another set of data. 6 Conclusion and Future Work We presented an automatic approach for the ex- 5 Case Study traction and visualisation of motion trajectories, We relied on the Pantheon dataset (Yu et al., 2016) which is easy to extend to different datasets, and to identify a list of notable figures to be used in that can provide insights for studies in many fields, our case study. We chose Pantheon since it pro- e.g., history and sociology. vides a ready-to-use set of people already classi- In the future, we will mainly focus on improv- fied into categories based on their domain occupa- ing the system coverage. Currently, missing tra- tion (e.g., Arts, Sports), birth year, nationality and jectories are mainly due to (i) the presence of pred- gender. More specifically, we considered 2,407 icates not recognized as lexical units in FrameNet, individuals from Europe and North America liv- e.g. exile; (ii) the lack of information in the ing between 1900 and 1955. First we downloaded English Wikipedia biography, and (iii) the pres- the corresponding Wikipedia pages, as published ence of sentences with complex temporal struc- in April 2016, collecting a corpus of more than tures, e.g., Cummings returned to Paris in 1921 7,5 million words. Then we used the workflow de- and remained there for two years before return- scribed in Section 3 and we enriched output data ing to New York. These issues can be dealt with with the categories taken from Pantheon. We man- by adding missing predicates to FrameNet, ex- ually refined the output by removing the sentences tend Pikes to other languages and experimenting wrongly identified as movements (14.02%), for with different systems for temporal information example those not referring to the subject of the processing (Llorens et al., 2010). We also plan biography (e.g., When communist North Korea in- to apply the methodology presented in (Aprosio vaded South Korea in 1950, he sent in U.S. troops). and Tonelli, 2015) to automatically recognize the The final dataset resulted in 2,929 sentences from Wikipedia text passages dealing with biographical 1,283 biographies, since 1,124 individuals had no information, so to discard sections containing use- associated movements. This may be due to either less information.

79 Figure 2: Screenshot of Ramble On Navigator.

References Hector Llorens, Estela Saquete, and Borja Navarro. 2010. Tipsem (English and Spanish): Evaluating Alessio Palmero Aprosio and Sara Tonelli. 2015. Rec- CRFs and Semantic Roles in Tempeval-2. In Pro- ognizing Biographical Sections in Wikipedia. In ceedings of the 5th SemEval, pages 284–291. Proceedings of EMNLP 2015. Christopher D. Manning, Mihai Surdeanu, John Bauer, Collin F. Baker, Charles J. Fillmore, and John B. Lowe. Jenny Rose Finkel, Steven Bethard, and David Mc- 1998. The Berkeley FrameNet project. In Proceed- Closky. 2014. The Stanford CoreNLP Natural Lan- ings of COLING-ACL ’98, pages 86–90. guage Processing Toolkit. In ACL (System Demon- strations), pages 55–60. Ciro Cattuto, Wouter Van den Broeck, Alain Barrat, Vittoria Colizza, Jean-Franc¸ois Pinton, and Alessan- Sarah Murray. 2013. Spatial Analysis and Humanities dro Vespignani. 2010. Dynamics of person-to- Data: A Case Study from the Grand Tour Travelers person interactions from distributed RFID sensor Project. In A CESTA Anthology. Stanford. networks. PloS one, 5(7):e11596. Andrew Piper. 2016. There will be numbers. CA: Francesco Corcoglioniti, Marco Rospocher, and Journal of Cultural Analytics, 1(1). Alessio Palmero Aprosio. 2015. Extracting Knowl- edge from Text with PIKES. In Proceedings of the Colin Pooley and Jean Turnbull. 2005. Migration International Semantic Web Conference (ISWC). and mobility in Britain since the eighteenth century. Routledge. Dipanjan Das, Desai Chen, Andre´ F. T. Martins, Nathan Schneider, and Noah A. Smith. 2014. Irene Russo, Tommaso Caselli, and Monica Mona- Frame-semantic parsing. Computational linguistics, chini. 2015. Extracting and Visualising Biograph- 40(1):9–56. ical Events from Wikipedia. In Proceedings of the 1st Conference on Biographical Data in a Digital Olivier Gergaud, Morgane Laouenan,´ Etienne Wasmer, World. et al. 2016. A brief history of human time: Explor- ing a database of’notable people’. Technical report, Maximilian Schich, Chaoming Song, Yong-Yeol Ahn, Sciences Po Departement of Economics. Alexander Mirsky, Mauro Martino, Albert-Laszl´ o´ Barabasi,´ and Dirk Helbing. 2014. A net- Marta C. Gonzalez, Cesar A. Hidalgo, and Albert- work framework of cultural history. Science, Laszlo Barabasi. 2008. Understanding individual 345(6196):558–562. human mobility patterns. Nature, 453(7196):779– Amy Zhao Yu, Shahar Ronen, Kevin Hu, Tiffany Lu, 782. and Cesar´ A. Hidalgo. 2016. Pantheon 1.0, a manu- ally verified dataset of globally famous biographies. Raja Jurdak, Kun Zhao, Jiajun Liu, Maurice Abou- Scientific data, 3. Jaoude, Mark Cameron, and David Newth. 2015. Understanding human mobility from Twitter. PloS one, 10(7):e0131469.

80 Autobank: a semi-automatic annotation tool for developing deep Minimalist Grammar treebanks

John Torr School of Informatics University of Edinburgh 11 Crichton Street, Edinburgh, UK [email protected]

Abstract ical advances have been made during that time, including the discovery of many robust cross- This paper presents Autobank, a pro- linguistic generalizations. These could prove very totype tool for constructing a wide- useful for NLP applications such as machine trans- coverage Minimalist Grammar (MG) (Sta- lation, particularly with respect to under-resourced bler, 1997), and semi-automatically con- languages. Unfortunately, the lack of any Mini- verting the Penn Treebank (PTB) into a malist treebank to date has meant that there has deep Minimalist treebank. The front end been very little research into statistical Minimalist of the tool is a graphical user interface parsing (though see Hunter and Dyer (2013)). which facilitates the rapid development of Given the labour intensity of constructing new a seed set of MG trees via manual reanno- treebanks from scratch, computational linguists tation of PTB preterminals with MG lex- have developed techniques for converting existing ical categories. The system then extracts treebanks into different formalisms (e.g. Hocken- various dependency mappings between the maier and Steedman (2002), Chen et al. (2006)). source and target trees, and uses these in These approaches generally involve two main sub- concert with a non-statistical MG parser tasks: the first is to create a general algorithm to to automatically reannotate the rest of the translate the existing trees into the representational corpus. Autobank thus enables deep tree- format of the target formalism, for example by bi- bank conversions (and subsequent mod- narizing and lexicalizing the source trees and, in ifications) without the need for complex the case of CCGbank (Hockenmaier and Steed- transduction algorithms accompanied by man, 2002), replacing traces of movement with al- cascades of ad hoc rules; instead, the locus ternative operations such as type-raising and com- of human effort falls directly on the task of position; the second task involves coding cascades grammar construction itself. of ad hoc rules to non-trivially modify and/or en- rich the underlying phrase structures, either be- 1 Introduction cause the target formalism requires this, or be- Deep parsing techniques, such as CCG parsing, cause the researcher disagrees with certain theo- have recently been shown to yield significant ben- retical decisions made by the original treebank’s efits for certain NLP applications. However, the annotators. CCGbank, for instance, replaces many construction of new treebanks for training and small clauses in the PTB by a two-complement evaluating parsers using different formalisms is analysis following Steedman (1996). extremely expensive and time-consuming. The Autobank is a new approach to semi-automatic Penn Treebank (PTB) (Marcus et al., 1993), for in- treebank conversion. It was designed to avoid stance, the most commonly used treebank within the need for coding complex transduction algo- NLP, took a team of linguists around three years rithms and cascades of ad hoc rules. Such rules to develop. Its structures were loosely based become far less feasible (and difficult for future on Chomsky’s Extended Standard Theory (EST) researchers to modify) when transducing to a very from the 1970s and, nearly half a century on, these deep formalism such as a Minimalist Grammar look very different from contemporary Chom- (MG) (Stabler, 1997), whose theory of phrase skyan Minimalist analyses. Considerable theoret-

81 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 81–86 c 2017 Association for Computational Linguistics structure1 differs considerably from that of the have also been added onto PTB non-terminals4, PTB2. Furthermore, given the many competing along with the head word and its span. For in- analyses for any given construction in the Mini- stance, the AGENT subject NP Jack in the sen- malist literature, no single MG treebank will be tence, Jack helped her, would be annotated with universally accepted. It is therefore hoped that Au- the tag ARG0{helped<1,2>}. We have also tobank will stimulate wider interest in broad cov- added the additional NP structure from Vadas and erage statistical Minimalist parsing by providing Curran (2007), the additional structure for hy- researchers with a relatively quick and straight- phenated compounds included in the Ontonotes 5 forward way to engineer their own MGs and tree- (Weischedel et al., 2012) version of the PTB, and banks, or to modify existing ones. the additional structure and role labels for coor- Autobank works as follows: the PTB first un- dination phrases recently released by Ficler and dergoes an initial preprocessing phase. Next, the Goldberg (2016). For all structure added, any researcher builds a seed MG corpus by annotat- function tags are redistributed accordingly5. ing lexical items on PTB trees with MG categories Finally, PropBank includes additional and then selecting from among a set of candi- antecedent-trace co-indexing which we have date parses which are output using these categories also imported, and some of this implies the need by MGParse, an Extended Directional Minimal- for additional NP structure beyond what Vadas ist Grammar (EDMG) parser (see Torr and Sta- and Curran have provided. For instance, in the bler (2016)). The system then extracts various de- phrase, the unit of New York-based Lowes Corp pendency mappings between the source and target that *T* makes kent cigarettes, the original anno- trees. Next, a set of candidate parses is generated tation has the unit and of New York-based Lowes for the remaining trees in the PTB, and these are Corp as separate sister NP and PP constituents scored using the dependency mappings extracted (with an SBAR node sister to both), both of from the seeds. In this way, the source corpus ef- which are co-indexed with the subject trace (*T*) fectively adopts a disambiguation role in lieu of position in PropBank. In such cases we have any statistical model. The basic architecture of the added an additional NP node resolving the two system, which was implemented in Python and its constituents into a single antecedent NP. Tkinter module, is given in fig 1. 3 The manual annotation phase 2 Preprocessing Autobank provides a powerful graphical user in- Autobank includes a module for preprocessing the terface enabling the researcher to construct an PTB which corrects certain mistakes and adds (ED)MG by relabelling PTB preterminals with some additional annotations carried out by vari- MG categories and selecting the correct MG parse ous researchers since the treebank’s initial release. from a set of candidates generated by the parser. For example, following Hockenmaier and Steed- The main annotation environment is shown in man (2002), we have corrected cases where verbs fig 2. The PTB tree and its MG candidates are re- were incorrectly labelled with the past tense tag spectively displayed on the top and bottom of the VBD instead of the past participle tag VBN. We screen. Between these are a number of buttons al- also extend the PTB tag set to include person, lowing for easy navigation through the PTB, in- number and gender3 information on nominals and cluding a regular expression search facility for pronominals, in order to constrain reflexive bind- locating specific construction types by searching ing and agreement phenomena in MGbank. both the bracketing and the string. The user can The semantic role labels of PropBank (Palmer also choose to focus on sentences of a given string et al., 2005) and Nombank (Meyers et al., 2004) length. On the left, the sentence is displayed from top to bottom, each word with a drop-down 1MGs are a mildly context sensitive and computational interpretation of Chomsky’s (1995) Minimalist Program. menu listing all MG categories so far associated 2For example, Minimalist trees contain many more null with that word’s PTB preterminal category. Like heads and traces of phrasal movement, along with shell and Xbar phrase structures, cartographic clausal and nominal 4Among other things, these crucially distinguish adjuncts structures, functional heads, head movements, covert move- from arguments, raising/ECM from subject/object control, ments/Agree operations etc. and promise-type subject control from object control/ECM. 3We used the NLTK name database to derive gender on 5We use a modified version of Collins’ (1999) head find- proper nouns. ing rules for this task as well as for the dependency extraction.

82 Automatic Manual extraction of Seed Penn Treebank Preprocessing Annotation source-target MGbank Phase dependency mappings

no

yes Automatic MGbank Finished? Auto MGbank Annotation Phase

Figure 1: Autobank architecture.

Figure 2: The main annotation environment.

CCG, EDMG is a strongly lexicalized and deriva- features up the tree via a simple unification mech- tional formalism, with all subcategorization and anism7. In lieu of any statistical model, this linear order information encoded on lexical items. enables the human annotator to tightly constrain EDMG categories are sequences of features or- the grammar and thus reduce the amount of lo- dered from left to right which are checked and cal and global ambiguity present during manual deleted as the derivation proceeds. For example, and automatic annotation. For example, the cat- the hypothetical category, d= =d v6, could be used egory: v{+TRANS.x}= +case{+ACC.-NULL} to represent a transitive verb that first looks to its =d lv{TRANS.x}, could be used for the null right for an object with d as its first feature (i.e. a causative light verb (so-called little v) that is stan- DP), before selecting a DP subject on its left and dardly assumed in Minimalism to govern a main thus yielding a constituent of category v, i.e. a VP. transitive verb. It specifies that its VP complement MGParse category features can also include must have the property TRANS and that the ob- subcategorization and agreement properties and requirements, and allow for percolation of such 7As in CCG, such unification is limited to atomic property values rather than the sorts of unbounded feature structures 6This category (which is actually used for ditransitives by found in Head-Driven Phrase Structure Grammars; unifica- MGParse) is similar to the CCG category (S NP)/NP. tion here is therefore not re-entrant. \

83 ject whose case feature it will check (in this case parse (to test the performance of the automatic an- via covert movement) must have the property ACC notator, and speed up annotation), a parser settings but must not have the property NULL, and that menu, and an option for backing up all data. following selection of an AGENT DP specifier the The extreme succinctness of the (strongly) lex- resulting vP will have the property TRANS. Fur- icalized (ED)MG formalism, as discussed in Sta- thermore, any additional properties carried by the bler (2013), means that seed set creation can be a VP complement (such as PRES, 3SG etc.) will be relatively rapid process. Like CCGs, MGs have percolated onto the vP (or rather onto its selectee abstract rule schemas which generalize across cat- (lv) feature) owing to the x variable. egories, thus dramatically reducing the size of the Both overt and null (i.e. phonetically silent) grammar. Taking CCGbank’s 1300 categories as MG categories can be added to the system and an approximate upper bound, a researcher work- later modified using the menu system at the top of ing five days a week on the annotation process and the screen. Any time the user attempts to modify a adding 20 MG categories per day to the system category, the system will first reparse any trees in should have added enough MG categories to parse the seed set containing that category to ensure that all the sentences of the PTB within 3 months. the same Xbar tree can still be generated for the sentence in question following the modification8. 4 The automatic annotation phase Once the user has selected an MG category for Once the user has created an initial seed set, they each word in the sentence, clicking parse causes can select auto generate corpus from the MGParse to return all possible parses using these corpus menu. This prompts the system to ex- categories. Parsing without annotating some or all tract a set of dependency mappings and lexical cat- of the words in the sentence is also possible: MG- egory mappings holding between each seed MG Parse will simply try all available MG categories tree and its PTB source tree. To achieve this, the already associated with a word’s PTB pretermi- system traverses the PTB and MG trees and ex- nal in the seed set. Once returned, the trees can tracts a Collins-style dependency tuple for every be viewed in several formats, including multiple non-head child of every non-terminal in each tree. MG derivation tree formats, and Xbar tree and The tuples include the head child and non-head MG (bare phrase structure) derived tree formats. child categories, the parent category, any relevant Candidate parses can be viewed side-by-side for function tags, the directionality of the dependency, comparison, and there is the option to perform a and the head child’s and non-head child’s head diff on the bracketings, and to eliminate incorrect words and spans. Where there are multiple tuples candidates from consideration. Once the user has in a tree with the same head and non-head word identified the correct tree, they can click add to spans, these are grouped together into a chain. seeds to save it; seeds can be viewed and re- For example, for the sentence the doctor ex- moved at any point using the native file system. amined Jack, the AGENT subject NP in a PTB- There will inevitably be occasions when the style tree would yield the following dependency: parser fails to return any parses. In these cases [VP, examined, <2,3>, NP, doctor, <1,2>, S, it is useful to build up the derivation step-by-step [ARG0, SUBJ], left]. Many Minimalists assume to identify the point where it fails, and Autobank that AGENT subjects are base-generated inside the provides an interface for doing just this (see fig 3). verb phrase (Koopman and Sportiche, 1991) in Whereas in annotation mode null heads were kept spec-vP, before moving to the surface subject po- hidden for simplicity, in derivation mode the en- sition in spec-TP. Although in its surface position tire null lexicon is available, along with the overt the subject is a dependent of the T(ense) head, it categories the user selected on the main annotation is the lexical verb which is the semantic head of screen. Other features of the system include a test the extended projection (i.e. of the CP clause con- sentence mode, a corpus stats display, a facility taining it), and both syntactic and semantic heads for automatically detecting the best candidate MG are used here when generating the tuples9. Two tuples are therefore extracted for the dependency 8In general, subcategorization properties and require- ments are the only features on an MG category that can be 9This also allows the system to capture certain systematic modified without first removing all seed parses containing changes in constituency and recognize, for instance, that tem- that category as they do not affect the Xbar tree’s geometry poral adjuncts tagged with a TMP label and attaching to VP (though they can license or prevent its generation). in the PTB should attach to TP in the MG tree.

84 Figure 3: The step-by-step derivation builder. between the verb and the subject and together they mapping is discovered which has previously been form a chain representing the latter’s deep and sur- seen in the seeds, the MG tree containing it is face positions. The system then establishes map- awarded with a point; the candidate with the most pings from PTB tuples/chains to MG tuples/chains points is added to the Auto MGbank. The abstract which share the same head and non-head spans mapping above, for instance, ensures that for tran- (this includes instances where the relation between sitive sentences, trees containing the subject trace the head and dependent has been reversed). in spec vP are preferred. The user can choose to Each mapping is stored in three forms of vary- specify the number and maximum string length of ing degrees of abstraction. In the first, all word- the trees that are automatically generated (together specific information (i.e. the head and non-head with a timeout value for the parser) and can then words and their spans) is deleted; in the second inspect the results and transfer any good trees into the spans and non-head word are deleted, but (a the seed corpus, thereby rapidly expanding it. lemmatized version of) the head word is retained, while in the third the spans are deleted but both 5 Conclusion the (lemmatized) head and non-head words are retained. The fully reified mapping for our sub- Autobank is a GUI tool currently being used ject dependency, for instance, would be: [VP, ex- to semi-automatically construct MGbank, a deep amine, NP, doctor, S, [ARG0, SUBJ], left] —> Minimalist version of the PTB. Minimalism is a [[T’, examine, DP, doctor, TP, left], [v’, exam- lively theory, however, and in continual flux. Au- ine, DP, doctor, vP, left]]. Including both abstract tobank was therefore designed with reusability in and reified mappings allows the system to recog- mind, in the hope that other researchers will use nise not only general phrase structural correspon- it to create alternative Minimalist treebanks, ei- dences, but also more idiosyncratic mappings con- ther from scratch or by modifying an existing one, ditioned by specific lexical items, as in the case of and to stimulate greater interest in computational idioms and light verb constructions, for instance. Minimalism and statistical MG parsing. The sys- The system next parses the remaining sen- tem could also potentially be adapted for use with tences, selecting a set of potential MG categories other source and (lexicalised) target formalisms. for each word in each sentence using the lexi- sentences. To ameliorate this, once enough seeds have been cal category mappings10. Whenever a dependency added, an MG supertagger (see Lewis and Steedman (2014)) will be trained and used to reduce the amount of lexical am- 10Note that there will be many MG categories for every biguity; to improve things further, multiple sentences will be PTB category, which will make parsing quite slow for certain processed in parallel during automatic annotation.

85 Acknowledgements The NomBank project: An interim report. In Pro- ceedings of HLT-EACL Workshop: Frontiers in Cor- Many thanks to Ed Stabler for providing the ini- pus Annotation. tial inspiration for this project, and for his subse- Martha Palmer, Dan Gildea, and Paul Kingsbury. 2005. quent guidance and encouragement last summer. The proposition bank: An annotated corpus of se- The work described in this paper was funded by mantic roles. Computational Linguistics, 31(1):71– Nuance Communications Inc. and the Engineering 106. and Physical Sciences Research Council. I would Edward Stabler. 1997. Derivational minimalism. In also like to thank the reviewers for their comments Christian Retore,´ editor, Logical Aspects of Com- and also my supervisors Mark Steedman and Shay putational Linguistics (LACL’96), volume 1328 of Cohen for their help and guidance during the de- Lecture Notes in Computer Science, pages 68–95, velopment of the Autobank system and the writing New York. Springer. of this paper. Edward Stabler. 2013. Two models of minimalist, incremental syntactic analysis. Topics in Cognitive Science, 5:611–633. References Mark Steedman. 1996. Surface Structure and Inter- John Chen, Srinivas Bangalore, and K. Vijay-Shanker. pretation. Linguistic Inquiry Monograph 30. MIT 2006. Automated extraction of tree-adjoining gram- Press, Cambridge, MA. mars from treebanks. Natural Language Engineer- ing, 12(3):251–299. John Torr and Edward P. Stabler. 2016. Coordination in minimalist grammars: Excorporation and across Noam Chomsky. 1995. The Minimalist Program. MIT the board (head) movement. In Proceedings of the Press, Cambridge, Massachusetts. Twelfth International Workshop on Tree Adjoining Grammar and Related Formalisms (TAG+12), pages Michael Collins. 1999. Head-Driven Statistical Mod- 1–17. els for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania. David Vadas and James Curran. 2007. Adding noun phrase structure to the penn treebank. In Proceed- Jessica Ficler and Yoav Goldberg. 2016. Coordina- ings of the 45th Annual Meeting of the Association of tion annotation extension in the penn tree bank. In Computational Linguistics, pages 240–247, Prague, Proceedings of the 54th Annual Meeting of the As- June. ACL. sociation for Computational Linguistics, ACL 2016, Berlin, German, Volume 1: Long Papers, pages Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, 834–842. Jeff Kaufman, Michelle Franchini, Mohammed El- Bachouti, Nianwen Xue, Martha Palmer, Jena D. Julia Hockenmaier and Mark Steedman. 2002. Hwang, Claire Bonial, Jinho Choi, Aous Mansouri, Acquiring compact lexicalized grammars from a Maha Foster, Abdel aati Hawwary, Mitchell Marcus, cleaner treebank. In Proceedings of the Third In- Ann Taylor, Craig Greenberg, Eduard Hovy, Robert ternational Conference on Language Resources and Belvin, and Ann Houston. 2012. OntoNotes Re- Evaluation (LREC), pages 1974–1981. lease 5.0 with OntoNotes DB Tool v0.999 beta. Tim Hunter and Chris Dyer. 2013. Distributions on minimalist grammar derivations. In Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13), pages 1–11, Sofia, Bulgaria, August. The Association of Computational Linguistics.

Hilda Koopman and Dominique Sportiche. 1991. The position of subjects. Lingua, 85(2-3):211–258.

Mike Lewis and Mark Steedman. 2014. Improved CCG parsing with semi-supervised supertagging. Transactions of the Association for Computational Linguistics, 2:327–338.

Mitch Marcus, Beatrice Santorini, and M. Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19:313–330.

A. Meyers, R. Reeves, C. Macleod, R. Szekely, V. Zielinska, B. Young, and R. Grishman. 2004.

86 Chatbot with a Discourse Structure-Driven Dialogue Management

Boris Galitsky Dmitry Ilvovsky Knowledge Trail Inc. National Research University San Jose, CA, USA Higher School of Economics Moscow, Russia [email protected] [email protected]

Abstract what is most interesting and relevant to them. The chat bot we introduce in this demo pa- We build a chat bot with iterative content per is inspired by the idea that knowledge explo- exploration that leads a user through a per- ration should be driven by navigating a single dis- sonalized knowledge acquisition session. course tree (DT) built for the whole corpus of rel- The chat bot is designed as an automated evant content. We refer to such tree as extended customer support or product recommenda- discourse tree. Moreover, to select rhetorically tion agent assisting a user in learning prod- cohesive answers, chat bot should be capable of uct features, product usability, suitability, classifying question-answer pairs as cohesive or troubleshooting and other related tasks. To not. This can be achieved by learning the pairs of control the user navigation through con- extended discourse trees for the question-answer tent, we extend the notion of a linguistic pairs. A question can have an arbitrary rhetoric discourse tree (DT) towards a set of doc- structure as long as the subject of this question uments with multiple sections covering a is clear to its recipient. An answer on its own topic. For a given paragraph, a DT is built can have an arbitrary rhetoric structure. However, by DT parsers. We then combine DTs for these structures, the DT of a question and the DT the paragraphs of documents to form what of an answer should be correlated when this an- we call extended DT, which is a basis for swer is appropriate to this question. We apply a interactive content exploration facilitated computational measure for how logical, rhetoric by the chat bot. To provide cohesive an- structure of a request or a question is in agreement swers, we use a measure of rhetoric agree- with that of a response, or an answer. ment between a question and an answer by Over last decade, Siri for iPhone and Cortana tree kernel learning of their DTs. for Windows Phone have been designed to serve as digital assistants. They analyze input question 1 Introduction sentences and return suitable answers for users Modern search engines have become very good queries (Crutzen et al., 2011). However, they at understanding typical, most popular user in- assume patterned word sequences as input com- tents, recognizing topic of a question and provid- mands. Moreover, there are previous studies that ing a relevant links. However, search engines are combine natural language processing techniques not necessarily capable of providing an answer with ontology technology to implement computer that would match a style, personal circumstances, system for intellectual conversation. There are knowledge state, an attitude of a user who formu- chat bot systems including ALICE3, which uti- lated a query. This is particularly true for long, lizes an ontology, like Cyc, API.ai and Amazon complex queries, and for a dialogue-based type of Lex, the platforms for developers to build chat interactions. In a chat bot, the flow query – clar- bots. Most of these systems expected the cor- ification request - clarification response - candi- rect formulation of a question, certain domain date answer should be cohesive, not just main- knowledge and a rigid grammatical structure in tain a topic of a conversation. Moreover, modern user query sentences; however, less rigid struc- search engines and modern chat bots are unable to tured sentences can appear in a users utterance leverage an immediate, explicit user feedback on to a chat bot. Developers of chat bot platforms

87 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 87–90 c 2017 Association for Computational Linguistics

Figure 1: Sample dialogue. User questions and responses are aligned on the left and system re- sponses - on the right. can specify preset Q/A pairs, but are unable to in- troduce domain-independent rhetoric constraints Figure 2: Extended discourse tree for three docu- which are much more flexible and reusable. ments used to navigate to a satisfactory answer

2 Personalized and Interactive Domain top to bottom on the search results page. On the Exploration Scenarios contrary, the chat bot allows a convergence in this answer navigation session since the answer #n+1 Most chat bots are designed to imitate human in- is suggested based on additional clarification sub- tellectual activity maintaining a dialogue. The pur- mitted after the answer #n is consulted by the user. pose of the chat bot with iterative exploration is Chat bot provides topics of answers for a user to to provide most efficient and effective informa- choose from. These topics give the user a chance tion access for users. A vast majority of chat to assess how her request was understood on one bots nowadays, especially based on deep learn- hand and restore the knowledge area associated ing, try to build a plausible sequence of words with her question on the other hand. In our exam- (Williams, 2012) to serve as an automated re- ples, topics include balance transfer, using funds sponse to user query. Such most plausible re- on a checking account, or canceling your credit sponses, sequences of syntactically and semanti- card. A user is prompted to select a clarification cally compatible words, are not necessarily most option, drill into either of these options, or decline informative. Instead, in this demo we focus on a all options and request a new set of topics which chat bot that helps a user to navigate to the exact, the chat bot can identify. insightful answer as fast as possible. For example, if a user is formulated her query 3 Navigation with the Extended DT Can I use one credit card to pay for another, the chat bot attempts to recognize a user intent and On the web, information is usually represented in a background knowledge about this user to estab- web pages and documents, with certain section lish a proper context. The chat bot leverages an structure. Answering questions, forming topics of observation learned from the web that an individ- candidate answers and attempting to provide an ual would usually want to pay with one credit card answer based on user selected topic are the opera- for another to avoid late payment fee when cash tions which can be represented with the help of a is unavailable as long as the user does not specify structure that includes the DTs of texts involved. her other circumstances. When a certain portion of text is suggested to a To select a suitable answer from a search en- user as an answer, this user might want to drill in gine, a user first reads snippets one-by-one and something more specific, ascend to a more general then proceeds to the linked document to consult level of knowledge or make a side move to a topic in detail. Reading the answer #n does not usu- at the same level. These user intents of navigating ally help to decide which next answer should be from one portion of text to another can be repre- consulted, so a user proceeds is to answer #n+1. sented in most cases as coordinate or subordinate Since answers are sorted by popularity, for a given discourse relations between these portions. user there is no better way than just proceed from Rhetorical Structure Theory (RST), proposed

88 by (Mann and Thompson, 1988), became popular as a framework for parsing the discourse relations and discourse structure of a text, represents texts by labeled hierarchical structures such as DTs. A DT expresses the author’s flow of thoughts at the level of a paragraph or multiple paragraphs. But DTs become fairly inaccurate when applied to larger text fragments, or documents. To enable Figure 3: Forming the Request-Response pair as a chat bot with a capability to form and navigate an element of a training set a DT for a corpus of documents with sections and hyperlinks, we introduce the notion of ofextended DT. To avoid inconsistency we would further refer candidates, we select those with higher classifi- to the original DTs as conventional. To construct cation score. Classification algorithm is based on conventional discourse trees we used one of the the tree kernel learning on conventional discourse existing discourse parsers (Joty et al., 2013). trees. Their regular nodes are rhetoric relations, The intuition behind navigation is illustrated in and terminal nodes are elementary discourse units Fig.2. Three areas in this chart denote three docu- (phrases, sentence fragments) which are the sub- ments each containing three section headers and jects of these relations. section content. For section content, discourse We selected Yahoo! Answer set of question- trees are built for the text. A navigation starts with answer pairs with broad range of topics. Out of the route node of a section that matches the user the set of 4.4 million user questions we selected query most closely. Then the chat bot attempts to 20000 which included more than two sentences. build a set of possible topics, which are the satel- Answers for most of the questions are fairly de- lites of the root node of the DT. If the user ac- tailed so no filtering was applied to answers. There cepts a given topic, the navigation continues along are multiple answers per questions and the best the chosen edge. Otherwise, when no topic covers one is marked. We consider the pair Question - the user interest, the chat bot backtracks the DT Best Answer as an element of the positive training and proceeds to the other section (possibly of an- set and Question - Other Answer as the one of the other documents) which matched the original user negative training set. To derive the negative train- query second best. Finally, the chat bot arrives at ing set, we either randomly selected an answer to a the DT node where the user is satisfied (shown by different but somewhat related question, or formed hexagon) or gives up and starts a new exploratory a query from the question and obtained an answer session . Inter-document and inter-section edges from web search results. for relations such as elaboration play similar role in knowledge exploration navigation to internal 5 Evaluation edges of a conventional discourse tree. We compare the efficiency of information access using the proposed chat bot with a major web 4 Relying on Q/A rhetoric agreement to search engines such as Google, for the queries select the most suitable answers where both systems have relevant answers. For a search engines, misses are search results preced- In conventional search approach, as a baseline, ing the one relevant for a given user. For a chat bot, Q/A match is measured in terms of keyword statis- misses are answers which causes a user to chose tics such as TF*IDF. To improve search relevance, other options suggested by the chat bot, or request this score is augmented by item popularity, item other topics. location or taxonomy-based score (Galitsky et al., Topics of questions include personal finance. 2013; Galitsky et al., 2014). The feature space Twelve users (authors colleagues) asked the chat includes Q/A pairs as elements, and a separation bot 15-20 questions each reflecting their financial hyper-plane splits this feature space. We combine situations, and stopped when they were either sat- DT(Q) with DT(A) into a single tree with the root isfied with an answer or dissatisfied and gave up. Q/A pair (Fig. 3). We then classify such pairs The same questions were sent to Google, and eval- into correct (with high agreement) and incorrect uators had to click on each search results snippet (with low agreement). Having multiple answer to get the document or a web page and decide on

89 cused on imitation of a human intellectual activity. The command-line demo1 and source code2 are available online under . Source code is a sub-project of Apache OpenNLP (https://opennlp.apache.org/). Since Bing search engine API is actively used for web mining to obtain candidate answers, a user would need to use her own Bing key for commercial use Figure 4: Comparing conventional search engine available at https://datamarket.azure. with chat bot in terms of a number of iterations com/dataset/bing/search. whether they can be satisfied with it. 7 Acknowledgements The structure of comparison of search efficiency for the chat bot vs the search engine is shown in This work was supported by Knowledge-Trail Fig. 4. The top portion of arrows shows that all Inc., Samur.ai, Sysomos Inc., RFBR grants #16- search results (on the left) are used to form the list 01-00583 and #16-29-12982 and was prepared of topics for clarification. The arrow on the bot- within the framework of the Basic Research Pro- tom shows that the bottom answer ended up being gram at the National Research University Higher selected by the chat bot based on two rounds of School of Economics (HSE) and supported within user feedback and clarifications. the framework of a subsidy by the Russian Aca- demic Excellence Project ’5-100’. Parameter / search en- Web search Chat bot gine Av. time to satisfactory 45.3 58.1 search result, sec References Av. time of unsatisfactory Rik Crutzen, Gjalt-Jorn Y. Peters, Sarah Dias Portugal, search session (giving up 65.2 60.5 Erwin M. Fisser, and Jorne J. Grolleman. 2011. An and starting a new search), sec artificially intelligent chat agent that answers adoles- Av. # of iter. to satisfactory cents’ questions related to sex, drugs, and alcohol: 5.2 4.4 search result an exploratory study. Journal of Adolescent Health, Av. # of iter. to unsatisfac- 48(5):514–519. 7.2 5.6 tory search result Boris Galitsky, Dmitry I. Ilvovsky, Sergei O. Table 1: Comparison of the time spent and a # of Kuznetsov, and Fedor Strok. 2013. Matching sets of parse trees for answering multi-sentence questions. iterations for the chat bot and Google search in the In RANLP, pages 285–293. domain of personal finance Boris A. Galitsky, Dmitry Ilvovsky, Sergei O. One can observe (Table 1) that the chat bots Kuznetsov, and Fedor Strok. 2014. Finding max- imal common sub-parse thickets for multi-sentence time of knowledge exploration session is longer search. In Graph Structures for Knowledge Repre- than search engines. Although it might seem to sentation and Reasoning, pages 39–57. Springer In- be less beneficial for users, business prefer users ternational Publishing. to stay longer on their , as the chance of Shafiq R. Joty, Giuseppe Carenini, Raymond T. Ng, user acquisition grows. Spending 7% more time and Yashar Mehdad. 2013. Combining intra-and on reading chat bot answers is expected to allow a multi-sentential rhetorical parsing for document- user to better familiarize himself with a domain, level discourse analysis. In ACL (1), pages 486–496. especially when these answers follow the selec- William C. Mann and Sandra A. Thompson. 1988. tions of this user. The number of steps of an ex- Rhetorical structure theory: Toward a functional the- ploration session for chat bot is 25% lower than ory of text organization. Text-Interdisciplinary Jour- for Google search. nal for the Study of Discourse, 8(3):243–281. Gwydion M. Williams. 2012. Chatbot detector. New 6 Conclusion Scientist, 215(2873):28. We conclude that using a chat bot with extended 1 discourse tree-driven navigation is an efficient and https://1drv.ms/u/s! AlZzGY7TCKAChBqHsfVCeNZ2V_KL fruitful way of information access, in comparison 2https://github.com/bgalitsky/ with conventional search engines and chat bots fo- relevance-based-on-parse-trees

90 Marine Variable Linker: Exploring Relations between Changing Variables in Marine Science Literature

Erwin Marsi1, Pinar Øzturk1, Murat V. Ardelan2 Department of Computer Science1, Department of Chemistry2 Norwegian University of Science and Technology emarsi,pinar,murat.v.ardelan @ntnu.no { }

Abstract ence, climate science and environmental science remain mostly unexplored. Due to significant dif- We report on a demonstration system for ferences between the conceptual frameworks of text mining of literature in marine science biomedicine and other disciplines, simply ”port- and related disciplines. It automatically ing” the biomedical text mining infrastructure to extracts variables (e.g. CO2) involved another domain will not suffice. Moreover, the in events of change/increase/decrease type of questions to be asked and the answers ex- (e.g increasing CO2), as well as co- pected from text mining may be quite different. occurrence and causal relations among Theories and models in marine science typically these events (e.g. increasing CO2 causes a involve changing variables and their complex in- decrease in pH in seawater), resulting in a teractions, which includes correlations, causal re- big knowledge graph. A web-based graph- lations and chains of positive/negative feedback ical user interface targeted at marine sci- loops, where multicausal events are common. entists facilitates searching, browsing and Many marine scientists are thus interested in find- visualising events and their relations in an ing evidence – or counter-evidence – in the litera- interactive way. ture for events of change and their relations. Here we report on an end-user system, resulting from 1 Introduction our ongoing work to automatically extract, relate, Progress in science relies significantly on the query and visualise events of change and their di- premise that – in addition to other methods for rection of variation. gaining knowledge such as experiments and mod- Our text mining efforts in the marine science elling – new knowledge can be inferred by com- domain are guided by a basic conceptual model bining existing knowledge found in the litera- described in (Marsi et al., 2014). The system pre- ture. Unfortunately such knowledge often remains sented here covers a subset of this model, namely, undiscovered because individual researchers can change events, variables and causal relations. A realistically only read a relatively small part of the change is an event in which the value of a variable literature, typically mostly in the narrow field of is changing, but the direction of change is unspec- their own expertise (Swanson, 1986). Therefore ified. There are two specific subtypes of a change we need software to help researchers managing the event: an increase in which direction of change ever growing scientific literature and quickly fulfil is positive and a decrease in which the direction their specific information needs. Even more so for of change is negative. A variable is something “big problems” in science, such as climate change, mentioned in the text that is changing its value. which require a system-level, cross-disciplinary This is a very broad definition that covers much approach. more than traditional entities (e.g. person, disease Text mining of scientific literature has been pi- or protein) and includes long and complex expres- oneered in biomedicine and is now finding its sions. A cause is relation that holds between a way to other disciplines, notably in the humani- pair of events in which a change in one variable ties and social sciences, holding the promise for causes a change in another variable. An example knowledge discovery from large text collections. of a sentence annotated according to this concep- Still, multidisciplinary fields such as marine sci- tual model is shown in Figure 1.

91 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 91–94 c 2017 Association for Computational Linguistics

Figure 1: Example of text annotation according to conceptual model

2 Text mining system ture, which is ultimately itself generalised to tem- perature. This is accomplished through progres- Our text mining system for marine science liter- sive pruning of a variable’s syntactic tree, using a ature is called Megamouth, inspired by the filter- combination or tree pattern matching and tree op- feeder sharks which filter plankton out of the wa- erations. ter. Its overall task is to turn unstructured data again uses tree- (text) into structured data (graph database) adher- Step 5: Relation extraction pattern matching with hand-written patterns to ex- ing to the conceptual model (discarding all other tract causal relations between pairs of events, iden- information) through a process of information ex- tifying their cause and relation roles. traction. The process is essentially a pipeline of processing steps, as briefly described below. Step 6: Conversion to graph All extracted Step 1: Document retrieval involves crawling variables, events and relations are subsequently the websites for a predefined set of journals and converted to a single huge property graph, which 1 extracting the text segments of interest from the is stored and indexed in a Neo4j graph database HTML code, which includes title, authors, ab- (Community Edition) to facilitate fast search and stract, references, etc. Marine-related articles are retrieval. It contains nodes for variables, gener- selected through a combination of term matching alised variables, event instances, event relations, with a manually compiled list of key words and a sentences and articles. It has edges between, e.g., LDA topic model. a variable and its generalisations. Properties on Step 2: Linguistic analysis consists of tokeni- nodes/edges hold information like a sentence’s sation, sentence splitting, lemmatisation, POS tag- number and character string on sentence nodes, or ging and constituency parsing using the Stanford the character offsets for event instances. CoreNLP tools (Manning et al., 2014). It pro- Step 7: Graph post-processing enriches the vides essential information required in subsequent initial graph in a number of ways using the Cypher processing such as variable extraction by pattern graph query language. Event instance nodes are matching against syntactic parse trees. aggregated in event type nodes. Likewise, causal Step 3: Variable and event extraction relation instances are aggregated in causal rela- is performed simultaneously through tree pat- tions types between event types. Furthermore, tern matching, where manually written pat- co-occurrence counts for event pairs occurring in terns are matched against lemmatised con- the same sentence are computed and added as stituency trees of sentences to extract events (in- co-occurrence relations between their respective crease/decrease/change) and their variables. It de- event type nodes. Post-processing also includes pends on two tools that are part of CoreNLP: addition of metadata and citation information, ob- Tregex is a library for matching patterns in trees tained through the Crossref metadata API, to arti- based on tree relationships and regular expression cles nodes in the graph. matches on nodes; Tsurgeon is a closely related li- The final output is a big knowledge graph (mil- brary for transforming trees through sequences of lions of nodes) containing all information ex- tree operations. For more details, see (Marsi and tracted from the input text. The graph can be Øzturk, 2016; Marsi and Ozt¨ urk,¨ 2015). searched in many different ways, depending on Step 4: Generalisation of variables addresses interest, using the Cypher graph query language. variables that are very long and complex and One possibility is searching for chains of causal therefore unlikely to occur more than once. These relations. The user interface described in the next are generalised (abstracted) by removing non- section offers a more user-friendly way of search- essential words and/or splitting them into atomic ing for a certain type of patterns, namely, relations variables. For example, the variable the annual, between changing variables. Milankovitch and continuum temperature is split into three parts, one of which is annual tempera- 1https://neo4j.com/

92 lations between these events, where queries can be composed in a similar fashion as for events. The first kind of relation is cooccurs, which means that two events co-occur in the same sentence. When two events are frequently found together in a sin- gle sentence, they tend to be associated in some way, possibly by correlation. The second kind of relation is causes, which means that two events in a sentence are causally related, where one event is Figure 2: Example of event query composition the cause and other the effect. Causality must be explicitly described in the sentence, for example, 3 User interface by words such as causes, therefore, leads to, etc. Relation search results are presented in two Although graph search queries can be written by ways. The relation types table contains all pairs of hand, it takes time, effort and a considerable event types, specifying their relation, event pred- amount of expertise. In addition, large tables icates, event variables and counts. Figure 3 pro- are difficult to read and navigate, lacking an easy vides an example for an open-ended search query way to browse the results, e.g., to look up the where the cause is an event with variable iron (in- source sentences and articles for extracted events. cluding its specialisations, direction of change un- Moreover, users need to have a local installation specified), whereas the effect is left open (i.e., can of all required software and data. The Marine be any event). The corresponding relation graph Variable Linker (MVL) is intended to solve these is shown in Figure 4. The nodes are event types problems. Its main function is to enable non- with red triangles for increases, blue triangles for expert users (marine scientists) to easily search the decreases and green diamonds for changes. graph database in an interactive way and to present Clicking on a row in the table or a node in the search results in a browsable and visual way. It is graph brings up a corresponding instances table a web application that runs on any modern plat- (cf. bottom of Figure 3), showing sentences, years form with a browser (Linux, Mac OS, Windows, and citations of articles containing the given rela- Android, iOS, etc). It is a graphical user interface, tion. The events are marked in colour: red for in- which relies on familiar input components such as creasing, blue for decreasing and green for chang- buttons and selection lists to compose queries, and ing. Hovering the mouse over the document icon uses interactive tables and graphs to present search will show a citation for the source article, whereas results. Hyperlinks are used for navigation and to clicking it will open the article’s web page con- connect related information, e.g. the webpage of taining the sentence in a new window. the source journal article. A demo of an MVL instance indexing 75,221 Figure 2 shows an example of a search query marine-related abstracts from over 30 journals is for events consisting of two search rules: (1) the currently freely accessible on the web.2 Source variable equals iron and (2) the event type is in- code for the text mining system and the graphical crease. In addition, the search is with specialisa- user interface is freely available.3 tions, which means that it includes variables that can be generalised to iron, such as particulate iron 4 Discussion and Future work or iron in clam mactra lilacea. More rules can be added to narrow down, or widen, event search. Our system still makes many errors. Variables and Clicking the search button will bring up a new ta- events are sometimes incorrectly extracted (e.g. ble for matching event types, showing the instance variable more iron in Figure 3 ought to be just counts, predicates and variables (not shown here). iron), often due to syntactic parsing errors, and Clicking on any row in this event types table will many are missed altogether (e.g. the decline in bring up the corresponding event instances table, the particulate organic carbon quotas in the same which shows all the actual mentions of this event Figure), because events can be expressed in so in journal articles. Each row in the instance table 2http://baleen.idi.ntnu.no/demos/ shows a sentence, year of publication and source. megamouth-abs/ Once events are defined, one can search for re- 3https://github.com/OC-NTNU

93 Figure 3: Example of search results for causal relations types (top) and a selected instance (bottom) many different and complex ways. Yet we feel interest from marine and climate scientists, rais- that even with a fair amount of noise, the current ing awareness of the potential of text mining, as proof-of-concept already offers practical merit in progress will crucially depend on building a com- battling the literature deluge. We will continue to munity similar to that in biomedical text mining. work on improving recall and precision, as well as usability aspects of the interface. One aspect Acknowledgments under active development is the integration of new Financial aid from the European Commis- and better algorithms for causal relation extraction sion (OCEAN-CERTAIN, FP7-ENV-2013-6.1-1; based on machine learning from manually anno- no:603773) is gratefully acknowledged. tated data. The knowledge graph can be searched References efficiently using Cypher queries, which opens up many other interesting opportunities for knowl- C.D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S.J. Bethard, and D. McClosky. 2014. The Stanford edge discovery. We hope our system will attract CoreNLP natural language processing toolkit. In ACL System Demonstrations, pages 55–60. E. Marsi and P. Ozt¨ urk.¨ 2015. Extraction and general- isation of variables from scientific publications. In EMNLP, pages 505–511, Lisbon, Portugal. E. Marsi and P. Øzturk. 2016. Text mining of related events from natural science literature. In Workshop on Semantics, Analytics, Visualisation: Enhancing Scholarly Data (SAVE-SD), Montreal, Canada. E. Marsi, P. Ozt¨ urk,¨ E. Aamot, G. Sizov, and M.V. Ardelan. 2014. Towards text mining in climate sci- ence: Extraction of quantitative variables and their relations. In Fourth Workshop on Building and Eval- uating Resources for Health and Biomedical Text Processing, Reykjavik, Iceland. D. R. Swanson. 1986. Fish oil, raynaud’s syndrome, and undiscovered public knowledge. Perspectives in Figure 4: Example of relation graph biology and medicine, 30(1):7–18.

94 Neoveille, a Web Platform for Neologism Tracking

Emmanuel Cartier Universite´ Paris 13 Sorbonne Paris Cite´ - LIPN - RCLN UMR 7030 CNRS 99 boulevard Jean-Baptiste Clement´ 93430 Villetaneuse, FRANCE [email protected]

Abstract is now possible to monitor language change and track linguistic innovations. This paper details a software designed to track neologisms in seven languages 3 Previous Work in Neology and Neology through newspapers monitor corpora. The Tracking platform combines state-of-the-art pro- cesses to track linguistic changes and a Linguistic change has been studied for decades web platform for linguists to create and and even centuries in linguistics, and has been manage their corpora, accept or reject dealt with more recently in computational linguis- automatically identified neologisms, de- tics. scribe linguistically the accepted neolo- gisms and follow their lifecycle on the 3.1 Linguistic Neology Models monitor corpora. In the following, after a short state-of-the-art in Neologism Re- 3.1.1 Neologism Categories trieval, Analysis and Life-tracking, we de- Linguistic change has been first focused on by scribe the overall architecture of the sys- the Comparative Grammars School, whose main tem. The platform can be freely browsed goal was to study languages diachronically. They at www.neoveille.org where de- have mainly based their descriptions and analysis tailed presentation is given. Access to the on linguistic forms, describing phonetical, phono- editing modules is available upon request. logical, morphological, syntactical and semantic 1 Credits change on a long-term basis (Geeraerts, 2010). More recently, several attempts have emerged in Neoveille is an international Project funded by the the field of linguistic change, mainly focusing on ANR IDEX specific funding scheme. It gathers the lexical units and proposing typology of neolo- seven Linguistics and Research Centers. See web- gisms (Schmid, 2015),(Sablayrolles, 2016). site for details. 3.1.2 Synchrony, Diachrony, Diastraty 2 Introduction A complementary approach is due to (Gevaudan Linguistic change is one of the fundamental prop- and Koch, 2010) who state that every lexical evo- erties of language, even if, at least on a short-term lution can be described through three parameters : period, languages appear to be extremely conser- two are universal, the semantic parameter (explic- vative and reluctant to change. Whereas NLP ef- iting a continuity of meaning or a discontinuity, forts have mainly focused on synchronic language in this case further described) and the stratic pa- analysis, research and applications are very sparse rameter (linking the linguistic structure to its so- on the diachronic side, especially concerning short ciological context : borrowings are explained this term diachrony. But, with the availability of big way); the third one is linked to every specific lin- web corpora, the maturity of automatic linguis- guistic formal structure, with four generic matri- tic analysis and especially those able to process ces : conversion, morphological extension, com- big data while maintaining a reasonable quality, it position and clipping).

95 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 95–98 c 2017 Association for Computational Linguistics

3.1.3 Neologism Life-cycle(s) and a reference dictionary from which unknown Neology is one of the aspect of linguistic change, words can be derived. Further filters then apply to with necrology and stability. One important aspect eliminate spellings errors and Proper Nouns. Sub- is thus to model the lifecycle of a neologism, from sequent developments all replicate this architec- its first occurrence to its potential disappearance ture : OBNEO (Cabre´ and De Yzaguirre, 1995), or conventionalization. (Traugott and Trousdale, NeoCrawler (Kerremans et al., 2012), Logoscope 2013) have proposed three salient states : innova- (Gerard´ et al., 2014) and more recently Neoveille tion, propagation and conventionalisation, linking (Cartier, 2016). each to several more-or-less obvious properties. Four main difficulties arise from these architec- With these models in mind, NLP has developed ture : first, EDA can not track semantic neolo- several algorithms to track and study the lifecycle gisms, as they use existing lexical units to con- of neologisms. vey innovative meanings; second, the design of a reference exclusion dictionary is not that ob- 3.2 Computational Models of Neology vious as it requires the existence of a machine- Computational Linguistics has begun to work on readable dictionary : this entails specific proce- linguistic change not long ago, mainly because it dures to apply this architecture to less-resourced needs to have at hand large diachronic electronic languages, and the availability of an up-to-date corpora. Neology is moreover still considered as machine-readable dictionary for more resourced an secondary topic, as novel lexical units repre- languages ; third, the EDA architecture is not suf- sents less than 5 percent of lexical units in corpora, ficient in itself : among unknown forms, most according to several studies. But linguistic change of them are Proper Nouns, spelling mistakes and is the complementary aspect of the synchronic other cases derived from corpus boilerplate re- structure. Every lexical unit is subjected to time, moval : this entails a post-processing phase to de- form and meaning can change, due to diastratic part cases; Fourth, these systems do not take into events and situations. The advent of electronic account the sociological and diatopic aspects of (long and short-term) diachronic corpora, scien- neologism, as they limit their corpora to specific tific research and advances on word-formation and domains : a ideal system should be able to ex- machine learning techniques able to manage big tend its monitoring to new corpora and maintain corpora, have permitted the emergence of neology diastratic meta-datas to characterize novel forms. tracking systems. Apart from a best knowledge of To the best of our knowledge, Neoveille (Cartier, language lifecyle(s), these tools would permit to 2016) is the only system implementing this aspect. update lexicographic resources, computational as- 3.3.2 Semantic Neology Approaches sets and parsers. As for semantic neology, three approaches have From the CL point of view, the main ques- been recently proposed, none of them being ex- tions are : how can we automatically track neolo- ploited in an operational system. The first one gisms, categorize them and follow their evolution, stems from the idea that meaning change is linked from their first appearance to their conventionali- to domain change : every texts and thus the con- sation or disappearance? At best, can we induce stituent existing lexical units are assigned one or neology-formation procedures from examples and more topic; if a lexical unit emerges in a new do- therefore predict potential neologisms? main, a change in meaning should have occurred (Gerard´ et al., 2014). The main drawback of this 3.3 Existing Neology Tracking System approach is that it is limited to specific semantic 3.3.1 the Exclusion Dictionary Architecture change (it can not tackle conventional metaphors (EDA) if appeared in the same domain, nor detect exten- The main achievement of neology tracking con- sion or restriction of meaning) and mainly limited sists in system extracting novel forms from mon- to Nouns. itor corpora, using lexicographic resources as a An other approach is linked to the distributional reference exclusion dictionary to induce unknown paradigm : ”You shall know a word by the com- words, what we can call the ”exclusion dictionary pany it keeps”(Firth, 1957). The main idea is to re- architecture” (EDA). The first system is due to trieve from a large corpora all the collocates or col- (Renouf, 1993) for English : a monitor corpora lostructions, and classify them according to sev-

96 eral metrics. The main salient resulting context from their first occurrence. Tracking the lifecycle words represent the ”profile” (Blumenthal, 2009) of neologisms requires to fix criteria to identify the or ”sketch” (Kilgarriff et al., 2004) of a lexical main phases : emergence, dissemination, conven- unit for the given synchronic period. The most tionalization (Traugott and Trousdale, 2013). In elaborated system is surely the Sketch Engine sys- operational systems, the main tool to follow the tem, which propose for every lexical engine its life of a neologism is the timeline rendering the ab- ”sketch”, i.e. a list, for any user-defined syntac- solute or relative frequency of the lexical unit. In tic schemas (for example modifiers, nominal sub- Neoveille, these figures are relative to specific di- ject, object and indirect object for verb) of occur- astratic and diatopic paramaters, visually enabling rences, sorted by one or several association mea- to distinguish emergence, spread and convention- sure. This system can be improved in two main alization. These analysis are available for each ways : first, it does not propose complete syntac- identified neologism by clicking on the stats icon tic schemas for lexical units like verbs (it is limited (see website, last neologisms menu). to either a SUBJ-VERB or VERB-OBJ relation, but does not propose SUBJ-VERB-OBJ relations); 4 Neoveille Tracking System second, it does not propose a clustering of occur- Architecture rences, whereas distributional semantics could fill The Neoveille architecture aims at enabling a the gap and propose distributional classes at any complete synergy between NLP system and expert place in the schema, instead of flat list of occur- linguists : expert linguists are not able to monitor rences. the vast amount of textual data whereas automatic A third approach consists in tracking semantic processes can help tackle this amount: experts can change by applying the second aspect of the distri- accurately decide if a word is or is not a neolo- butional hypothesis, that lexical units sharing the gism; our current point of view is that linguists most of contexts are most likely to be semantically must have the last word on what is and is not a similar. This assumption has been applied to many neologism, and on the linguistic description; but computational semantic tasks. Applied to seman- as knowledge and description will grow up with tic change, if you have at your disposal a bunch time, we will build Supervised Machine Learning of diachronic corpora, you can build the semantic techniques able to predict potential neologisms. vectors of any lexical unit corresponding to sev- The Neoveille web architecture has five main eral periods, and track the changes from one pe- components: riod to another. First experiments have been pro- posed by (Hamilton et al., 2016). The main ad- 1. A corpora manager: corpora is the main feed vantage of this approach resides in the fact that it for NLP systems, and we propose to linguists proposes for a given word a list of semantically a system enabling to choose their corpora and similar words, among which synonyms and hyper- to add to them several meta-datas. The cor- nyms, which permits to clearly explicit the mean- pora, once defined by the user, are retrieved ing of a word. The main drawback of this ap- on a daily basis, indexed and searched for ne- proach is to be unable to distinguish meanings for ologisms. Corpora management is available polysemous units. Another relative drawback re- in the restricted area on the left menu; lies on the fuzzy notion of similarity, which results 2. An advanced search engine on the corpora; in semantically too-slightly similar words (anal- : not only corpora can be monitored, but ogy), or even opposite words (antonymy). But this also the system should propose a search en- approach is clearly of great help to humanly grasp gine with advanced capabilities : advanced the meaning of a word. querying, filtering and faceting of results; the In the Neoveille Project, we are currently devel- Neoveille search engine is available on the oping a approach combining the Sketch approach restricted area on the left menu; based on mixed with semantic distributions on the main lex- , it enables to query the corpora ical unit and its arguments. in a multifactorial manner, with facets and vi- sual filters; 3.3.3 Tracking the Lifecycle of Neologisms In our view, we postulate that neologisms are new 3. Advanced Data Analytics expliciting the life- form-meaning pairs (Lexical units) and thus exist cycle of neologisms and their diachronic, di-

97 atopic and diastratic parameters : Neoveille Maria Teresa Cabre´ and Luis De Yzaguirre. 1995. provides such a Data Analytics Framework Strategie´ pour la detection´ semi-automatique des by combining meta-data to text mining; neologismes´ de presse. TTR : traduction, terminolo- gie, redaction,´ 8 (2), p. 89-100. These visual analysis are available for every neologisms by clicking the stats icon; Emmanuel Cartier. 2016. Neoveille,´ systeme` de reperage´ et de suivi des neologismes´ en sept langues. 4. A linguistic description component for ne- Neologica, 10, Revue internationale de neologie,´ ologisms : this module, whose microstruc- p.101-131. ture has been setup for several years, could John R. Firth. 1957. Papers in linguistics 1934–1951, be used for knowledge of neology in a given london, oxford university press, 1957. language, and could also be used by a super- Dirk Geeraerts. 2010. Theories of Lexical Semantics. vised machine learning system, as these fea- Oxford University Press. tures include a lot of formal properties. This component is accessible in the restricted area Paul Gevaudan and Peter Koch. 2010. Semantique´ cognitive et changement semantique.´ Grandes voies on the left menu. et chemins de traverse de la semantique´ cognitive, Memoire´ de la Societ´ e´ de linguistique de Paris, 5. formal and semantic neologisms tracking XVIII, pp. 103-145. with state-of-the-art techniques The formal and semantic neologism components are ac- Christophe Gerard,´ Ingrid Falk, and Delphine Bern- hard. 2014. Traitement automatise´ de la neologie´ : cessible in the restricted area on the left pourquoi et comment integrer´ l’analyse thematique?´ menu. They work for the seven languages of Actes du 4e Congres` mondial de linguistique the project. franc¸aise (CMLF 2014), Berlin, p. 2627-2646.

5 Conclusion and perspectives William L. Hamilton, Jure Leskovec, and Dan Juraf- sky. 2016. Diachronic Word Embeddings Reveal This short presentation has evoked the design and Statistical Laws of Semantic Change. ACL 2016. the overall architecture of a software for linguistic Daphne´ Kerremans, Susanne Stegmayr, and Hans-Jorg¨ analysis focusing on linguistic change in a con- Schmid. 2012. The NeoCrawler: identifying and temporary monitor corpora. It has several interest- retrieving neologisms from the internet and moni- ing properties : toring on-going change. Kathryn Allan and Justyna A. Robinson, eds., Current methods in historical se- real-time tracking, analysis and visualization mantics, Berlin etc.: de Gruyter Mouton, 59-96. • of linguistic change; A. Kilgarriff, R. Pavel, Pavel S., and Tugwell D. 2004. The Sketch Engine. Proceedings of Euralex, pages complete synergy between Computational 105–116, Lorient. • Linguistics processing and linguistic experts Antoinette Renouf. 1993. Sticking to the Text : a cor- intuitions and knowledge, especially the pos- pus linguist’s view of language. ASLIB Proceedings, sibility of editing automatic results by experts 45 (5), p. 131-136. and exploitation of linguistic annotations by machine learning processes; Jean-Franc¸ois Sablayrolles. 2016. Les neologismes´ . Collection Que sais-je? Presses Universitaires de France. modularity of software : corpora manage- • ment, state-of-the-art search engine including Hans-Jorg¨ Schmid. 2015. The scope of word- analysis and visualization, neologisms min- formation research. Peter O. Muller,¨ Ingeborg Ohn- ing, neologism linguistic description, lifecy- heiser, Susan Olsen and Franz Rainer, eds., Word- Formation. An International Handbook of the Lan- cle tracking. guages of Europe. Vol. I.

This project, currently focusing on seven lan- Elizabeth Closs Traugott and Graeme Trousdale. 2013. guages is in the path to extend to other languages. Constructionalization and constructional changes.

References P. Blumenthal. 2009. El´ ements´ d’une theorie´ de la combinatoire des noms. Cahiers de lexicologie 94 (2009-1), 11-29.

98 Building Web-Interfaces for Vector Semantic Models with the WebVectors Toolkit

Andrey Kutuzov Elizaveta Kuzmenko University of Oslo Higher School of Economics Oslo, Norway Moscow, Russia [email protected] eakuzmenko [email protected]

Abstract approaches, which allowed to train distributional models with large amounts of raw linguistic data We present WebVectors, a toolkit that very fast. The most established word embedding facilitates using distributional semantic algorithms in the field are highly efficient Continu- models in everyday research. Our toolkit ous Skip-Gram and Continuous Bag-of-Words, im- has two main features: it allows to build plemented in the famous word2vec tool (Mikolov web interfaces to query models using a et al., 2013b; Baroni et al., 2014), and GloVe in- web browser, and it provides the API to troduced in (Pennington et al., 2014). query models automatically. Our system Word embeddings represent the meaning of is easy to use and can be tuned according words with dense real-valued vectors derived from to individual demands. This software can word co-occurrences in large text corpora. They be of use to those who need to work with can be of use in almost any linguistic task: named vector semantic models but do not want to entity recognition (Siencnik, 2015), sentiment develop their own interfaces, or to those analysis (Maas et al., 2011), machine translation who need to deliver their trained models to (Zou et al., 2013; Mikolov et al., 2013a), cor- a large audience. WebVectors features vi- pora comparison (Kutuzov and Kuzmenko, 2015), sualizations for various kinds of semantic word sense frequency estimation for lexicogra- queries. For the present moment, the web phers (Iomdin et al., 2016), etc. services with Russian, English and Nor- Unfortunately, the learning curve to master wegian models are available, built using word embedding methods and how to present the WebVectors. results to general public may be steep, especially for people in (digital) humanities. Thus, it is im- 1 Introduction portant to facilitate research in this field and to In this demo we present WebVectors, a free and provide access to relevant tools for various linguis- open-source toolkit1 helping to deploy web ser- tic communities. vices which demonstrate and visualize distribu- With this in mind, we are developing the Web- tional semantic models (widely known as word Vectors toolkit. It allows to quickly deploy a stable embeddings). We show its abilities on the example and robust for operations on word em- of the living web service featuring distributional bedding models, including querying, visualization models for English and Norwegian2. and comparison, all available even to users who Vector space models, popular in the field of are not computer-savvy. distributional semantics, have recently become WebVectors can be useful in a very common sit- a buzzword in natural language processing. In uation when one has trained a distributional se- fact, they were known for decades, and an ex- mantics model for one’s particular corpus or lan- tensive review of their development can be found guage (tools for this are now widespread and sim- in (Turney et al., 2010). Their increased popu- ple to use), but then there is a need to demonstrate larity is mostly due to the new prediction-based the results to the general public. The toolkit can be installed on any Linux server with a small set 1https://github.com/akutuzov/ webvectors of standard tools as prerequisites, and generally 2http://ltr.uio.no/semvec works out-of-the-box. The administrator needs

99 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 99–103 c 2017 Association for Computational Linguistics only to supply a trained model or models for one’s Figure 1: Computing associates for the word ‘lin- particular language or research goal. The toolkit guistics’ based on the models trained on English can be easily adapted for specific needs. Wikipedia and BNC. 2 Deployment The toolkit serves as a web interface between dis- tributional semantic models and users. Under the hood it uses the following software:

Gensim library (Rehˇ u˚rekˇ and Sojka, 2010) ∙ which is responsible for actual interaction with models3;

Python Flask framework responsible for the ∙ user interface. It runs either on top of a reg- ular Apache HTTP server or as a standalone service (using or other standalone WSGI server).

Flask communicates with Gensim (functioning as a daemon with our wrapper) via sockets, send- ing user queries and receiving answers from mod- els. This architecture allows fast simultaneous pro- 3. apply algebraic operations to word vectors: cessing of multiple users querying multiple mod- addition, subtraction, finding average vector els over network. Models themselves are per- for a group of words (results are returned as manently stored in memory, eliminating time- lists of words nearest to the product of the consuming stage of loading them from permanent operation and their corresponding similarity storage every time there is a need to process a values); this can be used for analogical infer- query. ence, widely known as one of the most inter- The setup process is extensively covered by esting features of word embeddings (Mikolov the installation instructions available at https: et al., 2013b); //github.com/akutuzov/webvectors. 4. visualize semantic relations between words. 3 Main features of WebVectors As a user enters a set of words, the ser- vice builds a map of their inter-relations in Once WebVectors is installed, one can interact with the chosen model, and then returns a 2- the loaded model(s) via a web browser. Users are dimensional version of this map, projected able to: from the high-dimensional vector space, us- 1. find semantic associates: words semanti- ing t-SNE (Van der Maaten and Hinton, cally closest to the query word (results are 2008). An example of such visualization is returned as lists of words with correspond- shown in Figure 2; ing similarity values); an illustration of how it looks like in our demo web service is in 5. get the raw vector (array of real values) for Figure 1; the query word.

2. calculate exact semantic similarity between pairs of words (results are returned as cosine One can use part-of-speech filters in all of these similarity values, in the range between -1 and operations. It is important to note that this is pos- 1); sible only if a model was trained on a PoS-tagged 3Can be any distributional model represented as a list of corpus and the tags were added to the resulting vectors for words: word2vec, GloVe, etc. lemmas or tokens. Obviously, the tagger should

100 Figure 2: Visualizing the positions of several Figure 3: Analogical inference with several mod- words in the semantic space els

specific pattern described in the documentation; in response, a file with the first 10 associates or the semantic similarity score is returned. There are differentiate between homonyms belonging to dif- two formats available at the present moment: json ferent parts of speech. WebVectors can also use an and tab-separated text files. external tagger to detect PoS for query words, if not stated explicitly by the user. By default, Stan- 4 Live demos ford CoreNLP (Manning et al., 2014) is used for The reference web service running on our code morphological processing and lemmatization, but base is at http://ltr.uio.no/semvec. It one can easily adapt WebVectors to any other PoS- allows queries to 4 English models trained with the tagger. Continuous Skipgram algorithm (Mikolov et al., In fact, one can use not only PoS tags, but any 2013b): the widely known Google News model other set of labels relevant for a particular research published together with the word2vec tool, and project: time stamps, style markers, etc. WebVec- the models we trained on Gigaword, British Na- tors will provide the users with the possibility to tional Corpus (BNC) and English Wikipedia dump filter the models’ output with respect to these tags. from September 2016 (we plan to regularly update this last one). Additionally, it features a model Another feature of the toolkit is the possibility trained on the corpus of Norwegian news texts, to display results from more than one model si- Norsk aviskorpus (Hofland, 2000). To our knowl- multaneously. If several models are enumerated edge, this is the first neural embedding model in the configuration file, the WebVectors daemon trained on the Norwegian news corpus made avail- loads all of them. At the same time, the user inter- able online; (Al-Rfou et al., 2013) published dis- face allows to choose one of the featured models tributional models for Norwegian, but they were or several at once. The results (for example, lists trained on the Wikipedia only, and did not use the of nearest semantic associates) for different mod- current state-of-the-art algorithms. els are then presented to the user side-by-side, as Prior to training, each word token in the train- in the Figure 3. This can be convenient for re- ing corpora was not only lemmatized, but also search related to comparing several distributional augmented with a Universal PoS tag (Petrov et semantic models (trained on different corpora or al., 2012) (for example, boot VERB). Also, some with different hyperparameters). amount of strongly related bigram collocations like ‘Saudi::Arabia PROPN’ was extracted, so that they receive their own embeddings after the Last but not least, WebVectors features a sim- training. The Google News model already features ple API that allows to query the service automat- ngrams, but lacks PoS tags. To make it more com- ically. It is possible to get the list of semantic as- parable with other models, we assigned each word sociates for a given word in a given model or to in this model a PoS tag with Stanford CoreNLP. compute semantic similarity for a word pair. The Another running installation of our toolkit is user performs GET requests to URLs following a the RusVectores service available at http://

101 rusvectores.org (Kutuzov and Kuzmenko, guage Learning, pages 183–192, Sofia, Bulgaria, 2016). It features 4 Continuous Skipgram and August. Association for Computational Linguistics. Continuous Bag-of-Words models for Russian Marco Baroni, Georgiana Dinu, and German´ trained on different corpora: the Russian National Kruszewski. 2014. Don’t count, predict! a Corpus (RNC)4, the RNC concatenated with the systematic comparison of context-counting vs. Russian Wikipedia dump from November 2016, context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for the corpus of 9 million random Russian web pages Computational Linguistics, volume 1. collected in 2015, and the Russian news corpus (spanning time period from September 2013 to Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vec- November 2016). The corpora were linguistically tors with subword information. arXiv preprint pre-processed in the same way, lending the mod- arXiv:1607.04606. els the ability to better handle rich morphology of Knut Hofland. 2000. A self-expanding corpus based Russian. The RusVectores is already being em- on newspapers on the web. In Proceedings of the ployed in academic studies in computational lin- Second International Conference on Language Re- guistics and digital humanities (Kutuzov and An- sources and Evaluation (LREC-2000). dreev, 2015; Kirillov and Krizhanovskij, 2016; B. L. Iomdin, A. A. Lopukhina, K. A. Lopukhin, and Loukachevitch and Alekseev, 2016) (several other G. V. Nosyrev. 2016. Word sense frequency of research projects are in progress as of now). similar polysemous words in different languages. One can use the aforementioned services as live In Computational Linguistics and Intellectual Tech- nologies. Dialogue, pages 214–225. demos to evaluate the WebVectors toolkit before actually employing it in one’s own workflow. A. N. Kirillov and A. A. Krizhanovskij. 2016. Model’ geometricheskoj struktury sinseta [the model of ge- ometrical structure of a synset]. Trudy Karel’skogo 5 Conclusion nauchnogo centra Rossijskoj akademii nauk, (8). The main aim of WebVectors is to quickly deploy Andrey Kutuzov and Igor Andreev. 2015. Texts in, web services processing queries to word embed- meaning out: neural language models in seman- ding models, independently of the nature of the tic similarity task for Russian. In Computational Linguistics and Intellectual Technologies. Dialogue, underlying training corpora. It allows to make Moscow. RGGU. complex linguistic resources available to wide au- dience in almost no time. We continue to add new Andrey Kutuzov and Elizaveta Kuzmenko. 2015. Comparing neural lexical models of a classic na- features aiming at better understanding of embed- tional corpus and a web corpus: The case for Rus- ding models, including sentence similarities, text sian. Lecture Notes in Computer Science, 9041:47– classification and analysis of correlations between 58. different models for different languages. We also Andrey Kutuzov and Elizaveta Kuzmenko. 2016. We- plan to add models trained using other algorithms, bvectors: a toolkit for building web interfaces for like GloVe (Pennington et al., 2014) and fastText vector semantic models (in press). (Bojanowski et al., 2016). Natalia Loukachevitch and Aleksei Alekseev. 2016. We believe that the presented open source Gathering information about word similarity from toolkit and the live demos can popularize distri- neighbor sentences. In International Conference butional semantics and computational linguistics on Text, Speech, and Dialogue, pages 134–141. Springer. among general public. Services based on it can also promote interest among present and future Andrew L. Maas, Raymond E. Daly, Peter T. Pham, students and help to make the field more com- Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. pelling and attractive. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 142–150. References Association for Computational Linguistics. Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. Christopher D. Manning, Mihai Surdeanu, John Bauer, 2013. Polyglot: Distributed word representations Jenny Finkel, Steven J. Bethard, and David Mc- for multilingual NLP. In Proceedings of the Seven- Closky. 2014. The Stanford CoreNLP natural lan- teenth Conference on Computational Natural Lan- guage processing toolkit. In Association for Compu- tational Linguistics (ACL) System Demonstrations, 4http://www.ruscorpora.ru/en/ pages 55–60.

102 Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013a. Exploiting similarities among lan- guages for machine translation. arXiv preprint arXiv:1309.4168. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- rado, and Jeff Dean. 2013b. Distributed representa- tions of words and phrases and their composition- ality. Advances in Neural Information Processing Systems 26, pages 3111–3119. Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Nat- ural Language Processing (EMNLP), pages 1532– 1543. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012). European Language Resources Association (ELRA).

Radim Rehˇ u˚rekˇ and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Cor- pora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– 50, Valletta, Malta, May. ELRA. Scharolta Katharina Siencnik. 2015. Adapting word2vec to named entity recognition. In Nordic Conference of Computational Linguistics NODAL- IDA 2015, page 239. Peter D. Turney, Patrick Pantel, et al. 2010. From frequency to meaning: Vector space models of se- mantics. Journal of artificial intelligence research, 37(1):141–188. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605):85. Will Y. Zou, Richard Socher, Daniel M. Cer, and Christopher D. Manning. 2013. Bilingual word em- beddings for phrase-based machine translation. In EMNLP, pages 1393–1398.

103 InToEventS: An Interactive Toolkit for Discovering and Building Event Schemas

German´ Ferrero Audi Primadhanty Ariadna Quattoni Universidad Nacional de Cordoba´ Universitat Politecnica` Xerox Research Centre Europe [email protected] de Catalunya ariadna.quattoni@ [email protected] xrce.xerox.com

Abstract

Event Schema Induction is the task of learning a representation of events (e.g., bombing) and the roles involved in them (e.g, victim and perpetrator). This paper presents InToEventS, an interactive tool for learning these schemas. InToEventS allows users to explore a corpus and dis- cover which kind of events are present. We show how users can create useful event schemas using two interactive clustering steps.

1 Introduction

An event schema is a structured representation of Figure 1: Event Schema Definition an event, it defines a set of atomic predicates or facts and a set of role slots that correspond to the typical entities that participate in the event. For ex- 2013; Nguyen et al., 2015). However, they all end ample, a bombing event schema could consist of up assuming some form of supervision at docu- atomic predicates (e.g., detonate, blow up, plant, ment level, and the task of inducing event schemas explode, defuse and destroy) and role slots for a in a scenario where there is no annotated data is perpetrator (the person who detonates plants or still an open problem. This is probably because blows up), instrument (the object that is planted, without some form of supervision we do not even detonated or defused) and a target (the object that have a clear way of evaluating the quality of the is destroyed or blown up). Event schema induction induced event schemas. is the task of inducing event schemas from a tex- tual corpus. Once the event schemas are defined, In this paper we take a different approach. We slot filling is the task of extracting the instances argue that there is a need for an interactive event of the events and their corresponding participants schema induction system. The tool we present en- from a document. ables users to explore a corpus while discovering In contrast with information extraction systems and defining the set of event schemas that best de- that are based on atomic relations, event schemas scribes the domain. We believe that such a tool ad- allow for a richer representation of the semantics dresses a realistic scenario in which the user does of a particular domain. But, while there has been not know in advance the event schemas that he is a significant amount of work in relation discovery, interested in and he needs to explore the corpus to the task of unsupervised event schema induction better define his information needs. has received less attention. Some unsupervised The main contribution of our work is to present approaches have been proposed (Chambers and an interactive event schema induction system that Jurafsky, 2011; Cheung et al., 2013; Chambers, can be used by non-experts to explore a corpus and

104 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 104–107 c 2017 Association for Computational Linguistics

Figure 2: Event Clustering easily build event schemas and their correspond- a distance threshold that defines a semantic ball. ing extractors. For example, the injured predicate is represented The paper is organized as follows. Section 2 de- by the literal: injure, and the syntactic relation: scribes our event schema representation. Section 3 object. This tuple is designed to represent the fact describes the interactive process for event schema that a victim is a person or organization who has discovery. Section 4 gives a short overview of re- been injured, and whose corresponding word vec- lated work. Finally section 5 concludes. tor representation is inside a given semantic ball.

2 Event Schema 3 Event Schema Induction Our definition of event schema follows closely We now describe InToEventS, an interactive sys- that of Chambers and Jurafsky (2011). An event tem that allows a user to explore a corpus and build schema consists of two main components: event schemas. Like in (Chambers and Jurafsky, Event Triggers. These correspond to a set of 2011) the process is divided in two main steps. atomic predicates associated with an event. For First, the user will discover the events present in example, as shown in Figure 1, for the bombing the corpus (that is, event trigger sets) by interac- event schema we might have the atomic predi- tively defining a soft partition of the predicate lit- cates: explode, hurl, destroy, hit, shatter and in- erals observed in the corpus. Depending on his in- jure. Each atomic predicate is represented by a formation needs he will chose a subset of the clus- tuple composed of a literal (e.g. explode), a real- ters that correspond to the events that he is inter- valued word vector representation and a distance ested in representing. In the second step, for each threshold that defines a ball around the literal in a chosen event trigger set, the user will complete the word vector space representation. event schema and build slots or semantic roles via Event Slots. These correspond to the set of par- an interactive clustering of the syntactic arguments ticipating entities involved in the event. For ex- of the atomic predicates in the event trigger set. ample, for the bombing event schema we might have the event slots: victim, perpetrator, instru- 3.1 First Step: Event Induction ment and target. Each slot is represented by a tu- To build event trigger sets we will cluster predi- ple consisting of an entity type (e.g. person, or- cate literals observed in the corpus. To do this we ganization or object) and a set of predicates, for first need to compute a distance between the pred- example for the victim slot, the predicates are in- icate literals. Our notion of distance is based on jured, dies, wounded, fired and killed. Each pred- two simple observations: (1) literals that tend to icate is in turn represented by a tuple, consisting appear nearby in a document usually play a role of a literal, a syntactic relation, a word vector and in the same event description (e.g., This morning

105 Figure 3: Slots Clustering a terrorist planted a bomb, thankfully the police idea: let’s assume for instance a bombing event defused it before it blew up); and (2) literals with with triggers: attack, blow up, set off, injure, die, { similar meaning are usually describing the same ... , we can intuitively describe a victim of a bomb- } atomic predicates (e.g., destroy and blast). ing event as ”Someone who dies, is attacked or in- We first extract all unique verbs and all nouns jured”, that is: ”PERSON: subject of die, object noun with a corresponding synset in Wordnet la- of attack, object of injured”. beled as noun.event or noun.act, these are our Recall that a slot is a set of predicates repre- predicate literals. Then, for each pair of literals sented with a tuple composed by: a literal predi- we compute their distance taking into account the cate and a syntactic relation, e.g. kill-subject. Ad- probability that they will appear nearby in a doc- ditionaly each slot has an entity-type. In a first ument sampled from the corpus and the distance step, for each predicate in the event trigger set of the corresponding literals in a word embedding we extract from the corpus all unique tuples of vector space 1. the form predicate-syntactic relation-entity type. Once the distance is computed, we run agglom- The extraction of such tuples uses the universal erative clustering. In the interactive step we let the dependency representation computed by Stanford user explore the resulting dendrogram and chose CoreNLP parser and named entity classifier. a distance threshold that will result in an initial In a second step, we compute a distance be- partition of the predicate literals into event trig- tween each tuple that is based on the average word ger sets (see Figure 2). In a second step, the user embeddings of the arguments observed in the cor- can merge or split the initial clusters. In a third pus for a given tuple. For example, to compute a step, the user selects and labels the clusters that he vector (die, subject, PERSON) we identify all en- is interested in, this will become incomplete event tities of type PERSON in the corpus that are sub- schemas. Finally, using the word embedding rep- ject of the verb die and average their word embed- resentation the user has the option of expanding dings. each event trigger set by adding predicate literals Finally, as we did for event induction we run that are close in the word vector space. agglomerative clustering and offer the user an in- teractive graphic visualization of the resulting den- 3.2 Second Step: Role Induction dogram in Figure 3. The user can explore different In this step we will complete the event schemas clusters settings and store those that represent the of the previous step with corresponding semantic slots that he is interested in. roles or slots. This process is based on a simple Once the event schemas have been created, we 1For experiments we used the pre-trained 300 dimen- can use them to annotate documents. Figure 4 sional GoogleNews model from word2vec. shows an example of an annotated document.

106 Figure 4: Slot Filling

4 Previous Work tics: Human Language Technologies - Volume 1, HLT ’11, pages 976–986, Stroudsburg, PA, USA. To the best of our knowledge there is no previous Association for Computational Linguistics. work on interactive workflows for event schema induction. The most closely related work is on Nathanael Chambers. 2013. Event schema induction with a probabilistic entity-driven model. In EMNLP, interactive relation extraction. Thilo and Alan volume 13, pages 1797–1807. (2015) presented a web toolkit for exploratory re- lation extraction that allows users to explore a cor- Jackie Chi Kit Cheung, Hoifung Poon, and Lucy Van- derwende. 2013. Probabilistic frame induction. pus and build extraction patterns. Ralph and Yi- arXiv preprint arXiv:1302.4813. fan (2014) presented a system where users can create extractors for predifined entities and rela- Freedman Marjorie, Ramshaw Lance, Boschee Eliza- beth, Gabbard Ryan, Kratkiewicz Gary, Ward Nico- tions. Their approach is based on asking the user las, and Weishedel Ralph. 2011. Extreme extrac- for seeding example instances which are then ex- tion: machine reading in a week. In Proceedings ploited with a semi-supervised learning algorithm. of the Conference on Empirical Methods in Natural Marjorie et al. (2011) presented a system for inter- Language Processing (EMNLP 2011). active relation extraction based on active learning Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret, and boostrapping. and Romaric Besanc¸on. 2015. Generative event schema induction with entity disambiguation. In 5 Conclusion Proceedings of the 53rd annual meeting of the As- sociation for Computational Linguistics (ACL-15). We have presented an interactive system for event schema induction, like in (Chambers and Jurafsky, Grishman Ralph and He Yifan. 2014. An informa- tion extraction customizer. In Proceedings of Text, 2011) the workflow is based on reducing the prob- Speech and Dialogue. lem to two main clustering steps. Our system lets the user interact with the clustering process in a Michael Thilo and Akbik Alan. 2015. Schnapper: simple and intuitive manner and explore the cor- A web toolkit for exploratory relation extraction. In Proceedings of ACL-IJCNLP 2015 System De- pus to create the schemas that better fits his infor- mostration (ACL-15). mation needs.

References Nathanael Chambers and Dan Jurafsky. 2011. Template-based information extraction without the templates. In Proceedings of the 49th Annual Meet- ing of the Association for Computational Linguis-

107 ICE: Idiom and Collocation Extractor for Research and Education

Vasanthi Vuppuluri Shahryar Baki, An Nguyen, Rakesh Verma Verizon Labs Computer Science Dept. 375 W Trimble Rd, University of Houston San Jose, CA 95131, USA Houston, TX 77204, USA [email protected] [email protected] [email protected] [email protected]

Abstract language translation, topic segmentation, authorial style, and so on. As a result, a tool for these tasks Collocation and idiom extraction are well- would be very handy. known challenges with many potential ap- To tackle this void, we introduce a feature-rich plications in Natural Language Process- system called ICE (short for Idiom and Colloca- ing (NLP). Our experimental, open-source tion Extractor), which has two versions: one is software system, called ICE, is a python flexible and pipelined seamlessly for research pur- package for flexibly extracting colloca- poses as a component of a larger system such as tions and idioms, currently in English. It a question answering system, and the second as a also has a competitive POS tagger that web-based tool for educational purposes. ICE has can be used alone or as part of colloca- a modular architecture and also includes a POS tion/idiom extraction. ICE is available free tagger, which can be used alone or as part of col- of cost for research and educational uses location or idiom extraction. An experiment with in two user-friendly formats. This paper the CoNLL dataset shows that ICE’s POS tagger gives an overview of ICE and its perfor- is competitive against the Stanford POS tagger. mance, and briefly describes the research For ease of use in research, we provide ICE as a underlying the extraction algorithms. python package. For collocation extraction, ICE uses the IR 1 Introduction models and techniques introduced by (Verma et Idioms and collocations are special types of al., 2016). These methods include: dictionary phrases in many languages. An idiom is a phrase search, online dictionaries, a substitution method whose meaning cannot be obtained composition- that compares the Bing hit counts of a phrase ally, i.e., by combining the meanings of the words against the Bing hit counts of new phrases ob- that compose it. Collocations are phrases in tained by substituting the component words of the which there is a semantic association between the phrase one at a time to determine the “adherence component words and some restrictions on which factor” of the component words in a collocation, words can be replaced and which cannot. In short, and two methods that try to measure the probabil- collocations are arbitrarily restricted lexeme com- ity of association of the component words again binations such as look into and fully aware. using hit counts. In (Verma et al., 2016), the au- Many scientists from diverse fields have worked thors created a gold-standard dataset of colloca- on the challenging tasks of automated colloca- tions by taking 100 sentences at random from the tion and idiom extraction, e.g., see (Garg and Wiki50 dataset and manually annotating them for Goyal, 2014; Seretan, 2013; Verma and Vuppu- collocations (including idioms) using eight volun- luri, 2015; Verma et al., 2016) and the references teers, who used the Oxford Dictionary of Colloca- contained therein, yet there is no multi-purpose, tions and Oxford Dictionary of Idioms. Each sen- ready-to-use, and flexible system for extracting tence was given to two annotators, who were given these phrases. Collocation and its special forms, 25 sentences each for annotation, and their work such as idioms, can be useful in many important was checked and corrected afterwards by two tasks, e.g., summarization (Barrera and Verma, other people. In creating this dataset, even with 2012), question-answering (Barrera et al., 2011), the assistance of dictionaries, human performance

108 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 108–111 c 2017 Association for Computational Linguistics varied from an F1-score of about 39% to 70% for the collocation task. A comparison showed that their combination schemes outperformed ex- isting techniques, such as MWEToolkit (Ramisch et al., 2010) and Text-NSP (Banerjee and Peder- sen, 2003), with the best method achieving an F1- score of around 40% on the gold-standard dataset, which is within the range of human performance. For idiom extraction, ICE uses the semantics- based methods introduced by (Verma and Vuppu- luri, 2015). The salient feature of these methods is that they use Bing search for the definition of a given phrase and then check the compositional- ity of the phrase definition against combinations of the words obtained when a define word query is issued, where the word belongs to the phrase. Figure 1: Collocation extractor diagram If there is a difference in the meaning, that phrase is considered an idiom. In (Verma and Vuppuluri, 2015), authors showed that their method outper- forms AMALGr (Schneider et al., 2014). Their best method achieved an F1-score of about 95% on the VNC tokens dataset. Thus, ICE includes extraction methods for id- ioms and collocations that are state-of-the-art. Other tools exist for collocation extraction, e.g., see (Anagnostou and Weir, 2006), in which four methods including Text-NSP are compared.

2 ICE - Architecture and Algorithms As ICE’s algorithms are based on Bing search, users must provide a valid user id for the Bing API. ICE receives a list of sentences as an input and outputs a list of all collocations and idioms. Figure 2: Idiom extractor diagram It first splits the input sentences using NLTK sen- tence tokenizer, then generates n-grams and part of speech tags. ICE’s n-gram generator takes care slightly higher than 91.19%, the accuracy of the of punctuation marks and has been shown to be Stanford tagger on the same corpus. better than NSP’s n-gram generator. Finally, the Collocation/Idiom Extractor. The collocation output n-grams are given to the collocation and id- extraction technique combines different methods iom detection algorithms. Collocation and idiom in a pipeline in order to increase precision. Fig- extraction has been done by the algorithm given by ures 1 and 2 show the idiom and collocation ex- (Verma et al., 2016)1 and (Verma and Vuppuluri, traction system architectures separately. As shown 2015). For part of speech tagging we combined in the diagrams, there are two methods for identi- NLTK’s regex tagger with NLTK’s N-Gram Tag- fying idioms (called And and Or) and four differ- ger to have a better performance on POS tagging. ent methods for identifying collocations including: We compared our tagger with Stanford POS tag- offline dictionary search, online dictionary search, ger (Manning et al., 2014) on the CoNLL dataset.2 web search and substitution, and web search and The accuracy of our tagger is 92.11%, which is independence. 1 http://www2.cs.uh.edu/˜rmverma/paper_ For collocations, ICE pipelines the first and sec- 216.pdf 2Available at http://www.cnts.ua.ac.be/ ond methods, then pipelines them with the third conll2003/ner/ or the fourth method (both options are available

109 in the code). These methods are connected se- the n sets. The Or subsystem considers a phrase quentially. This means that if something is con- as an idiom if at least one word survives one of sidered as a collocation in one component, it will the subtractions (union of difference sets should be added to the list of collocations and will not be be non-empty). For the And, at least one word has given to the next component (yes/no arrows in the to exist that survived every subtraction (intersec- diagram). Table 1 shows the description of each tion of difference sets should be non-empty) component in collocation extractor diagram. Performance. ICE outperforms both Text- The Ngram Extractor receives all sentences and NSP and MWEToolkit. On the gold-standard generates n-grams ranging from bigrams up to 8- dataset, ICE’s F1-score was 40.40%, MWE- grams. It uses NLTK sentence and word tokeniz- Toolkit’s F1-score was 18.31%, and Text-NSP had ers for generating tokens. Then, it combines the 18%. We also compared our idiom extraction generated tokens together taking care of punctua- with AMALGr method (Schneider et al., 2014) on tion to generate the n-grams. their dataset and the highest F1-score achieved by Dictionary Check uses WordNet (Miller, 1995) ICE was 95% compared to 67.42% for AMALGr. as a dictionary and looks up each n-gram to see if For detailed comparison of ICE’s collocation and it exists in WordNet or not (a collocation should idiom extraction algorithm with existing tools, exist in the dictionary). All n-grams that are con- please refer to (Verma et al., 2016) and (Verma sidered as non-collocations are given to the next and Vuppuluri, 2015). component as input. Sample Code. Below is the sample code for The next component is Online Dictionary. It using ICE’s collocation extraction as part of a searches online dictionaries to see if the n-gram bigger system. For idiom extraction you can use exists in any of them or not. It uses Bing Search IdiomExtractor class instead of collocationEx- API 3 to search for n-grams in the web. tractor. Web Search and Substitution is the next compo- nent in the pipeline. This method uses Bing Search >> input=["he and Chazz duel API to obtain hit counts for a phrase query. Then with all keys on the line."] each word in the n-gram will be replaced by 5 ran- >>from ICE import dom words (one at the time), and the hit counts CollocationExtractor are obtained. At the end, we will have a list of hit >>extractor = counts. These values will be used to differentiate CollocationExtractor. between collocations and non-collocations. with_collocation_pipeline( The last component in the pipeline of colloca- "T1" , bing_key = "Temp", tion extraction is Web Search and Independence. pos_check = False) The idea of this method is to check whether the >> print(extractor. probability of a phrase exceeds the probability that get_collocations_of_length( we would expect if the words are independent. It input, length = 3)) uses hit counts in order to estimate the probabili- >> ["on the line"] ties. These probabilities are used to differentiate between collocations and non-collocations. Educational Uses. ICE also has a web-based When running the collocation extraction func- interface for demonstration and educational pur- tion, one of the components should be selected out poses. A user can type in a sentence into an input of the third and fourth ones. field and get a list of the idioms or collocations in The Idiom Extractor diagram is relatively sim- the sentence. A screen-shot of the web-based in- pler. Given the input n-gram, it creates n + 1 terface is shown in Figure 3.4 sets. The first contains (stemmed) words in the meaning of the phrase. The next n sets contain 3 Conclusion stemmed word in the meaning of each word in the n-gram. Then it applies the set difference opera- ICE is a tool for extracting idioms and colloca- tor to n pairs containing the first set and each of tions, but it also has functions for part of speech

3http://datamarket.azure.com/dataset/ 4The web interface is accessible through https:// bing/search shahryarbaki.ddns.net/collocation/

110 Table 1: Components of Collocation Extraction Subsystem of ICE Component Description Ngram Extractor Generates bigram up to 8gram Dictionary Check Look up ngram in dictionary (Wordnet) Online Dictionary Look up ngram in online dictionaries (Bing) Hitcount for phrase query and 5 generated queries Web Search and Substitution by randomly changing the words in the ngram Probability of a phrase exceeds the probability that Web Search and Independence we would expect if the words are independent

of the Fourth Text Analysis Conference, TAC 2011, Gaithersburg, Maryland, USA, November 14-15, 2011. Chitra Garg and Lalit Goyal. 2014. Automatic extrac- tion of idiom, proverb and its variations from text using statistical approach. An International Journal of Engineering Sciences. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David Mc- Closky. 2014. The stanford corenlp natural lan- guage processing toolkit. In ACL (System Demon- strations), pages 55–60. Figure 3: Screen-shot of the online version of ICE George A. Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41. tagging and n-gram extraction. All the compo- nents of the ICE are connected as a pipeline, Carlos Ramisch, Aline Villavicencio, and Christian Boitet. 2010. Multiword expressions in the wild?: hence every part of the system can be changed the mwetoolkit comes in handy. In Proceedings without affecting the other parts. The tool is of the 23rd International Conference on Computa- available online at https://github.com/ tional Linguistics: Demonstrations, pages 57–60. shahryarabaki/ICE as a python package and Association for Computational Linguistics. also at a website. The software is Licensed under Nathan Schneider, Emily Danchik, Chris Dyer, and the Apache License, Version 2.0. Noah A. Smith. 2014. Discriminative lexical se- mantic segmentation with gaps: running the mwe gamut. Transactions of the Association for Compu- References tational Linguistics, 2:193–206. Nikolaos K. Anagnostou and George R. S. Weir. 2006. Violeta Seretan. 2013. A multilingual integrated Review of software applications for deriving collo- framework for processing lexical collocations. In cations. In ICT in the Analysis, Teaching and Learn- Computational Linguistics - Applications, pages 87– ing of Languages, Preprints of the ICTATLL Work- 108. Springer - Netherlands. shop 2006, pages 91–100. Rakesh Verma and Vasanthi Vuppuluri. 2015. A new Satanjeev Banerjee and Ted Pedersen. 2003. The de- approach for idiom identification using meanings sign, implementation, and use of the ngram statis- and the web. Recent Advances in Natural Language tics package. In International Conference on Intelli- Processing, pages 681–687. gent Text Processing and Computational Linguistics, pages 370–381. Springer. Rakesh Verma, Vasanthi Vuppuluri, An Nguyen, Ar- jun Mukherjee, Ghita Mammar, Shahryar Baki, and Araly Barrera and Rakesh Verma. 2012. Combin- Reed Armstrong. 2016. Mining the web for collo- ing syntax and semantics for automatic extractive cations: Ir models of term associations. In Interna- single-document summarization. In CICLING, vol- tional Conference on Intelligent Text Processing and ume LNCS 7182, pages 366–377. Computational Linguistics. Araly Barrera, Rakesh Verma, and Ryan Vincent. 2011. Semquest: University of houston’s semantics- based question answering system. In Proceedings

111 Bib2vec: Embedding-based Search System for Bibliographic Information

Takuma Yoneda Koki Mori Makoto Miwa Yutaka Sasaki Department of Advanced Science and Technology Toyota Technological Institute 2-12-1 Hisakata, Tempaku-ku, Nagoya, Japan sd14084,sd15435,makoto-miwa,yutaka.sasaki @toyota-ti.ac.jp { }

Abstract words related to other authors based on the model using the ACL Anthology Reference Corpus (Bird We propose a novel embedding model that et al., 2008). Based on skip-gram and CBOW, our represents relationships among several el- model embeds vectors to not only words but also ements in bibliographic information with other elements of bibliographic information such high representation ability and flexibility. as authors and references and provides a great rep- Based on this model, we present a novel resentation ability and flexibility. The vectors can search system that shows the relationships be used to calculate distances among the elements among the elements in the ACL Anthology using similarity measures such as cosine distance Reference Corpus. The evaluation results and inner products. For example, the distances show that our model can achieve a high can be used to find words or authors related to a prediction ability and produce reasonable specific author. Our model can easily incorporate search results. new types without changing the model structure and scale well in the number of types. 1 Introduction 2 Related works Modeling relationships among several types of information, such as nodes in information net- Most previous work on modeling several elements work, has attracted great interests in natural lan- in bibliographic information is based on topic guage processing (NLP) and data mining (DM), models such as author-topic model (Rosen-Zvi et since their modeling can uncover hidden infor- al., 2004). Although the models work fairly well, mation in data. Topic models such as author- they have comparably low flexibility and scala- topic model (Rosen-Zvi et al., 2004) have been bility since they explicitly model the generation widely studied to represent relationships among process. Our model employs word representation- these types of information. These models, how- based models instead of topic models. ever, need a considerable effort to incorporate new Some previous work embedded vectors to the types and do not scale well in increasing the num- elements. Among them, large-scale information ber of types since they explicitly model the rela- network embedding (LINE) (Tang et al., 2015) tionships between types in the generating process. embedded a vector to each node in information Word representation models, such as skip-gram network. LINE handles single type of informa- and continuous bag-of-word (CBOW) (Mikolov et tion and prepares a network for each element sepa- al., 2013), have made a great success in NLP. They rately. By contrast, our model simultaneously han- have been widely used to represent texts, but re- dles all the types of information. cent studies started to apply these methods to rep- resent other types of information, e.g., authors or 3 Method papers in citation networks (Tang et al., 2015). We propose a novel method to represent biblio- We propose a novel embedding model that rep- graphic information by embedding vectors to ele- resents relationships among several elements in ments based on skip-gram and CBOW. bibliographic information, which is useful to dis- cover hidden relationships such as authors’ inter- 3.1 Task definition ests and similar authors. We built a novel search We assume the bibliographic data set has the fol- system that enables to search for authors and lowing structure. The data set is composed of bib-

112 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 112–115 c 2017 Association for Computational Linguistics liographic information of papers. Each paper con- Paper sists of several categories. Categories are divided

T¡ ¢£¤¥ ¨¤¢¡ ¦§© 1 into two groups: a textual category Ψ (e.g., titles Non-textual Category Element and abstracts1) and non-textual categories Φ (e.g., authors and references). Figure 1 illustrates an ex- 2 Non-textual Category ample structure of bibliographic information of a paper. Each category has one or more elements; the textual category usually has many elements while a non-textual category has a few elements Figure 1: Example of the bibliographic informa- (e.g., authors are not many for a paper). tion of a paper when the target is the element in the non-textual category. The black element is a 3.2 Proposed model target and the shaded elements are contexts. Our model focuses on a target element, and pre- Paper dicts a context element from the target element.

T      1

We use only the elements in non-textual categories Non-textual Category  Element as contexts to reduce the computational cost. Fig- ure 1 shows the case when we use an element in 2 a non-textual category as a target. For the black- Non-textual Category  painted target element in category Φ2, the shaded elements in the same paper are used as its contexts. When we use elements in the textual category Average as a target, instead of treating each element as a target, we consider that the textual category has Figure 2: Example when the target is the elements only one element that represents all the elements in the textual category in the category like CBOW. Figure 1 exemplifies the case that we consider the averaged vector of Our target loss can be defined as: the vectors of all the elements in the textual cate- gory as a target. log p(eb ea), (3) − | (ea,e ) D We describe our probabilistic model to predict Xb ∈ j i a context element eO from a target eI in a certain i where D denotes a set of all the correct pairs of paper. We define two d-dimensional vectors υt i i the elements in the data set. To reduce the cost of and ωt to represent an element et as a target and the summation in Eq. (1), we applied the noise- context, respectively. Similarly to the skip-gram j contrastive estimation (NCE) to minimize the loss model, the probability to predict element eO in the i (Gutmann and Hyvarinen,¨ 2010). context from input eI is defined as follows: 3.3 Predicting related elements exp(ωj υi + βj ) p(ej ei ) = O· I O , We predict the top k elements related to a query el- O| I j i j (ωj ,βj ) Sj exp(ωs υI + βs ) ement by calculating their similarities to the query s s ∈ · ePj Φ, ei Ψ Φ, (1) element. We calculate the similarities using one O ∈ I ∈ ∪ of three similarity measures: the linear function in j j j Eq. (1), dot product, and cosine distance. where βs denotes a bias corresponds to ωs, and S denotes pairs of ωj and βj that belong to a cate- s s 4 Experiments gory Φj. As we mentioned, our model considers that the textual category Ψ has only one averaged 4.1 Evaluation settings j vector. The vector υrep can be described as: We built our data set from the ACL Anthology Reference Corpus version 20160301 (Bird et al., 1 n υj = υj, ej Ψ (2) 2008). The statistics of the data set and our model rep n q q=1 ∈ settings are summarized in Table 1. X As pre-processing, we deleted commas and pe- 1Note that we have only one textual category since the categories for texts are usually not distinguished in most word riods that sticked to the tails of words and re- representation models. moved non-alphabetical words such as numbers

113 #Elements Min. set to ask the system to predict the removed au- Category Type Original Processed Freq. text textual 59,276 10,994 20 thor considering all the other elements in the pa- author non-textual 17,260 2,609 5 per. To choose a correct author from all the au- reference non-textual 10,871 10,871 1 thors can be insanely difficult, so we prepared 10 year non-textual 16 16 1 paper-id non-textual 19,475 19,475 1 selection candidates. In order to evaluate the effec- tiveness of our model, we compared the accuracy Table 1: Summary of our data set and model on this data set with that by logistic regression. As a result, when we use our model, we got 74.3% and brackets from abstracts and titles. We then (1,486 / 2,000) in accuracy, which was comparable lowercased the words, and made phrases using the to 74.1% (1,482 / 2,000) by logistic regression. word2phrase tool2. Table 2 shows the examples of the search results We prepared 5 categories: author, paper-id, ref- using our model. The leftmost column shows the erence, year and text. author consists of the list of authors we input to our model. In the rightmost authors without distinguishing the order of the au- two columns, we manually picked up words and thors. paper-id is an unique identifier assigned to authors belonging to a certain topic described in each paper, and this mimics the paragraph vector Sim et al. (2015) that can be considered to cor- model (Le and Mikolov, 2014). reference includes respond to the input author. This table shows that the paper ids of reference papers in this data set. our model can predict relative words or similar au- Although ids in paper-id and reference are shared, thors favorably well although the words are incon- we did not assign the same vectors to the ids since sistent with those by the author topic model. they are different categories. year is the publica- Figure 3 shows the screenshot of our system. tion year of the paper. text includes words and The lefthand box shows words in the word cloud phrases in both abstracts and titles, and it belongs related to the query and the righthand box shows to the textual category Ψ, while each other cate- the close authors. We can input a query by putting gory is treated as a non-textual category Φi. We it in the textbox or click one of the authors in the regard elements as unknown elements when they righthand box and select a similarity measure by appear less than minimum frequencies in Table 1. selecting a radio button. We split the data set into training and test. We prepared 17,475 papers for training and the re- 4.3 Discussion maining 2,000 papers for evaluation. For the test When we train the model, we did not use elements set, we regarded the elements that do not appear in in category Ψ as context. This reduced the com- the training set as unknown elements. putational costs, but this might disturbed the accu- We set the dimension d of vectors to 300 and racy of the embeddings. Furthermore, we used the show the results with the linear function. averaged vector for the textual category Ψ, so we do not consider the importance of each word. Our 4.2 Evaluation model might ignore the inter-dependency among We automatically built multiple choice questions elements since we applied skip-grams. To re- and evaluate the accuracy of our model. We also solve these problems, we plan to incorporate atten- compared some results of our model with those of tions (Ling et al., 2015) so that the model can pay author-topic model. more attentions to certain elements that are impor- Our method models elements in several cat- tant to predict other elements. egories and allows us to estimate relationships We also found that some elements have several among the elements with high flexibility, but this aspects. For example, words related to an author makes the evaluation complex. Since it is tough spread over several different tasks in NLP. We may to evaluate all the possible combinations of inputs be able to model this by embedding multiple vec- and targets, we focused on relationships between tors (Neelakantan et al., 2014). authors and other categories. We prepared an eval- uation data set that requires to estimate an author 5 Conclusions from other elements. We removed an (not un- This paper proposed a novel embedding method known) author from each paper in the evaluation that represents several elements in bibliographic 2https://github.com/tmikolov/word2vec information with high representation ability and

114 Our Model Author Topic-Model Input Author Relevant Words Similar Authors Topic Words Topic Authors Philipp Koehn machine translation Hieu Hoang alignment Chris Dyer hmeant Alexandra Birch translation Qun Liu human translators Eva Hasler align Hermann Ney Ryan McDonald dependency parsing Keith Hall parse Michael Collins extrinsic Slav Petrov sentense Joakim Nivre hearing David Talbot parser Jens Nilson

Table 2: Working examples of our model and author topic-model

Figure 3: Screen shot of the system with the search results for the query “Ryan McDonald”.

flexibility, and presented a system that can search Quoc V. Le and Tomas Mikolov. 2014. Distributed for relationships among the elements in the bib- representations of sentences and documents. In liographic information. Experimental results in ICML, pages 1188–1196. Table 2 show that our model can predict relative Wang Ling, Yulia Tsvetkov, Silvio Amir, Ramon Fer- words or similar authors favorably well. We plan mandez, Chris Dyer, Alan W Black, Isabel Tran- to extend our model by other modifications such coso, and Chu-Cheng Lin. 2015. Not all contexts are created equal: Better word representations with as incorporating attention and embedding multi- variable attention. In EMNLP, pages 1367–1372. ple vectors to an element. Since this model has high flexibility and scalability, it can be applied to Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- not only papers but also a variety of bibliographic rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- information in broad fields. ity. In NIPS, pages 3111–3119. Acknowledgments Arvind Neelakantan, Jeevan Shankar, Alexandre Pas- sos, and Andrew McCallum. 2014. Efficient non- We would like to thank the anonymous reviewer parametric estimation of multiple embeddings per for helpful comments and suggestions. word in vector space. In EMNLP, pages 1059–1069. Michal Rosen-Zvi, Thomas L. Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author- References topic model for authors and documents. In UAI, Steven Bird, Robert Dale, Bonnie J. Dorr, Bryan R. pages 487–494. Gibson, Mark Thomas Joseph, Min-Yen Kan, Dong- won Lee, Brett Powley, Dragomir R. Radev, and Yanchuan Sim, Bryan R. Routledge, and Noah A. Yee Fan Tan. 2008. The ACL anthology refer- Smith. 2015. A utility model of authors in the sci- ence corpus: A reference dataset for bibliographic entific community. In EMNLP, pages 1510–1519. research in computational linguistics. In LREC. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Michael Gutmann and Aapo Hyvarinen.¨ 2010. Noise- Yan, and Qiaozhu Mei. 2015. LINE: large-scale contrastive estimation: A new estimation principle information network embedding. In WWW, pages for unnormalized statistical models. In AISTATS, 1067–1077. pages 297–304.

115 The SUMMA Platform Prototype

Renars Liepins† Ulrich GermannH [email protected] [email protected]

Guntis Barzdins† Alexandra BirchH Steve RenalsH Susanne Weberu · · · Peggy van der Kreefti Hervé BourlardU João PrietoJ Ondrejˇ KlejchH · · · Peter BellH Alexandros LazaridisU Alfonso MendesJ Sebastian Riedels · · · Mariana S. C. AlmeidaJ Pedro BalageJ Shay CohenH Tomasz DwojakH · · · Phil GarnerU Andreas Gieferi Marcin Junczys-DowmuntH Hina Imrani · · · David NogueiraJ Ahmed Ali? Sebastião MirandaJ Andrei Popescu-BelisU · · · Lesly Miculicich WerlenU Nikos PapasarantopoulosH Abiola Obamuyide‡ · · Clive Jonesu Fahim Dalvi? Andreas Vlachos‡ Yang WangU Sibo TongU · · · · Rico SennrichH Nikolaos PappasU Shashi NarayanH Marco DamonteH · · · Nadir Durrani? Sameer Khurana? Ahmed Abdelali? Hassan Sajjad? · · · Stephan Vogel? David Sheppeyu Chris Hernonu Jeff Mitchells · · · †Latvian News Agency HUniversity of Edinburgh iDeutsche Welle uBBC

UIdiap Research Institute JPriberam Informatica S.A. sUniversity College London

‡University of Sheffield ?Qatar Computing Research Institute

Abstract scalable, integrated web-based platform to automat- ically monitor an arbitrarily large number of public We present the first prototype of the SUMMA broadcast and web-based news sources. Platform: an integrated platform for multilin- Two concrete use cases and an envisioned third use gual media monitoring. The platform contains case drive the project. a rich suite of low-level and high-level natural language processing technologies: automatic 1.1 Monitoring of External News Coverage speech recognition of broadcast media, ma- BBC Monitoring, a division of the British Broadcasting chine translation, automated tagging and clas- Corporation (BBC), monitors a broad variety of news sification of named entities, semantic parsing sources from all over the world on behalf of the BBC to detect relationships between entities, and and external customers. About 3002 staff journalists automatic construction / augmentation of fac- and analysts track TV, radio, internet, and social media tual knowledge bases. Implemented on the sources in order to detect trends and changing media Docker platform, it can easily be deployed, behaviour, and to flag breaking news events. A single customised, and scaled to large volumes of in- monitoring journalist typically monitors four TV chan- coming media streams. nels and several online sources simultaneously. This is about the maximum that any person can cope with 1 Introduction mentally and physically. Assuming 8-hour shifts, this limits the capacity of BBC Monitoring to monitoring SUMMA (Scalable Understanding of Multilingual Me- about 400 TV channels at any given time on average. 1 dia) is a three-year Research and Innovation Action At the same time, BBC Monitoring has access to about (February 2016 through January 2019), supported by 13,600 distinct sources, including some 1,500 TV and the European Union’s Horizon 2020 research and in- 1,350 radio broadcasters. Automating the monitoring novation programme. SUMMA is developing a highly process not only allows the BBC to cover a broader

1 www.summa-project.eu 2 To be reduced to 200 by the end of March, 2017.

116 Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 116–119 c 2017 Association for Computational Linguistics

Journalists access the system through a Data feed modules web-based GUI. RethinkDB stores media content and meta-information. RabbitMQ handles Live stream monitoring communication bet- ween database and NLP modules that augment the monitored data.

NLP modules for ASR, Light-weight node.js MT, Named Entity wrappers integrate Tagging, etc. run as components into the applications in Docker SUMMA infrastructure. containers. Docker Compose orchestrates all the components into a working infrastructure.

Docker provides containerization of the various components.

Figure 1: Architecture of the SUMMA Platform spectrum of news sources, but also allows journalists rylines. It can place geolocations of trending stories to perform deeper analysis by enhancing their ability on a map. Customised dashboards can be used to fol- to search through broadcast media across languages in low particular storylines. For the internal monitoring a way that other monitoring platforms do not support. use case, it will visualize statistics of content that was reused by other language departments. 1.2 Monitoring Internal News Production Deutsche Welle is Germany’s international public ser- 2 System Architecture vice broadcaster. It provides international news and background information from a German perspective in Figure 1 shows an overview of the SUMMA Platform 30 languages worldwide, 8 of which are used within prototype. The Platform is implemented as an orches- SUMMA. News production within Deutsche Welle is tra of independent components that run as individual organized by language and regional departments that containers on the Docker platform. This modular ar- operate and create content fairly independently. In- chitecture gives the project partners a high level of in- terdepartmental collaboration and awareness, however, dependence in their development. is important to ensure a broad, international perspec- The system comprises the following individual pro- tive. Multilingual internal monitoring of world-wide cessing components. news production (including underlying background re- search) helps to increase awareness of the work be- 2.1 Data Feed Modules and Live Streams tween the different language news rooms, decrease la- These modules each monitor a specific news source tency in reporting and reduce cost of news production for new content. Once new content is available, it is within the service by allowing adaptation of existing downloaded and fed into the database via a common news stories for particular target audiences rather than REST API. Live streams are automatically segmented creating them from scratch. into logical segments.

1.3 Data Journalism 2.2 Database Back-end The third use case is data journalism. Measurable data Rethink-DB3 serves as the database back-end. Once is extracted from the content available in and produced new content is added, Rethink-DB issues processing re- by the SUMMA platform and graphics are created with quests to the individual NLP processing modules via such data. The data journalism dashboard will be able RabbitMQ. to provide, for instance, a graphical overview of trend- ing topics over the past 24 hours or a heatmap of sto- 3 www.rethinkdb.com

117 Storyline Index Quick access to current storylines.

Storyline Summary Multi-item/document summary of news items in the storyline.

Individual news stories Within the storyline Left: frame to play the original video (if applicable); Right: tabbed text box with auto- matic transcription of the original audio source, automatic translation (plaintext), and automatic translation with recognized named entities marked up.

Figure 2: Web-based User Interface of the SUMMA Platform (Storyline View)

2.3 Automatic Speech Recognition 2.6 Topic Recognition and Labeling Spoken language from audio and video streams is first This module labels incoming news documents and processed by automatic speech recognition to turn it transcripts with a fine-grained set of topic labels. The into text for further processing. Models are trained labels are learned from a multilingual corpus of nearly on speech from the broadcast domain using the Kaldi 600k documents in 8 of the 9 SUMMA languages (all toolkit (Povey et al., 2011); speech recognition is per- except Latvian), which were manually annotated by formed using the CloudASR platform (Klejch et al., journalists at Deutsche Welle. The document model is 2015). a hierarchical attention network with attention at each level of the hierarchy, inspired by Yang et al. (2016), 2.4 Machine Translation followed by a sigmoid classification layer. The lingua franca within SUMMA is English. Machine 2.7 Deep Semantic Tagging translation based on neural networks is used to translate The system also has a component that performs seman- content into English automatically. The back-end MT tic parsing into Abstract Meaning Representations (Ba- systems are trained with the Nematus Toolkit (Sennrich narescu et al., 2013) with the aim to incorporate them et al., 2017); translation is performed with AmuNMT into the storyline generation eventually. The parser was (Junczys-Dowmunt et al., 2016). developed by Damonte et al. (2017). It is an incremen- tal left-to-right parser that builds an AMR graph struc- 2.5 Entity Tagging and Linking ture using a neural network controller. It also includes Depending on the source language, Entity Tagging and adaptations to German, Spanish, Italian and Chinese. Linking is performed either natively, or on the En- 2.8 Knowledge Base Construction glish translation. Entities are detected with TurboEn- tityRecognizer, a named entity recognizer within Tur- This component provides a knowledge base of factual boParser4 (Martins et al., 2009). Then, we link the relations between entities, built with a model based on detected mentions to the knowledge base with a system Universal Schemas (Riedel et al., 2013), a low-rank based on our submission to TAC-KBP 2016 (Paikens et matrix factorization approach.The entity relations are al., 2016). extracted jointly across multiple languages, with enti- ties pairs as rows and a set of structured relations and 4 https://github.com/andre-martins/ textual patterns as columns. The relations provide in- TurboParser formation about how various entities present in news

118 documents are connected. Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang. 2016. Is neural machine translation ready 2.9 Storyline Construction and Summarization for deployment? A case study on 30 translation di- Storylines are constructed via online clustering, i.e., by rections. CoRR, abs/1610.01108. assigning storyline identifiers to incoming documents Ondrejˇ Klejch, Ondrejˇ Plátek, Lukáš Žilka, and Filip in a streaming fashion, following the work in Aggar- Jurcíˇ cek.ˇ 2015. CloudASR: platform and service. wal and Yu (2006). The resulting storylines are subse- In Int’l. Conf. on Text, Speech, and Dialogue, pages quently summarized via an extractive system based on 334–341. Springer. Almeida and Martins (2013). André F. T. Martins, Noah A. Smith, and Eric P. Xing. 2009. Concise integer linear programming formula- 3 User Interface tions for dependency parsing. In ACL, pages 342– Figure 2 shows the current web-based SUMMA Plat- 350. form user interface in the storyline view. A storyline is Peteris Paikens, Guntis Barzdins, Afonso Mendes, a collection of news items that concerning a particular Daniel Ferreira, Samuel Broscheit, Mariana S. C. “story” and how it develops over time. Details of the Almeida, Sebastião Miranda, David Nogueira, Pedro layout are explained in the figure annotations. Balage, and André F. T. Martins. 2016. SUMMA at TAC knowledge base population task 2016. In TAC, 4 Future Work Gaithersburg, Maryland, USA. The current version of the Platform is a prototype de- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš signed to demonstrate the orchestration and interaction Burget, Ondrej˘ Glembek, Nagendra Goel, Mirko of the individual processing components. The look Hannemann, Petr Motlícek,ˇ Yanmin Qian, Petr and feel of the page may change significantly over the Schwarz, Jan Silovský, Georg Stemmer, and Karel course of the project, in response to the needs and re- Veselý. 2011. The Kaldi speech recognition toolkit. quirements and the feedback from the use case part- In ASRU. ners, the BBC and Deutsche Welle. Sebastian Riedel, Limin Yao, Benjamin M. Marlin, and Andrew McCallum. 2013. Relation extraction with 5 Availability matrix factorization and universal schemas. In HLT- NAACL, June. The public release of the SUMMA Platform as open source software is planned for April 2017. Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan- dra Birch, Barry Haddow, Julian Hitschler, Marcin 6 Acknowledgments Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. This work was conducted within the scope of 2017. Nematus: a toolkit for neural machine trans- the Research and Innovation Action SUMMA, lation. In EACL Demonstration Session, Valencia, which has received funding from the European Union’s Spain. Horizon 2020 research and innovation programme un- Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, der grant agreement No 688139. Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical attention networks for document clas- sification. In NAACL, San Diego, CA, USA. References Charu C. Aggarwal and Philip S. Yu. 2006. A frame- work for clustering massive text and categorical data streams. In SIAM Int’l. Conf. on Data Mining, pages 479–483. SIAM. Miguel B. Almeida and Andre F. T. Martins. 2013. Fast and robust compressive summarization with dual decomposition and multi-task learning. In ACL, pages 196–206. Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. Linguistic Annotation Workshop. Marco Damonte, Shay B. Cohen, and Giorgio Satta. 2017. An incremental parser for abstract meaning representation. In EACL.

119

Author Index

Abdelali, Ahmed, 61, 116 Geyken, Alexander, 69 Ai, Renlong,5 Giefer, Andreas, 116 Al Khatib, Khalid, 13 Gontrum, Johannes, 29 Ali, Ahmed, 61, 116 Groschwitz, Jonas, 29 Almeida, Mariana S. C., 116 Gurevych, Iryna, 69

Baki, Shahryar, 108 Haddow, Barry, 65 Balage, Pedro, 116 Heigold, Georg,5 Barzdins, Guntis, 116 Hemati, Wahed, 17 Baumartz, Daniel, 17 Hennig, Leonhard,5 Bell, Peter, 116 Henrich, Verena, 53 Bignotti, Enrico, 77 Hernon, Chris, 116 Birch, Alexandra, 65, 116 Hitschler, Julian, 65 Boro¸s,Tiberiu, 33 Hysa, Luis, 57 Boullosa, Beto, 69 Bourlard, Herve, 116 Ilvovsky, Dmitry, 87 Busemann, Stephan,5 Imran, Hina, 116

Cabrera, Benjamin, 21 Jones, Clive, 116 Cartier, Emmanuel, 95 Junczys-Dowmunt, Marcin, 65, 116 Cho, Kyunghyun, 65 Khurana, Sameer, 61, 116 Cohen, Shay B., 116 Kiesel, Johannes, 13 Cuadros, Montse, 49 Klejch, Ondrej, 116 da Cunha, Iria, 57 Koller, Alexander, 29 Dalvi, Fahim, 61, 116 Kutuzov, Andrey, 99 Damonte, Marco, 116 Kuzmenko, Elizaveta, 99 Dehdari, Jon,5 Lang, Alexander, 53 Dumitrescu, Stefan Daniel, 33 Läubli, Samuel, 65 Durrani, Nadir, 61, 116 Lazaridis, Alexandros, 116 Dwojak, Tomasz, 116 Lemnitzer, Lothar, 69 Eckart de Castilho, Richard, 69 Lepri, Bruno, 77 Liepins, Renars, 116 Ferrero, Germán, 104 List, Johann-Mattis,9 Ferri Ramírez, Cesar, 41 Fersini, Elisabetta, 25 Manchanda, Pikakshi, 25 Firat, Orhan, 65 Marsi, Erwin, 91 Mehler, Alexander, 17 Gabryszak, Aleksandra,5 Mendes, Alfonso, 116 Galitsky, Boris, 87 Menini, Stefano, 77 Gamallo, Pablo, 45 Messina, Enza, 25 Garcia, Marcos, 45 Miceli Barone, Antonio Valerio, 65 Garner, Philip N., 116 Miculicich Werlen, Lesly, 116 Germann, Ulrich, 116 Minock, Michael,1

121 Miranda, Sebastião, 116 Torr, John, 81 Mitchell, Jeff, 116 Miwa, Makoto, 112 Uslu, Tolga, 17 Mokry, Jozef, 65 Uszkoreit, Hans,5 Montané, M. Amor, 57 V. Ardelan, Murat, 91 Moreno-Ortiz, Antonio, 73 Vallejo Huanga, Diego Fernando, 41 Moretti, Giovanni, 77 van der Kreeft, Peggy, 116 Mori, Koki, 112 van Genabith, Josef,5 Morillo Alcívar, Paulina Adriana, 41 Verma, Rakesh, 108 Mubarak, Hamdy, 61 Vlachos, Andreas, 37, 116 Nadejde, Maria, 65 Vogel, Stephan, 61, 116 Narayan, Shashi, 116 vuppuluri, vasanthi, 108 Nguyen, An, 108 Wachsmuth, Henning, 13 Nogueira, David, 116 Wang, He,5 Nozza, Debora, 25 Wang, Yang, 116 Obamuyide, Abiola, 116 Weber, Susanne, 116

Palmonari, Matteo, 25 Xu, Feiyu,5 Papasarantopoulos, Nikos, 116 Yoneda, Takuma, 112 Pappas, Nikolaos, 116 Perez, Naiara, 49 Zhang, Yifan, 61 Pinar Øzturk, Pinar, 91 Pipa, Sonia, 33 Popescu-Belis, Andrei, 116 Prieto, João, 116 Primadhanty, Audi, 104

Quattoni, Ariadna, 104

Renals, Steve, 116 Rethmeier, Nils,5 Riedel, Sebastian, 116 Ristagno, Fausto, 25 Rodríguez-Torres, Iván, 45 Ross, Björn, 21 Rubino, Raphael,5

Sajjad, Hassan, 61, 116 Sasaki, Yutaka, 112 Schmeier, Sven,5 Sennrich, Rico, 65, 116 Sheppey, David, 116 Sprugnoli, Rachele, 77 Steffen, Jörg,5 Stein, Benno, 13 Steinert, Laura, 21

Teichmann, Christoph, 29 Thomas, Philippe,5 Thorne, James, 37 Tonelli, Sara, 77 Tong, Sibo, 116