Legal Entity Extraction with NER Systems

Total Page:16

File Type:pdf, Size:1020Kb

Legal Entity Extraction with NER Systems Escuela Tecnica´ Superior de Ingenieros Informaticos´ Universidad Politecnica´ de Madrid Legal Entity Extraction with NER Systems Master Thesis Master In Artificial Intelligence Author: Ines Badji, [email protected] Advisor(s): Oscar´ Corcho and V´ıctor Rodr´ıguez-Doncel https://github.com/ibadji/Legal-NER June, 2018 i Acknowledgments I am very grateful to have been given the opportunity to work with the Ontological Engineering Lab and would like to thank all its members for the warmth, kindness and support they showed me. A particular thanks to my advisors Oscar´ Corcho, V´ıctor Rodr´ıguez-Doncel and Elena Montiel for helping me with this Master Thesis. It was a pleasure collaborating with you and I hope to have the chance to work with youinthefuture. ii iii Abstract Named Entity Recognition over texts belonging to the legal domain focuses on cat- egories (legal entities) like references to specific laws, judgments, name of courts or stages in a legal process. Although there is a rich choice of libraries for implementing NER systems, these late ones are not domain specific and do not work well on text pertaining to the Legal domain. Similarly, little focus is given to Spanish since most research is done on the English language. The objective of the work presented in this thesis is the identification of legal entities in Spanish and English texts, with a main focus on informal references to legislative documents found in news, Twitter, contracts or journal articles. The work is framed in the H2020 Lynx project, aimed at creating a Legal Knowledge Graph enabling the provision of compliance-related services. A Rule Based approach can be used to recognize references to norms in Spanish and English documents belonging to the legal domain applied on top of a combination of Natural Language Processing Tools. To recognize the mentions in documents of a less formal nature, a number of vulgar variants for the names of the public acts or judgments is necessary. By querying on Wikidata, DBpedia and BOE a table of synonyms is produced. These resources have been published along with a small annotated data set taken as gold standard. iv Contents v Contents 1 Introduction................................ 3 1.1Motivation................................. 3 1.2 Objective and approach ......................... 6 1.3 Hypotheses ................................ 8 1.4Methodology............................... 8 1.5 Terminology ................................ 10 1.5.1 Abbreviations........................... 10 1.5.2 Definitions ............................. 10 2 StateoftheArt.............................. 11 2.1LiteratureOverview........................... 11 2.2 Technical Overview ............................ 12 2.2.1 Parameters Affecting Performance ................ 13 2.2.2 Implementation methods ..................... 14 2.2.3 Features.............................. 14 2.2.4 Processing Steps and Resources: ................. 15 2.2.5 Tools................................ 15 2.3 Recognition of Legal References ..................... 17 3 AlgorithmsandServices......................... 23 3.1Algorithms................................. 23 3.1.1 Patternsandresourcesofthedifferentcodesused....... 23 3.1.2 Algorithms for NER of legal references ............. 26 3.2Services.................................. 29 3.2.1 Main................................ 29 3.2.2 Auxiliary ............................. 30 4 Experimentationandevaluation..................... 31 4.1 Implementation details .......................... 31 4.2 Running the Code . ........................... 33 4.3 Corpus.................................. 35 4.4Methodologyofevaluation........................ 37 4.5 Results.................................. 38 4.5.1 Results of NER of legal references ................ 38 4.5.2 Additional results ........................ 41 4.6 Publication as services Results ...................... 42 5 Conclusions................................ 45 vi Contents List of Figures vii List of Figures 1 Sample text with references to norms. .................. 3 2 Twitter Example . ........................... 4 3 Entities List that can be extracted for Spanish ............. 7 4 Entities List that can be extracted for English ............. 7 5 Flow chart showing the processed followed in order to gather the informationfortherulecreation..................... 25 6 Regex rule sample ............................. 26 7 PEG rule sample ............................. 26 8 DbpediaQuery.............................. 27 9 DbpediaQuery.............................. 28 10 WikidataQuery.............................. 28 11 SPARQLQueryResult.......................... 29 12 ServiceUsefromauserpointofview.................. 30 13 FlowchartofthemaincomponentsoftheCode............ 32 14 NicknamesOutput............................ 33 15 RuleOutput................................ 33 16 FinalOutput,TextVersion....................... 33 17 Highlighted Corpus Focused on Law Entities .............. 34 18 MapfortheCorpusinFig.17...................... 34 19 Highlighted Corpus Focused on Person Organization and Location . 35 20 MapfortheCorpusinFig.19...................... 35 21 Organizationofgithubfolders....................... 36 22 Structureofthecorpus.......................... 37 23 English:Precisionoftheruleservice.................. 39 24 English:Recalloftheruleservice.................... 39 25 English:F1scoreoftheruleservice................... 40 26 English:Precisionoftheotherservice.................. 40 27 English:Recalloftheotherservice................... 41 28 English:F1scoreoftheotherservice.................. 41 29 Spanish:Precisionoftheruleservice.................. 42 30 Spanish:Recalloftheruleservice.................... 42 31 Spanish:F1scoreoftheruleservice................... 43 32 Spanish:Precisionoftheotherservice................. 43 33 Spanish:Recalloftheotherservice................... 43 34 Spanish:F1scoreoftheotherservice.................. 44 35 NERStateoftheArt........................... 48 36 European Union Administrative composition .............. 49 37 European Union Law referencing .................... 50 38 SpanishLaw................................ 51 39 SpanishGovernment:Part1 ...................... 52 40 SpanishGovernment:Part2 ...................... 53 41 Spain - Law Referencing ......................... 54 viii List of Figures List of Tables 1 List of Tables 1 Example of Legal Entity extraction ................... 4 2 StateoftheArt:SummaryoftheTechnicalOverview......... 13 3 Sample of available NER Tools. ..................... 15 4 Named entity recognition: software and projects related to the legal domain................................... 18 5 Companiesworkingontechnologyrelatedtolegaldomain....... 19 2 List of Tables 3 1 Introduction 1.1 Motivation Lawyers and professionals of the Legal field find extremely valuable having refer- ences to legal entities identified and highlighted in legal information systems, pos- sibly with hyperlinks to source documents. Machines on the other hand, make an even more important use of these references, using them to improve search algo- rithms, anonymize documents, make data analysis, summarize long texts or provide recommendation services. Named-entity recognition (NER) is a Natural Language Processing task aimed at identifying references to specific entities in a text. Those entities can be of different types such as persons, organizations, places, dates, quantities, monetary values or percentages. In the legal domain, references to other entities are also of interest, such as norms, judgments, courts or jurisdictions. These entities are hereinafter referred as legal entities. For example, in the financial aid information document for postdoctoral studies in UPM [19]the following text is stated: Las ayudas objeto de esta convocatoria estan sometidas a las Bases Reguladoras contenidas en la Resoluci´on Rectoral de 23 de febrero de 2017 para la concesi´on de ayudas del programa propio de I+D+i de la UPM,alaLey 39/2015, de 1 de oc- tubre, del Procedimiento Administrativo Com´un de las Administraciones Publicas, la Ley 35/2006, de 28 de noviembre, del Impuesto sobre la Renta de las Personas Fisicas y de modificaci´on parcial de las leyes de los Impuestos sobre Sociedades, sobre la Renta de no Residentes y sobre el Patrimonio,laLey 3 8/2003, de 17 de noviembre, General de Subvenciones, Real Decreto 462/2002, de 24 de mayo, sobre indemnizaciones por raz´on del servicio as´ı como la Resoluci´on de 9 de febrero 2018 por la que se dictan instrucciones sobre comisiones de servicio con derecho a indemnizaci´on,yalosEstatutos de la UPM, aprobados por Decreto 74/2010, de 21 de octubre (BOCM del 15 de noviembre). Fig. 1: Sample text with references to norms. In the text above, references to different norms have been identified: three laws (leyes, in red colour), one royal decree (Real Decreto, in magenta), one decree of the Autonomous Community of Madrid (in orange), one statute (in pink) and two inter- nal UPM dispositions (Resoluciones Rectorales, in brown). In addition, other legal entities might be found, such as institutions (UPM), official publications BOCM), etc. 4 1 Introduction Tab. 1: Example of Legal Entity extraction Detected Entity Entity Category Ley 39/2015 Ley Organica Ley 39/2006 Ley Organica Ley 38/2003 Ley Organica Real Decreto 462/2002 Real decreto Resolution de 9 de febrero 2018 Resolution Decreto 74/2010 Decreto BOCM Abbreviation The excerpt above comes from a formal document, and references to norms are expected to appear in their full form
Recommended publications
  • 15 Things You Should Know About Spacy
    15 THINGS YOU SHOULD KNOW ABOUT SPACY ALEXANDER CS HENDORF @ EUROPYTHON 2020 0 -preface- Natural Language Processing Natural Language Processing (NLP): A Unstructured Data Avalanche - Est. 20% of all data - Structured Data - Data in databases, format-xyz Different Kinds of Data - Est. 80% of all data, requires extensive pre-processing and different methods for analysis - Unstructured Data - Verbal, text, sign- communication - Pictures, movies, x-rays - Pink noise, singularities Some Applications of NLP - Dialogue systems (Chatbots) - Machine Translation - Sentiment Analysis - Speech-to-text and vice versa - Spelling / grammar checking - Text completion Many Smart Things in Text Data! Rule-Based Exploratory / Statistical NLP Use Cases � Alexander C. S. Hendorf [email protected] -Partner & Principal Consultant Data Science & AI @hendorf Consulting and building AI & Data Science for enterprises. -Python Software Foundation Fellow, Python Softwareverband chair, Emeritus EuroPython organizer, Currently Program Chair of EuroSciPy, PyConDE & PyData Berlin, PyData community organizer -Speaker Europe & USA MongoDB World New York / San José, PyCons, CEBIT Developer World, BI Forum, IT-Tage FFM, PyData London, Berlin, PyParis,… here! NLP Alchemy Toolset 1 - What is spaCy? - Stable Open-source library for NLP: - Supports over 55+ languages - Comes with many pretrained language models - Designed for production usage - Advantages: - Fast and efficient - Relatively intuitive - Simple deep learning implementation - Out-of-the-box support for: - Named entity recognition - Part-of-speech (POS) tagging - Labelled dependency parsing - … 2 – Building Blocks of spaCy 2 – Building Blocks I. - Tokenization Segmenting text into words, punctuations marks - POS (Part-of-speech tagging) cat -> noun, scratched -> verb - Lemmatization cats -> cat, scratched -> scratch - Sentence Boundary Detection Hello, Mrs. Poppins! - NER (Named Entity Recognition) Marry Poppins -> person, Apple -> company - Serialization Saving NLP documents 2 – Building Blocks II.
    [Show full text]
  • Design and Implementation of an Open Source Greek Pos Tagger and Entity Recognizer Using Spacy
    DESIGN AND IMPLEMENTATION OF AN OPEN SOURCE GREEK POS TAGGER AND ENTITY RECOGNIZER USING SPACY A PREPRINT Eleni Partalidou1, Eleftherios Spyromitros-Xioufis1, Stavros Doropoulos1, Stavros Vologiannidis2, and Konstantinos I. Diamantaras3 1DataScouting, 30 Vakchou Street, Thessaloniki, 54629, Greece 2Dpt of Computer, Informatics and Telecommunications Engineering, International Hellenic University, Terma Magnisias, Serres, 62124, Greece 3Dpt of Informatics and Electronic Engineering, International Hellenic University, Sindos, Thessaloniki, 57400, Greece ABSTRACT This paper proposes a machine learning approach to part-of-speech tagging and named entity recog- nition for Greek, focusing on the extraction of morphological features and classification of tokens into a small set of classes for named entities. The architecture model that was used is introduced. The greek version of the spaCy platform was added into the source code, a feature that did not exist before our contribution, and was used for building the models. Additionally, a part of speech tagger was trained that can detect the morphologyof the tokens and performs higher than the state-of-the-art results when classifying only the part of speech. For named entity recognition using spaCy, a model that extends the standard ENAMEX type (organization, location, person) was built. Certain experi- ments that were conducted indicate the need for flexibility in out-of-vocabulary words and there is an effort for resolving this issue. Finally, the evaluation results are discussed. Keywords Part-of-Speech Tagging, Named Entity Recognition, spaCy, Greek text. 1. Introduction In the research field of Natural Language Processing (NLP) there are several tasks that contribute to understanding arXiv:1912.10162v1 [cs.CL] 5 Dec 2019 natural text.
    [Show full text]
  • Translation, Sentiment and Voices: a Computational Model to Translate and Analyze Voices from Real-Time Video Calling
    Translation, Sentiment and Voices: A Computational Model to Translate and Analyze Voices from Real-Time Video Calling Aneek Barman Roy School of Computer Science and Statistics Department of Computer Science Thesis submitted for the degree of Master in Science Trinity College Dublin 2019 List of Tables 1 Table 2.1: The table presents a count of studies 22 2 Table 2.2: The table present a count of studies 23 3 Table 3.1: Europarl Training Corpus 31 4 Table 3.2: VidALL Neural Machine Translation Corpus 32 5 Table 3.3: Language Translation Model Specifications 33 6 Table 4.1: Speech Analysis Corpus Statistics 42 7 Table 4.2: Sentences Given to Speakers 43 8 Table 4.3: Hacki’s Vocal Intensity Comparators (Hacki, 1996) 45 9 Table 4.4: Model Vocal Intensity Comparators 45 10 Table 4.5: Sound Intensities for Sentence 1 and Sentence 2 46 11 Table 4.6: Classification of Sentiment Scores 47 12 Table 4.7: Sentiment Scores for Five Sentences 47 13 Table 4.8: Simple RNN-based Model Summary 52 14 Table 4.9: Simple RNN-based Model Results 52 15 Table 4.10: Embedded-based RNN Model Summary 53 16 Table 4.11: Embedded-RNN based Model Results 54 17 Table 5.1: Sentence A Evaluation of RNN and Embedded-RNN Predictions from 20 Epochs 63 18 Table 5.2: Sentence B Evaluation of RNN and Embedded-RNN Predictions from 20 Epochs 63 19 Table 5.3: Evaluation on VidALL Speech Transcriptions 64 v List of Figures 1 Figure 2.1: Google Neural Machine Translation Architecture (Wu et al.,2016) 10 2 Figure 2.2: Side-by-side Evaluation of Google Neural Translation Model (Wu
    [Show full text]
  • A Natural Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications
    The Thirty-Second Innovative Applications of Artificial Intelligence Conference (IAAI-20) A Natural Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications Shivashankar Subramanian,1,2 Ioana Baldini,1 Sushma Ravichandran,1 Dmitriy A. Katz-Rogozhnikov,1 Karthikeyan Natesan Ramamurthy,1 Prasanna Sattigeri,1 Kush R. Varshney,1 Annmarie Wang,3,4 Pradeep Mangalath,3,5 Laura B. Kleiman3 1IBM Research, Yorktown Heights, NY, USA, 2University of Melbourne, VIC, Australia, 3Cures Within Reach for Cancer, Cambridge, MA, USA, 4Massachusetts Institute of Technology, Cambridge, MA, USA, 5Harvard Medical School, Boston, MA, USA Abstract (Pantziarka et al. 2017; Bouche, Pantziarka, and Meheus 2017; Verbaanderd et al. 2017). However, manual review to More than 200 generic drugs approved by the U.S. Food and Drug Administration for non-cancer indications have shown identify and analyze potential evidence is time-consuming promise for treating cancer. Due to their long history of safe and intractable to scale, as PubMed indexes millions of ar- patient use, low cost, and widespread availability, repurpos- ticles and the collection is continuously updated. For exam- ing of these drugs represents a major opportunity to rapidly ple, the PubMed query Neoplasms [mh] yields more than improve outcomes for cancer patients and reduce healthcare 3.2 million articles.1 Hence there is a need for a (semi-) au- costs. In many cases, there is already evidence of efficacy for tomatic approach to identify relevant scientific articles and cancer, but trying to manually extract such evidence from the provide an easily consumable summary of the evidence. scientific literature is intractable.
    [Show full text]
  • Linked Data Triples Enhance Document Relevance Classification
    applied sciences Article Linked Data Triples Enhance Document Relevance Classification Dinesh Nagumothu * , Peter W. Eklund , Bahadorreza Ofoghi and Mohamed Reda Bouadjenek School of Information Technology, Deakin University, Geelong, VIC 3220, Australia; [email protected] (P.W.E.); [email protected] (B.O.); [email protected] (M.R.B.) * Correspondence: [email protected] Abstract: Standardized approaches to relevance classification in information retrieval use generative statistical models to identify the presence or absence of certain topics that might make a document relevant to the searcher. These approaches have been used to better predict relevance on the basis of what the document is “about”, rather than a simple-minded analysis of the bag of words contained within the document. In more recent times, this idea has been extended by using pre-trained deep learning models and text representations, such as GloVe or BERT. These use an external corpus as a knowledge-base that conditions the model to help predict what a document is about. This paper adopts a hybrid approach that leverages the structure of knowledge embedded in a corpus. In particular, the paper reports on experiments where linked data triples (subject-predicate-object), constructed from natural language elements are derived from deep learning. These are evaluated as additional latent semantic features for a relevant document classifier in a customized news- feed website. The research is a synthesis of current thinking in deep learning models in NLP and information retrieval and the predicate structure used in semantic web research. Our experiments Citation: Nagumothu, D.; Eklund, indicate that linked data triples increased the F-score of the baseline GloVe representations by 6% P.W.; Ofoghi, B.; Bouadjenek, M.R.
    [Show full text]
  • Named Entity Extraction for Knowledge Graphs: a Literature Overview
    Received December 12, 2019, accepted January 7, 2020, date of publication February 14, 2020, date of current version February 25, 2020. Digital Object Identifier 10.1109/ACCESS.2020.2973928 Named Entity Extraction for Knowledge Graphs: A Literature Overview TAREQ AL-MOSLMI , MARC GALLOFRÉ OCAÑA , ANDREAS L. OPDAHL, AND CSABA VERES Department of Information Science and Media Studies, University of Bergen, 5007 Bergen, Norway Corresponding author: Tareq Al-Moslmi ([email protected]) This work was supported by the News Angler Project through the Norwegian Research Council under Project 275872. ABSTRACT An enormous amount of digital information is expressed as natural-language (NL) text that is not easily processable by computers. Knowledge Graphs (KG) offer a widely used format for representing information in computer-processable form. Natural Language Processing (NLP) is therefore needed for mining (or lifting) knowledge graphs from NL texts. A central part of the problem is to extract the named entities in the text. The paper presents an overview of recent advances in this area, covering: Named Entity Recognition (NER), Named Entity Disambiguation (NED), and Named Entity Linking (NEL). We comment that many approaches to NED and NEL are based on older approaches to NER and need to leverage the outputs of state-of-the-art NER systems. There is also a need for standard methods to evaluate and compare named-entity extraction approaches. We observe that NEL has recently moved from being stepwise and isolated into an integrated process along two dimensions: the first is that previously sequential steps are now being integrated into end-to-end processes, and the second is that entities that were previously analysed in isolation are now being lifted in each other's context.
    [Show full text]
  • KOI at Semeval-2018 Task 5: Building Knowledge Graph of Incidents
    KOI at SemEval-2018 Task 5: Building Knowledge Graph of Incidents 1 2 2 Paramita Mirza, Fariz Darari, ∗ Rahmad Mahendra ∗ 1 Max Planck Institute for Informatics, Germany 2 Faculty of Computer Science, Universitas Indonesia, Indonesia paramita @mpi-inf.mpg.de fariz,rahmad.mahendra{ } @cs.ui.ac.id { } Abstract Subtask S3 In order to answer questions of type (ii), participating systems are also required We present KOI (Knowledge of Incidents), a to identify participant roles in each identified an- system that given news articles as input, builds swer incident (e.g., victim, subject-suspect), and a knowledge graph (KOI-KG) of incidental use such information along with victim-related nu- events. KOI-KG can then be used to effi- merals (“three people were killed”) mentioned in ciently answer questions such as “How many the corresponding answer documents, i.e., docu- killing incidents happened in 2017 that involve ments that report on the answer incident, to deter- Sean?” The required steps in building the KG include: (i) document preprocessing involv- mine the total number of victims. ing word sense disambiguation, named-entity Datasets The organizers released two datasets: recognition, temporal expression recognition and normalization, and semantic role labeling; (i) test data, stemming from three domains of (ii) incidental event extraction and coreference gun violence, fire disasters and business, and (ii) resolution via document clustering; and (iii) trial data, covering only the gun violence domain. KG construction and population. Each dataset
    [Show full text]
  • Data Driven Value Creation
    DATA DRIVEN VALUE CREATION DATA SCIENCE & ANALYTICS | DATA MANAGEMENT | VISUALIZATION & DATA EXPERIENCE D ONE, Sihlfeldstrasse 58, 8003 Zürich, d-one.ai Agenda Time Content Who 9.00 Introduction Philipp 9.30 Hands-On 1: How to kick-start NLP Jacqueline 10.15 Break / Letting People Catch up All ;-) 10.45 About ELK Philipp 11.00 Hands-On 2: Deep Dive Jacqueline 12.15 Wrap up / Discussion All 2 Today’s hosts Philipp Thomann Jacqueline Stählin Here today & tomorrow morning Here today & Monday till Wednesday 3 Setup Course Git - https://github.com/d-one/NLPeasy-workshop For infos during Workshop - https://tlk.io/amld2020-nlpeasy 4 «Making sense of data is the winning competence in all industries.» 5 Data drives Transformation 6 About ■ Zurich-based, established in 2005 with focus: data-driven value creation ■ 50 data professionals with excellent business and technical understanding & network ■ International & Swiss clients ■ In selected cases: investor for data-driven start-ups 7 Capabilities Business Consulting - Guide on data journey - Use cases, roadmap - Transformation and change - Building capabilities Machine Learning / AI Data Architecture - Ideation, implementation, - Data strategy operationalization Passionate - Enterprise requirements - Algorithms, modelling, NLP & - Information factory - Image recognition, visual analytics Down to - BI and analytics - Text analytics - Data science laboratory earth Data Management Data Experience - Data supply chain - From data to Insights to action - Data integration and modelling - Business report
    [Show full text]
  • Two Models for Named Entity Recognition in Spanish
    Two Models for Named Entity Recognition in Spanish. Submission to the CAPITEL Shared Task at IberLEF 2020 Elena Álvarez-Melladoa aInformation Sciences Institute, University of Southern California Abstract This paper documents two sequence-labeling models for NER in Spanish: a conditional random field model with handcrafted features and a BiLSTM-CRF model with word and character embeddings. Both models were trained and tested using CAPITEL (an annotated corpus of newswire written in European Spanish) and were submitted to the shared task on Spanish NER at IberLEF 2020. The best result was obtained by the CRF model, which produced an F1 score of 84.39 on the test set and was ranked #6 on the shared task. Keywords Spanish NER, CRF, BiLSTM-CRF 1. Introduction and Previous Work Named entity recognition (NER) is the task of extracting relevant spans of text (such as person names, location names, organization names) from a given document. NER has usually been framed as a sequence labeling problem [1]. Consequently, different sequence labeling approaches have been applied to NER, such as hidden Markov models [2], maxent Markov models [3] or conditional random field [4, 5]. More recently, different neural architectures have also been applied to NER [6, 7]. In this paper, we explore two sequence-labeling models to perform NER on CAPITEL, an annotated corpus of journalistic texts written in European Spanish. The explored models are a CRF model with handcrafted features and a BiLSTM-CRF model with word and character embeddings. Both models were submitted to the CAPITEL shared task on Spanish NER at IberLEF 2020 [8].
    [Show full text]
  • Developing Amaia: a Conversational Agent for Helping Portuguese Entrepreneurs—An Extensive Exploration of Question-Matching Approaches for Portuguese
    information Article Developing Amaia: A Conversational Agent for Helping Portuguese Entrepreneurs—An Extensive Exploration of Question-Matching Approaches for Portuguese José Santos 1 , Luís Duarte 1 , João Ferreira 1 , Ana Alves 1,2 and Hugo Gonçalo Oliveira 1,* 1 CISUC, DEI, University of Coimbra, 3030-290 Coimbra, Portugal; [email protected] (J.S.); [email protected] (L.D.); [email protected] (J.F.); [email protected] (A.A.) 2 Instituto Superior de Engenharia de Coimbra (ISEC), Instituto Politécnico de Coimbra, 3030-199 Coimbra, Portugal * Correspondence: [email protected] Received: 31 July 2020; Accepted: 28 August 2020; Published: 1 September 2020 Abstract: This paper describes how we tackled the development of Amaia, a conversational agent for Portuguese entrepreneurs. After introducing the domain corpus used as Amaia’s Knowledge Base (KB), we make an extensive comparison of approaches for automatically matching user requests with Frequently Asked Questions (FAQs) in the KB, covering Information Retrieval (IR), approaches based on static and contextual word embeddings, and a model of Semantic Textual Similarity (STS) trained for Portuguese, which achieved the best performance. We further describe how we decreased the model’s complexity and improved scalability, with minimal impact on performance. In the end, Amaia combines an IR library and an STS model with reduced features. Towards a more human-like behavior, Amaia can also answer out-of-domain questions, based on a second corpus integrated in the KB. Such interactions are identified with a text classifier, also described in the paper. Keywords: semantic textual similarity; question answering; conversational agents; machine learning; information retrieval; text classification 1.
    [Show full text]
  • Detecting Politeness in Natural Language by Michael Yeomans, Alejandro Kantor, Dustin Tingley
    CONTRIBUTED RESEARCH ARTICLES 489 The politeness Package: Detecting Politeness in Natural Language by Michael Yeomans, Alejandro Kantor, Dustin Tingley Abstract This package provides tools to extract politeness markers in English natural language. It also allows researchers to easily visualize and quantify politeness between groups of documents. This package combines and extends prior research on the linguistic markers of politeness (Brown and Levinson, 1987; Danescu-Niculescu-Mizil et al., 2013; Voigt et al., 2017). We demonstrate two applications for detecting politeness in natural language during consequential social interactions— distributive negotiations, and speed dating. Introduction Politeness is a universal dimension of human communication (Goffman, 1967; Lakoff, 1973; Brown and Levinson, 1987). In practically all settings, a speaker can choose to be more or less polite to their audience, and this can have consequences for the speakers’ social goals. Politeness is encoded in a discrete and rich set of linguistic markers that modify the information content of an utterance. Sometimes politeness is an act of commission (for example, saying “please” and “thank you”) <and sometimes it is an act of omission (for example, declining to be contradictory). Furthermore, the mapping of politeness markers often varies by culture, by context (work vs. family vs. friends), by a speaker’s characteristics (male vs. female), or the goals (buyer vs. seller). Many branches of social science might be interested in how (and when) people express politeness to one another, as one mechanism by which social co-ordination is achieved. The politeness package is designed to make it easier to detect politeness in English natural language, by quantifying relevant characteristics of polite speech, and comparing them to other data about the speaker.
    [Show full text]
  • Fasttext-Based Intent Detection for Inflected Languages †
    information Article FastText-Based Intent Detection for Inflected Languages † Kaspars Balodis 1,2,* and Daiga Deksne 1 1 Tilde, Vien¯ıbas Gatve 75A, LV-1004 R¯ıga, Latvia; [email protected] 2 Faculty of Computing, University of Latvia, Rain, a blvd. 19, LV-1586 R¯ıga, Latvia * Correspondence: [email protected] † This paper is an extended version of our paper published in 18th International Conference AIMSA 2018, Varna, Bulgaria, 12–14 September 2018. Received: 15 January 2019; Accepted: 25 April 2019; Published: 1 May 2019 Abstract: Intent detection is one of the main tasks of a dialogue system. In this paper, we present our intent detection system that is based on fastText word embeddings and a neural network classifier. We find an improvement in fastText sentence vectorization, which, in some cases, shows a significant increase in intent detection accuracy. We evaluate the system on languages commonly spoken in Baltic countries—Estonian, Latvian, Lithuanian, English, and Russian. The results show that our intent detection system provides state-of-the-art results on three previously published datasets, outperforming many popular services. In addition to this, for Latvian, we explore how the accuracy of intent detection is affected if we normalize the text in advance. Keywords: intent detection; word embeddings; dialogue system 1. Introduction and Related Work Recent developments in deep learning have made neural networks the mainstream approach for a wide variety of tasks, ranging from image recognition to price forecasting to natural language processing. In natural language processing, neural networks are used for speech recognition and generation, machine translation, text classification, named entity recognition, text generation, and many other tasks.
    [Show full text]