Legal Entity Extraction with NER Systems
Total Page:16
File Type:pdf, Size:1020Kb
Escuela Tecnica´ Superior de Ingenieros Informaticos´ Universidad Politecnica´ de Madrid Legal Entity Extraction with NER Systems Master Thesis Master In Artificial Intelligence Author: Ines Badji, [email protected] Advisor(s): Oscar´ Corcho and V´ıctor Rodr´ıguez-Doncel https://github.com/ibadji/Legal-NER June, 2018 i Acknowledgments I am very grateful to have been given the opportunity to work with the Ontological Engineering Lab and would like to thank all its members for the warmth, kindness and support they showed me. A particular thanks to my advisors Oscar´ Corcho, V´ıctor Rodr´ıguez-Doncel and Elena Montiel for helping me with this Master Thesis. It was a pleasure collaborating with you and I hope to have the chance to work with youinthefuture. ii iii Abstract Named Entity Recognition over texts belonging to the legal domain focuses on cat- egories (legal entities) like references to specific laws, judgments, name of courts or stages in a legal process. Although there is a rich choice of libraries for implementing NER systems, these late ones are not domain specific and do not work well on text pertaining to the Legal domain. Similarly, little focus is given to Spanish since most research is done on the English language. The objective of the work presented in this thesis is the identification of legal entities in Spanish and English texts, with a main focus on informal references to legislative documents found in news, Twitter, contracts or journal articles. The work is framed in the H2020 Lynx project, aimed at creating a Legal Knowledge Graph enabling the provision of compliance-related services. A Rule Based approach can be used to recognize references to norms in Spanish and English documents belonging to the legal domain applied on top of a combination of Natural Language Processing Tools. To recognize the mentions in documents of a less formal nature, a number of vulgar variants for the names of the public acts or judgments is necessary. By querying on Wikidata, DBpedia and BOE a table of synonyms is produced. These resources have been published along with a small annotated data set taken as gold standard. iv Contents v Contents 1 Introduction................................ 3 1.1Motivation................................. 3 1.2 Objective and approach ......................... 6 1.3 Hypotheses ................................ 8 1.4Methodology............................... 8 1.5 Terminology ................................ 10 1.5.1 Abbreviations........................... 10 1.5.2 Definitions ............................. 10 2 StateoftheArt.............................. 11 2.1LiteratureOverview........................... 11 2.2 Technical Overview ............................ 12 2.2.1 Parameters Affecting Performance ................ 13 2.2.2 Implementation methods ..................... 14 2.2.3 Features.............................. 14 2.2.4 Processing Steps and Resources: ................. 15 2.2.5 Tools................................ 15 2.3 Recognition of Legal References ..................... 17 3 AlgorithmsandServices......................... 23 3.1Algorithms................................. 23 3.1.1 Patternsandresourcesofthedifferentcodesused....... 23 3.1.2 Algorithms for NER of legal references ............. 26 3.2Services.................................. 29 3.2.1 Main................................ 29 3.2.2 Auxiliary ............................. 30 4 Experimentationandevaluation..................... 31 4.1 Implementation details .......................... 31 4.2 Running the Code . ........................... 33 4.3 Corpus.................................. 35 4.4Methodologyofevaluation........................ 37 4.5 Results.................................. 38 4.5.1 Results of NER of legal references ................ 38 4.5.2 Additional results ........................ 41 4.6 Publication as services Results ...................... 42 5 Conclusions................................ 45 vi Contents List of Figures vii List of Figures 1 Sample text with references to norms. .................. 3 2 Twitter Example . ........................... 4 3 Entities List that can be extracted for Spanish ............. 7 4 Entities List that can be extracted for English ............. 7 5 Flow chart showing the processed followed in order to gather the informationfortherulecreation..................... 25 6 Regex rule sample ............................. 26 7 PEG rule sample ............................. 26 8 DbpediaQuery.............................. 27 9 DbpediaQuery.............................. 28 10 WikidataQuery.............................. 28 11 SPARQLQueryResult.......................... 29 12 ServiceUsefromauserpointofview.................. 30 13 FlowchartofthemaincomponentsoftheCode............ 32 14 NicknamesOutput............................ 33 15 RuleOutput................................ 33 16 FinalOutput,TextVersion....................... 33 17 Highlighted Corpus Focused on Law Entities .............. 34 18 MapfortheCorpusinFig.17...................... 34 19 Highlighted Corpus Focused on Person Organization and Location . 35 20 MapfortheCorpusinFig.19...................... 35 21 Organizationofgithubfolders....................... 36 22 Structureofthecorpus.......................... 37 23 English:Precisionoftheruleservice.................. 39 24 English:Recalloftheruleservice.................... 39 25 English:F1scoreoftheruleservice................... 40 26 English:Precisionoftheotherservice.................. 40 27 English:Recalloftheotherservice................... 41 28 English:F1scoreoftheotherservice.................. 41 29 Spanish:Precisionoftheruleservice.................. 42 30 Spanish:Recalloftheruleservice.................... 42 31 Spanish:F1scoreoftheruleservice................... 43 32 Spanish:Precisionoftheotherservice................. 43 33 Spanish:Recalloftheotherservice................... 43 34 Spanish:F1scoreoftheotherservice.................. 44 35 NERStateoftheArt........................... 48 36 European Union Administrative composition .............. 49 37 European Union Law referencing .................... 50 38 SpanishLaw................................ 51 39 SpanishGovernment:Part1 ...................... 52 40 SpanishGovernment:Part2 ...................... 53 41 Spain - Law Referencing ......................... 54 viii List of Figures List of Tables 1 List of Tables 1 Example of Legal Entity extraction ................... 4 2 StateoftheArt:SummaryoftheTechnicalOverview......... 13 3 Sample of available NER Tools. ..................... 15 4 Named entity recognition: software and projects related to the legal domain................................... 18 5 Companiesworkingontechnologyrelatedtolegaldomain....... 19 2 List of Tables 3 1 Introduction 1.1 Motivation Lawyers and professionals of the Legal field find extremely valuable having refer- ences to legal entities identified and highlighted in legal information systems, pos- sibly with hyperlinks to source documents. Machines on the other hand, make an even more important use of these references, using them to improve search algo- rithms, anonymize documents, make data analysis, summarize long texts or provide recommendation services. Named-entity recognition (NER) is a Natural Language Processing task aimed at identifying references to specific entities in a text. Those entities can be of different types such as persons, organizations, places, dates, quantities, monetary values or percentages. In the legal domain, references to other entities are also of interest, such as norms, judgments, courts or jurisdictions. These entities are hereinafter referred as legal entities. For example, in the financial aid information document for postdoctoral studies in UPM [19]the following text is stated: Las ayudas objeto de esta convocatoria estan sometidas a las Bases Reguladoras contenidas en la Resoluci´on Rectoral de 23 de febrero de 2017 para la concesi´on de ayudas del programa propio de I+D+i de la UPM,alaLey 39/2015, de 1 de oc- tubre, del Procedimiento Administrativo Com´un de las Administraciones Publicas, la Ley 35/2006, de 28 de noviembre, del Impuesto sobre la Renta de las Personas Fisicas y de modificaci´on parcial de las leyes de los Impuestos sobre Sociedades, sobre la Renta de no Residentes y sobre el Patrimonio,laLey 3 8/2003, de 17 de noviembre, General de Subvenciones, Real Decreto 462/2002, de 24 de mayo, sobre indemnizaciones por raz´on del servicio as´ı como la Resoluci´on de 9 de febrero 2018 por la que se dictan instrucciones sobre comisiones de servicio con derecho a indemnizaci´on,yalosEstatutos de la UPM, aprobados por Decreto 74/2010, de 21 de octubre (BOCM del 15 de noviembre). Fig. 1: Sample text with references to norms. In the text above, references to different norms have been identified: three laws (leyes, in red colour), one royal decree (Real Decreto, in magenta), one decree of the Autonomous Community of Madrid (in orange), one statute (in pink) and two inter- nal UPM dispositions (Resoluciones Rectorales, in brown). In addition, other legal entities might be found, such as institutions (UPM), official publications BOCM), etc. 4 1 Introduction Tab. 1: Example of Legal Entity extraction Detected Entity Entity Category Ley 39/2015 Ley Organica Ley 39/2006 Ley Organica Ley 38/2003 Ley Organica Real Decreto 462/2002 Real decreto Resolution de 9 de febrero 2018 Resolution Decreto 74/2010 Decreto BOCM Abbreviation The excerpt above comes from a formal document, and references to norms are expected to appear in their full form