D2.4 Ontology population: connecting legal text to ontology concepts and instances

Grant Agreement nº: 690974 Project Acronym: MIREL Project Title: MIning and REasoning with Legal texts Website: http://www.mirelproject.eu/ Contractual delivery date: 31/12/2018 Actual delivery date: 31/12/2018 Contributing WP 2 Dissemination level: Public Deliverable leader: UL Contributors: UNITO, CORDOBA, UNIBO, INRIA

D2.4 Ontology population: connecting legal text to ontology concepts and instances

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 690974

MIREL- 690974 Page 2 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Document History

Version Date Author Partner Description 0.1 30/11/2010 Livio Robaldo UL Initial draft 1.0 30/12/2017 Livio Robaldo UL Final Version

Contributors

Partner Name Role Contribution UL Livio Robaldo Editor Main editor of the document UNITO Luigi Di Caro Contributor Writing of specific sections Cordoba Laura Alonso Alemany Contributor Writing of specific sections UNIBO Monica Palmirani Contributor Writing of specific sections INRIA Serena Villata Contributor Writing of specific sections

Disclaimer: The in this document is provided “as is”, and no guarantee or warranty is given that the information is fit for any particular purpose. MIREL consortium members shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials subject to any liability which is mandatory due to applicable .

MIREL- 690974 Page 3 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Table of Contents

Executive Summary ...... 5 1 Introduction ...... 6 2 The DAPRECO knowledge base and the Privacy Ontology (PrOnto) ...... 7 2.1 The General Data Protection Regulation (GDPR) ...... 8 2.2 The Privacy Ontology (PrOnto) ...... 9 2.3 The DAPRECO knowledge base ...... 11 3 The Eunomos/MenslegiS system and the European Legal Taxonomy Syllabus ...... 13 3.1 The Eunomos System ...... 13 3.1.1 Statistical NLP procedures used in Eunomos ...... 16 3.1.2 Rule-based NLP procedures used in Eunomos ...... 20 3.2 The European Legal Taxonomy Syllabus ...... 22 3.2.1 The European Legal Taxonomy Syllabus Schema ...... 24 3.2.2 The European Legal Taxonomy Syllabus ontology ...... 28 4 Legal NERC with ontologies ...... 33 4.1 Establish a mapping between LKIF and YAGO...... 33 4.2 Domain and classes to be learned ...... 34 4.3 Named entity recognition, classification, and linking trained on the LKIF+YAGO ontology . 35 4.3.1 Legal NERC via curriculum ...... 36 4.3.2 Legal NERC via support vector machines ...... 41 5 Conclusions ...... 50 References ...... 50

MIREL- 690974 Page 4 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Executive Summary

The goal of this deliverable is to report the achieved results and the ongoing work of the MIREL project devoted to connect legal text to concepts and instances in legal ontologies. The connection may be done manually or via NLP procedures for mining named entities and recurring linguistic patterns from legal documents and connect them to legal ontologies. In fact, recurring patterns usually correspond to relevant concepts, which may be associated with computational constructs such as classes, individuals, or relations belonging to legal ontologies. For this , ontologies may be also created (and not only populated) via semi-automatic ontology learning techniques, i.e., by applying automated ontology inference and population techniques, and combining automated methods and human assessment with, e.g., curriculum learning techniques.

MIREL- 690974 Page 5 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

1 Introduction

Ontology as a branch of philosophy is described by [Smith, 2008] as “the science of what is, of the kinds of objects, , events, processes and relations in every area of reality”. Ontologies that explicitly describe reality, or provide “an explicit specification of a conceptualization” [Gruber, 1993], are also an important area of computer science. The objective in computer science is to provide people, or more typically artificial agents, with structured and navigable knowledge about entities and their inter-relations. Ontologies can help people share and reference knowledge about concepts in general or specialist areas, in one or more languages. Ontologies can also be used for tasks such as semantic searches, interoperability between systems, or to facilitate reasoning and problem solving in artificial .

The peculiarities of the legal domain are many: are written in “legalese” - a domain-specific sublanguage that inherits all the expressivity and ambiguity of natural language with additional terms of its own that are often obscure, subject to changes over time, contextually defined, ill- defined, subject to interpretation to deal with their vagueness, defined in incompatible ways in different legal sources, or difficult to translate into different languages. As [Peller, 1985] stated: “legal discourse can never escape its own textuality”.

For this reason, legal ontologies are an active field of research because they could be useful in a range of different scenarios. Legal ontologies could help legal practitioners and scholars keep up to date with continuous changes in the law and understand legal sub-languages outside their own areas of expertise or jurisdiction. They could help legislators draft legislation with clarity and consistency. Moreover, a multi-jurisdictional legal ontology could help show the inter-relationship between, for instance, national and European Union terms and foster harmonization, a need recognized, e.g., by the Mandelkern Group Report [Mandelkern, 2001], for the EU Commission. The group stressed the need for internal coherence and consistency in the use of EU legal terms, as well as external coherence - consistency in the transposition of legal concepts into national law. Legal ontologies could potentially help translators of EU legislation become aware of how their choice of terms will be interpreted in different jurisdictions and help drafters of national legislation ensure greater consistency in their use of terminology when transposing European legislation. They could also serve as a useful tool in legal search, , automatic translation, automated reasoning and regulatory compliance verification. Finally, legal ontologies could help users find similar or related legislation, or compare the transposition of laws in national jurisdiction - useful for legal scholars in comparative law or who deal with cross-border issues and international financial institutions.

In light of this, concepts in legal ontologies need to be connected with textual spans referring to them, from legislation, case law, or similar documents. The connection may be done manually, as in the DAPRECO knowledge base, illustrated below in section 2, or via NLP procedures for mining named entities and recurring linguistic patterns from legal documents, as in the Eunomos system, illustrated in section 3, which defines a suite of both statistical and rule-based NLP procedures to connect the legal texts in its knowledge base to the concept of the European Legal Taxonomy

MIREL- 690974 Page 6 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Syllabus Ontology. Finally, since legal texts are plenty of recurring patterns, which usually correspond to relevant (legal) concepts, legal ontologies may be also created (and not only populated) via semi-automatic ontology learning techniques, i.e., by applying automated ontology inference and population techniques, and combining automated methods and human assessment with, e.g., curriculum learning techniques. The MIREL project allowed University of Cordoba and INRIA to carry out joint research on the topic; the main results obtained so far are illustrated below in section 4.

2 The DAPRECO knowledge base and the Privacy Ontology (PrOnto)

The DAPRECO knowledge base is the main outcome of the interdisciplinary project bearing the same name "DAPRECO: DAta Protection REgulation COmpliance"1, and one of the main outcome of the collaboration between WP1 and WP2 of MIREL, especially because it has been developed among a strict collaboration between University of Luxembourg (coordinator of MIREL) and University of Bologna (MIREL WP1 leader).

The DAPRECO knowledge base is a repository of rules written in the LegalRuleML2, an XML formalism that aims at being a standard for representing semantic/logical content of legal documents. University of Bologna, University of Luxembourg, Data61, and others are members of the technical committee of LegalRuleML.

The rules represent the provisions of the General Data Protection Regulation (GDPR), the new regulation that is having significant impact on the European digital market. The DAPRECO knowledge base builds upon the Privacy Ontology (PrOnto) [Palmirani et al., 2018a] [Palmirani et al., 2018b] [Palmirani et al., 2018c] designed to give a legal knowledge modeling of the privacy agents, data types, types of processing operations, rights and obligations involved in the GDPR.

The DAPRECO knowledge base integrates further constraints to the ontology in the form of if-then rules referring either to standard first order logic implications or to deontic statements. If-then rules are formalized in reified Input/Output (Robaldo and Sun, 2017) logic, and then codified in LegalRuleML. Reified Input/Output logic is an extension of standard Input/Output logic that integrates reification-based mechanisms coming from the Natural Language Semantics literature. This paper shows how reification is a good candidate to effectively formalize, via uniform and simple (flat) representations, complex linguistic/deontic phenomena that may be found in legal texts. To date, the DAPRECO knowledge base is the biggest knowledge base in LegalRuleML and Input/Output logic freely available online3.

1 https://www.fnr.lu/projects/data-protection-regulation-compliance 2 https://www.oasis-open.org/committees/legalruleml 3 https://github.com/dapreco/daprecokb/blob/master/gdpr/rioKB_GDPR.xml

MIREL- 690974 Page 7 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

The transition of European data protection rules from the former Directive 95/46/EC (Data Protection Directive) to the recent General Data Protection Regulation (GDPR)4 caused a lot of distress among controllers and processors, due to the stricter requirements imposed on data processing activities, and to the high fines, coupled with inquisitor powers of supervisory authorities, into which they could incur in case of a violation.

Compliance with the GDPR is today a mandatory requirement of any processing of personal data. Moreover, the data protection by design principle5 requires data protection approaches and technologies to be included within processing activities and supporting software. Compliance and software development are only two of the that call for a means to efficiently represent the GDPR provisions in a machine-readable format, e.g. to create a suitable model of the GDPR, so that its rules can be read and understood by computer tools that support these activities.

A full model of the GDPR has been built and is currently evolving for refinement. Such an ambitious goal builds upon three foundational pillars:

a) an ontology of legal concepts that can be referred whenever there is the need to reason about the meaning of a piece of law; b) a formal representation of the semantics of the legal text, with its modalities of obligations, permissions, and entailments, and with the logic structures of this various parts clearly expressed; c) a representation of legal texts, with its concepts and its semantics in a format that can be machine-readable. These three elements are closely intertwined: the ontological concepts in the ontology are referred to in the semantics representation, and therein enriched with deontic relations proper of the logic; both are translated into a machine-readable, XML-based representation and associated with a XML version of the legal text, of which they represent the interpretation.

2.1 The General Data Protection Regulation (GDPR) The protection of personal data is a fundamental right in the European Union, sanctified in the Charter of Fundamental Rights of the European Union (Article 8), alongside the right to privacy (Article 7 of the Charter, but also Article 8 of the European Convention on Human Rights). Although ofer overlapped, the two rights differ significantly. Whereas the right to privacy entitles an individual to prevent unbidden acquisition or use of his or her own personal data, the right to data protection requires that personal data are collected lawfully and processed in a fair manner, for specified purposes, and for no longer than the time required for those purposes.

4 Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). 5 GDPR, Article 25.

MIREL- 690974 Page 8 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

A common and minimal framework for the protection of personal data in the European Union since had been already laid out in 1995. However, the sociotechnological changes that ensued made those rules inadequate to face the context of the last decades, where data are considered an asset and their processing en masse thwarts the aforementioned fundamental right. For this reason, the European Union recently adopted the new GDPR, applicable since 25 May 2018, which mainly addresses two purposes: to provide a uniform legislation, without the need to implement the European provisions in Member States laws; and to strengthen the rights of the data subject, granting him or her a renewed dominion over his or her personal data. The GDPR is bringing about a small revolution in the world of online services. The importance of the GDPR is attested by the recent commotion around the quasi-entirety of service providers that have changed their privacy policies, introducing more fine-grained privacy controls for data subjects [Satariano, 2018]. The interest that is being stirred by the GDPR also affects several other stakeholders, including authorities, auditors and academia, who are addressing solutions to evaluate and improve the compliance of services with the GDPR provisions [IT Governance Privacy Team, 2017] [Hintze, 2017]. Under the GDPR, many of the data processing operations carried out by a controller (be it a private enterprise or a public body) or processor are subject to legal duties, duties that need to be integrated in the ow of the controller's activities. The software tools that process personal data are supposed to operate in a way theat tries to avoid infringements of the GDPR. And yet, none of the available tools has originally been developed with data protection in . Much like security has done in the past decades, data protection must nowadays crawl its way, often guided by what consultant specialists say, into tools used wherever personal data are at stake |software development and testing, compliance checking, business process modelling, and so on. Nonetheless, no tool can guarantee actual compliance with the GDPR. Primarily, data protection rules convey complex requirements that are expressed in natural (and rather technical from a legal point of view) language and subject to legal interpretation, which makes them all the more complex to understand. Some standards dealing with domains that partly overlap with data protection, such as privacy and security, suggest best practises whose content has a more straightforward connection with their actual implementation, and they are openly scrutinized and widely adopted to boot. However, perusing standards and the GDPR's text to understand whether certain norms in the former do actually realize what provisions in the latter demand can be a complex and grueling task. Automation in parsing the normative texts of standards and the GDPR may ease the task, but it requires that the GDPR provisions and the text of the standards are formalized in a machine-readable form. Herein the most critical part is the formalization of the legal aspects of the GDPR which are subject to interpretation.

2.2 The Privacy Ontology (PrOnto) The first pillar of the GDPR model is the legal text in Akoma Ntoso format [Palmirani and Vitali, 2011], which makes it easy to navigate the document, referencing specific portions of text, using ordinary XML parsers. Akoma Ntoso is a XML format used for storing legal documents, not only

MIREL- 690974 Page 9 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

statute law but also decisions, opinions, doctrine and the like. The XML structure easily allows to integrate it with other documents using normal XML parsers. This component contains the full legal text of the GDPR, meticulously tagged with its structure (that is, its partitioning into chapters, sections, articles, paragraphs, and so on). It is then possible to add inline annotation such that pointers from linguistic descriptions to ontological concepts stored in an OWL ontology. The current work in DAPRECO and MIREL is to enrich the Akoma Ntoso annotation to pointer to an ongoing ontology for privacy, called the Privacy Ontology6 (PrOnto) [Palmirani et al., 2018a], [Palmirani et al., 2018b], [Palmirani et al., 2018c]. PrOnto was built from scratch, following a more structured approach based on a thorough ontology development methodology called MeLON. The purpose of PrOnto is not only to model the data protection requirements of the GDPR. Rather, it aims at proving a comprehensive ontological framework to model data protection and privacy in general. The scope of PrOnto is wider than that of the GDPR, in a threefold perspective. First, it accommodates not only the requirements of the GDPR, but also those introduced in Member State legislation. Second, it is not limited to the data protection principles of the EU, but is fit to support legislation from other countries as well. Third, it does not only cover data protection, which concerns the rules on the fair processing of personal data, mainly developed in European legislation, but also privacy, which concerns the right to keep one's personal information private, and is the outcome of a long evolution in U.S. courts. However, it is indeed true that the GDPR represents the core of PrOnto. To accommodate these requirements, PrOnto follows a modular structure. The ontology is designed in such a way that it models the essential European data protection rules contained in the GDPR, but these rules can be easily extended by attaching additional ontologies (for example, for domain-specific provisions, or for Member State legislation). By modelling the GDPR provisions within the core ontology, the ontological representation matches the ratio of the GDPR. A significant number of provisions allow Member State legislation to override the GDPR provisions, either to increase the protection granted to the data subject, or to introduce derogations for specific domains. PrOnto is loyal to this view, as it forms the baseline upon which provisions overriding the GDPR can be plugged. Figure 2 shows a high-level view of PrOnto, highlighting its partitioning into five main pillars. Data are processed by means of processing activities. These are performed through the interaction of the various agents, who operate under the deontic rules (rights and obligations) laid out by the legislation. Specific purposes constitute the legal basis of the processing. If the processing is carried out within the boundaries of the purpose (i.e., collecting only the data that is required for that purpose, and not maintaining personal data beyond the time that is reasonably necessary), then it is lawful.

6 Currently stored at https://github.com/guerret/lu.uni.dapreco.parser/blob/master/resources/pronto- v8.graphml .

MIREL- 690974 Page 10 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Figure 1: Pillars of the Pronto Ontology

The MeLOn methodology, on which the design of the PrOnto ontology has been grounded, follows standard principles of minimization, which may be found within main surveys on computational ontology design and evaluation [Brank, Grobelnik, and Mladenific, 2005] [Bandeira et al., 2016]. As a general rule in ontology engineering, design principles such as minimization and avoiding redundancy are needed to achieve computational efficiency (see [Bandeira et al., 2016], particularly Q9 and Q10 in Table 7). However, the principle of minimization prevents PrOnto to be a knowledge base truly fit for automatic compliance checking. Simply put, a computational ontology in OWL only defines and describes the main concepts involved, as well as the main semantic relations between them, that may be useful to index information in the data protection domain, thus facilitating their navigation and search.

2.3 The DAPRECO knowledge base The PrOnto ontology represents a first step towards IT solutions for indexing information in the EU data protection domain, thus facilitating their navigation and search.

MIREL- 690974 Page 11 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

On the other hand, in order to assess compliance we also need fine-grained rules representing the obligations and permissions from the GDPR, as well as further concepts involved in these obligations and permissions, concepts which have not been included in PrOnto in the first place, to comply with the principle of minimization. The DAPRECO project aims at building a knowledge base of such rules on top of the PrOnto ontology. The DAPRECO knowledge base may be then seen as an extension of PrOnto, and is of course compatible with it. The additional concepts, introduced to properly model GDPR obligations and permissions, have been consistently connected with the ones belonging to PrOnto. An example can be used to clarify how the DAPRECO knowledge base extends PrOnto. The GDPR requires that the processing of personal data be lawful, according to the provision contained in Article 5. In the DAPRECO knowledge base, this obligation is represented as a deontic if-then rule that reads: “if there is a processing of personal data ep, then it is obligatory for ep to be lawful". Article 6 of the GDPR contains the rules on the lawfulness of processing. In short, the processing of personal data is lawful if it has a legal basis among the ones listed in Article 6. In other words, the list says when the above obligation is satisfied, so that we enrich the DAPRECO knowledge base with additional if-then rules, corresponding to standard non-deontic first-order logic entailments, in the form: “if these conditions hold for the processing of personal data ep, then ep is lawful". The PrOnto ontology does not include any the above if-then rules, either the deontic or the non- deontic ones. The PersonalDataProcessing class, whose individuals are activities involving the processing of personal data, only includes a boolean attribute lawfulness, codified as an OWL data of the class. When lawfulness is true for a certain individual of the class ep, then ep is taken to be lawful, and unlawful otherwise. But the ontology does not include further knowledge specifying when the attribute lawfulness is true or false, i.e., the conditions conveyed by the provisions in Article 6. While in theory these provisions, conveying knowledge necessary to assess compliance with the GDPR, could be represented in PrOnto as well, doing so would overly complicate the structure of the ontology itself. The provisions, represented in terms of if-then rules, are then included in the DAPRECO knowledge base only. The latter is codified using the LegalRuleML format, which allows to connect the logical representations of the provisions with the ontological concepts in PrOnto as well as with the (Akoma Ntoso representation of the) GDPR articles. On the other hand, as said above, in the legal domain a new dimension needs to be taken into account while assessing compliance with legislation: norms are likely subject to different (incompatible) legal interpretations. The GDPR became applicable on 25 May 2018. Apart from some doctrinal debates, and a few far-reaching decisions that somehow anticipated the impact of the GDPR, disputes concerning the legal interpretation of its provisions did not take place up to that date. Therefore, currently the concrete examples of such legal interpretations are very scarce. To account for different legal interpretations, the formulas stored in the DAPRECO knowledge base have been made defeasible, by means of constructs drawn from Circumscriptive Logic [McCarthy, 1980].

MIREL- 690974 Page 12 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

3 The Eunomos/MenslegiS system and the European Legal Taxonomy Syllabus

The Eunomos system [Boella et al., 2016] is an advanced legal document and knowledge management system, based on legislative XML and ontologies and incorporating state-of-the art research from legal , that aims at helping law researchers and practitioners to manage complex information. Eunomos is being developed as a commercial software part of a wider suite distributed by Nomotika S.R.L.; the commercial version of Eunomos is called MenslegiS7.

The ontology used within the Eunomos system is the European Legal Taxonomy Syllabus [Ajani et al., 2017], a multi-lingual, multi-level legal ontology to help legal professionals involved in transnational activities, or multinational organisations, manage the deep, complex and interconnected knowledge required for understanding laws in different jurisdictions.

This section presents the Eunomos system (subsection 3.1) and the European Legal Taxonomy Syllabus (subsection 3.2), with a particular focus on the IT procedures employed in the former to link legal documents to items (concepts, individuals, and relations) belonging to the latter.

3.1 The Eunomos System The Eunomos system has been developed with clearly-defined aims and objectives to support the work of law firms, law scholars, and in-house legal offices in financial institutions and public sector organisations. In other words, it was created to help legal researchers and practitioners manage and monitor legislative information. The system is based on mature technologies in legal informatics - legislative XML and ontologies [Corcho et al., 2005], [Palmirani and Vitali, 2011], [Palmirani, 2011], - combined in an intuitive way that addresses requirements from the commercial sector to access and monitor legal information. Less developed technologies, such as logical representation of norms and information extraction of legislative text are not used now but may be in the future. Eunomos can be employed as an in-house software that enables expert users to search, classify, annotate and build legal knowledge and keep up to date with legislative changes. Alternatively, Eunomos can be offered as an online service so that legislation monitoring is effectively outsourced. The software and related services can be provided to several clients, which means that information and costs are shared. The system, being based on the Legal Taxonomy Syllabus ontology, described in the next section, is inherently multilingual and multilevel, so it can be used for different systems, using similar legislative XML standards, and even for the EU level, keeping separate ontologies for each system. The basic idea of Eunomos is creating a stricter coupling between legal knowledge and its legislative sources, associating the concepts of its legal ontology with regulations structured using

7 https://www.augeos.it/EN/DynamicContents/Details?ContentId=MENSLEGIS

MIREL- 690974 Page 13 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

legislative XML. The legal document management part of the system is composed of a legal inventory database of norms (about 70,000 Italian national laws in the current version) converted into legislative XML format, with links between related legislation created by automated analysis of in-text references and each article semi-automatically classified into legal domains. Most laws are collected from portals by means of Web spiders on a daily basis, but they can also be inserted into the database via a Web interface. Currently the system harvests the Normattiva8 national portal , the portal Arianna9 of Regione Piemonte and a portal of regulations from the Ministry of Economy. For each legislation, Eunomos stores and time-stamps the original and most up-to date versions, but nothing prevents including multiple versions of the coordinated text for users, like lawyers or judges, whose primary concern is not only to have up-to-date information on the law. After they are converted into legislative XML, cross-references are extracted to build a network of links between norms citing one other. The semi-automated classification of norms is supported by classification and similarity tools described below in subsections 3.1.1. Legal concepts can be extracted and modelled using the Legal Taxonomy Syllabus; the ontology is part of the database and it is saved as a table that is a repository of concepts, that are connected, but independent from, terms in a many-to- many relationship. The classical RDF subject-predicate object triple that defines the relationships between the concepts is stored in a separate table. Reconstructing transitive relations can be expensive in a relational database, so there is another cache table that stores the complete transitive closure of the ontology. The ontology is well- integrated within the document management system, so that links can be made between concepts, the terms used to express the concepts, and items of the laws that feature the terms. Viceversa, terms in the text of legislations are annotated with references to the concepts. Figure 1 shows the components of the system and the flow of documents into the system. The system addresses retrieval and interpretation problems with the following functionalities:  The problem of increase in scope, volume and complexity of the law is addressed by creating a large database of laws converted into legislative XML and downloaded automatically from legislative portals, which are annotated and updated regularly;  The problem of specialisation is addressed by the semi-automated classification of articles, enabling users to view only those sections of legislation that are relevant to their domain of interest;  The problem of fragmentation of laws is handled by enabling users to view legislation at European, national and regional level from the same Web interface;  The problem of keeping up with changes in the law is addressed by alert messages sent to users notifying them that a newly downloaded legislation is relevant to their domain of interest. Where legislation is updated, users can view consolidated text where available from state portals, as well as the original version. Where previous laws are modified or abrogated implicitly, Eunomos provides a mechanism to annotate the legislation with

8 http://www.normattiva.it 9 http://arianna.consiglioregionale.piemonte.it

MIREL- 690974 Page 14 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

implicit cross-references (and hyperlinks) to the amending piece of legislation.  The issue of legal “terms of art” that can vary in meaning in different contexts and over time is addressed with multi-level updatable domain-specific ontologies where terms can be aligned with various concepts and definitions; concepts are associated with the specific textual sources by links.  The issue of vague and imprecise language is addressed with additional information, clarifications and interpretation supplied by knowledge engineers based on thorough legal research;  The issue of cross-references is addressed by a facility whereby the user can either hover over a cross-reference, and the referenced article appears in a pop-up text box, or click on a hyperlink to the referenced article to see the text in context.

The Eunomos system resolves the resource bottleneck by decoupling all competences needed to build a large reliable legal knowledge base for regulatory compliance. We need to overcome, on the one hand, the limitation of the manual updating of the knowledge bases – this would be highly

MIREL- 690974 Page 15 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

time-consuming and error-prone – and, on the other hand, to support current NLP technologies that, even at the best of their performances are however unable to fully-automatically carry out the work. The Eunomos employs a semi-automatic approach to build and update the knowledge bases, where user-friendly interfaces allow knowledge engineers to enter a massive amount of data without the intervention of experts in the underlying technologies, who are required to modify them in rare and exceptional occasions only. Eunomos employs both a statistical classifier to classify the documents according to their topics, and rules-base procedures for entity linking and named entity recognition. Knowledge engineers do not need to have any competence in machine-readable formalisms, NLP, or the other technologies used in the system. During their daily work, knowledge engineers enter new data into the database by correcting the reference links and the document domains automatically suggested them by the system, and possibly adding further explanations in plain text. The domain classifier is periodically re-trained on the new (enlarged) training sets. On the other hand, concerning the rule-based procedures, an expert in NLP, by periodically looking at the missing or incorrect linguistic patterns found by these procedures, decides if and how modifying the rules. Nevertheless, a revision of the rule is indeed rarely required in that legal texts are usually plenty of recurring linguistic patterns and have a limited lexicon, thus the current set of rules is already able to find the correct links in the majority of cases.

3.1.1 Statistical NLP procedures used in Eunomos Statistical NLP in Eunomos is used for text classification and text similarity. Text classification. Even if the technicalities of the classification process we use is outside the scope of this deliverable, we summarize here our methodology, described in details in [Boella et al., 2011] and [Boella et al., 2012]. For each new piece of legislation, the classification task is: (1) to find which domains are relevant to the legislation, and (2) to identify which domain each article belongs to. The first task enables targeted email notification messages to be sent to all users interested in the particular domains covered by new legislation. The second task enables users to view, in each piece of new legislation, only articles relevant to a particular domain. We use the TULE parser [Lesmo, 2009] that performs a deep analysis over the syntactic structure of the sentences and allows a direct selection of the informative units, i.e., lemmatized nouns. This is a better solution than the more common practice of using WordNet [Fellbaum, 1998] or other top-domain ontologies to eliminate stopwords and lemmatize informative as they are unable to recognise and lemmatize many legal domain-specific terms. In the proposed system, we used a set of documents that have been manually annotated with categories, allowing the training of a supervised classification module. Such training set is composed by 156 legal texts and 15 categories (or classes). Since statistical methods requires

MIREL- 690974 Page 16 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

sufficiently large textual information for each category, we filtered out those with few associated documents, building three datasets with different degrees of filtering (namely, S1 , S2 , and S3 ), see Table 1. Note that dataset S3 preserves more than 70% of the original data (i.e. 110 documents out of 156), although it contains only 5 out of 15 total categories. One category (C15 in Table 1) has not been used since the associated texts do not contain any specific topic. Although there are plenty of for text classification, we used the well-known Support Vector Machines (SVM) for this task, since it frequently achieves state-of-the-art accuracy levels [Joachims, 1998] [Cortes and Vapnik, 1995]. This makes use of vectorial representations of text documents and works by calculating the hyperplane having the maximum distance with respect to the nearest data examples. More in detail, we used the Sequential Minimal Optimization algorithm (SMO) [Platt, 1999] with a polynomial kernel. The association between text and a category label has been fed into an external application based on the WEKA toolkit [Hall et al., 2009] and incorporated in Eunomos, creating a model that can be used to classify new laws inserted on a daily basis into the database by web spiders or users. The WEKA toolkit was used as a framework for the experiments because it supports several algorithms and validation schemes allowing an efficient and centralized way to evaluate the results of the system. In addition to the standard bag-of-words approach (where each text is represented as an unordered collection of words), we also wanted to test whether the system TULE and its additional features increased the accuracy of the classification module with respect to the standard use of WordNet lemmas. This is marked with the label “+TULE” in the results of Table 2. As it can be noticed, the use of the additional features improved the accuracy of the classifier. The evaluation of a classification task can range from very poor to excellent depending on the data. A simple way to estimate the complexity of the input is to compute the separation and compactness of the classes. The Separability Index (SI) [Greene, 2001] measures the average number of documents in a dataset that have a nearest neighbor within the same class, where the nearest neighbor is calculated using Cosine Similarity. Tests on the whole dataset revealed a SI of 66.66%, which indicates a high overlap among the labelled classes. Table 2 shows the SI values for all the three datasets. The SVM classifier achieves an accuracy of 92.72% when trained with the n-folds cross validation scheme [Greene, 2001] on dataset S3 + TULE (using n = 10, which is a common practice in the literature). As shown in Table 2, the classifier achieves lower accuracy levels with datasets S1 and S2, though it was already expected from their low SI values. Nevertheless, it is interesting to see that classification on dataset S1 is still acceptable in terms of accuracy despite its very low SI. This is due to the fact that, although there is a large overlap between the dictionaries used in different classes, there are some terms that characterize them properly.

MIREL- 690974 Page 17 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

MIREL- 690974 Page 18 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Text similarity. Eunomos uses a text similarity algorithm, the Cosine Similarity, to find the most similar pieces of legislation in the whole database. Since each piece of legislation contains a large amount of text, they are indexed with the PostgreSQL internal inverted index facility in order to enable fast full text searches and ranking for document similarity. The Cosine Similarity metric uses the Term Frequency-Inverse Document Frequency (TF-IDF) measure to gauge the relative weight to be apportioned to various key words in the respective documents. The Cosine Similarity metric is particularly useful for finding similar single-domain legislation.

However, legislation that contains norms on different topics can introduce noise into the comparative process. Therefore, the software has been adapted to include similarity searches on an article level. For each piece of text, Eunomos may generate a list of the most similar texts in the whole database using Cosine Similarity. Where labelled data is not available, Cosine Similarity can be also used to build a training set for a supervised classification module.

Applying Cosine Similarity to search for relevant text is a common practice in general-purpose Information Retrieval tasks [Salton and Buckley, 1988]. In these cases, the only issue is to determine how many texts to select and return. This means choosing an appropriate threshold (or cutoff) to apply to the ordered list of relevant articles created with the Cosine Similarity measure. A naive solution for truncating the list of texts that are ordered by its similarity with the input one is to use a fixed cutoff k. This way, only the first k articles are considered as relevant. However, this approach does not take into account the distribution of the ordered similarity values. An alternative approach is to find where the similarity values suffer a significant fall.

MIREL- 690974 Page 19 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

This separates the actual similar texts from the rest. A practical way to implement this idea is to analyze the distribution of the ordered values looking at the highest difference (or highest "jump") between adjacent values in the list [Boella et al., 2011].

In our experiments, we made use of the categories associated with already-labeled documents as part of the similarity evaluation process. More in detail, given one document d and a set of similar ones Sd, the evaluation task looks at whether the documents contained in Sd have the same categories as the input document d. Figure 9 shows the result of the accuracy when fixing the cutoff k, and when using our document-level automatic estimation of k. This shows that, notwithstanding the benefit of using a variable and data-dependent approach for estimating the cutoff, the accuracy level reached by this technique is noticeably higher than with the use of fixed cutoffs.

3.1.2 Rule-based NLP procedures used in Eunomos Eunomos uses a rule-based pattern-matching module to automatically determine whether a reference is a simple reference or it modifies or overrides other legislation and for entity linking with respect to the European Legal Taxonomy Syllabus. The present subsection illustrates the rule- based technology on the former task; it is easy to understand how the same technology can be applied for the latter (simpler) task. The interface of Eunomos enables knowledge engineers to manually override the result of the pattern-matching procedure, as they are ordered and executed according to a certain priority. As it is standard in defeasible logic, priorities are used for overridding general behaviours with more specific ones. Contrary to the identification of references and ontological concepts, classifying modificatory provisions features an higher linguistic variation, and rules must deal with ambiguities. For instance, the verb "sopprimere" (to suppress) may be used in legislation to specify either an "abrogazione" (abrogation) or a "sostituzione" (substitution). When the verb is followed by the preposition "da" (by), it usually specifies a substitution, e.g. "Articolo X soppresso da Articolo Y" ("Article X is suppressed by Article Y"). Otherwise, it usually specifies an abrogation. To deal with this ambiguity, the rule-based module includes two rules: a default rule that classifies the modificatory provision as abrogation and an higher-priority rule that checks whether the verb is used in a linguistic pattern that denotes a substitution. For ease of understanding, we provide only conceptual representations of the rules in the figures below. Figure 7 shows the conceptual representation of the default rule that classifies the modificatory provision as abrogation. The rule is triggered when the system finds in the input text a verb with the lemma "sopprimere".

MIREL- 690974 Page 20 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Then, it checks whether there is a verb with lemma "essere" (to be) in the two preceding words, and whether there is a normative reference in the five preceding words of the verb with lemma "essere". The normative references, found by the automatic reference detection tool, are substituted with the strings rif1, rif2, etc. and considered as proper nouns by the TULE parser. When the rule in Figure 7 is satisfied, the provision is annotated as "abrogazione", with the normative reference occurring therein identified as "norma".

On the other hand, we add in the system the rule in Figure 8 and assign to it a higher priority than the rule in Figure 7, so that it is executed before the latter.

In Figure 8, the checks carried out on the words preceding the keyword "sopprimere" are the same as for those in Figure 7. Furthermore, the rule in Figure 8 requires the occurrence of the preposition "da" immediately after the keyword and a normative reference (that will be annotated as "novella") among the five words following the preposition.

MIREL- 690974 Page 21 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

To evaluate the Eunomos module for extracting legal modifications, we used a dataset composed of 180 files, containing 2,306 modificatory provisions manually annotated by the legal experts of the CIRSFID research center10 of the University of Bologna.

Our system obtains 86.60% recall and 98.56% precision. The match between a provision automatically calculated by the module and the corresponding one stored in the corpus is considered valid only if it matches both the type of the provision (abrogation, substitution, insertion, etc.) and all its arguments, such as "norma" and "novella" in Figure 8. A similar system has been proposed in [Lesmo et al., 2009].

That system also uses the TULE parser and it has been evaluated on the same corpus of 2,306 modificatory provisions from CIRSFID. [Lesmo et al., 2009] reports 71.7% recall and 83.0% precision. It is worth noticing that the system presented here achieves an very high level of precision, close to 100%, because the rules behave as a kind of "filter". In other words, the system uses ad-hoc rules, each of which describes a specific valid pattern. As a consequence, (almost) any provision matching with this pattern is precisely classified by the pattern itself. Recall is lower in that rules are added one by one, which turns out to be an highly time-consuming task.

3.2 The European Legal Taxonomy Syllabus The European Legal Taxonomy Syllabus is a lightweight ontology grounded on a different vision of legal ontologies, which aims at solving the inherent difficulty of defining concepts from the legal domain in an ontology. To understand such a difficulty, it is useful to look at the legal process and how legal concepts interact with the social and political environment. Our perspective on laws and legal concepts is largely informed by insights from comparative law. Comparative law has emerged as a thorough approach to investigating and describing laws and legal systems, and the analysis that emanates from the investigation can help explain the meaning of legal terms as used by practitioners. One of the most influential comparative lawyers is Rodolfo Sacco. Starting from the premise that norms are not "legal flowers without stem or root" [Sacco, 1991, p.27], he identified factors that influence how laws are interpreted in different jurisdictions. First, all legal systems have several legal formants, otherwise known as sources of law - codes and statutes, judicial decisions, legal scholarship and political ideologies: “the civil may say that this rule comes, in principle, from the code; the common lawyer may say it comes from a particular statute or from judicial decisions; and yet they both will learn their law initially from the books of legal scholars” [Sacco, 1991, p.22]. The importance of these legal formants vary considerably in different jurisdictions and different areas of law - case law is more important in France than in Italy, some areas of English law are subject to more statutes than others - although all these legal formants have some influence,

10 http://www.cirsfid.unibo.it/

MIREL- 690974 Page 22 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

whatever the official model of the law might say. The existence of multiple legal formants creates uncertainty, since they are rarely in complete harmony on a point of law. And yet, this does not usually stop the law from functioning. Legal formants are not the only factors that influence legal interpretation. There are also invisible factors such as the beliefs or mentality of the interpreters, their social and cultural background. Sacco calls these factors cryptotypes. Such factors rarely need to be articulated in a mono-culture. Comparative law helps reveal hidden cryptotypes when a seemingly equivalent rule is interpreted in different ways in different legal jurisdictions, or when an implicit rule is made explicit in another legal system. Sacco cites as an example the issue of whether an heir can transfer property before possessing it. Belgian interpreters of the Civil Code have deemed such transfers invalid, but the French upheld them. The discrepancy is explained by the fact that while the Code itself does not support such transfers, the old Roman law did, and the custom carried over into French law. Sacco noted that it is quite common to find that not all legal rules are fully articulated. A synecdoche occurs when only part of a phenomenon is indicated when referring to the whole. He gives an example that the legal definition of contract in French law refers to the will of the parties without mentioning the need for the will to be declared or that there needs to be a good reason for the parties to declare their will and for the law to respect it. Filling in the gaps requires knowledge of the legal culture and custom. Unwritten rules are passed on from one generation of jurists to another. Sacco claims that identification of legal formants, cryptotypes, and synecdoche were found "almost as a by-product" [Sacco, 1991, p.388] in comparing different legal systems, but has led to a more profound understanding of how the law functions than the legal philosophies of Kelsen, Hohfeld and others. The reason is that limiting the study of law to a single legal system leads one to ignore features that appear to be too "obvious" to mention. Such features are not necessarily "obvious" or common to all legal systems, and their discovery reveals the unwritten rules and values that underpin the law in different legal systems. The comparative approach can go beyond the letter of the law to find its true meaning [Sacco, 1991, p.16-17]: “An abstract idea finds concrete expression in a given legal language much as, in biology, a genotype or distinctive set of genes is expressed in the phenotype or outward form of a plant or animal. The jurist of an individual country studies the phenotype. The comparativist must study the genotype of which it is the expression.”. Another important contribution of comparative law is the exploration of the interplay between legal formants, and the awareness that while the law constantly evolves, legal formants rarely move together in sync, so that conflicting valid interpretations are inevitable [Sacco, 1991, p.394]. [Sacco 1991, p.24] states: “Comparison recognizes that the “legal formants” within a system are not always uniform and therefore contradiction is possible. The principle of non-contradiction, the fetish of municipal lawyers, loses all value in an historical perspective, and the comparative perspective is historical par excellence.”. This is in contrast with Kelsen’s conclusion that contradictory norms within the same legal system cannot exist, in line with a universalist view of law [Kelsen, 1992, p.112]: “If legal cognition

MIREL- 690974 Page 23 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

encounters legal norms that contradict one another in content, it seeks, by interpreting their meaning, to resolve the contradiction as a mere pseudo-contradiction. If this effort fails, legal cognition disposes of the material to be interpreted, disposes of it as lacking in meaning altogether and therefore as non-existent in the legal sphere qua realm of meaning.”. Sacco has had an immense influence on legal scholarship, most especially in his bottom-up approach to the analysis of legal concepts. For example, [Graziadei, 2004] analyses the gradual acceptance of interest as a return on capital in European countries achieved through different paths. There is no synonymy relation between terms-concepts such as "frutto civile" (legal fruit) and "income" from civil law and common law respectively, but these systems can achieve functionally similar operational rules thanks to the functioning of the entire taxonomy of national legal concepts. The study shows the influence of changes in religion, economy and philosophy on the evolution of these concepts in different civil and common law jurisdictions as well as the practical implications of conceptual differences. In summary, comparative law suggests that there is more to norms than what is in legislation, that analysing different national systems gives a deeper understanding of law, and that differences between national systems can be revealed with a bottom-up approach. This view is an alternative to conventional top-down approaches to legal ontologies, which bear a close resemblance to Kelsen’s doctrine of the unity of law: "As it is the task of natural science to describe its object - reality - in one system of natural laws, so it is the task of to comprehend all human law in one system of norms" [Kelsen, 1941]. The European Legal Taxonomy Syllabus is composed of a legal ontology schema, a web-based legal ontology tool conforming to the ontology schema, and a multi-lingual legal ontology on European consumer law constructed with the tool. The three components have been of course built while keeping an eye on the vision described above. The next subsection will present the legal ontology schema by describing the motivations behind it resulting from the legal analysis. The web-based tool is simply an IT tool used to populate the ontology; it is not particularly interesting to illustrate the tool for the objectives of this deliverable. The multi-lingual legal ontology on European consumer law constructed with the tool will be presented below in section 3.2.2.

3.2.1 The European Legal Taxonomy Syllabus Schema The European Legal Taxonomy Syllabus is an ontology framework designed to address the issues raised above. The most important insight from lawyers, which informed our design, was that the meaning of a legal term depends on its context (jurisdiction, domain, legislation, timeframe, interpretation). We designed an ontology schema that aims to make these considerations explicit. Our system attempts to model interpretations beyond the letter of the law as well as temporal evolution of concepts in an intuitive way, allowing users to traverse different definitions and determine which definitions are most relevant to their query. From a pragmatic need to model European law and national transpositions, the ontology framework must be both multi-lingual, multi-jurisdictional and multi-level. This allows links to be

MIREL- 690974 Page 24 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

made between different national ontologies, so that users can find similar terms in other languages and other jurisdictions, and compare their meaning. The schema has been designed to support the definition of concepts on the basis of a comparative law methodology. Following [Sacco, 1991], we chose to adopt a bottom-up approach to ontology creation, i.e., to compare low-level concepts among different legal systems. In our view, a comparative law methodology ensures a non-superficial understanding of legal terms. Starting from low-level elements rather than abstract or composite concepts fosters evidence-based conceptualisations and generally gives rise to less disagreement among ontology contributors. We also adopt the view of comparative law that legal concepts are influenced by formants other than legislation, and ensure that the ontology should provide space for annotation and citations of case law and doctrine. The main purpose of the European Legal Taxonomy Syllabus tool is to support the work of legal practitioners, scholars and translators in multi-lingual and multi-jurisdictional contexts such as the European Union, to help share technical knowledge and analyse the law in all its complexity. As a secondary aim, the system can be used to build automated tools, e.g., for information retrieval and translation. Since the ontology framework is primarily designed for human reference, it supports lightweight rather than axiomatic ontologies. In the classification of [Giunchiglia and Zaihrayeu, 2009] we use "(informal) lightweight ontologies": “Note that lightweight ontologies are much easier to be understood and constructed by ordinary users. ... designing a full-fledged ontology (expressed, for example, in OWL-DL) is a difficult and error-prone task even for experienced users ...”. The choice of building a lightweight ontology was motivated by the need to develop a more user- friendly system, thereby enlarging the possible audience of contributors and users, and at the same time, reducing the costs of building an ontology. It was also driven by the consideration that many peculiarities of law, such as interpretation, penumbra, interaction with social values, metaphors and dynamics are far from having commonly accepted solutions in logic.

Our analysis of the legal domain led us to identify the following features in the ontology schema to allow the representation of relevant information in the legal domain. Such features are not always straightforward to represent using standard approaches to ontology design.

Terms and concepts: the varying and highly contextualised meaning of legal terms means that there needs to be a structured way to allow more than one meaning for terms in a legal ontology. ELTS separates terms and concepts, allowing terms to be mapped to different concepts and to have concepts mapped to more than one term (in the same language, or in different languages in the case of multi-lingual nations and of EU law). Terms can be either single words or multi-words (cf. examples below). Therefore we have many-to-many relations between terms and concepts, thereby allowing both synonymity and polysemy. Since we are in a multi-lingual context, a term in our system is structured as the term itself together with the jurisdiction identifier and the relevant language, in order to account for multiple languages in the same jurisdiction. The idea of neatly separating the lexicon from the conceptual level is of course not new (cf. [Buitelaar, 2010]) and is

MIREL- 690974 Page 25 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

the foundation of several models, including the well- known Lemon lexicon model [McCrae et al., 2011]. Lemon was designed to model lexicon and machine-readable dictionaries relative to ontologies in OWL standard. It has been successfully used to publish many dictionaries/ontologies as Linked Data e.g., [Villegas and Bel, 2015], [Ehrmann et al., 2014].

Sources: each concept is linked to its legal source(s), possibly more than one, since a concept can arise from multiple parts of legislation and also from the interaction of legislation, case law and doctrine. However, listing the sources is not enough. For the sake of clarity, concepts are associated with a description in natural language. Nevertheless, it is important to identify the legal sources, since they contain important information about scope and purpose.

Domains: in addition to the contextual information of legal sources, the concepts are classed in domains, traversal with respect to the jurisdictions and levels, in order to organize knowledge and improve search and browsing. Each concept can be associated with more than one domain.

Multi-lingual and multi-jurisdictional: the multi-jurisdictional nature of the EU requires not only a multi-lingual ontological framework associating concepts with terms in different languages, but also a multi-jurisdictional one. The ELTS schema involves separate ontologies for each jurisdiction whose concepts are in turn mapped to terms in relevant languages. Specific relations connect concepts from the ontologies of different jurisdictions, which are separate from the relations within the same ontologies. In particular, the relation “implement”, described in more detail below, connects concepts in the EU ontology with concepts in the national ontologies.

Multi-level: besides being multi-lingual and multi-jurisdictional, the ELTS schema distinguishes between the EU level and national levels: these constitute separate ontologies. Note that the EU level contains a single ontology, where all concepts have associated terms in the different languages of the Member States considered. This, however, is a simplification, since it is possible that there are unwanted divergences among different languages even at the EU level. Concepts in the EU level ontology can be associated with terms in all the languages of the Member states. Concepts in the different ontologies at the national level can be associated only with the terms of the languages of the nation they belong to.

Ontological relations: due to the holistic nature of the law, legal concepts are better understood in relation to others. Therefore within each ontology, the concepts are linked via ontological relations such as “is-a”, “part-of” and by more specific legal relations such as legal “purpose”, expressing the legal goal (e.g., “consumer protection”) that the legal system aims to achieve with that concept. Relations among concepts may change over time (as stated below).

Implementation relation: Concepts at the EU level can be connected to national level concepts by an implementation relation, representing how the concept has been transposed into one national legal system. Given the separation of terms and concepts, the term associated with an EU level concept is not necessarily the same term used to express the same concept in the implementing legislation at national level. The relation is many-to-many, since a national level concept may be

MIREL- 690974 Page 26 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

the fusion of more than one EU level concept and/or a Member State may express an EU level concept in multiple ways.

Dynamic nature of meaning: the ontology schema must account for the fact that almost every legislation brings new definitions of terms that effectively replace prior conceptualisations. Therefore the ontology must represent the current legal situation, and yet researchers may still need to refer to deprecated conceptualisations for historical purposes or to trace the evolution of terms. This also raises the problem of what happens to the ontological relations associated with the replaced concept, a sort of frame problem. Since we are dealing with a semi-automated context, the proposed solution is that the new concept should inherit all the relations of the replaced one and it is the responsibility of the knowledge engineer to remove the outdated ones and possibly introduce new relations in accordance with authoritative interpretations.

Interpretation: there is a tension in the law between the highly contextual character of meaning, which leads to multiple meanings for one term, each one associated with specific sources, and the need to systematize legal knowledge. The legislator can introduce in a new legislation a new meaning for a term, whose utility can go beyond the context of that legislation. Due to interpretation, the meaning of the term can be extended also to other concepts denoted by the term in other contexts. The frequent merging of meanings assigned to legal terms that takes place in legal reasoning or simultaneous transposition of multiple directives means that more complex concepts can emerge which do not necessarily replace contextual meanings in all situations. This situation cannot be simply modelled by “is-a” relations, since the concepts resulting from the interpretation are neither necessarily more general in meaning, nor necessarily the simple intersection of the more contextualized meanings. Rather, what is generalized is the context of use of the concept. Moreover, the original contextual meaning of a term must always be available to the user and not only the merged one deriving from interpretation.

Conceptual drafts: the ontology must be able to accomodate the conceptualization also of draft legislation, to compare the resulting “draft” ontology. Glossaries created to achieve consistency in legal terminology, such as the CFR and ACQP above, may contain conceptualisations that are yet to be accepted officially. ELTS allows the creation of temporary legal ontologies whose concepts are linked to current legal ontologies until such time as the old concepts are replaced when the draft legislation becomes law.

Given the non-formal character of ELTS, which cannot be defined in a standard ontology language such as OWL, the ELTS ontology schema is provided here as a UML semi-formal representation , shown in Figure 2. The database schema is of course very close to the ontology schema in Figure 2.

MIREL- 690974 Page 27 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

3.2.2 The European Legal Taxonomy Syllabus ontology The data in the European Legal Taxonomy Syllabus is on the subject of consumer law. Terms were extracted from a corpus of 24 EC directives, and 2 EC regulations. Occurrences of such entries were detected from national transposition laws from English, French, Spanish, Italian and German jurisdictions. The number of annotated terms and concepts are provided in Tables 3 and 4.

MIREL- 690974 Page 28 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Law is a highly polysemous domain, and the meaning of terms is highly context-sensitive. Legal terms are often defined differently in different legislation and other legal formants, and each definition can be regarded as a different concept. At the same time, some concepts can be expressed in different terms, particularly when transpositions from European law sometimes import literal translations rather than using the terms that have evolved organically in the national jurisdiction. All this means that an accurate ontology of law, particularly in a multi-jurisdictional environment, has to deal with possible relations between terms and concepts.

ELTS is structurally different from ontologies that permit only one meaning for every term. In ELTS, terms (lexical entries) are separated from concepts. Moreover, the concepts are associated with quotations from the sources contributing to their definition.

The text of the sources is maintained in a database of documents, so that it is possible to establish a link between the concept and the whole text of the source in which it is defined. In this way, it is possible to show the user the text of legislation and case law enriched with links to occurrences of concepts and highlighting terms related to those concepts (see Figure 4).

In Figure 5, we illustrate how a concept associated with a term is shown to the user. The concept associated with the term "consumer" at the European level is shown in a tabular format. For ambiguous terms, more than one table is shown for each different meaning. The language, the alternative linguistic realizations associated with the concept, and the domain ("Consumer law") all precede the description of the term in natural language.

As stated above, ELTS allows the insertion in the text of paragraphs where concepts are defined. However, to gain better understanding of legal concepts, it is often necessary to consider a broader fragment. For example, in the case of "consumer", the definition is not enough, and it is necessary to collect multiple paragraphs where consumer protection norms are presented and discussed. References to legislation, notes and attached documents are also optional fields. The implementation and national associations fields are discussed below.

MIREL- 690974 Page 29 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

To help users understand the inter-relationship between concepts, ELTS contains a number of ontological relations between concepts (see Table 5). Notwithstanding classic relations such as the “is_a” relation linking a category to its supercategory, it also includes the “purpose” relation, which links a concept to the legal principle behind it, and “concerns”, which refers to a general relatedness between concepts and is often used to link complex concepts to the basic concepts that contribute to their definitions. Legal practitioners have also deemed it useful to have a similarity relation between concepts, and this similarity relation is not the same as strict synonymity between terms: “quasi_synonym”.

In this way, European Legal Taxonomy Syllabus enables clear modeling of the similarities and differences in conceptual inter-relations in different jurisdictions. It is important to note that these structures can also change. With the modification of concepts in new legislation at the national or European level, new purposes can be introduced or old ones rendered obsolete. Note that the above ontological relations connect only concepts within a single ontology (the EU ontology or the one of a Member state) and are kept distinct from relations across different ontologies such as “implementation or “translation.

MIREL- 690974 Page 30 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Furthermore, legal concepts constantly evolve. When new legislation is approved and enacted, it can introduce a number of new definitions that can change the meaning of terms defined by old laws, rendering the old conceptualisations obsolete. Normative change [Palmirani and Brighi, 2006] is an open issue in building tools for formal models of legal frameworks. There are two types of normative change: explicit change and implicit change. In the first case, the new norm explicitly states the abrogation of a specific paragraph of an old law (see, e.g., [Spinosa et al., 2009]).

Alternatively, the new law can define a term in a way that contradicts definitions in previous laws without mentioning these laws explicitly. The same can take place if a judgment, or even an

MIREL- 690974 Page 31 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

authoritative scholarly work, changes the meaning of a term. In these cases, the new definition implicitly renders the old definition obsolete.

The evolution of concepts is often ignored in the design of ontologies, since for most domains users are usually only interested in the current state of affairs. We believe that it is essential for legal ontologies to provide mechanisms for keeping track of conceptual changes, since deprecated definitions may still be useful, e.g., when looking at cases occurring before the introduction of new legislation, or in order to understand the evolution of terms. Similar concerns lie behind the work of [Palmirani and Brighi, 2006] in modelling time-specific multi-version coordinated legal texts.

We are aware of the difficulties of modeling time in and in formal ontology. On the other hand, the expressivity provided by other solutions is beyond what is needed for the legal domain, where it is sufficient to represent the legal concepts that stand at the current time, along with any (deprecated) ones that have been replaced. As such, European Legal Taxonomy Syllabus introduces a simpler solution inspired by the way time is dealt with in the databases of data warehouses - we add a new ontological relation called "replaced_by".

The "replaced_by" relation allows new concepts to explicitly replace old ones while old concepts are still retained in the system. The semantics of this relation is procedural, in that it is embedded in the way the database deals with this relation when it is added. The new ontological relation is dealt with by the system with some particular characteristics:

 A "replaced_by" relation brings with it a new data field not present in the other relations: the substitution date.  When the user searches in the concepts database, replaced concepts will not be shown unless the user specifies a certain date in the past. This enables the user to obtain a snapshot of the legal ontology for any particular moment.  When a new concept is an update of an old one, all the ontological relations linked to the old concept are automatically copied to the new concept. If some of these relations are no longer valid for the new concept, the interface allows manual intervention from the user.

MIREL- 690974 Page 32 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

4 Legal NERC with ontologies

University of Cordoba and INRIA worked together on a Wikipedia-based approach to develop resources for the legal domain. In particular, the work done was devoted to:

(1) Establish a mapping between a legal domain ontology, LKIF [Hoekstra et. al., 2007], and a Wikipedia-based ontology, YAGO [Suchanek et al., 2007], and through that we populate LKIF. (2) Use the mentions of those entities in Wikipedia text to train a specific Named Entity Recognizer and Classifier. The research work has been primarily published in [Teruel et al., 2017a], [Teruel et al., 2017b], and [Cardellino et al., 2017].

4.1 Establish a mapping between LKIF and YAGO Many legal ontologies have been proposed in the literature with different purposes and applied to different sub-domains, e.g., [Ajani et al., 2017], [Hoekstra et al., 2007], [Athan et al., 2015]. However, their manual creation and maintenance is a very time-consuming and challenging task: domain-specific information needs to be created by legal experts to ensure the semantics of regulations is fully captured. Such ontologies have little coverage, because they have a small number of entities or dwell only in concepts, not concrete entities. Moreover, only very few annotated legal corpora exist where entities can be gathered from. All this constitutes an important barrier for Information Extraction from legal text. There is little work on increasing the coverage of legal ontologies. [Bruckschen et al., 2010] describe a legal ontology population approach through an automatic NER to legal data. [Lenci et al., 2009]’sontology learning system T2K extract terms and their relations from Italian legal texts, and it is able to identify the classes of the ontology. [Humphreys et al., 2015] extract norm elements (norms, reasons, powers, obligations) from European Directives using dependency parsing and semantic role labeling, taking advantage of the structured format of the Eunomos legal system described above [Boella et al., 2016]. On the other hand, [Boella et al., 2014] exploit POS tags and syntactic relations to classify textual instances as legal concepts. All these approaches rely on an important amount of domain knowledge and hand-crafted heuristics to delimit legal concepts and how they are expressed in text. In contrast, in this research an unexpensive approach has been taken. The approach exploits the information already available in Wikipedia and connecting it with ontologies. A mapping between the WordNet- and Wikipedia-based YAGO ontology and the LKIF ontology for the legal domain is established. By doing this, the semantics of LKIF is transfered to Wikipedia entities and populating the LKIF ontology with Wikipedia entities and their mentions. It must be observed that Wikipedia has been used as a corpus for NERC because it provides a fair amount of naturally occurring text where entities are tagged and linked to an ontology, i.e., the DBpedia [Hahm et al., 2014] ontology. One of the shortcomings of such approach is that not all

MIREL- 690974 Page 33 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

entity mentions are tagged, but it is a starting point to learn a first version of a NERC tagger, which can then be used to tag further corpora and alleviate the human annotation task.

4.2 Domain and classes to be learned The target domain of this research is formally represented by the well-known LKIF ontology [Hoekstra et al., 2007], which provides a model for core legal concepts. In order to transfer the semantics of LKIF to the relevant annotated entities in Wikipedia, a mapping between the extended LKIF and YAGO [Suchanek et al., 2007], a Wikipedia-based principled ontology, has been manually defined between the classes of the two ontologies. The mapping is from a node in one ontology to another node in the other ontology. All children nodes of a connected node are connected by their most immediate parent. Therefore, all children nodes of the mapped YAGO nodes are effectively connected to LKIF through this mapping. There are a total of 69 classes in this portion of the LKIF ontology, of which 30 could be mapped to a YAGO node, either as children or as equivalent classes. Two YAGO classes were mapped as parent of an LKIF class, although these we are not exploiting in this approach. From YAGO, 47 classes were mapped to a LKIF class, with a total of 358 classes considering their children, and summing up 4’5 million mentions.Since curriculum learning requires that concepts are organized in a hierarchy, it has been decided to not use the hierarchy provided by the two ontologies themselves because LKIF is not hierarchical, but more aimed to represent interrelations and mereology. Thus, a hierarchy of concepts displayed in Figure 1 has been developed.

MIREL- 690974 Figure 2: Levels of abstractionPage of34 tofhe 55 reference ontology. 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

The top distinction is between Named Entities and non-Named Entities, then within Named Entities the classes ``Person’’, ``Organization’’, ``Document’’, ``Abstraction’’ and ``Act’’ have been distinguished; within those LKIF classes have been distinguished and within those YAGO classes have been distinguished. In order to build the training corpus, the spans of text that are an anchor for a hyperlink whose URI is one of the mapped entities have been considered as tagged entities. Then, sentences that contained at least one named entity have been extracted. Then, words within the anchor span belong to the “I” class (Inside), outside the span, to the “O” class. The “O” class made more than 90% of the instances, so non-named entity words have been randomly subsampled in order to make it at most 50% of the corpus, i.e., so that classifiers would not be too biased. Thus built, the corpus consists of 21 million words. The corpus was divided into three parts: 80% of the corpus for training, 10% for tuning and 10% for testing. The elements on each part were randomly selected to preserve the proportion of each class in the original corpus, with a minimum of one instance of each class appearing in each part. We consider only entities with a Wikipedia page and with more than 3 mentions in Wikipedia.

4.3 Named entity recognition, classification, and linking trained on the LKIF+YAGO ontology Named Entity Recognition and Classification (NERC) is a cornerstone for Information Extraction (IE). Accurate and specific NERC allows for improved Information Retrieval (IR) and a more informative representation of the contents of documents. It is the basis for the identification and formal representation of propositions, claims and arguments in legal texts, as shown by [Surdeanu et al., 2010]. Information Retrieval and Extraction are key issues in legal practice nowadays, because they allow for an extensive and quick exploitation of jurisprudence. If law practitioners are provided with relevant cases when they are building their arguments for a new case, they are more liable to produce a sounder argumentation. It is also to be expected that cases are resolved more definitely if compelling jurisprudence is provided, even at an early stage in the judicial process. More and more technological solutions are being developed in this line, which shows the feasibility and utility of this line of work. In this context, open-source tools and resources are important also to provide equity to the access of law. In the legal domain, Named Entities are not only names of people, places or organizations, as in general-purpose NERC. Named Entities are also names of laws, of typified procedures and even of concepts. Named Entities may also be classified differently, for example, countries and organizations are classified as Legal Person, as can be seen in the following example extracted from a judgment of the European Court of Human Rights:

MIREL- 690974 Page 35 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Example 1.1. The [Court]organization is not convinced by the reasoning of the [combined

divisions of the Court of Cassation]organization , because it was not indicated in the

[judgment]abstraction that [Egitim-Sen]person had carried out [illegal activities]abstraction capable

of undermining the unity of the [Republic of Turkey]organization.

Different levels of granularity can be distinguished in NERC. The most fine-grained level of NERC, Named Entity Linking (NEL) has acquired much from the community in recent years, mostly because of the availability of knowledge bases and computational resources that make NEL feasible. The task of NEL consists in determining the identity of entities mentioned in text with respect to a knowledge base. Example 1.1 can be tagged as follows:

Example 1.2. The [Court]European_Court_of_Human_Rights is not convinced by the reasoning of the

[combined divisions of the Court of Cassation]Yargıtay Hukuk Genel Kurulu, because it was not

indicated in the [judgment]Court_of_Cassation’s_judgment_of_22_May_2005 that [Egitim-

Sen]Education_and_Science_Workers_Union_(Turkey) had carried out [illegal activities]0 capable of

undermining the unity of the [Republic of Turkey]Turkey.

As it should be clear from the previous subsections, in the legal domain Named Entities are best represented using ontologies. While this is true of any domain, the need for an ontology representing the underlying semantics of Named Entities is crucial in the legal domain, with the severe requirement of precision, a rich hierarchical structure, and well-founded semantics for some of its sub-domains (see, for example, the Hohfeldian analysis of legal rights [Hohfeld, 1919]). Some ontologies have been created to model the legal domain, with different purposes and applied to different sub-domains, e.g., [Hoekstra et. al., 2007], [Athan et al., 2015], and [Ajani et al., 2017]. However, their manual creation and maintenance is a very time-consuming and challenging task: domain-specific information needs to be created by legal experts to ensure the semantics of regulations is fully captured. Therefore, such ontologies have little coverage, because they have a small number of entities or dwell only in abstract concepts. Moreover, only very few annotated legal corpora exist with annotations for entities. All this constitutes an important barrier for Information Extraction from legal text. In the research conducted at the University of Cordoba and INRIA, this issue has been tackled by addressing the following research question: how to populate legal ontologies, with a small number of annotated entities, to support named entity recognition, classification and linking? Specifically, two low-cost high-coverage legal Named Entity Recognizers, Classifiers and Linkers have been developed and trained on the corpus illustrated in the previous section: one based on Curriculum Learning and one based on Support Vector Machines.

4.3.1 Legal NERC via curriculum learning Even using Wikipedia, many of the classes have few instances. To adress the problems of training with few instances, one of the solutions adopted by University of Cordoba and INRIA was to apply

MIREL- 690974 Page 36 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

a learning strategy called curriculum learning [Bengio et al., 2009]. Roughly, curriculum learning is a method that trains a model incrementally, by presenting to it increasingly more complex concepts. This should allow finding the most adequate generalizations and avoiding overfitting. However, it has been found that curriculum learning does not produce the expected improvements. On the contrary, reversed curriculum learning, learning from most specific to most general, produces better results, and it helps to indicate that there may be incoherences in the mappings between ontologies. Curriculum learning (CL) is a continuation method [Allgower and Georg, 2012], i.e. an optimization strategy for dealing with minimizing non-convex criteria, like neural networks classifiers. The basic idea of this method is to first optimize a smoothed objective, in our case, more general concepts, and then gradually consider less smoothing, in our case, more specific concepts. The underlying intuition is that this approach reveals the global picture (Bengio et al., 2009). Curriculum learning have been applied with the following rationale. First, a neural network with randomly set weights is trained to distinguish NE vs. non-NE. Once this classifier has converged, the weights obtained are used as the starting point of a classifier with a similar architecture (in number of layers and number of neurons per layer), but with more specific classes. In our case, the classification divides the examples in the six classes Person, Organization, Document, Abstraction, Act, non-NE. Again when this classifier converges, its weights are used for the next level of classification, the LKIF concepts, and finally the YAGO classes. Let us consider the following example: we start with the text "Treaty of Rome", then in the first iteration we train the classifier to learn it as a NE; the second iteration classifies it as a Document; in the third iteration it falls in the LKIF Treaty class, and finally, in the last iteration, it is linked to the YAGO WordNet treaty 106773434. When the neural network has been trained, experiments with one, two and three hidden layers have been carried out, but a single hidden layer, smaller than the input layer, performed better, so this has been set as the architecture for neural networks. In each iteration of CL only the output layer is modified to suit the abstraction of the classes to the corresponding step of the CL iteration, leaving the hidden layer and the weights from the input to the hidden layer. Examples have been represented with a subset of the features proposed by [Finkel et al., 2005] for the Stanford Parser CRF-model. For each instance (each word) we used: current word, current word PoStag, all the n-grams (1 <= n <= 6) of characters forming the prefixes and suffixes of the word, the previous and next word, the bag of words (up to 4) at left and right, the tags of the surrounding sequence with a symmetric window of 2 words and the occurrence of a word in a full or part of a gazetteer. The final vector characterizing each instance had more than 1.5e6 features, too large to be handled due to memory limitations. In addition, the matrix was largely sparse. As a solution, a simple feature selection technique have been applied using Variance Threshold. All features have been filtered out with variance less than 2e-4, reducing the amount of features to 10854.

MIREL- 690974 Page 37 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Evaluation. The neural network classifier have been evaluated by comparing batch learning and curriculum learning. As a comparison ground, also a linear classifier have been trained, namely a Support Vector Machine (SVM) with a linear kernel, and the Stanford CRF Classifier model for NER [Stanford NLP Group, 2016], training it with the corpus with Wikipedia annotations for the LKIF classes. For the Stanford NERC, he same features has been used as the MLP classifiers, except the features of presence in gazetteers and the PoS tags of surrounding words. Decision trees and Naive Bayes (NB) classifiers were discarded because the cardinality of the classes was too large for those methods. To evaluate the performance, we computed accuracy, precision and recall in a word-to-word basis in the test portion of our corpus. For this particular problem, the performance for the majority class, non-NE, eclipses the performance in the rest. To have a better insight on the performance, macro-average of precision and recall without the non-NE class have been also provided. Macro- average is showing differences in all classes, with less populated classes comparable to more populated ones. The difference in performance between different classifiers was very small. To assess the statistical significance of results, a Student's t-test have been applied with paired samples comparing classifiers. The Wikipedia corpus have been divided in five disjunct subcorpora, then divided those in train/validation/test, compared results and obtained p-values for the comparison. In order to evaluate the performance of this approach in legal corpora like norms or case-law, a corpus of judgments of the European Court of Human Rights have been manually annotated, identifying NEs that belong to classes in the ontology or to comparable classes that might be added to the ontology. Excerpts from 5 judgments of the ECHR, totalling 19,000 words, have been annotated, thus identifying 1,500 entities, totalling 3,650 words. Annotators followed specific guidelines, inspired in the LDC guidelines for annotation of NEs [Linguistic Data Consortium, 2014]. There were 4 different annotators. The agreement between judges ranged from κ=.4 to κ=.61, without significant differences across levels of granularity. Most of the disagreement between annotators was found for the recognition of NEs, not for their classification. The inter-annotator agreement obtained for this annotation is not high, and does not guarantee reproducible results, but it is useful for a first assessment of performance. Analysis of results. The results on the test portion of our Wikipedia corpus are reported in Table 1. The table shows overall accuracy, and the average recall and precision across classes other than the non-NE class. It can be seen that neural network classifiers perform better than both SVM and the Stanford NER. Differences are noticeable when the non-NE class is not included in the metric, as in the non-weighted average of precision and recall without non-NEs. It can be observed that curriculum learning does not introduce an improvement in accuracy over batch learning in a neural network. As explained above, the paired t-test has been applied in five different samples of the corpus to assess whether the difference between classifiers was significant or not; it has been found that two out of five of the obtained results were not significantly different (p<0.05), but the other three were. Therefore, it seems that Curriculum Learning, at least the way we applied it here, does not introduce an improvement.

MIREL- 690974 Page 38 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

The results have been further analyzed so that it was possible to find that the MLP classifier performs far better in smaller classes (with less instances) than in bigger classes, for all levels of abstraction but most dramatically for the LKIF level, where F-score for the 20% biggest classes drops to .11 (in contrast with .62 for NERC and .42 for YAGO), while for the smallest classes it keeps within the smooth decrease of performance that can be expected from the increase in the number of classes, and thus an increase in the difficulty of classification. These results corroborate an observation that has already been anticipated in general results, namely, that the LKIF level of generalization is not adequate for automated NERC learnt from the Wikipedia, because the NERC cannot distinguish the classes defined at that level, that is, in the original LKIF ontology. In contrast, the NERC does a better job at distinguishing YAGO classes, which are natively built from Wikipedia, even if the classification problem is more difficult because of the bigger number of classes.

Table 1: Results for the test portion of the Wikipedia corpus. Accuracy figures consider non-NEs, but precision and recall are an average of all classes (macro-average) except the majority class of non-NEs. The results for the NER level for Curriculum Learning

MIREL- 690974 Page 39 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

On the other hand, the fact that smaller classes are recognized better than bigger classes indicates that bigger classes are ill-delimited. It may be that these classes are built as catch-all classes, grouping heterogeneous subclasses. That indicates that curriculum learning might work better learning first from most concrete classes, then from more general classes. In Table 2, the performance of curriculum learning in reverse, that is, from the smallest classes to the most general ones, are shown.

Table 2: Comparison of curriculum learning strategies, from most general to most specific (CL) and from most specific to most general (rev CL), with accuracy including the class of non-NEs and macro-average excluding the class of non-NEs. It can be seen that curriculum learning from most specific to most general provides the best result for the NERC level of abstraction, outperforming the other two neural approaches. However, at the LKIF level, the batch approach performs better. This seems to indicate that, for this particular hierarchy and dataset, curriculum learning seems more adequately applied from most specific to most general. Moreover, the YAGO and NERC levels seem to be coherent with each other, while the LKIF level seems disconnected from the other two. Therefore, it seems that the chosen level of granularity for legal NERC using our ontology should be either the 6-class level or the YAGO level, depending on the level of granularity that is required. Moreover, the mapping between YAGO and LKIF needs to be further consolidated. The results for different approaches to NERC trained on Wikipedia, with the corpus of judgments of the ECHR described above are shown in Table 3. It is possible to see that the drop in

MIREL- 690974 Page 40 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

performance with respect to results on Wikipedia is very important, but on the other hand this annotator has no annotation cost, because examples are obtained from Wikipedia, so it can be considered as a preprocess for human validation/annotation of legal text.

Table 3: Comparison of different strategies for NERC trained on Wikipedia, as they perform in ECHR judgments.

4.3.2 Legal NERC via support vector machines Using the corpus described above, a classifier for Named Entity Recognition and Classification based on Support Vector Machines have been trained. The objective of this classifier is to identify in naturally occurring text mentions the Named Entities belonging to the classes of the ontology, and classify them in the corresponding class, at different levels of granularity. In this research the URI level has not been considered; this is treated qualitatively differently by the Named Entity Linking approach detailed below. Different approaches have been applied to exploit our annotated examples. First of all, a linear classifier have been trained, namely a Support Vector Machine (SVM) with a linear kernel, and the Stanford CRF Classifier model for NERC [Surdeanu et al., 2010], with our corpus with Wikipedia annotations for the LKIF classes. Decision trees and Naive Bayes (NB) classifiers were discarded because the cardinality of the classes was too large for those methods. The Stanford NERC could not handle the level of granularity with most classes, the YAGO level. Moreover, a neural network have been (again) learnt, carrying out experiments with one, two and three hidden layers, but it resulted that a single hidden layer, smaller than the input layer, performed better, so this architecture was set. More complex configurations of the neural network have been explored,

MIREL- 690974 Page 41 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

with subsequent levels of granularity, including the Curriculum Learning described above. However, none of these more complex configurations improved performance. The experiments were also carried out using word embeddings. Some exploration using the Google News corpus pre-trained embeddings has been carried out, 8 which are 3 million dense word vectors of dimension 300, trained on a 100 billion words corpus. However, it has been decided to go with some self-trained embeddings using Word2Vec’s skip-gram algorithm, based solely in the Wikipedia corpus used later for the NERC task. All words with less than 5 occurrences were filtered out, leaving roughly 2.5 million unique tokens (meaning that a capitalized word is treated differently than an all lower case word), from a corpus of 1 billion raw words. The trained embeddings were of size 200, and taking them we generate a matrix where each instance is represented by the vector of the instance word surrounded by a symmetric window of 3 words at each size. Thus, the input vector of the network is of dimension 1400 as it holds the vectors of a 7 word window total. If the word was near the beginning or the end of a sentence, the vector is padded with zeros. Zeros we padded also in case no representation of the word (capitalized or not) is found in the Word2Vec model. Word embeddings are known to be particularly apt for domain transfer, because they provide some smoothing over the obtained model, preventing overfitting to the training set. Therefore, they were expected to be useful to transfer the models obtained from Wikipedia to other corpora, like the judgments of the ECHR. However, it is also known that embeddings are more adequate the bigger the corpus they are learnt from, and if the corpus belongs to the same domain to which it will be applied. In the present case, the corpus at disposal, namely Wikipedia, was very big, so that it does not belong to the domain to which the embeddings were applied, namely the judgments. Therefore, experiments have been conducted with three kinds of embeddings: embeddings obtained from Wikipedia alone (as described above), those obtained with the same methodology but from the judgments alone, and those obtained with a mixed corpus made of judgments of the ECHR, and a similar quantity of text from Wikipedia. To train word embeddings for judgments of the ECHR, we obtained all cases in English from the ECHR’s official site available on November 2016, leading to a total of 10,735 documents. The Named Entity Linking task consists in assigning YAGO URIs to the Wikipedia mentions. The total number of entities found in the selected documents is too big (174,913) to train a classifier directly. To overcome this problem, a two-step classification pipeline has been used. Using the NERC provided by the previous step, each mention have been firstly classified as its most specific class in our ontology. For each of these classes, a classifier has been trained to identify the correct YAGO URI for the instance using only the URIs belonging to the given class. Therefore, several classifiers have been built, each of them trained with a reduced number of labels. Note that each classifier is trained using only entity mentions for a total of 48,353 classes, excluding the `O' class. The state of the art tool for NEL is Babelfy11, but the NERC developed in this research could not have been compared to it because it has a daily limit of 1000 queries.

11 http://babelfy.org/

MIREL- 690974 Page 42 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

The algorithm to train the two-step pipeline is the following:

1. Assign to each mention its ground truth ontology label. 2. Split the dataset into train/test/validation. 3. For each assigned ontology class: 3.1. Build new train/test/validation datasets by filtering out mentions not tagged with this class. 3.2. Train and evaluate a classifier with the new train/test/validation datasets. while the algorithm for classification is:

1. For each instance, assign a NE class to it using a previously trained NERC. 2. Select the classifier assigned to the class, and use it to obtain a YAGO uri prediction of the instance. The classifiers learnt for each of the classes were Neural Network classifiers with a single hidden layer, of size 2*number of classes with a minimum of 10 and a maximum of 500. Other classifiers cannot handle the high number of classes in this setting, in particular, the Stanford NERC is incapable of handling them. As a comparison ground, two baselines have been also evaluated, a random classifier and a k- nearest neighbors. For the random baseline, given the LKIF class for the entity (either ground truth or assigned by an automated NERC), the final label is chosen randomly among the YAGO URIs seen for that LKIF class in the training set, weighted by their frequency. The k-nearest neighbors classifier is trained using the current, previous and following word tokens, which is equivalent to checking the overlap of the terms in the entity. Two types of evaluations have been distinguished: the performance of each classifier, using ground truth ontology classes, and the performance of the complete pipeline, accumulating error from automated NERC. The individual classifier performance is not related to the other classifiers, and is affected only by the YAGO URIs in the same LKIF class. It is calculated using the test set associated with each class, that does not include the `O' class. Evaluation. To evaluate the performance, accuracy, precision and recall have been computed in a word-to-word basis in the test portion of our Wikipedia corpus, totalling 2 million words of which the half belong to NEs and the other half to non-NEs. Thus, the evaluation consisted on calculating the proportion of words that had been correctly or incorrectly tagged as part of a NE and as belonging to a class of NEs at different levels of granularity. For this particular problem, accuracy does not throw much light upon the performance of the classifier because the performance for the majority class, non-NE, eclipses the performance for the rest. To have a better insight on the performance, the metrics of precision and recall are more adequate. Those metrics have been calculated per class, and a simple average without the non-NE class has been provided.

MIREL- 690974 Page 43 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Besides not being obscured by the huge non-NE class, this average is not weighted by the population of the class (thus an equivalent of macro-average). Therefore, the differences in these metrics are then showing differences in all classes, with less populated classes in equal footage with more populated ones. Additionally, the performance of some classifiers have been discriminated in the 20% most populated classes and in the 20% least populated classes, to have a global view of the errors. The confusion matrix of classification is shown in Figure 2. The matrix is casting classes into bins according to their frequency to enable results to be displayed. This evaluation shows how errors are distributed, in order to address further developments in the right direction.

Figure 3: Confusion matrix of classification by Neural Networks with handcrafted features in different levels of granularity: NERC (top), LKIF (middle) and YAGO (bottom).

MIREL- 690974 Page 44 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Evaluating on Wikipedia has the advantage that NERC and NEL models have been learnt with Wikipedia itself, so they are working on comparable corpora. However, even if it is useful to detect NEs in the Wikipedia itself, it is far more useful for the community to detect NEs in legal corpora like norms or case-law. That is why we have manually annotated a corpus of judgments of the European Court of Human Rights, identifying NEs that belong to classes in our ontology or to comparable classes that might be added to the ontology. This annotated corpus is useful to evaluate the performance of the developed NERC and NEL tools, but it will also be used to train specific NERC and NEL models that might be combined with Wikipedia ones. More precisely, excerpts from 5 judgments of the ECHR have been annotated; they have been obtained from the Court website10 and totalling 19,000 words. 1,500 entities, totalling 3,650 words have been identified. Annotators followed specific guidelines, inspired in the LDC guidelines for annotation of NEs [Linguistic Data Consortium, 2014]. Annotators were instructed to classify NEs at YAGO and URI levels, but no consistent annotation guidelines could be developed for the URI level, which is equivalent to Named Entity Linking, thus it has not been used for evaluation yet. There were 4 different annotators, and three judgments were annotated by at least 2 annotators independently, to assess interannotator agreement using Cohen's kappa coefficient [Cohen, 1960]. The agreement between judges ranged from κ=.4 to κ=.61, without significant differences across levels of granularity. Most of the disagreement between annotators was found for the recognition of NEs, not for their classification. The classes and subclasses of Document, Organization and Person were the most consistent across annotators, while Act, Abstraction and non-NE accumulated most discrepancies. The inter-annotator agreement obtained for this annotation is not high, and does not guarantee reproducible results. We are planning to improve annotation guidelines, including discussion sessions to unify criteria. Then, a more reliable version of these annotations will be produced, useful for evaluation, and more importantly, to train domain-specific NERC and NEL. For the time being, these annotations can be used for evaluation to obtain results that are indicative of the performance of the tools on legal text. Analysis of the results. The results for NERC on the test portion of the Wikipedia corpus at different levels of abstraction are reported in Table 4. Table 4 shows the overall accuracy (taking into consideration the `O' class), and the average recall, precision and F-measure across classes other than the non-NE class. The Stanford NERC could not deal with the number of classes in the YAGO level, so it was not evaluated in that level. A summary of that information is provided in Figure 3, displaying accuracy and F-measure of the different approaches at different levels of granularity. Results with handcrafted features and with word embeddings obtained from the Wikipedia are also shown. At bird's eye view, it can be seen that the SVM classifier performs far worse than the rest, and also that word embeddings consistently worsen the performance of the Neural Network classifier.

MIREL- 690974 Page 45 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Table 4: Results for Named Entity Recognition and Classification on the test portion of the Wikipedia corpus, for different approaches, at different levels of granularity. Accuracy figures take into consideration the majority class of non-NEs, but precision and recall are an average of all classes (macro-average) except the majority class of non-NEs.

Figure 4: Results of different approaches to NERC on the Wikipedia test corpus, at different levels of granularity, with accuracy (left) and F-measure (right), as displayed in Table 4.

MIREL- 690974 Page 46 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

The Stanford NERC performs worse than the Neural Network classifier at the NER level, but they perform indistinguishably at NERC level and Stanford performs better at LKIF level. However, it can be observed that the Neural Network performs better at the YAGO level than at the LKIF level, even though there are 122 classes at the YAGO level vs. 21 classes at LKIF level. A closer look at performance reveals (Figure 4) that the Neural Network classifier performs far better in smaller classes (with less instances) than in bigger classes, for all levels of abstraction but most dramatically for the LKIF level, where F-score for the 20% biggest classes drops to .11 (in contrast with .62 for NERC and .42 for YAGO), while for the smallest classes it keeps within the smooth decrease of performance that can be expected from the increase in the number of classes, and thus an increase in the difficulty of classification.

Figure 5: F-measure of the Neural Network classifier for NERC at different levels of granularity, discriminating the 20% most populated classes (blue) and 20% least populated classes (red).

These results corroborate an observation that has already been anticipated in general results, namely, that the LKIF level of generalization is not adequate for automated NERC, and that the NERC cannot distinguish the classes defined at that level, that is, in the original LKIF ontology. In contrast, the NERC does a better job at distinguishing YAGO classes, even if the classification problem is more difficult because of the bigger number of classes. On the other hand, the fact that smaller classes are recognized better than bigger classes indicates that bigger classes are ill-delimited. It may be that these classes are built as catch-all classes, grouping heterogeneous subclasses. Therefore, it seems that the chosen level of granularity for legal NERC using our ontology should be the most fine-grained, because it provides most

MIREL- 690974 Page 47 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

information without a significant loss in performance, or even with a gain in performance for the most populated classes. Another possibility to improve the performance at LKIF level would be to revisit the alignment, which is planned for the near future. The results for NERC in the corpus of judgments of the ECHR described above are shown in Table 5 and in Figure 5.

Table 5: Results for Named Entity Recognition and Classification on the corpus of judgments of the ECHR, for different approaches, at different levels of granularity, with models trained only with the documents of the ECHR themselves (divided in training and test) and with models trained with the Wikipedia, combined with embeddings obtained from the Wikipedia, from the ECHR or from both. Accuracy figures take into consideration the majority class of non-NEs, but precision and recall are an average of all classes (macro- average) except the majority class of non-NEs.

Figure 6: Results of different approaches to NERC on the judgments of the ECHR, at different levels of granularity, with accuracy (left) and F-measure (right), as displayed in Table 5. Approaches with different embeddings are distinguished.

MIREL- 690974 Page 48 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

The results refer to models trained on Wikipedia and applied to the ECHR documents, and with models trained with and applied to the ECHR corpus (divided in training and test splits). The also show models working on different representations of examples. The variations are handcrafted features and different combinations of embeddings: obtained from Wikipedia alone, obtained from the judgments of the ECHR alone, and obtained from Wikipedia and the ECHR in equal parts. On the ECHR corpus results obtained for models trained with the annotated corpus of ECHR judgments perform significantly better than those trained with Wikipedia, even if the latter are obtained with a much bigger corpus. The differences in performance can be seen more clearly in the F-measure plot in Figure 5 (right). This drops in performance is mainly due to the fact that the variability of entities and the way they are mentioned is far smaller in the ECHR than in Wikipedia. There are fewer unique entities and some of them are repeated very often (e.g., "Court", "applicant") or in very predictable ways (e.g., cites of cases as jurisprudence). For models trained with the annotated corpus of ECHR judgments, word embeddings decrease performance. This results are mainly explainable because of overfitting: word embeddings prevent overfitting, and are beneficial specially in the cases of very variable data or domain change, which is not the case when the NERC is trained with the ECHR corpus, with very little variability. It must be also highlighted that there is little difference between word embeddings trained with different inputs, although Wikipedia-trained word embeddings present better performance in general. There is no consistent difference between mixed and ECHR trained embeddings. In contrast, in Wikipedia-trained models, ECHR and mixed (ECHR+Wikipedia) word embeddings improve both precision and recall. This shows that, when we have a domain-specific model, embeddings obtained from a significantly bigger corpus are more beneficial. However, when no in-domain information is available, a representation obtained from many unlabeled examples yields a bigger improvement. For a lengthier discussion of these results, see [Teruel and Cardellino, 2017]. On the other hand, as explained above, NEL could not be evaluated on the corpus of judgments, but only on Wikipedia, because annotation at the level of entities has not been consolidated in the corpus of judgments of the ECHR. Therefore, approaches to NEL have only been evaluated on the test portion of the corpus of Wikipedia. Results are shown in Table 6. As could be expected from the results for NERC, word embeddings worsened the performance of prediction. We can see that the performance of NEL is quite acceptable if it is applied on ground-truth labels, but it only reaches a 16% F-measure if applied over automatic NERC at the YAGO level of classification. Thus, the fully automated pipeline for NEL is far from satisfactory. Nevertheless, we expect that improvements in YAGO-level classification will have a big impact on NEL. We also plan to substitute the word-based representation of NEs by a string-based representation that allows for better string overlap heuristics and a customized edit distance for abbreviation heuristics.

MIREL- 690974 Page 49 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

Table 6: Results for Named Entity Linking on the test portion of the Wikipedia corpus, for different approaches, including random and K-nn baselines.

5 Conclusions

Legal ontologies are pivotal in legal informatics to index legal texts with respect to concepts (classes, instances, and semantic relations) in computational ontologies such as the Privacy Ontology (PrOnto) and the European Legal Taxonomy Syllabus illustrated above.

However, since “legal discourse can never escape its own textuality” [Peller, 1985], it is fundamental to link concepts in an ontology to the textual spans from legal documents referring to them, so that NLP techniques need to be developed in order to semi-automatically assist and guide the population or the creation of legal ontologies.

This deliverables described the research carried out in MIREL on the topic. University of Luxembourg and University of Bologna have worked together on the DAPRECO knowledge base, the biggest public knowledge base in LegalRuleML available online, which links norms in the General Data Protection Regulation (GDPR) with concepts from the Privacy Ontology (PrOnto). University of Turin and University of Luxembourg, have worked together on the Eunomos system, which index legal texts with respect to the European Legal Taxonomy Syllabus; the MIREL partner Nomotika SRL distributes a commercial version of Eunomos, called MenslegiS. Finally, University of Cordoba and INRIA to carried out joint research on curriculum learning techniques to create legal ontologies by harvesting recurring relevant linguistic patterns from legal texts.

References

[Ajani et al., 2017] Ajani, G. and Boella, G. and Di Caro, L. and Robaldo, L. and Humphreys, L. and Praduroux, S. and Rossi, P. and Violato, A.: The European legal taxonomy syllabus: A multi- lingual, multi-level ontology framework to untangle the web of European legal terminology. Applied Ontology, Vol. 11 (4), pp. 325-375, 2016.

MIREL- 690974 Page 50 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

[Allgower and Georg, 2012] Eugene L Allgower and Kurt Georg. 2012. Numerical continuation methods: an introduction, volume 13. Springer Science & Business Media. [Athan et al., 2015] Tara Athan, Guido Governatori, Monica Palmirani, Adrian Paschke, and Adam Wyner. 2015. LegalRuleML: Design principles and foundations. In Wolfgang Faber and Adrian Pashke, editor, The 11th Reasoning Web Summer School, pages 151–188, Berlin, Germany, jul. Springer. [Bandeira et al.2016] Bandeira, Judson, Ig Ibert Bittencourt, Patricia Espinheira, and Seiji Isotani. 2016. FOCA: A methodology for ontology evaluation. CoRR. [Bengio et al., 2009] Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY, USA. ACM. [Boella et al., 2014] Guido Boella, Luigi Di Caro, Alice Ruggeri, and Livio Robaldo. 2014. Learning from syntax generalizations for automatic semantic annotation. Journal Intelligent Information System, 43(2):231–246. [Boella et al., 2016] Boella, G. and Di Caro, L. and Humphreys, L. and Robaldo, L. and Rossi, P. and van der Torre, L.: Eunomos, a legal document and knowledge management system for the Web to provide relevant, reliable and up-to-date information on the law. Artificial Intelligence and Law, Vol. 24 (245). [Boella et al., 2011] Boella, G., di Caro, L., & Humphreys, L. (2011). Using classification to support legal knowledge engineers in the Eunomos legal document management system. In Fifth international workshop on juris-informatics (JURISIN). [Boella et al., 2012] Boella, G., di Caro, L., Humphreys, L., & Robaldo, L. (2012). Using legal ontology to improve classification in the Eunomos legal document and knowledge management system. In Semantic processing of legal texts (SPLET) at LREC 2012. [Brank, Grobelnik, and Mladenific, 2005] Brank, Janez, Marko Grobelnik, and Dunja Mladenific. 2005. A survey of ontology evaluation techniques. In Proceedings of 8th International multi- conference Information Society. [Bruckschen et al., 2010] Mrian Bruckschen, Caio Northfleet, Douglas da Silva, Paulo Bridi, Roger Granada, Renata Vieira, Prasad Rao, and Tomas Sander. 2010. Named entity recognition in the legal domain for ontology population. In 3rd Workshop on Semantic Processing of Legal Texts (SPLeT 2010). [Buitelaar, 2010] Buitelaar, P. (2010). Ontology-based semantic lexicons: mapping between terms and object descriptions. In ren Huang, C., Calzolari, N., Gangemi, A., Lenci, A., Oltramari, A., and Prevot, L., editors, Ontology and the Lexicon, pages 212–223. Cambridge University Press. Cambridge Books Online. [Cardellino et al., 2017] M. Teruel L. Alonso Alemany C. Cardellino, S. Villata. A Low-cost, High- coverage Legal Named Entity Recognizer, Classifier and Linker. In Proc. of the 16th International Conference on Artificial Intelligence and Law (ICAIL-2017), 2017.

MIREL- 690974 Page 51 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

[Cohen, 1960] J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational & Psycological Measure 20 (1960), 37–46. [Corcho et al., 2005] Oscar Corcho, Mariano Fernández-López, Asunción Gómez-Pérez, Angel López-Cima. Building Legal Ontologies with METHONTOLOGY and WebODE. Law and the Semantic Web pp 142-157. 2005. [Cortes and Vapnik, 1995] Cortes, C., Vapnik, V. (1995). Support-vector networks. Machine learning, 20 (3), 273–297. [Ehrmann et al., 2014] Ehrmann, M., Cecconi, F., Vannella, D., McCrae, J. P., Cimiano, P., and Navigli, R. (2014). A multilingual semantic network as linked data: lemon-BabelNet. In Proceedings of the 3rd Workshop on Linked Data in Linguistics. [Fellbaum, 1998] Fellbaum, C. (1998). WordNet: An Electronic Lexical Database (M. Cambridge, Ed.). MIT Press. [Finkel et al., 2005] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 363–370, Stroudsburg, PA, USA. Association for Computational Linguistics. [Giunchiglia and Zaihrayeu, 2009] Giunchiglia, F. and Zaihrayeu, I. (2009). Lightweight ontologies. In Encyclopedia of Database Systems, pages 1613–1619. Springer, Berlin. [Graziadei, 2004] Graziadei, M. Tuttifrutti. In Birks, P. and Pretto, A., editors, Themes in Comparative Law. Oxford University Press, Oxford. 2004. [Greene, 2001] Greene, J. (2001). Feature subset selection using Thornton's separability index and its applicability to a number of sparse proximity-based classifiers. In Proceedings of the annual symposium of the pattern recognition association of South Africa. [Gruber, 1993] Gruber, T. (1993). A translation approach to portable ontology specifications. Knowledge acquisition, 5(2):199–220. [Hahm et al., 2014] Younggyun Hahm, Jungyeul Park, Kyungtae Lim, Youngsik Kim, Dosam Hwang, and Key-Sun Choi. 2014. Named entity corpus construction using Wikipedia and DBPedia ontology. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may. European Language Resources Association (ELRA). [Hall et al., 2009] M. Hall and F. Eibe and G. Holmes and B. Pfahringer and P. Reutemann and I.H. Witten. (2009, November). The weka data mining software: an update. SIGKDD Explor. Newsl., 11 , 10–18. [Hintze, 2017] Hintze, Mike. 2017. Viewing the GDPR through a de-identification lens: A tool for compliance, clarification and consistency. International Data Protection Law, November. Forthcoming.

MIREL- 690974 Page 52 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

[Hoekstra et. al., 2007] R. Hoekstra, J. Breuker, M. Di Bello, and A. Boer. 2007. The LKIF core ontology of basic legal concepts. In Proceedings of theWorkshop on Legal Ontologies and Artificial Intelligence Techniques (LOAIT 2007). [Hohfeld, 1919] W. Hohfeld. Fundamental Legal Conceptions. Yale University Press. 1919. [Humphreys et al., 2015] Llio Humphreys, Guido Boella, Livio Robaldo, Luigi di Caro, Loredana Cupi, Sepideh Ghanavati, Robert Muthuri, and Leendert van der Torre. 2015. Classifying and extracting elements of norms for ontology population using semantic role labelling. In Proceedings of the Workshop on Automated Detection, Extraction and Analysis of Semantic Information in Legal Texts. [IT Governance Privacy Team, 2017] IT Governance Privacy Team. 2017. EU General Data Protection Regulation (GDPR): An Implementation and Compliance Guide. IT Governance Ltd, second edition, August. [Joachims, 1998] Joachims, T. (1998). Text with support vector machines: Learning with many relevant features. Machine Learning: ECML-98 , 137–142. [Kelsen, 1941] Kelsen, H. The pure theory of law and analytical jurisprudence. Harvard Law Review, 55. 1941. [Kelsen, 1992] Kelsen, H. An Introduction to the Problems of Legal Theory. Oxford University Press, Oxford. 1992. [Lenci et al., 2009] A. Lenci, S. Montemagni, V. Pirrelli, and G. Venturi. 2009. Ontology learning from italian legal texts. In Proceeding of the 2009 Conference on Law, ontologies and the Semantic Web: Channelling the Legal information Flood. [Lesmo, 2009] Lesmo, L. (2009). The Turin University Parser at Evalita 2009. Proceedings of EVALITA, 9. [Linguistic Data Consortium, 2014] Deftere annotation guidelines: Entities v1.7. http://nlp.cs.rpi.edu/kbp/2014/ereentity.pdf. [Mandelkern, 2001] Mandelkern group on better regulation final report. Paris: EU. [McCarthy1980] McCarthy, John. 1980. Circumscription: A form of nonmonotonic reasoning. Artificial Intelligence, 13(1)(2):27-39. [McCrae et al., 2011] McCrae, J., Spohr, D., and Cimiano, P. (2011). Linking lexical resources and ontologies on the semantic web with lemon. In Proceedings of the 8th Extended Semantic Web Conference on The Semantic Web: Research and Applications - Volume Part I, ESWC’11, pages 245–259, Berlin, Heidelberg. Springer-Verlag. [Palmirani et al.2018a] Palmirani, Monica, Michele Martoni, Arianna Rossi, Cesare Bartolini, and Livio Robaldo. 2018a. Pronto: Privacy ontology for legal compliance. In Proceedings of the 18th European Conference on Digital Government (ECDG), October. Forthcoming. [Palmirani et al.2018b] Palmirani, Monica, Michele Martoni, Arianna Rossi, Cesare Bartolini, and Livio Robaldo. 2018b. Pronto: Privacy ontology for legal reasoning. In Proc. of the 7th

MIREL- 690974 Page 53 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

International Conference on Electronic Government and the Information Systems Perspective (EGOVIS). Forthcoming. [Palmirani et al.2018c] Palmirani, Monica, Michele Martoni, Arianna Rossi, Cesare Bartolini, and Livio Robaldo. 2018c. Pronto: Privacy ontology for legal reasoning. In Proceedings of the Internationales Rechtsinformatik Symposion (IRIS), February [Palmirani and Vitali, 2011] Akoma Ntoso for Legal Documents. In proc. of Legislative XML for the Semantic Web pp 75-100, 2011. [Palmirani and Brighi, 2006] Palmirani, M. and Brighi, R. (2006). Time model for managing the dynamic of normative system. In Electronic Government, pages 207–218. Springer, Berlin. [Palmirani, 2011] Legislative change management with Akoma Ntoso. In proc. of Legislative XML for the Semantic Web pp 101-130, 2011. [Peller, 1995] Peller, G. (1985). The metaphysics of American law. California Law Review Vol. 73, pages 1151–1290. [Platt, 1999] Platt, J. (1999). Sequential minimal optimization: A fast algorithm for training support vector machines. Advances in Kernel Methods-Support Vector Learning, 208, 98112. [Robaldo and Sun2017] Robaldo, L. and X. Sun. 2017. Reified input/output logic: Combining input/output logic and reification to represent norms coming from existing legislation. The Journal of Logic and Computation, 7. [Sacco, 1991] Sacco, R. (1991). Legal formants: A dynamic approach to comparative law installments I and II. American Journal of Comparative Law, 39. [Salton and Buckley, 1988] Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. & management, 24 (5), 513–523. [Satariano, 2018] Satariano, Adam. 2018. What the G.D.P.R., Europe's tough new data law, means for you, and for the Internet. Online article, May 2018. [Smith, 2008] Smith, B. (p.155, 2008). Ontology. In The Blackwell guide to the philosophy of computing and information. Wiley. [Spinosa et al., 2009] Spinosa, P., Giardiello, G., Cherubini, M., Marchi, S., Venturi, G., and Montemagni, S. (2009). NLP-based metadata extraction for legal text consolidation. In Proceedings of the 12th International Conference on Artificial Intelligence and Law, ACM. [Suchanek et al., 2007] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 697–706, New York, NY, USA. ACM. [Surdeanu et al., 2010] Mihai Surdeanu, Ramesh Nallapati, and Christopher D. Manning. 2010. Legal Claim Identification: Information Extraction with Hierarchically Labeled Data. In Proceedings of the LREC 2010 Workshop on the Semantic Processing of Legal Texts (SPLeT- 2010). Malta.

MIREL- 690974 Page 54 of 55 09/03/2019

D2.4 Ontology population: connecting legal text to ontology concepts and instances

[Stanford NLP Group, 2016] Stanford named entity recognizer (NER). http://nlp.stanford.edu/software/CRF-NER.shtml. [Teruel et al., 2017a] M. Teruel, L. Alonso Alemany, C. Cardellino, S. Villata. Legal NERC with ontologies, Wikipedia and curriculum learning, In Proc. of European Chapter of the Association for Computational Linguistics (EACL), 2017. [Teruel et al., 2017b] M. Teruel L. Alonso Alemany C. Cardellino, S. Villata Learning Slowly To Learn Better: Curriculum Learning for Legal Ontology Population, In Proc. of the 30th Florida Artificial Intelligence Research Society (FLAIRS), 2017. [Teruel and Cardellino, 2017] Cristian Cardellino and Milagro Teruel. 2017. In-domain or out- domain word embeddings? A study for Legal Cases. In Student Session of the European Summer School for Logic, Language and Information (ESSLLI 2017). [Villegas and Bel, 2015] Villegas, M. and Bel, N. (2015). PAROLE/SIMPLE ‘lemon’ ontology and lexicons. Semantic Web, 6(4):363–369.

MIREL- 690974 Page 55 of 55 09/03/2019