Esis Titled, “MULTIMODAL LEGAL IN- FORMATION RETRIEVAL” and the Work Presented in It Are My Own

PhD-FSTC-2018-31 The Faculty of University of Bologna Sciences, Technology and Law School Communication DISSERTATION Defence held on 27/04/2018 in Bologna to obtain the degree of DOCTEUR DE L’UNIVERSITÉ DU LUXEMBOURG EN INFORMATIQUE AND DOTTORE DI RICERCA IN Law, Science and Technology by Kolawole John ADEBAYO Born on 31 January 1986 in Oyo (Nigeria). MULTIMODAL LEGAL INFORMATION RETRIEVAL Dissertation Defence Committee Dr. Leon van der Torre, Dissertation Supervisor Professor, Université du Luxembourg, Luxembourg Dr. Guido Boella, Dissertation Supervisor Professor, Università degli Studi di Torino, Italy Dr. Marie-Francine Moens, Chairman Professor, Katholieke Universiteit, Belgium Dr. Henry Prakken, Vice-Chairman Professor, Universiteit Utrecht, Netherlands Dr. Schweighofer Erich, Member Professor, Professor, Universität Wien, Austria Dr. Monica Palmirani, Discussant Professor, Università di Bologna, Italy Dr. Luigi Di Caro, Discussant Professor, Università degli Studi di Torino, Italy iii Declaration of Authorship I, Kolawole John ADEBAYO, declare that this thesis titled, “MULTIMODAL LEGAL IN- FORMATION RETRIEVAL” and the work presented in it are my own. I confirm that: • This work was done wholly or mainly while in candidature for a research degree at this University. • Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. • Where I have consulted the published work of others, this is always clearly at- tributed. • Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. • I have acknowledged all main sources of help. • Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: v “The limits of my language are the limits of my world.” Ludwig Wittgenstein vii Abstract Kolawole John ADEBAYO MULTIMODAL LEGAL INFORMATION RETRIEVAL The goal of this thesis is to present a multifaceted way of inducing semantic represen- tation from legal documents as well as accessing information in a precise and timely manner. The thesis explored approaches for semantic information retrieval (IR) in the Legal context with a technique that maps specific parts of a text to the relevant concept. This technique relies on text segments, using the Latent Dirichlet Allocation (LDA), a topic modeling algorithm for performing text segmentation, expanding the concept using some Natural Language Processing techniques, and then associating the text segments to the concepts using a semi-supervised Text Similarity technique. This solves two problems, i.e., that of user specificity in formulating query, and information over- load, for querying a large document collection with a set of concepts is more fine-grained since specific information, rather than full documents is retrieved. The second part of the thesis describes our Neural Network Relevance Model for E-Discovery Information Re- trieval. Our algorithm is essentially a feature-rich Ensemble system with different compo- nent Neural Networks extracting different relevance signal. This model has been trained and evaluated on the TREC Legal track 2010 data. The performance of our models across board proves that it capture the semantics and relatedness between query and document which is important to the Legal Information Retrieval domain. Subject: Legal Informatics. Keywords: Convolutional Neural Network, Concept, Concept-based IR, CNN, E-Discovery, Eurovoc, EurLex, Information Retrieval, Document Retrieval, Legal Information Retrieval, Semantic Annotation, Semantic Similarity, Latent Dirichlet Allocation, LDA, Long Short- Term Memory, LSTM, Natural Language Processing, Neural Information Retrieval Neu- ral Networks, Text Segmentation, Topic Modeling viii Il Riassunto L’obiettivo di questa tesi è quello di presentare un modo sfaccettato di accesso alle informazioni da un curpus di documenti legali in modo preciso ed efficiente. Il lavoro inizia con un’esplorazione degli approcci relativi al recupero di informazioni semantiche (Information Retrieval o IR) nel contesto giuridico con una tecnica che mappa alcune parti di un testo a specifici concetti di una ontologia, basandosi su una segmentazione semantica dei testi. Tecniche di elaborazione del linguaggio naturale vengono poi uti- lizzate per associare i segmenti di testo ai concetti utilizzando una tecnica di similarità testuale. Pertanto, interrogando un documento legale di grandi dimensioni con una serie di concetti è possibile recuperare segmenti di testo ad una grana più fine, piuttosto che i documenti completi originari. La tesi si conclude con la descrizione di un classificatore di reti neurali per l’E-Discovery. Questo modello è stato addestrato e valutato sui dati della legal track TREC 2010, ottenendo una performance in grado di dimostrare che le più recenti tecniche di neural computing possono fornire buone soluzioni al recupero di informazioni che vanno dalla gestione dei documenti, di informazioni aziendali e di sce- nario relativi all’E-Discovery. Oggetto: Informatica legale. Parole chiave: Recupero di informazioni, Annotazione semantica, Somiglianza semantica, Allineamento testuale, Risposte automatiche a domande legali, Reti neurali, Eurlex, Eurovoc, Estrazione di keyphrase. ix Acknowledgements I would like to thank the almighty God, the giver of life and the one in whom abso- lute power and grace resides. Many people have made the completion of this doctoral programme a reality. My thanks go to Prof. Monica Palmirani, the coordinator of the LAST-JD programme, and other academic committee members for finding me suitable for this doctoral programme. I thank my supervisors, Prof. Guido Boella, Prof. Leon Van Der Torre, and Dr. Luigi Di Caro for their guidance, advise, countless periods of discussion, and for putting up with my numerous deadline-day paper review request. Guido has been a father figure, and I could not have asked for more. Leon gave me tremendous support throughout the Ph.D. journey. I have participated at many conferences partly due to him. Luigi, a big part of the compliment goes to you! You never stopped reminding me that I could do better. I thank my wife, Oluwatumininu, and beautiful daughters -Joyce and Hillary for the love and understanding even when I am miles and months away. This thesis is only possible because of you! I thank my mum, Modupe, and siblings -Olaide, Adeboyin, Adefemi, Abiola and Oluwatobi for their love and support at all time. I do not forget other family members whom I cannot mention for the sake of space. God keep and bless you all. I thank the friends and colleagues that I have met during the period of the doctoral research, You all have made my stay in the many countries where I have worked memo- rable. Marc Beninati, thanks for being a friend and brother. To Livio Robaldo, thanks for your encouragement, you gave me fire when I needed it the most!,Dina Ferrari, thanks for your usual support. Rohan, thanks for our many useful discussions. To Antonia - my Landlady in Barcelona, and Prof. Roig -my host, you made Barcelona to be eternally engraved in my heart. Lastly, I thank the European Commission without whose scholarship I would not have been a part of this prestigious academic programme. xi Contents Declaration of Authorship iii Abstract vii Acknowledgements ix Contents xi List of Figures xv List of Tables xvii I General Introduction1 1 INTRODUCTION3 1.1 Introduction....................................3 1.2 Legal Information Retrieval...........................4 1.3 Motivation.....................................6 1.4 Problem Statement................................9 1.5 Thesis Question.................................. 13 1.5.1 Research Approach............................ 13 1.5.2 Research Goal............................... 14 1.6 Contribution to Knowledge........................... 14 1.7 Thesis Outline................................... 16 1.8 Publication..................................... 17 1.9 Chapter Summary................................. 18 2 INFORMATION RETRIEVAL AND RETRIEVAL MODELS 19 2.1 What is Information Retrieval?......................... 19 2.2 The Goal of an Information Retrieval System................. 21 2.3 Desiderata of an Information Retrieval System................ 22 2.4 Definition..................................... 25 2.4.1 Document................................. 25 2.4.2 Electronically Stored Information (ESI)................ 26 2.4.3 Collection................................. 26 2.5 The Notion of Relevance............................. 26 xii 2.6 Information Retrieval Models.......................... 28 2.6.1 Boolean Model.............................. 29 2.6.2 Vector Space model............................ 31 Term Weighing Approaches....................... 34 Latent Semantic Indexing........................ 35 2.6.3 Probabilistic Models........................... 36 2.6.4 Language Models............................. 38 2.7 Evaluation of Information Retrieval Systems................. 40 2.7.1 Precision.................................. 41 2.7.2 Recall.................................... 42 2.7.3 F-Measure................................. 42 2.7.4 Mean Average Precision......................... 43 2.7.5 Normalized Discounted Cumulative Gain.............. 43 2.7.6 Accuracy.................................. 44 2.8 Approaches for Improving

Esis Titled, “MULTIMODAL LEGAL IN- FORMATION RETRIEVAL” and the Work Presented in It Are My Own

Domain-Based Sense Disambiguation in Multilingual Structured Data

Role of Natural Language Processing in Information Retrieval; Challenges and Opportunities Khaled M

Topical Sequence Profiling and Cluster Labeling Sequence Flows from Left to Right, Thetopicsfromtopto Withrespecttodesiredtopiclabelproperties

BI SEARCH and TEXT ANALYTICS New Additions to the BI Technology Stack

Auto-Relevancy and Responsiveness Baseline II

Semantic Analysis of Tweets Using LSA and SVD

Unsupervised Extraction and Clustering of Key Phrases from Scientific Publications

Analysis on Semantic Web Information Retrieval Methodologies

Reducing Polysemy in Wordnet

TWO-LAYERED ARCHITECTURE for PEER-TO-PEER CONCEPT SEARCH Janakiram Dharanipragada, Fausto Giunchiglia, Harisankar Haridas and Ul

Falcons Concept Search: a Practical Exploring Engine for Net Ontologies

Semantic Similarity Analysis and Application in Knowledge Graphs