Arxiv:2010.12871V1 [Cs.CL] 24 Oct 2020

Large Scale Legal Text Classification Using Transformer Models Zein Shaheen Gerhard Wohlgenannt Erwin Filtz ITMO University ITMO University Vienna University of Economics and Business (WU) St. Petersburg, Russia St. Petersburg, Russia Vienna, Austria [email protected] [email protected] erwin.fi[email protected] Abstract—Large multi-label text classification is a challenging legal documents from Eur-Lex [7], the legal database of the Natural Language Processing (NLP) problem that is concerned European Union (EU). The usage of language in the given with text classification for datasets with thousands of labels. documents is highly domain specific, and includes many legal We tackle this problem in the legal domain, where datasets, text artifacts such as case numbers. Modern neural NLP such as JRC-Acquis and EURLEX57K labeled with the EuroVoc algorithms often tackle domain specific text by fine-tuning vocabulary were created within the legal information systems pretrained language models on the type of text at hand [8]. of the European Union. The EuroVoc taxonomy includes around 7000 concepts. In this work, we study the performance of various Both datasets are labelled with terms from the the European recent transformer-based models in combination with strategies Union’s multilingual and multidisciplinary thesaurus EuroVoc such as generative pretraining, gradual unfreezing and discrim- [9]. inative learning rates in order to reach competitive classification performance, and present new state-of-the-art results of 0:661 (F1) for JRC-Acquis and 0:754 for EURLEX57K. Furthermore, The goal of this work is to advance the state-of-the-art in we quantify the impact of individual steps, such as language LMTC based on these two datasets which exhibit many of model fine-tuning or gradual unfreezing in an ablation study, the characteristics often found in LMTC datasets: power-law and provide reference dataset splits created with an iterative label distribution, highly domain specific language and a large stratification algorithm. and hierarchically organized set of labels. We apply current Keywords–multi-label text classification; legal document NLP transformer models, namely BERT [10], RoBERTa [11], datasets; transformer models; EuroVoc. DistilBERT [12], XLNet [13] and M-BERT [10], and combine them with a number of training strategies such as gradual unfreezing, slanted triangular learning rates and language model I. INTRODUCTION fine-tuning. In the process, we create new standard dataset Text classification, i.e., the process of assigning one or splits for JRC-Acquis and EURLEX57 using an iterative strat- multiple categories from a set of options to a document [1], ification approach [14]. Providing a high-quality standardized is a prominent and well-researched task in Natural Language dataset split is very important, as previous work was typically Processing (NLP) and text mining. Text classification variants done on different random splits, which makes results hard to include simple binary classification (for example, decide if compare [15]. Further, we make use of the semantic relations a document is spam or not spam), multi-class classification inside the EuroVoc taxonomy to infer reduced label sets for the (selection of one from a number of classes), and multi-label datasets. Some of our main evaluation results are the Micro-F1 classification. In the latter, multiple labels can be assigned to score of 0:661 for JRC-Acquis and 0:754 for EURLEX57K, a single document. In Large Multi-Label Text Classification which sets new states-of-the-art to the best of our knowledge. (LMTC), the label space is typically comprised of thousands of labels, which obviously raises task complexity. The work The main findings and contributions of this work are: (i) presented here tackles an LMTC problem in the legal domain. the experiments with BERT, RoBERTa, DistilBERT, XLNet, LMTC tasks often occur when large taxonomies or formal M-BERT (trained on three languages), and AWD-LSTM in arXiv:2010.12871v1 [cs.CL] 24 Oct 2020 ontologies are used as document labels, for example in the combination with the training tricks to evaluate and compare medical domain [2] [3], or when using large open domain the performance of the models, (ii) providing new standardized taxonomies for labelling, such as annotating Wikipedia with datasets for further investigation, (iii) ablation studies to mea- labels [4]. A common feature of many LMTC tasks is that sure the impact and benefits of various training strategies, and some labels are used frequently, while others are used very (iv) leveraging the EuroVoc term hierarchy to generate variants rarely (few-shot learning) or are never used (zero-shot learn- of the datasets for which higher classification performance can ing). This situation is also referred to by power-law or long-tail be achieved. frequency distribution of labels, which also characterizes our datasets and which is a setting that is largely unexplored for The remainder of the paper is organized as follows: After text classification [3]. Another difficulty often faced in LMTC a discussion of related work in Section II, we introduce datasets [3] are long documents, where finding the relevant the EuroVoc vocabulary and the two datasets (Section III), areas to correctly classify documents is a needle in a haystack and then present the main methods (AWD-LSTM, BERT, situation. RoBERTa, DistilBERT, XLNet) in Section IV. Section V In this work, we focus on LMTC in the legal domain, contains extensive evaluations of the methods on both datasets based on two datasets, the well-known JRC-Acquis dataset [5] as well as ablation studies, and after a discussion of results and the new EURLEX57K dataset [6]. Both datasets contain (Section VI) we conclude the paper in Section VII. II. RELATED WORK @prefix rdf: <http: //www.w3.org/1999/02/22−rdf−syntax−ns # t y p e> . In connection with the JRC-Acquis dataset, Steinberger @prefix skos: <http: //www.w3.org/2004/02/skos/core#> . @prefix dcterms: <http: //purl.org/dc/terms/> . et al. [16] present the “JRC EuroVoc Indexer JEX”, by the @prefix e v : <http: //eurovoc.europa.eu/> . Joint Research Centre (JRC) of the European Commission. @prefix evs: <http: // eurovoc.europa.eu/schema#> . The tool categorizes documents using the EuroVoc taxonomy <http: // eurovoc.europa.eu/100142> rdf:type evs:Domain ; by employing a profile-based ranking task; the authors report skos:prefLabel ”04 POLITICS”@en . an F-score between 0.44 and 0.54 depending on the document <http: // eurovoc.europa.eu/100166> language. Boella et al. [17] manage to apply a support vector rdf:type evs:MicroThesaurus ; skos:prefLabel ”0421 parliament”@en ; machine approach to the problem by transforming the multi- dcterms:subject ev:100142 ; label classification problem into a single-label problem. Liu et skos:hasTopConcept ev:41 . <http: //eurovoc.europa.eu/41> al. [18] present a new family of Convolutional Neural Network rdf:type evs:ThesaurusConcept ; (CNN) models tailored for multi-label text classification. They skos:prefLabel ”powers of parliament”@en ; compare their method to a large number of existing approaches skos:inScheme ev:100166 . <http: // eurovoc.europa.eu/1599> on various datasets; for the EurLex/JRC dataset however, rdf:type evs:ThesaurusConcept ; another method (SLEEC), provided the best results. SLEEC skos:prefLabel ”legislative period”@en ; (Sparse Local Embeddings for Extreme Classification) [19], skos:inScheme ev:100166 skos:broader ev:41 . creates local distance preserving embeddings which are able to accurately predict infrequently occurring (tail) labels. The results on precision for SLEEC applied in Liu et al. [18] are Figure 1. EuroVoc example P@1: 0.78, P@3: 0.64 and P@5: 0.52 – however, they use a previous version of the JRC-Acquis dataset with only 15.4K documents. A. EuroVoc Chalkidis et al. [6] recently published their work on the The datasets we use for our experiments contain legal new EURLEX57K dataset. The dataset will be described documents from the legal information system of the European in more detail (incl. dataset statistics) in the next sections. Union (Eur-Lex) and are classified into a common classi- Chalkidis et al. also provide a strong baseline for LMTC on fication schema, the EuroVoc [9] thesaurus published and this dataset. Among the tested neural architectures operating on maintained by the Publications Office of the European Union the full documents, they have best results with BIGRUs with since 1982. The EuroVoc thesaurus has been introduced to label-wise attention. As input representation they use either harmonize the classification of documents in the communi- GloVe [20] embeddings trained on domain text, or ELMO cations across EU institutions and to enable a multilingual embeddings [21]. The authors investigated using only the first search as the thesaurus provides all its terms in the official zones of the (long) documents for classification, and show that language of the EU member states. It is organized based on the title and recitals part of each document leads to almost the the Simple Knowledge Organization System (SKOS) [23] , same performance as considering the full document [6]. This which encodes data using the Resource Description Format helps to alleviate BERT’s limitation of having a maximum of (RDF) [24] and is well-suited to represent hierarchical relations 512 tokens as input. Using only the first 512 tokens of each between terms in a thesaurus like EuroVoc. EuroVoc uses document as input, BERT [10] archives the best performance SKOS to hierarchically organize its concepts into 21 domains, overall. The work of Chalkidis et al. is inspired by You et for instance Law, Trade or Politics, to name a few. Each domain al. [22] who experimented with RNN-based methods with self contains multiple microthesauri (127 in total), which in turn attention on five LMTC datasets (RCV1, Amazon-13K, Wiki- have in total around 600 top terms. About 7K terms (also called 30K, Wiki-500K, and EUR-Lex-4K).

Load more