<<

Large Scale Legal Text Classification Using Transformer Models

Zein Shaheen Gerhard Wohlgenannt Erwin Filtz ITMO ITMO University University of Economics and Business (WU) St. Petersburg, St. Petersburg, Russia Vienna, [email protected] [email protected] erwin.fi[email protected]

Abstract—Large multi-label text classification is a challenging legal documents from Eur-Lex [7], the legal database of the Natural Language Processing (NLP) problem that is concerned European Union (EU). The usage of language in the given with text classification for datasets with thousands of labels. documents is highly domain specific, and includes many legal We tackle this problem in the legal domain, where datasets, text artifacts such as case numbers. Modern neural NLP such as JRC-Acquis and EURLEX57K labeled with the EuroVoc algorithms often tackle domain specific text by fine-tuning vocabulary were created within the legal information systems pretrained language models on the type of text at hand [8]. of the European Union. The EuroVoc taxonomy includes around 7000 concepts. In this work, we study the performance of various Both datasets are labelled with terms from the the European recent transformer-based models in combination with strategies Union’s multilingual and multidisciplinary thesaurus EuroVoc such as generative pretraining, gradual unfreezing and discrim- [9]. inative learning rates in order to reach competitive classification performance, and present new state-of-the-art results of 0.661 (F1) for JRC-Acquis and 0.754 for EURLEX57K. Furthermore, The goal of this work is to advance the state-of-the-art in we quantify the impact of individual steps, such as language LMTC based on these two datasets which exhibit many of model fine-tuning or gradual unfreezing in an ablation study, the characteristics often found in LMTC datasets: power-law and provide reference dataset splits created with an iterative label distribution, highly domain specific language and a large stratification algorithm. and hierarchically organized set of labels. We apply current Keywords–multi-label text classification; legal document NLP transformer models, namely BERT [10], RoBERTa [11], datasets; transformer models; EuroVoc. DistilBERT [12], XLNet [13] and M-BERT [10], and combine them with a number of training strategies such as gradual un- freezing, slanted triangular learning rates and language model I.INTRODUCTION fine-tuning. In the process, we create new standard dataset Text classification, i.e., the process of assigning one or splits for JRC-Acquis and EURLEX57 using an iterative strat- multiple categories from a set of options to a document [1], ification approach [14]. Providing a high-quality standardized is a prominent and well-researched task in Natural Language dataset split is very important, as previous work was typically Processing (NLP) and text mining. Text classification variants done on different random splits, which makes results hard to include simple binary classification (for example, decide if compare [15]. Further, we make use of the semantic relations a document is spam or not spam), multi-class classification inside the EuroVoc taxonomy to infer reduced label sets for the (selection of one from a number of classes), and multi-label datasets. Some of our main evaluation results are the Micro-F1 classification. In the latter, multiple labels can be assigned to score of 0.661 for JRC-Acquis and 0.754 for EURLEX57K, a single document. In Large Multi-Label Text Classification which sets new states-of-the-art to the best of our knowledge. (LMTC), the label space is typically comprised of thousands of labels, which obviously raises task complexity. The work The main findings and contributions of this work are: (i) presented here tackles an LMTC problem in the legal domain. the experiments with BERT, RoBERTa, DistilBERT, XLNet, LMTC tasks often occur when large taxonomies or formal M-BERT (trained on three languages), and AWD-LSTM in arXiv:2010.12871v1 [cs.CL] 24 Oct 2020 ontologies are used as document labels, for example in the combination with the training tricks to evaluate and compare medical domain [2] [3], or when using large open domain the performance of the models, (ii) providing new standardized taxonomies for labelling, such as annotating with datasets for further investigation, (iii) ablation studies to mea- labels [4]. A common feature of many LMTC tasks is that sure the impact and benefits of various training strategies, and some labels are used frequently, while others are used very (iv) leveraging the EuroVoc term hierarchy to generate variants rarely (few-shot learning) or are never used (zero-shot learn- of the datasets for which higher classification performance can ing). This situation is also referred to by power-law or long-tail be achieved. frequency distribution of labels, which also characterizes our datasets and which is a setting that is largely unexplored for The remainder of the paper is organized as follows: After text classification [3]. Another difficulty often faced in LMTC a discussion of related work in Section II, we introduce datasets [3] are long documents, where finding the relevant the EuroVoc vocabulary and the two datasets (Section III), areas to correctly classify documents is a needle in a haystack and then present the main methods (AWD-LSTM, BERT, situation. RoBERTa, DistilBERT, XLNet) in Section IV. Section V In this work, we focus on LMTC in the legal domain, contains extensive evaluations of the methods on both datasets based on two datasets, the well-known JRC-Acquis dataset [5] as well as ablation studies, and after a discussion of results and the new EURLEX57K dataset [6]. Both datasets contain (Section VI) we conclude the paper in Section VII. II.RELATED WORK @prefix rdf: . In connection with the JRC-Acquis dataset, Steinberger @prefix skos: . @prefix dcterms: . et al. [16] present the “JRC EuroVoc Indexer JEX”, by the @prefix e v : . Joint Research Centre (JRC) of the . @prefix evs: . The tool categorizes documents using the EuroVoc taxonomy rdf:type evs:Domain ; by employing a profile-based ranking task; the authors report skos:prefLabel ”04 ”@en . an F-score between 0.44 and 0.54 depending on the document language. Boella et al. [17] manage to apply a support vector rdf:type evs:MicroThesaurus ; skos:prefLabel ”0421 parliament”@en ; machine approach to the problem by transforming the multi- dcterms:subject ev:100142 ; label classification problem into a single-label problem. Liu et skos:hasTopConcept ev:41 . al. [18] present a new family of Convolutional Neural Network rdf:type evs:ThesaurusConcept ; (CNN) models tailored for multi-label text classification. They skos:prefLabel ”powers of parliament”@en ; compare their method to a large number of existing approaches skos:inScheme ev:100166 . on various datasets; for the EurLex/JRC dataset however, rdf:type evs:ThesaurusConcept ; another method (SLEEC), provided the best results. SLEEC skos:prefLabel ”legislative period”@en ; (Sparse Local Embeddings for Extreme Classification) [19], skos:inScheme ev:100166 skos:broader ev:41 . creates local distance preserving embeddings which are able to accurately predict infrequently occurring (tail) labels. The results on precision for SLEEC applied in Liu et al. [18] are Figure 1. EuroVoc example P@1: 0.78, P@3: 0.64 and P@5: 0.52 – however, they use a previous version of the JRC-Acquis dataset with only 15.4K documents. A. EuroVoc Chalkidis et al. [6] recently published their work on the The datasets we use for our experiments contain legal new EURLEX57K dataset. The dataset will be described documents from the legal information system of the European in detail (incl. dataset statistics) in the next sections. Union (Eur-Lex) and are classified into a common classi- Chalkidis et al. also provide a strong baseline for LMTC on fication schema, the EuroVoc [9] thesaurus published and this dataset. Among the tested neural architectures operating on maintained by the Publications Office of the European Union the full documents, they have best results with BIGRUs with since 1982. The EuroVoc thesaurus has been introduced to label-wise attention. As input representation they use either harmonize the classification of documents in the communi- GloVe [20] embeddings trained on domain text, or ELMO cations across EU institutions and to enable a multilingual embeddings [21]. The authors investigated using only the first search as the thesaurus provides all its terms in the official zones of the (long) documents for classification, and show that language of the EU member states. It is organized based on the title and recitals part of each document to almost the the Simple Knowledge Organization System (SKOS) [23] , same performance as considering the full document [6]. This which encodes data using the Resource Description Format helps to alleviate BERT’s limitation of having a maximum of (RDF) [24] and is well-suited to represent hierarchical relations 512 tokens as input. Using only the first 512 tokens of each between terms in a thesaurus like EuroVoc. EuroVoc uses document as input, BERT [10] archives the best performance SKOS to hierarchically organize its concepts into 21 domains, overall. The work of Chalkidis et al. is inspired by You et for instance Law, or Politics, to name a few. Each domain al. [22] who experimented with RNN-based methods with self contains multiple microthesauri (127 in total), which in turn attention on five LMTC datasets (RCV1, -13K, Wiki- have in total around 600 top terms. About 7K terms (also called 30K, Wiki-500K, and EUR-Lex-4K). Similar work has been descriptors, concepts or labels) are assigned to one or multiple done in the medical domain, Mullenbach et al. [2] investigate microthesauri and connected to top terms using the predicate label-wise attention in LMTC for medical code prediction (on skos:broader. the MIMIC-II and MIMIC-III datasets). All concepts in EuroVoc have a preferred (skos: In this work, we experiment with BERT, RoBERTa, Dis- prefLabel) label and non-preferred (skos:altLabel) tilBERT, XLNet, M-BERT and AWD-LSTM. We provide ab- label for each language; the label language is indicated with lation studies to measure the impact of various training strate- language tags. Figure 1 illustrates with an example serialized gies and heuristics. Moreover, we provide new standardized in Turtle (TTL) [25] format how the terms are organized in the datasets for further investigation by the research community, EuroVoc thesaurus. Our example is from the domain 04 POLI- and leverage the EuroVoc term hierarchy to generate variants TICS and we show only the English labels of the concepts. The of the datasets. domain 04 POLITICS has the EuroVoc ID ev:100142 and is of rdf:type evs:Domain. Each domain has microthesauri III.DATASETS AND EUROVOC VOCABULARY as the next lower level in the hierarchy. In this example, In this section, we first introduce the multilingual EuroVoc we can see that a evs:Microthesaurus named 0421 thesaurus which is used to classify legal documents published parliament is assigned to the 04 POLITICS domain using by the institutions of the European Union. The EuroVoc (dcterms:subject ev:100142) and is also connected thesaurus is also used as a classification schema for the to the next lower level of top terms. The top term powers documents contained in the two legal datasets we use for our of parliament (ev:41) is linked to the microthesaurus using experiments, the JRC-Acquis V3 and EURLEX57K datasets skos:inScheme. Finally, the lowest level in this example is which are described in this section. the concept legislative period (ev:1599) which is linked to its skos:broader ev:41 ( ) top term powers of parliament ( ), TABLE I. DATASET STATISTICS FOR JRC-ACQUISAND EURLEX57K. and is also directly linked to the microthesaurus 0421 parlia- ment to which it belongs to using skos:inScheme. JRC-Acquis EURLEX57K #Documents 20382 57000 The legal documents are annotated with multiple EuroVoc Max #Tokens/Doc 469820 3934 classes typically on the lowest level which results in a huge Min #Tokens/Doc 21 119 amount of available classes a document can be potentially Mean #Tokens/Doc 2243.43 758.46 classified in. In addition, this also comes with the disadvantage StdDev #Tokens/Doc 7075.94 542.86 of the power-law distribution of labels such that some labels Median #Tokens/Doc 651.0 544 are assigned to many documents whereas others are only Mode #Tokens/Doc 275 275 assigned to a few documents or to no documents at all. The advantages of using a multilingual and multi-domain thesaurus for document classification are manifold. Most importantly, it as the JRC-Acquis V3 dataset, but the documents are compa- allows us to reduce the numbers of potential classes by going rable in their minimum number of tokens, median and mode up the hierarchy, which does not make classification incorrect of tokens per document. The large difference in the maximum but only more general. Reducing the number of labels allows number of tokens per document impacts the standard deviation to compare the efficiency of the model for different label sets, and the mean number of tokens. The reason for this difference which vary in size and sparsity. In this line, we use a class is that JRC-Acquis also includes documents dealing with the reduction method to generate datasets with a reduced number budget of the European Union, comprised of many tables. As of classes by replacing the original labels with the top terms, both datasets originate from the same source, but with different microthesauri or domains they belong to. For the top terms providers, we analyzed the number of documents contained in dataset, we leverage the skos:broader relations of the both datasets and found an overlap of approx. 12%. original descriptors, for the microthesauri dataset we follow Table II provides an overview of label statistics for both skos:inScheme links to the microthesauri, and the domains datasets. We created different versions based on the original dcterms:subject dataset is inferred via the links of the descriptors (DE), top terms (TT), microthesauri (MT) and microthesauri. This process creates three additional datasets domains (DO) and present the numbers for all versions. The (top terms, microthesauri, domains) [26]. Furthermore, such maximum number of labels assigned to a single document a thesaurus would also allow to incorporate potentially more is similar for both datasets. The average number of labels fine-grained national thesauri of member states which could be per document in the original (DE) version is 5.46 (JRC- aligned with EuroVoc and therefore enable multilingual search Acquis) and 5.07 (EURLEX57). Due to the polyhierarchy in in an extended thesarus. the geography domain a label be assigned to multiple Top Terms, therefore the number of Top Term labels is higher than B. Legal Text Datasets that of the original descriptors. In this work we focus on legal documents collected from Figure 2 visualizes the power-law (long tail) label distri- the Eur-Lex [7] database serving as the official site for re- bution, where a large portion of EuroVoc descriptors is used trieving , such as , International rarely (or never) as document annotations. In the JRC-Acquis agreements and Legislation, and case law of the European dataset only 50% of the labels available in EuroVoc are used Union (EU). Eur-Lex provides the documents in the official to classify documents. Only 417 labels are used frequently languages of the EU member states. As discussed in previous (used on more than 50 documents) and 3,3147 labels have work [26] the documents are well structured and written a frequency between 1–50 (few-short). The numbers for the in domain specific language. Furthermore, legal documents EURLEX57K dataset are similar [6], with 59.31% of all are typically longer compared to texts often taken for text EuroVoc labels being actually present in EURLEX57K. From classification task such as the -21578 dataset containing those labels, 746 are frequent, 3,362 have a frequency between news articles. 1–50, and 163 are only in the testing, but not in the training, dataset split (zero-shot). The high number of infrequent la- In this paper, we use the English versions of the two legal bels obviously is a challenge when using supervised learning datasets JRC-AcquisV3 [27] and EURLEX57K [28]. The JRC- approaches. Acquis V3 dataset has been compiled by the Joint Research Acquis Commu- Centre (JRC) of the European Union with the IV. METHODS nautaire being the applicable EU law and contains documents in XML format. Each JRC document is divided into body, In this section we describe the methods used in the signature, annex and descriptors. The EURLEX57K dataset has LMTC experiments presented in the evaluation section, and the been prepared by academia [6] and is provided in JSON format general training process. Furthermore, we discuss important structured into several parts, namely the header including title related points such as language model pretraining and fine- and legal body, recitals (legal background references), the main tuning, and discriminative learning rates, and other important body (organized in articles) and the attachments (appendices, foundations for the evaluation section like dataset splitting and annexes). Furthermore and in contrast to JRC-Acquis, the multilingual training. EURLEX57K dataset is already provided with a split into train and test sets. A. General Training Strategy and Implementation Table I shows a comparison of the dataset characteristics. In accordance with common NLP practice, as first intro- EURLEX57K contains almost three times as many documents duced by Howard and Ruder for text classification [29], we TABLE II. DATASET STATISTICS – NUMBEROFLABELSPERDOCUMENT. JRC-Acquis EURLEX57K Label DE TT MT DO DE TT MT DO Max 24 30 14 10 26 30 15 9 Min 1 1 1 1 1 1 1 1 Mean 5.46 6.04 4.74 3.39 5.07 5.94 4.55 3.24 StdDev 1.73 3.14 1.92 1.17 1.7 3.06 1.82 1.04 Median 6 5 5 3 5 5 4 3 Mode 6 4 4 3 6 4 4 3

language, which need little domain adaption, discriminative fine-tuning applies different learning rates depending on the layer; earlier layers use smaller learning rates compared to later layers. Secondly, slanted triangular learning rates quickly increase the learning rate at the beginning of a training epoch up to the maximal learning rate in order to find a suitable of the parameter space, and then slowly reduce the learning rate to refine the parameters. And finally, in gradual unfreezing the training process is divided into multiple cycles, where each cycle consists of several training epochs. Training starts after freezing all layers except for the last few layers in cycle one, during later cycles more layers are unfrozen gradually (from last to first layers). The intuition is that, in fine- tuning a deep learning model (similar to discriminative fine- tuning), that later layers are more task and domain specific and Figure 2. Power-law distribution of descriptors in the JRC-Acquis dataset. need more fine-tuning. In the evaluation section, we provide details about our unfreezing strategy (Table IV).

C. Baseline Model train our models in two steps: first we fine-tune the language modeling part of the model to the target corpus (JRC-Acquis or We use AWD-LSTM [33] as a baseline model. Merity et EURLEX57K), and then we train the classifier on the training- al. [33] investigate different strategies for regularizing word- split of the dataset. level LSTM language models, including the weight-dropped LSTM with its recurrent regularization, and they introduce NT- The baseline model (AWD-LSTM) and the transformer ASGD as a new version of average stochastic gradient descent models are available with pretrained weights, trained with lan- in AWD-LSTM. guage modelling objectives on large corpora such as Wikitext or Webtext – a process that is computationally very expensive. In the ULMFiT approach [29] of FastAI, AWD-LSTM Fine-tuning allows to transfer the language modeling capabil- is used as encoder, with extra layers added on top for the ities to a new domain [29]. classification task. Our implementation makes use of the FastAI library [30], For any of the models (AWD-LSTM and transformers) which includes the basic infrastructure to apply training strate- we apply the basic method discussed above: a) fine-tune the gies like gradual unfreezing or slanted triangular learning language model on all documents (ignoring the labels) of the rates (see below). Moreover, for the transformer models, we dataset (JRC-Acquis or EURLEX57K), and then b) fine-tune integrate the Hugging Face transformers package [31] with the classifier using the training-split of the dataset. FastAI. Our implementation including the evaluation results, is D. Transformer Models available on GitHub [32]. The repository also includes the In the experiments we study the performance of BERT, reference datasets created with iterative splitting, which can RoBERTa, DistilBERT and XLNet on the given text classifi- be used by other researchers as reference datasets – in order cation tasks. BERT is an early, and very popular, transformer to have a fair comparison of different approaches in the future. model, RoBERTa is a modified version of BERT trained on a larger corpus, DistilBERT is a distilled version of BERT and B. Tricks for Performance Improvement (within FastAI) thereby with lower computational cost, and finally, XLNet can be fed with larger input token sequences. In their Universal Language Model Fine-tuning for Text Classification (ULMFiT) approach, Howard and Ruder [29] BERT: BERT [10] is a bidirectional language model which propose a number of training strategies and tricks to improve aims to learn contextual relations between words using the model performance, which are available within the FastAI transformer architecture [34]. We use an official release of the libary. Firstly, based on the idea that early layers in a deep pre-trained models, details about the specific hyperparameters neural network capture more general and basic features of are found in Section V-A. The input to BERT is either a single text (a sentence or For splitting both JRC-Acquis and EURLEX57K, we use document), or a text pair. The first token of each sequence is the iterative stratification algorithm proposed by Sechidis et the special classification token [CLS], followed by WordPiece al. [14], ie. its implementation provided by the scikit-multilearn tokens of the first text A, then a separator token [SEP], and library [37]. Applying this algorithm leads to a better document (optionally) after that WordPiece tokens for the second text B. split with respect to the target labels, and in turn, helps with generalization of the results and allows for a fair comparison In addition to token embeddings, BERT uses positional of different approaches. The reference splits of the dataset are embeddings to represent the position of tokens in the se- available online [32]. quence. For training, BERT applies Masked Language Model- ing (MLM) and Next Sentence Prediction (NSP) objectives. In In the experiments in Section V we use these dataset splits, MLM, BERT randomly masks 15% of all WordPiece tokens but in addition for EURLEX57K also the dataset split of the in each sequence and learns to predict these masked tokens. dataset creators [6], in order to compare to their evaluation For NSP, BERT is fed in 50% of cases with the actual next results. sentence B, in the other cases with a random sentence B from the corpus. F. Multilingual Training RoBERTa: RoBERTa, introduced by Liu et al. [11], re- trains BERT with an improved methodology, much more data, JRC-Acquis is a collection of parallel texts in 22 languages larger batch size and longer training times. In RoBERTa the – we make use of this property to train multilingual BERT training strategy of BERT is modified by removing the NSP [38] on an extended version of JRC-Acquis in 3 languages. objective. Further, RoBERTa uses byte pair encoding (BPE) as Multilingual BERT provides support for 104 languages and a tokenization algorithm instead of WordPiece tokenization in it is useful for zero-shot learning tasks in which a model is BERT. trained using data from one language and then used to make inference on data in other languages. DistilBERT: We use a distilled version of BERT released by Sanh et al. [12]. DistilBERT provides a lighter and faster We extend the English JRC-Acquis dataset with parallel version of BERT, reducing the size of the model by 40% while data in German and French. The additional data has the retaining 97% of its capabilities on language understanding same dataset split as in the English version, ie. if an English tasks [12]. The distillation process includes training a complete document is in the training set then the German and French BERT model (the ) using the improved methodology versions will be in the same split as well. proposed by Liu et al. [11], then DistilBERT (the student) is trained to reproduce the behaviour of the teacher by using V. EVALUATION cosine embedding loss. This section first discusses evaluation setup (for example XLNet: The previously discussed transformer-based mod- model hyperparameters) and then evaluation results for JRC- els are limited to a fixed context length (such as 512 tokens), Acquis and EURLEX57K. while legal documents are often long and exceed this context length limit. XLNet [13] includes segments recurrence, intro- duced in Transformer-XL [35], allowing it to digest longer A. Evaluation Setup documents. XLNet follows RoBERTa in removing the NSP objective, while introducing a novel permutation language Evaluation setup includes important aspects such as dataset model objective. In our work with XLNet, we fine-tune the splits, preprocessing, the specific model architectures and classifier directly without LM fine-tuning (as LM fine-tuning variants, and major hyperparameters used in training. of XLNet was computationally not possible on the hardware a) Dataset Splits:: The official JRC-Acquis dataset available for our experiments). does not include a standard train-validation-test split, and as discussed in Section IV-E a random split exhibits unfavorable E. Dataset Splitting characteristics. We apply iterative splitting [14] to ensure that each split has the same label distribution as the original Stratification of classification data aims at splitting the data data. We split with an 80%/10%/10% ratio for training/valida- in a way that in all dataset splits (training, validation, test) the tion/test sets. For the EURLEX57K the dataset creators already target classes appear in similar proportions. In multi-label text provide a split and a strong baseline evaluation. We run our classification stratification becomes harder, because the target models on the given split in order to compare results, and also is a combination of multiple labels. In random splitting, it is create our own split with iterative splitting (dataset available possible that most instances of a specific class end up either in the mentioned GitHub repository [32]). in the training or test split (esp. for low frequency classes), and therefore the split can be unrepresentative with respect to b) Text Preprocessing:: All described models have their the original data set. Moreover, random splitting and different own preprocessing included (e.g. WordPiece tokenization in train/validation/test ratios create the problem that results from BERT), we do not apply extra preprocessing to the text. different approaches are hard to compare [15]. c) Neural Network Architectures:: For AWD-LSTM, Depending on the dataset, other criteria can be used for we use the standard setup of the pretrained model included in dataset splitting, for example Azarbonyad et al. [36] split JRC- FastAI, which has an input embedding layer with embedding Acquis documents according to document’s year, where older size of 400, followed by three LSTM layers with hidden sizes documents could be used in training, and newer in testing. of 1152 and weight dropout probability of 0.1. TABLE III. ARCHITECTUREHYPERPARAMETERSOFTRANSFORMER TABLE V. GRADUALUNFREEZINGSETTINGSFOR AWD-LSTM MODELS Cycle # Max LR # Unfrozen Layers # Iterations 1 2e-1 1 2 2 1e-2 2 5 3 1e-3 3 5 Model Name # Layers # Heads Context Length Is Cased batch- size BERT 12 12 512 False 4 4 5e-3 all 20 Roberta 12 12 512 False 4 5 1e-4 all 32 DistilBERT 6 12 512 False 4 6 1e-4 all 32 XLNet 12 12 1024 True 2

per cycle, and unfrozen layers as shown in the table. For the transformer models, we start from pretrained mod- e) LM Fine-tuning:: For the transformer models we do els, the uncased BERT model [39], the RoBERTa model [40], LM fine-tuning for 5 iterations, with a batch size of 4 and DistilBERT [41], and the XLNET model [42]. LR of 5e − 5. Transformer fine-tuning is done with a script1 In Table III, we see that many architectural details are provided by Hugging Face. For the AWD-LSTM model we similar for the different model types. The transformer models first fine-tune the frozen LM for 2 epochs, and then in cycle all have 12 network layers, except DistilBERT with 6 layers, two fine-tune the unfrozen model for another 5 epochs. and 12 attention heads. XLNet allows for longer input contexts, f) Hardware specifications: We trained the models on a but for performance reasons we limited the context to 1024 single GPU device (NVIDIA GeForce GTX 1080 with 11 GB tokens, and it was necessary to reduce the batch size to 2 to fit of GDDR5X memory). For inference, we use an i7- the model into GPU memory, and also we could not unfreeze 8700K CPU @ 3.70GHz and 16GB RAM. the whole pretrained model (see below). To create the text classifiers, we take the representation of B. Evaluation Metrics the text generated by the transformer model or AWD-LSTM, and add two fully connected layers of size 1200 and 50, In the evaluations, in line with Chalkidis et al. [6], we respectively, with a dropout probability of 0.2, and an output apply the following evaluation metrics: micro-averaged F1, layer. We apply batch normalization on the fully connected R-Precision@K (RP@K), and Normalized Discounted Cumu- layers. lative Gain (nDCG@K). Precision@K (P@K) and Recall@K (R@K) are popular measures in LTMC, too, but they unfairly d) Gradual Unfreezing:: Gradual unfreezing is one of penalize in situations where the number of gold labels is the ULMFiT strategies discussed in Section IV-B, where the unequal to K, which is the typical situation in our datasets. neural network layers are grouped, and trained starting with the This problem led to the introduction of more suitable metrics last group, then incrementally unfrozen and trained further. like RP@K and nDCG@K. In the following, we briefly discuss the metrics. TABLE IV. GRADUALUNFREEZINGDETAILS:LEARNING RATES (LR), The F 1-score is a common metric in information retrieval NUMBEROFEPOCHS (ITERS), AND LAYER GROUPS THAT ARE UNFROZEN. systems, and it is calculated as the harmonic mean between # Unfrozen Layers precision and recall. If we have a label L, Precision, Recall, and F 1-score with respect to L are calculated as follows:

T rueP ositivesL P recisionL = Cycle Max LR # Iters T rueP ositivesL+F alseP ositivesL T rueP ositivesL RecallL = T rueP ositivesL+F alseNegativesL BERT RoBERTa DistilBERT XLNet F 1 = 2 ∗ P recision∗Recall 1 2e-4 12 4 2 4 L P recision+Recall Micro-F1 is an extension of the F 1-score for multi-label 2 5e-5 12 8 4 6 3 5e-5 12 12 6 8 classification tasks, and it treats the entire set of predictions 4 5e-5 36 12 6 8 as one vector and then calculates the F 1. We use grid search 5 5e-5 36 12 6 8 to pick the threshold on the output probabilities of the models that gives the best Micro-F1 score on the validation set. The Except for DistilBERT, which has only 2 layers per layer threshold determines which labels we assign to the documents. group, all transformer models have 3 groups of 4 layers used Propensity scores prioritize predicting a few relevant labels in the unfreezing process. Table IV gives an overview of over the large number of irrelevant ones [43]. R-Precision@K the training setup for the transformer models. We trained the (RP @K) calculates precision for the top K ranked labels, if classifier for 5 cycles, starting in cycle 1 with 4 layers and the number of ground truth labels for a document is less than a LR = 2e − 4, and 12 training epochs (Iters). The setup of K, K is set to this number for this document. the other cycles is shown in the table. Overall, we used the RP @K = 1 PN PK Rel(n,k) same setup for all transformer models with a goal of better N n=1 k=1 min(K,Rn) comparison between models. (Remark: hand-picking LRs and Where N is the number of documents, Rel(n, k) is set training epochs might to slightly better results.) to 1 if the k-th retrieved label in the top-K labels of the n-th

Table V shows the main hyperparameters of AWD-LSTM 1https://github.com/huggingface/transformers/blob/master training, we trained the model in 6 cycles, with LRs, epochs /examples/language-modeling/run language modeling.py document is correct, otherwise it is set to 0 . Rn is the number created, additionally to the default descriptors dataset, datasets of ground truth labels for the n-th document. for EuroVoc Top Terms (TT), Micro-Thesauri (MT), and EuroVoc Domains (DO). With the reduced number of classes, Normalized Discounted Cumulative Gain nDCG@k for the classification performance is clearly rising, for example from a list of top K ranked labels measures ranking quality. It is based Micro-F1 of 0.661 (descriptors) to 0.839 (EuroVoc domains). on the assumption that highly relevant documents are more We argue that the results with the inferred labels show that our useful than moderately relevant documents. approach might be well-suitable for real-world applications 1 PN PK 2Rel(n,k)−1 in scenarios like automatic legal document classification or nDCG@K = Zk N n=1 n k=1 log2(1+k) keyword/label suggestion – for example the RP@5 for domains N is the number of documents, Rel(n, k) is set to 1 if the (DO) is at 0.928, so the classification performance (depending k-th retrieved label in the top-K labels of the n-th document is on the use case requirements) may be sufficient. correct, otherwise it is set to 0. Zkn is a normalization factor to ensure nDCG@K = 1 for a perfect ranking.

C. Evaluation Results The evaluation results are organized into three subsec- tions, results for the JRC-Acquis dataset, results for the EU- RLEX57K dataset, and finally results from ablation studies. 1) JRC-Acquis: Table VI presents an overview of the results on the JRC-Acquis dataset for the transformer models and the AWD-LSTM baseline, and initial results from the multilingual model. The observations here are as follows: Firstly, transformer- based models outperform the LSTM baseline by a large margin. Further, within the transformer models RoBERTa and BERT yield best results, the scores are almost the same. As expected, the distilled version of BERT is a bit lower in most metrics like Micro-F1, but the difference is small. Figure 3. A visualization of RP@K and nDCG@K for all transformer In this set of experiments, XLNet is behind DistilBERT, models for JRC-Acquis. which we attribute to two main causes: (i) for computational reasons (given the available GPU hardware), we could not fine- tune the LM on XLNet, and in classifier training we could Figure 3 contains a visual representation of RP@K and not unfreeze the full model. (ii) We used the same LR on all nDCG@K for the transformer models applied to the JRC- models; the choice of LR was influenced by a recommendation Acquis dataset. We can see how similar the performance on BERT learning rates in Devlin et al. [10], and may not be of BERT and RoBERTa is for different values of K, and optimal for XLNet. Overall, we could not properly test XLNet RoBERTa scores are consistently marginally better. due to its high computational requirements, and did therefore 2) EURLEX57K: In this subsection we report the evalu- not include it in the set of experiments on the EURLEX57K ation results on the new EURLEX57K dataset by Chalkidis dataset. et al. [6]. In order to compare to the results of the dataset The initial set of experiments with multilingual BERT (M- creators, we ran the experiments on the dataset and dataset split BERT) provides very promising results, on par with (45K training, 6K validation, 6K testing) provided by Chalkidis and BERT. This is remarkable given the fact that we use the et al. [6]. Below, we also show evaluation results on our same amount of global training steps – which means, because dataset split (created with the iterative stratification approach). our multilingual dataset is 3 times larger, that on individual Table VIII gives an overview of results for our transformer documents we train only a 1/3 of the time. We expect even models, and compares them to the strong baselines in existing better results with more training epochs. LM fine-tuning of the work. Chalkidis et al. [6] evaluate various architectures, the M-BERT model was done on the text from all three languages results of the three best models presented here: BERT-BASE, (en, de, fr). BIGRU-LWAN-ELMO and BIGRU-LWAN-L2V. BERT-BASE is a BERT model with an extra classification layer on top, Regarding comparisons to existing baseline results, firstly BIGRU-LWAN combines a BIGRU encoder with Label-Wise because of the problem of different dataset splits (see Sec- Attention Networks (LWAN), and uses either Elmo (ELMO) tion IV-E) results are hard to compare. However, Steinberger or word2vec (L2V) embeddings as inputs. Table VIII shows et al. [16] report an F1-score of 0.48, Esuli et al. [44] report that our models outperform the previous baseline, the best an F1 of 0.589 and Chang et al. [15] do not provide F1, but results are delivered by RoBERTa and DistilBERT. The good only P@5 (62.64) and R@5 (61.59). performance of DistilBERT in these experiments is surprising (We need further future experiments to explain the results For Table VII, we picked one transformer-based method, sufficiently. One intuition might be that the random weight namely BERT, and analyzed its performance on the various initialization of the added layers was very suitable.). JRC datasets resulting from class reduction described in Sec- tion III-A. By using inference on the EuroVoc hierarchy, we Overall, the results are much better than for the smaller TABLE VI. COMPARISON BETWEEN DIFFERENT TRANSFORMER MODELS, FINE-TUNEDUSINGTHESAMENUMBEROFITERATIONSON JRC-ACQUIS. BERT RoBERTa XLNet DistilBERT AWD-LSTM Multilingual BERT Micro-F1 0.661 0.659 0.605 0.652 0.493 0.663 RP@1 0.867 0.873 0.845 0.884 0.762 0.873 RP@3 0.784 0.788 0.736 0.78 0.619 0.783 RP@5 0.715 0.716 0.661 0.711 0.548 0.717 RP@10 0.775 0.778 0.733 0.775 0.627 0.777 nDCG@1 0.867 0.873 0.845 0.884 0.762 0.873 nDCG@3 0.803 0.807 0.762 0.805 0.651 0.804 nDCG@5 0.750 0.753 0.703 0.75 0.594 0.752 nDCG@10 0.778 0.781 0.746 0.779 0.630 0.780

TABLE VII. BERT RESULTS FOR JRC-ACQUISWITH class reduction experiments. METHODSAPPLIED, WHICHLEADTO 4 DATASETS:DE(DESCRIPTORS),TT Finally, in Table X, we trained a BERT model on our (TOP-TERMS),MT(MICROTHESAURI,DO(DOMAINS) iterative split of the EURLEX57K dataset in order to provide a DE TT MT DO strong baseline for future work on a standardized and arguably Micro-F1 0.661 0.745 0.778 0.839 improved version of the EURLEX57K dataset. RP@1 0.867 0.922 0.943 0.967 RP@3 0.784 0.838 0.871 0.905 3) Ablation Studies: In this section, we want to study the RP@5 0.715 0.804 0.844 0.928 contributions of various training process components – by RP@10 0.775 0.857 0.908 0.974 excluding some of those components individually (or reducing nDCG@1 0.867 0.922 0.943 0.967 the number of training epochs). We focus on three important nDCG@3 0.803 0.858 0.888 0.919 aspects: (i) the use of Language Model (LM) fine-tuning, (ii) nDCG@5 0.750 0.829 0.864 0.929 gradual unfreezing, (iii) and a reduction of the number of nDCG@10 0.778 0.852 0.896 0.952 training cycles. In Table XI, we compare the evaluation metrics when removing the LM fine-tuning (on the legal target corpus) step JRC dataset, with the best Micro-F1 for JRC being 0.661 before classification model training to the original version (BERT), while for EURLEX57K we reach 0.758 (RoBERTa). including LM fine-tuning (in parenthesis). For all examined Table IX presents the results for BERT on the additional models, we can see a small but consistent improvement of datasets with Top Terms (TT), Micro-Thesauri (MT) and results when using LM fine-tuning. The relative improvement Domains (DO) labels inferred from the EuroVoc taxonomy in the metrics is in the range of 1%–3%. In conclusion, LM (similar to Table VII, which presents the scores of JRC- fine-tuning to the legal text corpus is a crucial step for reaching Acquis). As expected from the general results on the EU- a high classification performance. RLEX57 dataset, the values on the derived datasets are better In Table XII, we examine the effect of two factors, the 0.956 than for JRC-Acquis, for example RP@5 is now at for training epochs (Iter.) hyperparameter, and of the use of the the domains (DO). gradual unfreezing technique. Regarding number of epochs, both models benefit from longer training, for BERT the difference is large (about 4% relative improvement in F1- score), while for the simpler DistilBERT model less training appears to be required, after 36 epochs it even provides better accuracy than BERT at this point, and finally only gains a 1.2% improvement from more training epochs. Secondly, we study the effect of Gradual Unfreezing (GU), which for BERT has a large impact, with a relative improvement in F1 of about 6%. In summary, longer training times benefit esp. more complex models like BERT, and gradual unfreezing is a very helpful strategy for optimizing performance.

VI.DISCUSSION Much of the detailed discussion is already included in FIGURE 4. RP@K ANDNDCG@K FORTHETRANSFORMERMODELS the Evaluation Results section (Section V-C), so here we will TRAINEDON EURLEX57K. summarize and extend on some of the key findings. In comparing model performance, starting with LSTM Similar to Figure 3, Figure 4 shows RP@K and nDCG@K versus transformer architectures, the results show that the at- for BERT, RoBERTa and DistilBERT depending on the tention mechanism used in transformers is superior to LSTMs of K. RoBERTa and DistilBERT are almost identical in in finding aspects relevant for the classification task in long their performance, BERT lags behind a little in this set of documents. Within the transformer models, firstly we did not TABLE VIII. RESULTSFOROURTRANSFORMER-BASEDMODELSON EURLEX57K, ANDSTRONGBASELINESFROM CHALKIDISETAL. Ours Chalkidis et al. [6] BERT RoBERTa DistilBERT BERT-BASE BIGRU-LWAN-ELMO BIGRU-LWAN-L2V Micro-F1 0.751 0.758 0.754 0.732 0.719 0.709 RP@1 0.912 0.919 0.925 0.922 0.921 0.915 RP@3 0.843 0.85 0.848 - - - RP@5 0.805 0.812 0.807 0.796 0.781 0.770 RP@10 0.852 0.860 0.862 0.856 0.845 0.836 nDCG@1 0.912 0.919 0.925 0.922 0.921 0.915 nDCG@3 0.859 0.866 0.866 - - - nDCG@5 0.828 0.835 0.833 0.823 0.811 0.801 nDCG@10 0.849 0.857 0.858 0.851 0.841 0.832

notice much difference between BERT and RoBERTa, which TABLE IX. BERT RESULTS ON EURLEX57K WITH class reduction METHODSAPPLIED, PLUSTHEBASELINERESULTSOF BERT-BASE (DE) is not unexpected, as they are technically very similar. Overall, FROM CHALKIDISETAL. [6]. results were a bit better for RoBERTa. DistilBERT delivered surprisingly good results for the EURLEX57K dataset, and has the benefits of lower computational cost. Both for the DE TT MT DO JRC-Aquis and the EURLEX57K datasets, the results indicate

DE baseline that DistilBERT is better in retrieving the most probable label Micro-F1 0.751 0.825 0.84 0.883 0.732 compared with RoBERTa and BERT. XLNet on the other hand, RP@1 0.912 0.948 0.959 0.978 0.922 requires a lot of computational resources, and we were not able RP@3 0.843 0.896 0.915 0.939 - to properly train the model for that reason. Finally, the first set RP@5 0.805 0.876 0.902 0.956 0.796 of experiments on multilingual training with M-BERT gave RP@10 0.852 0.909 0.943 0.986 0.856 promising results, hence it will be further studied in future nDCG@1 0.912 0.948 0.959 0.978 0.922 work. nDCG@3 0.859 0.907 0.924 0.947 - nDCG@5 0.828 0.891 0.912 0.955 0.823 The ablation studies showed the positive effects of the nDCG@10 0.849 0.904 0.931 0.97 0.851 training (fine-tuning) strategies that we applied, both LM- finetuning on the target domain, as well as gradual unfreezing of the network layers (in groups) proved to be crucial in TABLE X. BERT RESULTS ON EURLEX57K WITH THE NEW ITERATIVE reaching state-of-the-art classification performance. STRATIFICATION DATASET SPLIT. Micro-F1 RP@1 RP@5 nDCG@1 nDCG@5 To compare the computational costs, we calculated infer- 0.760 0.914 0.809 0.914 0.833 ence times for each model on an Intel i7-8700K CPU @ 3.70GHz. DistilBERT provides the lowest run time at 12 ms/example. RoBERTa and BERT (which have an identical TABLE XI. CLASSIFICATION METRICS FOR THE JRC-ACQUIS DATASET, architecture) have very similar run times with 17.1 ms, and WHEN not USING LM FINE-TUNING – IN PARENTHESES THE RESULTS with 17.3 ms/example, respectively. XLNet, the heaviest model, FINE-TUNING (FOR COMPARISON). requires 77 ms/example. BERT RoBERTa DistilBERT For a fair comparison, we trained all transformer models Micro-F1 0.64 (0.66) 0.65 (0.66) 0.61 (0.62) with the same set of hyperparameters (such as learning rate RP@1 0.86 (0.87) 0.87 (0.87) 0.86 (0.87) RP@3 0.77 (0.78) 0.77 (0.79) 0.75 (0.76) and number of training epochs). With customized and hand- RP@5 0.70 (0.72) 0.70 (0.72) 0.67 (0.68) picked parameters for each training cycle we expect further RP@10 0.76 (0.78) 0.77 (0.78) 0.74 (0.75) improvements of scores, which will be studied in future nDCG@1 0.86 (0.87) 0.87 (0.87) 0.86 (0.87) work together with model ensemble approaches and text data nDCG@3 0.79 (0.80) 0.79 (0.81) 0.77 (0.78) augmentation. nDCG@5 0.74 (0.75) 0.74 (0.75) 0.71 (0.72) nDCG@10 0.77 (0.72) 0.77 (0.78) 0.75 (0.76) VII.CONCLUSIONS Natural Language Processing ( In) this work we evaluate TABLE XII. ABLATION STUDY: BERT AND DISTILBERT PERFORMANCE current transformer models for natural language processing ON JRC-ACQUISREGARDINGTHENUMBEROFTRAININGEPOCHS (ITER.) in combination with training strategies like language model ANDTHEUSEOF GRADUAL UNFREEZING (GU). (LM) fine-tuning, slanted triangular learning rates and grad- # Iter. Use GU Prec. Rec. Mic.-F1 ual unfreezing in the field of LMTC (large multi-label text 36 True 0.678 0.601 0.637 classification) on legal text datasets with long-tail label dis- 108 False 0.674 0.575 0.621 tributions. The datasets contain around 20K documents (JRC-

BERT 108 True 0.695 0.630 0.661 Acquis) and 57K documents (EUROLEX57K) and are labeled 36 True 0.696 0.601 0.645 with EuroVoc descriptors from the 7K terms in the EuroVoc 108 False 0.663 0.583 0.620 taxonomy. The use of an iterative stratification algorithm Distil- BERT 108 True 0.701 0.611 0.653 for dataset splitting (into training/validation/testing) allows to create standardized splits on the two datasets to enable [9] The European Union’s multilingual and multidisciplinary thesaurus. comparison and reproducibility in future experiments. In the Retrieved: 09,2020. [Online]. Available: https://eur-lex.europa.eu/ experiments, we provide new state-of-the-art results on both browse/eurovoc.html datasets, with a micro-F1 of 0.661 for JRC-Acquis and 0.754 [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” for EUROLEX57K, and even higher scores for new datasets in Proc. 2019 NAACL: Language Technologies, Volume 1 with reduced label sets inferred from the EuroVoc hierarchy (Long and Short Papers). Minneapolis, Minnesota: ACL, Jun. (top terms, microthesauri, and domains). 2019, pp. 4171–4186, retrieved: 09, 2020. [Online]. Available: https://www.aclweb.org/anthology/N19-1423 The main contributions are: (i) new state-of-the-art LMTC [11] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, classification results on both datasets for a problem type that is L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert still largely unexplored [3], (ii) a comparison and interpretation pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. of the performance of the applied models: AWD-LSTM, [12] V. Sanh, L. Debut, J. Chaumond, and T. , “Distilbert, a distilled BERT, RoBERTa, DistilBERT and XLNet, (iii) the creation version of bert: smaller, faster, cheaper and lighter,” arXiv preprint and provision (on GitHub) of new standardized versions of the arXiv:1910.01108, 2019. two legal text datasets created with an iterative stratification [13] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language algorithm, (iv) deriving new datasets with reduced label sets understanding,” in Advances in neural information processing systems, via the semantic structure within EuroVoc, and (v) ablation 2019, pp. 5754–5764. studies that quantify the contributions of individual training [14] K. Sechidis, G. Tsoumakas, and I. Vlahavas, “On the stratification of strategies and hyperparameters such as gradual unfreezing, multi-label data,” in Joint European Conference on Machine Learning number of training epochs and LM fine-tuning in this complex and Knowledge Discovery in Databases. Springer, 2011, pp. 145–158. LMTC setting. [15] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, and I. Dhillon, “X- bert: extreme multi-label text classification using bidirectional encoder There are multiple angles for future work, including - representations from transformers,” arXiv preprint arXiv:1905.02331, tentially deriving higher performance by using hand-picked 2019. learning rates and other hyperparameters for each model [16] R. Steinberger, M. Ebrahim, and M. Turchi, “Jrc eurovoc indexer individually, and further experiments on using models such jex-a freely available multi-label categorisation tool,” arXiv preprint as multilingual BERT to profit from the availability of parallel arXiv:1309.5223, 2013. corpora. Moreover, experiments with new architectures such as [17] G. Boella et al., “Linking legal : breaking the and language barrier in european legislation and case law,” in Proceedings Graph Neural Networks [45] and various data augmentation of the 15th International Conference on Artificial Intelligence and Law, techniques are candidates to improve classification perfor- 2015, pp. 171–175. mance. [18] J. Liu, W.-C. Chang, Y. Wu, and Y. Yang, “Deep learning for extreme multi-label text classification,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information ACKNOWLEDGEMENTS Retrieval, 2017, pp. 115–124. This work was supported by the Government of the Russian [19] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain, “Sparse local (Grant 074-U01) through the ITMO Fellowship and embeddings for extreme multi-label classification,” in Advances in neural information processing systems, 2015, pp. 730–738. Professorship Program. [20] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language REFERENCES Processing (EMNLP), 2014, pp. 1532–1543. [Online]. Available: http://www.aclweb.org/anthology/D14-1162 [1] F. Sebastiani, “Machine learning in automated text categorization,” [21] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, ACM computing surveys (CSUR), vol. 34, no. 1, 2002, pp. 1–47. and L. Zettlemoyer, “Deep contextualized word representations,” in [2] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, and J. Eisenstein, Proc. 2018 NAACL: Human Language Technologies, Volume 1 (Long “Explainable prediction of medical codes from clinical text,” arXiv Papers). New Orleans, Louisiana: Association for Computational preprint arXiv:1802.05695, 2018. Linguistics, Jun. 2018, pp. 2227–2237, retrieved: 09, 2020. [Online]. [3] A. Rios and R. Kavuluru, “Few-shot and zero-shot multi-label learning Available: https://www.aclweb.org/anthology/N18-1202 for structured label spaces,” in Proceedings of the Conference on [22] R. You, S. Dai, Z. Zhang, H. Mamitsuka, and S. Zhu, “Attentionxml: Empirical Methods in Natural Language Processing. Conference on Extreme multi-label text classification with multi-label attention based Empirical Methods in Natural Language Processing, vol. 2018. NIH recurrent neural networks,” arXiv preprint arXiv:1811.01727, 2018. Public Access, 2018, p. 3132. [23] SKOS Simple Knowledge Organization System. Retrieved: 09,2020. [4] P. Ioannis et al., “Lshtc: A benchmark for large-scale text classification,” [Online]. Available: https://www.w3.org/2004/02/skos/ arXiv preprint arXiv:1503.08581, 2015. [24] Resource Description Framework. Retrieved: 09,2020. [Online]. [5] E. Loza Menc´ıa and J. Furnkranz,¨ Efficient Multilabel Classification Available: https://eur-lex.europa.eu/browse/eurovoc.html Algorithms for Large-Scale Problems in the Legal Domain. , Heidelberg: Springer, 2010, pp. 192–215, retrieved: 09, 2020. [Online]. [25] RDF 1.1 Turtle. Retrieved: 09,2020. [Online]. Available: https: Available: https://doi.org/10.1007/978-3-642-12837-0 11 //www.w3.org/TR/turtle [6] I. Chalkidis, E. Fergadiotis, P. Malakasiotis, and I. Androutsopoulos, [26] E. Filtz, S. Kirrane, A. Polleres, and G. Wohlgenannt, “Exploiting eu- “Large-scale multi-label text classification on EU legislation,” in Proc rovoc’s hierarchical structure for classifying legal documents,” in OTM 57th Annual Meeting of the ACL. Florence, : Association for Confederated International Conferences” On the Move to Meaningful Computational Linguistics, Jul. 2019, pp. 6314–6322, retrieved: 09, Internet Systems”. Springer, 2019, pp. 164–181. 2020. [Online]. Available: https://www.aclweb.org/anthology/P19-1636 [27] JRC-Acquis. Retrieved: 09,2020. [Online]. Available: https://ec.europa. [7] Eurepean Union Law Website. Retrieved: 09,2020. [Online]. Available: eu/jrc/en/language-technologies/jrc-acquis https://eur-lex.europa.eu [28] EURLEX57K dataset. Retrieved: 09,2020. [Online]. Available: http: [8] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer learning //nlp.cs.aueb.gr/software and datasets/EURLEX57K/ in natural language processing,” in 2019 NAACL: Tutorials, 2019, pp. [29] J. Howard and S. Ruder, “Universal language model fine-tuning for text 15–18. classification,” arXiv preprint arXiv:1801.06146, 2018. [30] Fastai documentation. Retrieved: 09,2020. [Online]. Available: https: //docs.fast.ai/ [31] Huggingface transformers. Retrieved: 09,2020. [Online]. Available: https://huggingface.co/transformers [32] Legal Documents, Large Multi-Label Text Classification. Retrieved: 09,2020. [Online]. Available: https://github.com/zeinsh/ Legal-Docs-Large-MLTC [33] S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing lstm language models,” arXiv preprint arXiv:1708.02182, 2017. [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008. [35] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” arXiv preprint arXiv:1901.02860, 2019. [36] H. Azarbonyad and M. Marx, “How many labels? determining the number of labels in multi-label text classification,” in International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 2019, pp. 156–163. [37] Multi-label data stratification. Retrieved: 09,2020. [Online]. Available: http://scikit.ml/stratification.html#Multi-label-data-stratification [38] BERT, Multi-Lingual Model. Retrieved: 09,2020. [Online]. Available: https://github.com/google-research/bert/blob/master/multilingual.md [39] Huggingface BERT base uncased model. Retrieved: 09,2020. [Online]. Available: https://huggingface.co/bert-base-uncased [40] Huggingface RoBERTa base model. Retrieved: 09,2020. [Online]. Available: https://huggingface.co/roberta-base [41] Huggingface DistilBERT cased model. Retrieved: 09,2020. [Online]. Available: https://huggingface.co/distilbert-base-uncased [42] Huggingface XLNET cased model. Retrieved: 09,2020. [Online]. Available: https://huggingface.co/xlnet-base-cased [43] H. Jain, Y. Prabhu, and M. Varma, “Extreme multi-label loss func- tions for recommendation, tagging, ranking & other missing label applications,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and , 2016, pp. 935– 944. [44] A. Esuli, A. Moreo, and F. Sebastiani, “Funnelling: A new ensemble method for heterogeneous transfer learning and its application to cross- lingual text classification,” ACM Transactions on Information Systems (TOIS), vol. 37, no. 3, 2019, pp. 1–30. [45] A. Pal, M. Selvakumar, and M. Sankarasubbu, “Magnet: Multi-label text classification using attention-based graph neural network,” in Proc. 12th Int. Conf. on Agents and Artificial Intelligence - Volume 2: ICAART, INSTICC. SciTePress, 2020, pp. 494–505.