An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Total Page:16

File Type:pdf, Size:1020Kb

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification Imperial College London Department of Computing An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification Supervisors: Author: Prof Alessandra Russo Clavance Lim Nuri Cingillioglu Submitted in partial fulfillment of the requirements for the MSc degree in Computing Science of Imperial College London September 2019 Contents Abstract 1 Acknowledgements 2 1 Introduction 3 1.1 Motivation .................................. 3 1.2 Aims and objectives ............................ 4 1.3 Outline .................................... 5 2 Background 6 2.1 Overview ................................... 6 2.1.1 Text classification .......................... 6 2.1.2 Training, validation and test sets ................. 6 2.1.3 Cross validation ........................... 7 2.1.4 Hyperparameter optimization ................... 8 2.1.5 Evaluation metrics ......................... 9 2.2 Text classification pipeline ......................... 14 2.3 Feature extraction ............................. 15 2.3.1 Count vectorizer .......................... 15 2.3.2 TF-IDF vectorizer ......................... 16 2.3.3 Word embeddings .......................... 17 2.4 Classifiers .................................. 18 2.4.1 Naive Bayes classifier ........................ 18 2.4.2 Decision tree ............................ 20 2.4.3 Random forest ........................... 21 2.4.4 Logistic regression ......................... 21 2.4.5 Support vector machines ...................... 22 2.4.6 k-Nearest Neighbours ........................ 23 2.4.7 Multilayer perceptron ........................ 24 2.4.8 Convolutional neural networks ................... 25 2.4.9 Recurrent neural networks ..................... 28 2.4.10 Hierarchical attention network ................... 29 3 The EURLEX Dataset 32 3.1 Structure of the EUR-Lex database .................... 32 3.2 The EUR-Lex paper ............................ 32 i Table of Contents 3.3 Filtering the dataset ............................ 34 3.3.1 Distribution of the distilled EURLEX dataset .......... 36 3.4 Analysis of the distilled EURLEX dataset ................ 38 3.5 Dataset visualisation ............................ 41 4Experiments 44 4.1 Overview ................................... 44 4.2 Preprocessing ................................ 44 4.3 Machine learning models .......................... 45 4.4 Summary of results ............................. 46 4.5 Analysis of results .............................. 46 4.5.1 Naive Bayes classifier ........................ 46 4.5.2 Decision tree ............................ 48 4.5.3 Random forest ........................... 52 4.5.4 Logistic regression ......................... 53 4.5.5 k-NN ................................. 61 4.5.6 Linear SVM ............................. 64 4.5.7 Non-linear SVM ........................... 69 4.5.8 MLP ................................. 71 4.5.9 Preliminary conclusions ...................... 72 4.6 Deep learning models ............................ 72 4.7 Summary of results ............................. 75 4.7.1 General trend ............................ 75 4.7.2 Sentence embeddings ........................ 75 4.8 Analysis of results .............................. 77 4.8.1 MLP ................................. 77 4.8.2 CNN ................................. 77 4.8.3 LSTM ................................ 79 4.8.4 HAN ................................. 80 4.8.5 Conclusions on deep learning models ............... 81 4.9 Choice of best models ............................ 82 5RelatedWork 83 5.1 Deep learning approaches to NLP ..................... 83 5.2 Deep learning approaches to text classification .............. 85 5.3 Text classification in the legal context ................... 87 6 Conclusions 92 6.1 Summary of contributions ......................... 92 6.2 Practical implications ............................ 93 6.3 Challenges .................................. 95 6.3.1 Lack of labelled data ........................ 95 6.3.2 Hardware limitations ........................ 95 6.4 Possible improvements ........................... 95 6.5 Future work ................................. 96 6.6 Legal and ethical considerations ...................... 96 ii Table of Contents Appendices 98 A Ethics checklist 98 B Copy of readme for code repository 101 iii Abstract This project provides a comprehensive comparative study of the performance of su- pervised machine learning models in the natural language processing task of text classification, specifically in the legal context. We distill a dataset of European Union legislation for multi-label classification into one for a single-label, multi-class classifi- cation task. We provide visualisations and analysis of the dataset. We then draw a distinction between ‘machine learning’ models, including the Naive Bayes classifier, lo- gistic regression and support vector machines, and more contemporary ‘deep learning’ approaches, such as convolutional neural networks, long short-term memory networks and the hierarchical attention network. We experiment with traditional count-based vectorizers for feature embedding with the machine learning models, and pre-trained word embeddings for the deep learning models. We critically evaluate the performance of each model on its own, and with those in its group, before proposing a final model. Finally, we discuss the potential uses of such a classifier in professional legal practice. 1 Acknowledgements I would like to express my utmost gratitude to the following people, without whom this project would not have been possible: Professor Alessandra Russo and Nuri Cingillioglu, for their continued guidance, • encouragement and time throughout the course of this project. My family, for everything. • 2 Chapter 1 Introduction 1.1 Motivation The application of technology to assist legal professionals with the provision of legal services, a sector known as legal tech, has received tremendous investment and interest in recent years [29]. In particular, with the recent successes of machine learning meth- ods in fields such as computer vision and pattern recognition, expectations that these methods will provide the panacea for the ills of the legal profession, such as repetitive administrative work, have begun to arise [66]. The broad aim of this project is thus to apply the latest methods used in machine learning and in natural language processing (NLP) to a dataset in the legal context. More specifically, the goal will be to experiment with and compare the performance of several machine learning and deep learning methods for the task of text classification. Text classification has many potential uses in the legal domain, particularly for cate- gorising legal documents and cases which can aid the process of legal research, and for the development of a knowledge management system (for a detailed example of such an implementation, see [5, 6]). The task is an interesting one from an academic perspective, for several reasons. While text classification as an NLP task in general is well-studied, the specific study of text classification methods in the legal domain has remained relatively under-explored [28, 67]. Applying text classification methods specifically to the legal context is not a trivial problem, i.e. simply because a method has proven to be useful in classifying texts of a general subject matter does not mean that the method will necessarily work equally well in the legal context. This is because the structure of legal language can be distinguished from that of ordinary language in terms of vocabulary, syntax, semantics and other linguistic features [5, 65]. The types of texts used in NLP research tend to be user reviews, where the language used tends to be colloquial or informal (such as the IMDB dataset [45]), posts scraped from Twitter (which are a maximum of 280 characters) and other documents of a much shorter length than a legal judgement or piece of legislation [28]. 3 Chapter 1. Introduction 1.2 Aims and objectives The broad aim of this project is to present a framework through which a document in the English language with legal subject matter can be classified into one of several predefined classes. Specifically, documents will be drawn from a dataset of European Union legislation, with each document belonging to one of 20 classes. Concretely, the goal will be to propose a classification model. Given an unseen legal text document of length n, X =(x1,x2,...,xn), where xi is an individual token in the document, the model will assign X to one of k classes, where k =20in our case. This aim will be achieved by fulfilling the following objectives: Extracting relevant sections of legal texts from their raw HTML source obtained • from a publicly accessible European Union law repository Preprocessing the unstructured data to a structured format • Analysing the characteristics of the dataset (e.g. distribution of classes) • Exploring different methods of evaluating the performance of classifiers • Drawing a distinction between two groups of classifiers, ‘machine learning’ meth- • ods and ‘deep learning’ methods, and analysing each group separately, with dif- ferent methods of feature extraction: – Classifiers based on various machine learning methods, with count vector- ization and TF-IDF vectorization: 1. Naive Bayes classifier 2. Decision tree 3. Random forest 4. Logistic regression 5. Support vector machines (SVMs) 6. K-nearest neighbours (k-NN) 7. Multilayer perceptron (MLP) – Comparing the performance of classifiers based on deep learning methods, with pre-trained
Recommended publications
  • Ranking and Automatic Selection of Machine Learning Models Abstract Sandro Feuz
    Technical Disclosure Commons Defensive Publications Series December 13, 2017 Ranking and automatic selection of machine learning models Abstract Sandro Feuz Victor Carbune Follow this and additional works at: http://www.tdcommons.org/dpubs_series Recommended Citation Feuz, Sandro and Carbune, Victor, "Ranking and automatic selection of machine learning models Abstract", Technical Disclosure Commons, (December 13, 2017) http://www.tdcommons.org/dpubs_series/982 This work is licensed under a Creative Commons Attribution 4.0 License. This Article is brought to you for free and open access by Technical Disclosure Commons. It has been accepted for inclusion in Defensive Publications Series by an authorized administrator of Technical Disclosure Commons. Feuz and Carbune: Ranking and automatic selection of machine learning models Abstra Ranking and automatic selection of machine learning models Abstract Generally, the present disclosure is directed to an API for ranking and automatic selection from competing machine learning models that can perform a particular task. In particular, in some implementations, the systems and methods of the present disclosure can include or otherwise leverage one or more machine-learned models to provide to a software application one or more machine learning models from different providers. The trained models are suited to a task or data type specified by the developer. The one or more models are selected from a registry of machine learning models, their task specialties, cost, and performance, such that the application specified cost and performance requirements are met. An application processor interface (API) maintains a registry of various machine learning models, their task specialties, costs and/or performances. A third-party developer can make a call to the API to select one or more machine learning models.
    [Show full text]
  • Enhanced Thesaurus Terms Extraction for Document Indexing
    Enhanced Thesaurus Terms Extraction for Document Indexing Frane ari¢, Jan najder, Bojana Dalbelo Ba²i¢, Hrvoje Ekli¢ Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 10000 Zagreb, Croatia E-mail:{Frane.Saric, Jan.Snajder, Bojana.Dalbelo, Hrvoje.Eklic}@fer.hr Abstract. In this paper we present an mogeneous due to diverse background knowl- enhanced method for the thesaurus term edge and expertise of human indexers. The extraction regarded as the main support to task of building semi-automatic and auto- a semi-automatic indexing system. The matic systems, which aim to decrease the enhancement is achieved by neutralising burden of work borne by indexers, has re- the eect of language morphology applying cently attracted interest in the research com- lemmatisation on both the text and the munity [4], [13], [14]. Automatic indexing thesaurus, and by implementing an ecient systems still do not achieve the performance recursive algorithm for term extraction. of human indexers, so semi-automatic sys- Formal denition and statistical evaluation tems are widely used (CINDEX, MACREX, of the experimental results of the proposed MAI [10]). method for thesaurus term extraction are In this paper we present a method for the- given. The need for disambiguation methods saurus term extraction regarded as the main and the eect of lemmatisation in the realm support to semi-automatic indexing system. of thesaurus term extraction are discussed. Term extraction is a process of nding all ver- batim occurrences of all terms in the text. Keywords. Information retrieval, term Our method of term extraction is a part of extraction, NLP, lemmatisation, Eurovoc.
    [Show full text]
  • A Robust Deep Learning Approach for Spatiotemporal Estimation of Satellite AOD and PM2.5
    remote sensing Article A Robust Deep Learning Approach for Spatiotemporal Estimation of Satellite AOD and PM2.5 Lianfa Li 1,2,3 1 State Key Laboratory of Resources and Environmental Information Systems, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Datun Road, Beijing 100101, China; [email protected]; Tel.: +86-10-648888362 2 University of Chinese Academy of Sciences, Beijing 100049, China 3 Spatial Data Intelligence Lab Ltd. Liability Co., Casper, WY 82609, USA Received: 22 November 2019; Accepted: 7 January 2020; Published: 13 January 2020 Abstract: Accurate estimation of fine particulate matter with diameter 2.5 µm (PM ) at a high ≤ 2.5 spatiotemporal resolution is crucial for the evaluation of its health effects. Previous studies face multiple challenges including limited ground measurements and availability of spatiotemporal covariates. Although the multiangle implementation of atmospheric correction (MAIAC) retrieves satellite aerosol optical depth (AOD) at a high spatiotemporal resolution, massive non-random missingness considerably limits its application in PM2.5 estimation. Here, a deep learning approach, i.e., bootstrap aggregating (bagging) of autoencoder-based residual deep networks, was developed to make robust imputation of MAIAC AOD and further estimate PM2.5 at a high spatial (1 km) and temporal (daily) resolution. The base model consisted of autoencoder-based residual networks where residual connections were introduced to improve learning performance. Bagging of residual networks was used to generate ensemble predictions for better accuracy and uncertainty estimates. As a case study, the proposed approach was applied to impute daily satellite AOD and subsequently estimate daily PM2.5 in the Jing-Jin-Ji metropolitan region of China in 2015.
    [Show full text]
  • Fasttext.Zip: Compressing Text Classification Models
    Under review as a conference paper at ICLR 2017 FASTTEXT.ZIP: COMPRESSING TEXT CLASSIFICATION MODELS Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve´ Jegou´ & Tomas Mikolov Facebook AI Research fajoulin,egrave,bojanowski,matthijs,rvj,[email protected] ABSTRACT We consider the problem of producing compact architectures for text classifica- tion, such that the full model fits in a limited amount of memory. After consid- ering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quan- tization artefacts. Combined with simple approaches specifically adapted to text classification, our approach derived from fastText requires, at test time, only a fraction of the memory compared to the original FastText, without noticeably sacrificing quality in terms of classification accuracy. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy. 1 INTRODUCTION Text classification is an important problem in Natural Language Processing (NLP). Real world use- cases include spam filtering or e-mail categorization. It is a core component in more complex sys- tems such as search and ranking. Recently, deep learning techniques based on neural networks have achieved state of the art results in various NLP applications. One of the main successes of deep learning is due to the effectiveness of recurrent networks for language modeling and their application to speech recognition and machine translation (Mikolov, 2012).
    [Show full text]
  • Predicting Construction Cost and Schedule Success Using Artificial
    Available online at www.sciencedirect.com International Journal of Project Management 30 (2012) 470–478 www.elsevier.com/locate/ijproman Predicting construction cost and schedule success using artificial neural networks ensemble and support vector machines classification models ⁎ Yu-Ren Wang , Chung-Ying Yu, Hsun-Hsi Chan Dept. of Civil Engineering, National Kaohsiung University of Applied Sciences, 415 Chien-Kung Road, Kaohsiung, 807, Taiwan Received 11 May 2011; received in revised form 2 August 2011; accepted 15 September 2011 Abstract It is commonly perceived that how well the planning is performed during the early stage will have significant impact on final project outcome. This paper outlines the development of artificial neural networks ensemble and support vector machines classification models to predict project cost and schedule success, using status of early planning as the model inputs. Through industry survey, early planning and project performance information from a total of 92 building projects is collected. The results show that early planning status can be effectively used to predict project success and the proposed artificial intelligence models produce satisfactory prediction results. © 2011 Elsevier Ltd. APM and IPMA. All rights reserved. Keywords: Project success; Early planning; Classification model; ANNs ensemble; Support vector machines 1. Introduction Menches and Hanna, 2006). In particular, researches have indi- cated that project definition in the early planning process is an im- In the past few decades, the researchers and industry prac- portant factor leading to project success (Le et al., 2010; Thomas titioners have recognized the potential impact of early plan- and Fernández, 2008; Younga and Samson, 2008). Based on ning to final project outcomes and started to put more these results, this research intends to further investigate this rela- emphasis on early planning process (Dvir, 2005; Gibson et tionship and to examine if the status of early planning can be used al., 2006; Hartman and Ashrafi, 2004).
    [Show full text]
  • Using Lexico-Syntactic Ontology Design Patterns for Ontology Creation and Population
    Using Lexico-Syntactic Ontology Design Patterns for ontology creation and population Diana Maynard and Adam Funk and Wim Peters Department of Computer Science University of Sheffield Regent Court, 211 Portobello S1 4DP, Sheffield, UK Abstract. In this paper we discuss the use of information extraction techniques involving lexico-syntactic patterns to generate ontological in- formation from unstructured text and either create a new ontology from scratch or augment an existing ontology with new entities. We refine the patterns using a term extraction tool and some semantic restrictions derived from WordNet and VerbNet, in order to prevent the overgener- ation that occurs with the use of the Ontology Design Patterns for this purpose. We present two applications developed in GATE and available as plugins for the NeOn Toolkit: one for general use on all kinds of text, and one for specific use in the fisheries domain. Key words: natural language processing, relation extraction, ontology generation, information extraction, Ontology Design Patterns 1 Introduction Ontology population is a crucial part of knowledge base construction and main- tenance that enables us to relate text to ontologies, providing on the one hand a customised ontology related to the data and domain with which we are con- cerned, and on the other hand a richer ontology which can be used for a variety of semantic web-related tasks such as knowledge management, information re- trieval, question answering, semantic desktop applications, and so on. Automatic ontology population is generally performed by means of some kind of ontology-based information extraction (OBIE) [1, 2]. This consists of identi- fying the key terms in the text (such as named entities and technical terms) and then relating them to concepts in the ontology.
    [Show full text]
  • Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms
    Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms Aleksander Wawer Agnieszka Mykowiecka Institute of Computer Science Institute of Computer Science PAS PAS Jana Kazimierza 5 Jana Kazimierza 5 01-248 Warsaw, Poland 01-248 Warsaw, Poland [email protected] [email protected] Abstract Unsupervised WSD algorithms aim at resolv- ing word ambiguity without the use of annotated This paper compares two approaches to corpora. There are two popular categories of word sense disambiguation using word knowledge-based algorithms. The first one orig- embeddings trained on unambiguous syn- inates from the Lesk (1986) algorithm, and ex- onyms. The first one is an unsupervised ploit the number of common words in two sense method based on computing log proba- definitions (glosses) to select the proper meaning bility from sequences of word embedding in a context. Lesk algorithm relies on the set of vectors, taking into account ambiguous dictionary entries and the information about the word senses and guessing correct sense context in which the word occurs. In (Basile et from context. The second method is super- al., 2014) the concept of overlap is replaced by vised. We use a multilayer neural network similarity represented by a DSM model. The au- model to learn a context-sensitive transfor- thors compute the overlap between the gloss of mation that maps an input vector of am- the meaning and the context as a similarity mea- biguous word into an output vector repre- sure between their corresponding vector represen- senting its sense. We evaluate both meth- tations in a semantic space.
    [Show full text]
  • LASLA and Collatinus
    L.A.S.L.A. and Collatinus: a convergence in lexica Philippe Verkerk, Yves Ouvrard, Margherita Fantoli, Dominique Longrée To cite this version: Philippe Verkerk, Yves Ouvrard, Margherita Fantoli, Dominique Longrée. L.A.S.L.A. and Collatinus: a convergence in lexica. Studi e saggi linguistici, ETS, In press. hal-02399878v1 HAL Id: hal-02399878 https://hal.archives-ouvertes.fr/hal-02399878v1 Submitted on 9 Dec 2019 (v1), last revised 14 May 2020 (v2) HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. L.A.S.L.A. and Collatinus: a convergence in lexica Philippe Verkerk, Yves Ouvrard, Margherita Fantoli and Dominique Longrée L.A.S.L.A. (Laboratoire d'Analyse Statistique des Langues Anciennes, University of Liège, Belgium) has begun in 1961 a project of lemmatisation and morphosyntactic tagging of Latin texts. This project is still running with new texts lemmatised each year. The resulting files have been recently opened to the interested scholars and they now count approximatively 2.500.000 words, the lemmatisation of which has been checked by a philologist. In the early 2.000's, Collatinus has been developed by Yves Ouvrard for teaching.
    [Show full text]
  • Investigating Classification for Natural Language Processing Tasks
    UCAM-CL-TR-721 Technical Report ISSN 1476-2986 Number 721 Computer Laboratory Investigating classification for natural language processing tasks Ben W. Medlock June 2008 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2008 Ben W. Medlock This technical report is based on a dissertation submitted September 2007 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Fitzwilliam College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Abstract This report investigates the application of classification techniques to four natural lan- guage processing (NLP) tasks. The classification paradigm falls within the family of statistical and machine learning (ML) methods and consists of a framework within which a mechanical `learner' induces a functional mapping between elements drawn from a par- ticular sample space and a set of designated target classes. It is applicable to a wide range of NLP problems and has met with a great deal of success due to its flexibility and firm theoretical foundations. The first task we investigate, topic classification, is firmly established within the NLP/ML communities as a benchmark application for classification research. Our aim is to arrive at a deeper understanding of how class granularity affects classification accuracy and to assess the impact of representational issues on different classification models. Our second task, content-based spam filtering, is a highly topical application for classification techniques due to the ever-worsening problem of unsolicited email.
    [Show full text]
  • Experiments in Clustering Urban-Legend Texts
    How to Distinguish a Kidney Theft from a Death Car? Experiments in Clustering Urban-Legend Texts Roman Grundkiewicz, Filip Gralinski´ Adam Mickiewicz University Faculty of Mathematics and Computer Science Poznan,´ Poland [email protected], [email protected] Abstract is much easier to tap than the oral one, it becomes feasible to envisage a system for the machine iden- This paper discusses a system for auto- tification and collection of urban-legend texts. In matic clustering of urban-legend texts. Ur- this paper, we discuss the first steps into the cre- ban legend (UL) is a short story set in ation of such a system. We concentrate on the task the present day, believed by its tellers to of clustering of urban legends, trying to reproduce be true and spreading spontaneously from automatically the results of manual categorisation person to person. A corpus of Polish UL of urban-legend texts done by folklorists. texts was collected from message boards In Sec. 2 a corpus of 697 Polish urban-legend and blogs. Each text was manually as- texts is presented. The techniques used in pre- signed to one story type. The aim of processing the corpus texts are discussed in Sec. 3, the presented system is to reconstruct the whereas the clustering process – in Sec. 4. We manual grouping of texts. It turned out that present the clustering experiment in Sec. 5 and the automatic clustering of UL texts is feasible results – in Sec. 6. but it requires techniques different from the ones used for clustering e.g. news ar- 2 Corpus ticles.
    [Show full text]
  • Linked Data Triples Enhance Document Relevance Classification
    applied sciences Article Linked Data Triples Enhance Document Relevance Classification Dinesh Nagumothu * , Peter W. Eklund , Bahadorreza Ofoghi and Mohamed Reda Bouadjenek School of Information Technology, Deakin University, Geelong, VIC 3220, Australia; [email protected] (P.W.E.); [email protected] (B.O.); [email protected] (M.R.B.) * Correspondence: [email protected] Abstract: Standardized approaches to relevance classification in information retrieval use generative statistical models to identify the presence or absence of certain topics that might make a document relevant to the searcher. These approaches have been used to better predict relevance on the basis of what the document is “about”, rather than a simple-minded analysis of the bag of words contained within the document. In more recent times, this idea has been extended by using pre-trained deep learning models and text representations, such as GloVe or BERT. These use an external corpus as a knowledge-base that conditions the model to help predict what a document is about. This paper adopts a hybrid approach that leverages the structure of knowledge embedded in a corpus. In particular, the paper reports on experiments where linked data triples (subject-predicate-object), constructed from natural language elements are derived from deep learning. These are evaluated as additional latent semantic features for a relevant document classifier in a customized news- feed website. The research is a synthesis of current thinking in deep learning models in NLP and information retrieval and the predicate structure used in semantic web research. Our experiments Citation: Nagumothu, D.; Eklund, indicate that linked data triples increased the F-score of the baseline GloVe representations by 6% P.W.; Ofoghi, B.; Bouadjenek, M.R.
    [Show full text]
  • A Taxonomy of Massive Data for Optimal Predictive Machine Learning and Data Mining Ernest Fokoue
    Rochester Institute of Technology RIT Scholar Works Articles 2013 A Taxonomy of Massive Data for Optimal Predictive Machine Learning and Data Mining Ernest Fokoue Follow this and additional works at: http://scholarworks.rit.edu/article Recommended Citation Fokoue, Ernest, "A Taxonomy of Massive Data for Optimal Predictive Machine Learning and Data Mining" (2013). Accessed from http://scholarworks.rit.edu/article/1750 This Article is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Articles by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. A Taxonomy of Massive Data for Optimal Predictive Machine Learning and Data Mining Ernest Fokoué Center for Quality and Applied Statistics Rochester Institute of Technology 98 Lomb Memorial Drive, Rochester, NY 14623, USA [email protected] Abstract Massive data, also known as big data, come in various ways, types, shapes, forms and sizes. In this paper, we propose a rough idea of a possible taxonomy of massive data, along with some of the most commonly used tools for handling each particular category of massiveness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data massiveness. The specific statistical machine learning technique used to handle a particular massive data set will depend on which category it falls in within the massiveness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Prepro- cessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Se- quentialization.
    [Show full text]