Learning Methods and Algorithms for Semantic Text Classification Across Multiple Domains

Alma Mater Studiorum { Universitàdi Bologna DOTTORATO DI RICERCA IN Ingegneria Elettronica, Informatica e delle Telecomunicazioni Ciclo XXVII Settore concorsuale di afferenza: 09/H1 Settore scientifico disciplinare: ING-INF/05 LEARNING METHODS AND ALGORITHMS FOR SEMANTIC TEXT CLASSIFICATION ACROSS MULTIPLE DOMAINS Presentata da: Roberto Pasolini Coordinatore Dottorato Relatore Prof. Alessandro Vanelli-Coralli Prof. Gianluca Moro Correlatore Prof. Claudio Sartori Esame finale anno 2015 Abstract Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems. Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined tax- onomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort. Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced sta- tistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. De- spite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times. A second, deeper contribution involves the design of a domain- independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one. To everyone who somehow supported me v Contents Introduction ix Contributions . xi Structure of the thesis . xiii Conventions . xiv Technical remarks . xiv 1 Data and Text Mining 1 1.1 Data mining . 1 1.1.1 Applications . 2 1.1.2 Machine learning . 3 1.2 Text mining . 5 1.2.1 Applications . 5 1.3 High-level text mining tasks . 7 1.3.1 Text categorization . 7 1.3.2 Sentiment analysis . 7 1.3.3 Text clustering . 8 1.3.4 Document summarization . 8 1.4 Natural language processing . 9 1.4.1 Part-Of-Speech Tagging . 10 1.4.2 Word Sense Disambiguation . 11 1.4.3 Other tasks . 12 2 General Techniques and Tools for Text Mining 13 2.1 Brief history . 13 vi Contents 2.2 Bag-of-Words representation . 14 2.2.1 Cosine similarity . 16 2.3 Extraction of features . 17 2.3.1 Lemmas . 17 2.3.2 Stems . 18 2.3.3 n-grams and phrases . 19 2.3.4 Concepts . 20 2.4 Term selection and weighting . 21 2.4.1 Basic word filtering . 21 2.4.2 Feature selection . 22 2.4.3 Term weighting . 23 2.5 Extraction of latent semantic information . 26 2.5.1 Latent semantic analysis . 26 2.5.2 Probabilistic models . 28 2.6 Linguistic and semantic knowledge bases . 29 2.6.1 WordNet . 30 3 Text Categorization 35 3.1 Problem description . 35 3.2 Variants . 36 3.2.1 Binary, single-label and multi-label classification . 36 3.2.2 Hierarchical classification . 37 3.3 Knowledge engineering approach . 39 3.4 Machine learning approach . 40 3.4.1 General setup . 41 3.4.2 Supervised term selection . 42 3.5 Common learning algorithms for text . 44 3.5.1 Na¨ıve Bayes . 44 3.5.2 Support Vector Machines . 47 3.5.3 Other methods . 49 3.6 Nearest centroid classification . 50 3.7 Hierarchical classification . 52 3.7.1 Big-bang approach . 53 3.7.2 Local classifiers . 53 3.8 Experimental evaluation . 54 3.8.1 Benchmark datasets . 55 3.8.2 Evaluation metrics . 58 Contents vii 4 Cross-Domain Text Categorization 65 4.1 Problem description . 65 4.1.1 Formalization . 67 4.1.2 Motivations . 67 4.2 State of the art . 68 4.2.1 Instance transfer . 68 4.2.2 Feature representation transfer . 69 4.2.3 Other related works . 70 4.3 Evaluation . 71 4.3.1 Common benchmark datasets . 71 4.4 Iterative refining of category representations . 72 4.4.1 Rationale . 75 4.4.2 Base method . 77 4.4.3 Computational complexity . 79 4.4.4 Results . 81 4.4.5 Variant with logistic regression . 91 4.4.6 Variant with termination by quasi-similarity . 95 4.4.7 Discussion . 99 5 A Domain-Independent Model for Semantic Relatedness 101 5.1 General idea . 101 5.1.1 Possible applications . 105 5.2 Related work . 106 5.3 General working scheme for text categorization . 108 5.3.1 Semantic knowledge base . 108 5.3.2 Model training . 109 5.3.3 Semantic matching algorithm . 111 5.3.4 Classification . 114 5.3.5 Computational complexity . 114 5.3.6 General experiment setup . 115 5.4 Search of semantic relationships . 117 5.4.1 Use of WordNet . 118 5.4.2 Existing approaches . 119 5.4.3 Hypernyms-based algorithm . 121 5.4.4 Extended search algorithm . 123 5.5 Flat categorization . 126 5.5.1 Example couples and category profiles . 126 5.5.2 Classification . 127 viii Contents 5.5.3 Experiment setup . 128 5.5.4 Experiment results . 130 5.6 Hierarchical categorization . 138 5.6.1 Couple labeling and selection . 138 5.6.2 Representation of categories . 140 5.6.3 Top-down classification algorithm . 141 5.6.4 Experiment setup . 142 5.6.5 Experiment results . 144 5.7 Discussion . 151 6 Conclusions 155 6.1 Ideas for future research . 157 A Network Security through Distributed Data Clustering 159 A.1 Distributed data mining . 159 A.2 Network intrusion detection systems . 160 A.3 General system model . 161 A.4 Implementation details . 162 A.5 Simulation setup . 164 A.6 Simulation results . 166 Bibliography 169 ix Introduction Nowadays, across many contexts, information of all kinds is produced at fast rates. Many things can be considered \information": the current tem- perature in a room, the details of a purchase in a store, a post on a social network and so on. From information, especially in large quantities, useful knowledge can be extracted: the details of many sales of a store may give useful indications about products often purchased together, while a collection of public messages about some product or service can reveal what people generally think about it. However, the extraction of useful knowledge from very large amounts of data is not always trivial, especially to be performed manually. Data mining research is dedicated to developing processes for automated extraction of useful, high-level information hidden within large amounts of data. It has many applications in business, finance, science, society and so on. Data mining is mostly based on machine learning, the study of algorithms to analyze a set of raw data and extract from it a knowledge model which encapsulates recurring patterns within it and allows to make more or less accurate predictions on future data. Machine learning algorithms can solve various task, such as classification of items or events according to prominent characteristics, such as classifying reviews of a product as positive or negative. Ordinary machine learning techniques work on structured data, where every piece of information is well distinguishable and computers can easily handle it. On the other way, unstructured data exist, which are easily interpreted by humans in small amounts, but not directly understandable by computers. A prominent part of such data is constituted by free text in x Introduction natural language, such as English or Italian, in units of different lengths, ranging from short messages on Twitter to publicly available whole books. Such text data, whose examples have been given above, can contain like other very valuable information, but automatic analysis is necessary to efficiently extract such knowledge in a usable form. Text mining is a branch of data mining studying techniques for automatic treatment of free text. This is generally achieved by processing text to represent it with structured forms, which can then be analyzed by usual learning algorithms. A very common form to efficiently represent every document of a large collection is the bag of words, a multiset of relevant words or other related features contained in the text. In this context, a quite common task is text categorization, consisting in the automatic organization of a possibly very large set of documents into distinct, meaningful categories. Categorization is usually carried out according to separate different topics discussed in document, but the general description involves also tasks like spam filtering in e-mail and identification of positive and negative reviews of a product. Organization of documents by topics can be useful in contexts like dividing books in sections, sepa- rating news articles by categories and organizing Web pages in hierarchical directories. Most methods for automatic text categorization are based on machine learning: given a training set of documents already labeled with respective correct categories, they are reduced to bags of words and, using suitable learning algorithms, a knowledge model is inferred from them, which is able to classify further documents within the same categories.

Load more