Natural Language Information Retrieval Text, Speech and Language Technology

Total Page:16

File Type:pdf, Size:1020Kb

Natural Language Information Retrieval Text, Speech and Language Technology Natural Language Information Retrieval Text, Speech and Language Technology VOLUME7 Series Editors Nancy Ide, Vassar College, New York Jean Veronis, Universite de Provence and CNRS, France Editorial Board Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT& T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universitat Autonoma de Barcelona, Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMS/-CNRS, France The titles published in this series are listed at the end of this volume. Natural Language Information Retrieval Edited by Tomek Strzalkowski General Electric, Research & Development SPRINGER-SCIENCE+BUSINESS MEDIA, B.V. A C.I.P. Catalogue record for this book is available from the Library of Congress. ISBN 978-90-481-5209-4 ISBN 978-94-017-2388-6 (eBook) DOI 10.1007/978-94-017-2388-6 Printed on acid-free paper All Rights Reserved ©1999 Springer Science+Business Media Dordrecht Originally published by Kluwer Academic Publishers in 1999 No part of the material protected by this <:opyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without written permission from the copyright owner TABLE OF CONTENTS PREFACE xiii CONTRIBUTING AUTHORS xxiii 1 WHAT IS THE ROLE OF NLP IN TEXT RETRIEVAL? }(aren Sparck Jones 1 1 Introduction 0 0 0 0 0 0 0 0 0 0 0 0 1 2 Linguistically-motivated indexing 2 201 Basic Concepts 0 0 0 0 0 . 2 202 Complex Descriptions and Terms 6 3 Research and tests 0 0 0 . 0 0 0 0 0 0 0 0 9 301 Phase 1 : Experiments from the 1960s to the 1980s 10 302 Phase 2 : The Nineties 0 16 303 The TREC Programme 17 4 Other roles for NLP 0 0 . 0 0 0 21 2 NLP FOR TERM VARIANT EXTRACTION: SYNERGY BETWEEN MORPHOLOGY, LEXICON, AND SYNTAX Christian Jacquemin and Evelyne Tzoukermann 25 1 Introduction: From Term Conflation to Linguistic Analysis of Term Variants 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 1.1 From Syntactic Variants to Morpho-syntactic Variants 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 2 Controlled Indexing and Term Variant Extraction 0 28 201 Conflation of Single Word Terms 29 202 Multi-word Term Conflation 0 0 0 31 203 Purpose of this Chapter 0 0 0 0 0 34 3 An Architecture for Controlled Indexing 35 301 Morphological Analysis 0 0 0 0 . 35 302 Multi-word Term Extraction and Conflation 0 38 4 Morphological Analysis 0 0 0 0 0 0 0 0 0 0 38 401 Inflectional Analysis 0 0 0 0 0 0 0 0 41 402 Morpho-syntactic Disambiguation 43 v VI CONTENTS 4.3 Derivational Analysis 47 4.4 System Implementation 50 4.5 Advantages of Overgenerating . 50 5 FASTR: A Tool for Term Variant Extraction 52 5.1 A Grammar of Terms and a Metagrammar of Transformations 52 5.2 A Metagrammar for Syntactic Variants 54 5.3 A Metagrammar for Morpho-syntactic Variants 56 5.4 A Method for the Formulation of Metarules 60 5.5 Evaluation 66 6 Conclusion 70 3 COMBINING CORPUS LINGUISTICS AND HUMAN MEMORY MODELS FOR AUTOMATIC TERM ASSOCIATION Gerda Ruge 75 1 Introduction . 75 2 Various Approaches to Automatic Term Association 77 2.1 Term Association by Statistic Corpus Analysis . 78 2.2 Term Association by Linguistically Based Corpus Analysis . 79 3 Human Memory Models .. .. .... ...... 81 3.1 A Well Known Memory Model . 81 3.2 A Memory Model Explaining Human Recall Capability . 82 4 Associationism and Term Association . 85 5 Spreading Activation with Heads and Modifiers 88 5.1 Spreading Activation on the Basis of Heads and Modifiers .. ... ..... .. .. ... .. 88 5.2 Indirect Activation of Semantically Similar Words 89 5.3 Taking into account Synonymous Heads and Modifiers 90 6 Experiments . 92 6.1 Test Data .. ... ... 92 6.2 Parameters of the Network 92 6.3 Similarity Measure . 94 6.4 Results . 94 7 Valuation of the Spreading Activation Approach 95 CONTENTS vii 4 USING NLP OR NLP RESOURCES FOR INFORMATION RETRIEVAL TASKS Alan F. Smeaton 99 1 Introduction . 99 2 Early Experiments . 100 3 Using Natural Language Processes or NLP Resources . 103 4 Using WordNet for Information Retrieval 105 5 Status and Plans . 109 5 EVALUATING NATURAL LANGUAGE PROCESSING TECHNIQUES IN INFORMATION RETRIEVAL Tomek Strzalkowski, Fang Lin, .lin Wang and Jose Perez-Carballo 113 1 Introduction and Motivation. 113 2 NLP-Based Indexing in Information Retrieval 116 3 NLP in Information Retrieval: A Perspective 118 4 Stream-based Information Retrieval Model . 121 5 Advanced Linguistic Streams . 123 5.1 Head+Modifier Pairs Stream 123 5.2 Simple Noun Phrase Stream . 128 5.3 Name Stream .. .. 129 5.4 Other Streams . 130 6 Stream Merging and Weighting . 133 6.1 Inter-stream merging using score calculation . 133 6.2 Inter-stream merging using precision distribution estimates .... 135 6.3 Stream coefficients . 136 7 Query Expansion Experiments 136 7.1 Why Query Expansion? 136 7.2 Guidelines for manual query expansion . 138 7.3 Automatic Query Expansion 139 8 Summary of Results . 140 8.1 Ad-hoc runs . 140 8.2 Routing Runs . 141 9 Conclusions . ... 142 6 STYLISTIC EXPERIMENTS IN INFORMATION RETRIEVAL .Jussi K aT'lgren 147 1 Stylistics . 147 2 Materials and Processing . 148 2.1 Experiments Performed 148 2.2 Corpus . .. 148 Vlll CONTENTS 2.3 Variables Examined . 149 2.4 On Non-Parametric Multivariate Statistics 153 2.5 Correlation Between Variables 153 3 Visualizing Stylistic Variation 153 4 Stylistics and Relevance ....... 158 4.1 Relevance Judgments .... 158 4.2 Relevance of Stylistics to Relevance 159 4.3 Hypotheses ...... 160 4.4 Results and Discussion . 160 5 Stylistics and Precision . 161 6 Further Work . 163 7 EXTRACTION-BASED TEXT CATEGORIZATION: GENERATING DOMAIN-SPECIFIC ROLE RELATIONSHIPS AUTOMATICALLY Ellen RilojJ and Jeffrey Lorenzen 167 1 Introduction ............. 167 2 Extraction-based text categorization 169 2.1 Extraction patterns ..... 169 2.2 Relevancy signatures . 172 2.3 Augmented relevancy signatures 175 3 Automatically generating extraction patterns 177 4 Word-augmented relevancy signatures . 179 4.1 Fully Automatic Text Categorization . 181 5 Experimental results . 182 5.1 The Terrorism Category 182 5.2 The Attack Category . 185 5.3 The Bombing Category 188 5.4 The Kidnapping Category 191 5.5 Comparing automatic and hand-crafted dictionaries 193 6 Conclusions . 194 8 LASIE JUMPS THE GATE Yorick Wilks and Robert Gaizauskas 197 1 Introduction . 197 2 Background . 199 2.1 TIPSTER 200 3 GATE Design .. 201 3.1 GDM .. 202 3.2 CREOLE 202 3.3 GGI .. 204 4 LaSIE: An Application In GATE 205 4.1 Significant Features ... 207 CONTENTS IX 4.2 LaSIE Modules . 207 4.3 System Performance . 209 5 Other IE systems and modules within GATE 210 6 The European IE scene . 211 7 Limitations of IE systems 212 8 Conclusion .. ... .. 213 9 PHRASAL TERMS IN REAL-WORLD IR APPLICATIONS Joe Zhou 215 1 Introduction . 215 2 Phrasing/Proximity in IR: A Compatibility Study 217 2.1 Method ...... .. 217 2.2 Empirical Data Input . 219 2.3 Empirical Data Output 220 2.4 Evaluation . 221 2.5 Claims .. ....... 224 3 Automatic Suggestion of Key Terms 225 3.1 Introduction . 225 3.2 Methodology . 227 3.3 Results and Discussion . 231 4 Information Retrieval Applications 247 4.1 Introduction . 24 7 4.2 Document Surrogater: A Summarization Prototype . 248 4.3 Document Sampler: A Categorization Prototype 253 5 Conclusion . 257 10 NAME RECOGNITION AND RETRIEVAL PERFORMANCE Paul Thompson and Christopher Dozier 261 1 Introduction . 261 2 Definitions, Problems, and Issues . 262 3 The Study . 264 3.1 Name Recognition Accuracy . 264 3.2 Evaluation of Name Recognition and Retrieval Performance . 264 3.3 Name Recognition Case Law Collection 265 4 Results . 266 4.1 Name Recognition Accuracy. 266 4.2 Effect on Retrieval Performance. 267 4.3 Name Frequencies in the Case Law Collection . 267 5 Discussion . 268 X CONTENTS 6 Conclusions 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 270 A The 38 Case Law Queries with Names Highlighted 272 11 COLLAGE: AN NLP TOOLSET TO SUPPORT BOOLEAN RETRIEVAL Jim Cowie 273 1 Introduction 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 273 2 Objectives 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 274 3 Rube Goldberg (or Heath Robinson) Recipe 0 275 4 Language Processing Technology 277 5 Query Algebra 0 0 0 0 0 0 277 6 Topic Structuring 0 0 0 0 0 0 0 0 0 278 601 Name Recognition 0 0 0 0 278 602 Noun Phrase Recognition 279 7 Topic Parsing 0 0 0 0 279 8 Query Generation 281 9 Document Ra nking 0 281 10 BRS/SEARCH(t) 0 282 11 Lexical Resources 0 283 12 Wordnet 0 0 0 0 0 0 283 13 Transfer Lexicons 283 14 Standard Source Lookup 0 284 15 Bi-gram generation 285 16 Further Work 0 0 0 0 0 0 0 286 12 DOCUMENT CLASSIFICATION AND ROUTING Louise Guthrie, Joe Guthr-ie and James Le istensnider 289 1 Background 0 0 0 0 0 0 0 0 289 101 Meaning of a Text 290 1.2 Flavor of a Te xt 0 290 2 Introduction 0 0 0 0 0 0 0 0 291 201 The Intuitive Model 291 202 Routing vso Classification vs o Retrieval 0 292 203 The Relevance of a Topic 293 2.4 Some Approaches 0 0 0 0 0 0 0 0 0 0 0 0 0 294 205 Overview of this Paper 0 0 0 0 0 0 0 0 0 0 294 3 Application of the Multinomial Distribution to Classification 295 3 01 Flavors 0 0 0 0 295 3 02 Word Selection 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 296 303 A Simple Test 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 297 4 Application of the Multinomial Distribution to Routing 297 401 Word Selection 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 298 CONTENTS Xl 4.2 Zero Word Counts ....
Recommended publications
  • An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification
    Imperial College London Department of Computing An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification Supervisors: Author: Prof Alessandra Russo Clavance Lim Nuri Cingillioglu Submitted in partial fulfillment of the requirements for the MSc degree in Computing Science of Imperial College London September 2019 Contents Abstract 1 Acknowledgements 2 1 Introduction 3 1.1 Motivation .................................. 3 1.2 Aims and objectives ............................ 4 1.3 Outline .................................... 5 2 Background 6 2.1 Overview ................................... 6 2.1.1 Text classification .......................... 6 2.1.2 Training, validation and test sets ................. 6 2.1.3 Cross validation ........................... 7 2.1.4 Hyperparameter optimization ................... 8 2.1.5 Evaluation metrics ......................... 9 2.2 Text classification pipeline ......................... 14 2.3 Feature extraction ............................. 15 2.3.1 Count vectorizer .......................... 15 2.3.2 TF-IDF vectorizer ......................... 16 2.3.3 Word embeddings .......................... 17 2.4 Classifiers .................................. 18 2.4.1 Naive Bayes classifier ........................ 18 2.4.2 Decision tree ............................ 20 2.4.3 Random forest ........................... 21 2.4.4 Logistic regression ......................... 21 2.4.5 Support vector machines ...................... 22 2.4.6 k-Nearest Neighbours .......................
    [Show full text]
  • Fasttext.Zip: Compressing Text Classification Models
    Under review as a conference paper at ICLR 2017 FASTTEXT.ZIP: COMPRESSING TEXT CLASSIFICATION MODELS Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve´ Jegou´ & Tomas Mikolov Facebook AI Research fajoulin,egrave,bojanowski,matthijs,rvj,[email protected] ABSTRACT We consider the problem of producing compact architectures for text classifica- tion, such that the full model fits in a limited amount of memory. After consid- ering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quan- tization artefacts. Combined with simple approaches specifically adapted to text classification, our approach derived from fastText requires, at test time, only a fraction of the memory compared to the original FastText, without noticeably sacrificing quality in terms of classification accuracy. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy. 1 INTRODUCTION Text classification is an important problem in Natural Language Processing (NLP). Real world use- cases include spam filtering or e-mail categorization. It is a core component in more complex sys- tems such as search and ranking. Recently, deep learning techniques based on neural networks have achieved state of the art results in various NLP applications. One of the main successes of deep learning is due to the effectiveness of recurrent networks for language modeling and their application to speech recognition and machine translation (Mikolov, 2012).
    [Show full text]
  • Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms
    Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms Aleksander Wawer Agnieszka Mykowiecka Institute of Computer Science Institute of Computer Science PAS PAS Jana Kazimierza 5 Jana Kazimierza 5 01-248 Warsaw, Poland 01-248 Warsaw, Poland [email protected] [email protected] Abstract Unsupervised WSD algorithms aim at resolv- ing word ambiguity without the use of annotated This paper compares two approaches to corpora. There are two popular categories of word sense disambiguation using word knowledge-based algorithms. The first one orig- embeddings trained on unambiguous syn- inates from the Lesk (1986) algorithm, and ex- onyms. The first one is an unsupervised ploit the number of common words in two sense method based on computing log proba- definitions (glosses) to select the proper meaning bility from sequences of word embedding in a context. Lesk algorithm relies on the set of vectors, taking into account ambiguous dictionary entries and the information about the word senses and guessing correct sense context in which the word occurs. In (Basile et from context. The second method is super- al., 2014) the concept of overlap is replaced by vised. We use a multilayer neural network similarity represented by a DSM model. The au- model to learn a context-sensitive transfor- thors compute the overlap between the gloss of mation that maps an input vector of am- the meaning and the context as a similarity mea- biguous word into an output vector repre- sure between their corresponding vector represen- senting its sense. We evaluate both meth- tations in a semantic space.
    [Show full text]
  • Investigating Classification for Natural Language Processing Tasks
    UCAM-CL-TR-721 Technical Report ISSN 1476-2986 Number 721 Computer Laboratory Investigating classification for natural language processing tasks Ben W. Medlock June 2008 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2008 Ben W. Medlock This technical report is based on a dissertation submitted September 2007 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Fitzwilliam College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Abstract This report investigates the application of classification techniques to four natural lan- guage processing (NLP) tasks. The classification paradigm falls within the family of statistical and machine learning (ML) methods and consists of a framework within which a mechanical `learner' induces a functional mapping between elements drawn from a par- ticular sample space and a set of designated target classes. It is applicable to a wide range of NLP problems and has met with a great deal of success due to its flexibility and firm theoretical foundations. The first task we investigate, topic classification, is firmly established within the NLP/ML communities as a benchmark application for classification research. Our aim is to arrive at a deeper understanding of how class granularity affects classification accuracy and to assess the impact of representational issues on different classification models. Our second task, content-based spam filtering, is a highly topical application for classification techniques due to the ever-worsening problem of unsolicited email.
    [Show full text]
  • Linked Data Triples Enhance Document Relevance Classification
    applied sciences Article Linked Data Triples Enhance Document Relevance Classification Dinesh Nagumothu * , Peter W. Eklund , Bahadorreza Ofoghi and Mohamed Reda Bouadjenek School of Information Technology, Deakin University, Geelong, VIC 3220, Australia; [email protected] (P.W.E.); [email protected] (B.O.); [email protected] (M.R.B.) * Correspondence: [email protected] Abstract: Standardized approaches to relevance classification in information retrieval use generative statistical models to identify the presence or absence of certain topics that might make a document relevant to the searcher. These approaches have been used to better predict relevance on the basis of what the document is “about”, rather than a simple-minded analysis of the bag of words contained within the document. In more recent times, this idea has been extended by using pre-trained deep learning models and text representations, such as GloVe or BERT. These use an external corpus as a knowledge-base that conditions the model to help predict what a document is about. This paper adopts a hybrid approach that leverages the structure of knowledge embedded in a corpus. In particular, the paper reports on experiments where linked data triples (subject-predicate-object), constructed from natural language elements are derived from deep learning. These are evaluated as additional latent semantic features for a relevant document classifier in a customized news- feed website. The research is a synthesis of current thinking in deep learning models in NLP and information retrieval and the predicate structure used in semantic web research. Our experiments Citation: Nagumothu, D.; Eklund, indicate that linked data triples increased the F-score of the baseline GloVe representations by 6% P.W.; Ofoghi, B.; Bouadjenek, M.R.
    [Show full text]
  • Natural Language Processing Security- and Defense-Related Lessons Learned
    July 2021 Perspective EXPERT INSIGHTS ON A TIMELY POLICY ISSUE PETER SCHIRMER, AMBER JAYCOCKS, SEAN MANN, WILLIAM MARCELLINO, LUKE J. MATTHEWS, JOHN DAVID PARSONS, DAVID SCHULKER Natural Language Processing Security- and Defense-Related Lessons Learned his Perspective offers a collection of lessons learned from RAND Corporation projects that employed natural language processing (NLP) tools and methods. It is written as a reference document for the practitioner Tand is not intended to be a primer on concepts, algorithms, or applications, nor does it purport to be a systematic inventory of all lessons relevant to NLP or data analytics. It is based on a convenience sample of NLP practitioners who spend or spent a majority of their time at RAND working on projects related to national defense, national intelligence, international security, or homeland security; thus, the lessons learned are drawn largely from projects in these areas. Although few of the lessons are applicable exclusively to the U.S. Department of Defense (DoD) and its NLP tasks, many may prove particularly salient for DoD, because its terminology is very domain-specific and full of jargon, much of its data are classified or sensitive, its computing environment is more restricted, and its information systems are gen- erally not designed to support large-scale analysis. This Perspective addresses each C O R P O R A T I O N of these issues and many more. The presentation prioritizes • identifying studies conducting health service readability over literary grace. research and primary care research that were sup- We use NLP as an umbrella term for the range of tools ported by federal agencies.
    [Show full text]
  • Using Wordnet to Disambiguate Word Senses for Text Classification
    Using WordNet to Disambiguate Word Senses for Text Classification Ying Liu1, Peter Scheuermann2, Xingsen Li1, and Xingquan Zhu1 1 Data Technology and Knowledge Economy Research Center, Chinese Academy of Sciences Graduate University of Chinese Academy of Sciences 100080, Beijing, China [email protected], [email protected], [email protected] 2 Department of Electrical and Computer Engineering Northwestern University, Evanston, Illinois, USA, 60208 [email protected] Abstract. In this paper, we propose an automatic text classification method based on word sense disambiguation. We use “hood” algorithm to remove the word ambiguity so that each word is replaced by its sense in the context. The nearest ancestors of the senses of all the non-stopwords in a give document are selected as the classes for the given document. We apply our algorithm to Brown Corpus. The effectiveness is evaluated by comparing the classification results with the classification results using manual disambiguation offered by Princeton University. Keywords: disambiguation, word sense, text classification, WordNet. 1 Introduction Text classification aims at automatically assigning a document to a pre-defined topic category. A number of machine learning algorithms have been investigated for text classification, such as K-Nearest Neighbor (KNN) [1], Centroid Classifier [2], Naïve Bayes [3], Decision Trees [4], Support Vector Machines (SVM) [4]. In such classifiers, each document is represented by a n-dimensional vector in a feature space, where each feature is a keyword in the given document. Then traditional classification algorithms can be applied to generate a classification model. To classify an unseen document, a feature vector is constructed by using the same set of n features and then passed to the model as the input.
    [Show full text]
  • An Automatic Text Document Classification Using Modified Weight and Semantic Method
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8 Issue-12, October 2019 An Automatic Text Document Classification using Modified Weight and Semantic Method K.Meena, R.Lawrance Abstract: Text mining is the process of transformation of Text analytics involves information transformation, useful information from the structured or unstructured sources. pattern recognition, information retrieval, association In text mining, feature extraction is one of the vital parts. This analysis, predictive analytics and visualization. The goal of paper analyses some of the feature extraction methods and text mining is to turn text data into useful data for analysis proposed the enhanced method for feature extraction. Term with the help of analytical methods and Natural Language Frequency-Inverse Document Frequency(TF-IDF) method only assigned weight to the term based on the occurrence of the term. Processing (NLP). The increased use of text management in Now, it is enlarged to increases the weight of the most important the internet has resulted in the enhanced research activities words and decreases the weight of the less important words. This in text mining. enlarged method is called as M-TF-IDF. This method does not Now-a-days, everyone has been using internet for surfing, consider the semantic similarity between the terms. Hence, Latent posting, creating blogs and for various purpose. Internet has Semantic Analysis(LSA) method is used for feature extraction massive collection of many text contents. The unstructured and dimensionality reduction. To analyze the performance of the text content size may be varying. For analyzing the text proposed feature extraction methods, two benchmark datasets contents of internet efficiently and effectively, it must be like Reuter-21578-R8 and 20 news group and two real time classified or clustered.
    [Show full text]
  • The Impact of NLP Techniques in the Multilabel Text Classification Problem
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Repositório Científico da Universidade de Évora The impact of NLP techniques in the multilabel text classification problem Teresa Gon¸calves and Paulo Quaresma Departamento de Inform´atica, Universidade de Ev´ ora, 7000 Ev´ ora, Portugal tcg j [email protected] Abstract. Support Vector Machines have been used successfully to classify text documents into sets of concepts. However, typically, linguistic information is not being used in the classification process or its use has not been fully evaluated. We apply and evaluate two basic linguistic procedures (stop-word removal and stemming/lemmatization) to the multilabel text classification problem. These procedures are applied to the Reuters dataset and to the Portuguese juridical documents from Supreme Courts and Attorney General's Office. 1 Introduction Automatic classification of documents is an important problem in many do- mains. Just as an example, it is needed by web search engines and information retrieval systems to organise text bases into sets of semantic categories. In order to develop better algorithms for document classification it is nec- essary to integrate research from several areas. In this paper, we evaluate the impact of using natural language processing (NLP) and information retrieval (IR) techniques in a machine learning algorithm. Support Vector Machines (SVM) was the chosen classification algorithm. Several approaches to automatically classify text documents using linguis- tic transformations and Machine Learning have been pursued. For example, there are approaches that present alternative ways of representing documents using the RIPPER algorithm [10], that use Na¨ıve Bayes to select features [5] and that generate of phrasal features using both [2].
    [Show full text]
  • Word Embeddings for Multi-Label Document Classification
    Word Embeddings for Multi-label Document Classification Ladislav Lenc†‡ Pavel Kral´ †‡ ‡ NTIS – New Technologies † Department of Computer for the Information Society, Science and Engineering, University of West Bohemia, University of West Bohemia, Plzen,ˇ Czech Republic Plzen,ˇ Czech Republic [email protected] [email protected] Abstract proaches including computer vision and natural language processing. Therefore, in this work we In this paper, we analyze and evaluate use different approaches based on convolutional word embeddings for representation of neural nets (CNNs) which were already presented longer texts in the multi-label document in Kim(2014) and Lenc and Kr al´ (2017). classification scenario. The embeddings Usually, the pre-trained word vectors ob- are used in three convolutional neural net- tained by some semantic model (e.g. word2vec work topologies. The experiments are (w2v) (Mikolov et al., 2013a) or glove (Penning- ˇ realized on the Czech CTK and English ton et al., 2014)) are used for initialization of Reuters-21578 standard corpora. We com- the embedding layer of the particular neural net. pare the results of word2vec static and These vectors can then be progressively adapted trainable embeddings with randomly ini- during neural network training. It was shown in tialized word vectors. We conclude that many experiments that it is possible to obtain bet- initialization does not play an important ter results using these vectors compared to the ran- role for classification. However, learning domly initialized vectors. Moreover, it has been of word vectors is crucial to obtain good proven that even “static” vectors (initialized by results. pre-trained embeddings and fixed during the net- work training) usually bring better performance 1 Introduction than randomly initialized and trained ones.
    [Show full text]
  • Experiments with One-Class SVM and Word Embeddings for Document Classification
    Experiments with One-Class SVM and Word Embeddings for Document Classification Jingnan Bi1[0000−0001−6191−3151], Hyoungah Kim2[0000−0002−0744−5884], and Vito D'Orazio1[0000−0003−4249−0768] 1 University of Texas at Dallas School of Economic, Political, and Policy Sciences 2 Chung-Ang University Institute of Public Policy and Administration Abstract. Researchers often possess a document set that describes some concept of interest, but which has not been systematically collected. This means that the researcher's set of known, relevant documents, is of little use in machine learning applications where the training data is ideally sampled from the same set as the testing data. Here, we propose and test several methods designed to help solve this problem. The central idea is to combine a one-class classifier, here we use the OCC-SVM, with feature engineering methods that we expect to make the input data more generalizable. Specifically, we use two word embeddings approaches, and combine each with a topic model approach. Our experiments show that Word2Vec with Vector Averaging produces the best model. Furthermore, this model is able to maintain high levels of recall at moderate levels of precision. This is valuable for researchers who place a high cost on false negatives. Keywords: document classification · content analysis · one-class algo- rithms · word embeddings. 1 Introduction Text classification is widely used in the computational social sciences [5, 10, 12, 25]. Typically, the initial corpus is not labeled, but researchers wish to classify it to identify relevant information on some concept of interest. In an ideal setting, and to use standard supervised machine learning, a researcher would randomly sample from the corpus, label the sampled documents, and use that as training data.
    [Show full text]
  • A Novel Approach to Document Classification Using Wordnet
    A Novel Approach to Document Classification using WordNet Koushiki Sarkar# and Ritwika Law* ABSTRACT Content based Document Classification is one of the biggest challenges in the context of free text mining. Current algorithms on document classifications mostly rely on cluster analysis based on bag- of-words approach. However that method is still being applied to many modern scientific dilemmas. It has established a strong presence in fields like economics and social science to merit serious attention from the researchers. In this paper we would like to propose and explore an alternative grounded more securely on the dictionary classification and correlatedness of words and phrases. It is expected that application of our existing knowledge about the underlying classification structure may lead to improvement of the classifier's performance. 1. INTRODUCTION Content based Document Classification is one of the biggest challenges in the context of free text mining. This is a problem relevant to many areas of Physical and Social Sciences. At a basic level, it is a challenge of identifying features, useful for classification, in qualitative data. At a more applied level, it can be used for classifying sources (Human, Machine or Nature) of such data. Among the more important recent applications, we note its usage in the Social Networking sites (see [1], [2], [5] and [6]), Medical Sciences [3] and Media [4] among others. One of the major problems with most data classification models is that the classification is essentially blind. Current algorithms on document classifications mostly rely on cluster analysis based on bag-of- words approach. The basic classifiers that use a bag-of-words approach, or more sophisticated Bayesian classification algorithms all mostly use word frequency in one form or the word.
    [Show full text]