A Hybrid Document Features Extraction with Clustering Based Classification Framework on Large Document Sets
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification
University of Arkansas, Fayetteville ScholarWorks@UARK Theses and Dissertations 5-2020 Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification Marion Pauline Chiariglione University of Arkansas, Fayetteville Follow this and additional works at: https://scholarworks.uark.edu/etd Part of the Numerical Analysis and Scientific Computing Commons, and the Theory and Algorithms Commons Citation Chiariglione, M. P. (2020). Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification. Theses and Dissertations Retrieved from https://scholarworks.uark.edu/etd/3580 This Thesis is brought to you for free and open access by ScholarWorks@UARK. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of ScholarWorks@UARK. For more information, please contact [email protected]. Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science by Marion Pauline Chiariglione IUT Dijon, University of Burgundy Bachelor of Science in Computer Science, 2017 May 2020 University of Arkansas This thesis is approved for recommendation to the Graduate Council Susan Gauch, Ph.D. Thesis Director Qinghua Li, Ph.D. Committee member Khoa Luu, Ph.D. Committee member Abstract Quoting a borrowed excerpt of text within another literary work was infrequently done prior to the beginning of the eighteenth century. However, quoting other texts, particularly Shakespeare, became quite common after that. Our work develops automatic approaches to identify that trend. Initial work focuses on identifying exact and modified sections of texts taken from works of Shakespeare in novels spanning the eighteenth century. We then introduce a novel approach to identifying modified quotes by adapting the Edit Distance metric, which is character based, to a word based approach. -
Using N-Grams to Understand the Nature of Summaries
Using N-Grams to Understand the Nature of Summaries Michele Banko and Lucy Vanderwende One Microsoft Way Redmond, WA 98052 {mbanko, lucyv}@microsoft.com views of the event being described over different Abstract documents, or present a high-level view of an event that is not explicitly reflected in any single document. A Although single-document summarization is a useful multi-document summary may also indicate the well-studied task, the nature of multi- presence of new or distinct information contained within document summarization is only beginning to a set of documents describing the same topic (McKeown be studied in detail. While close attention has et. al., 1999, Mani and Bloedorn, 1999). To meet these been paid to what technologies are necessary expectations, a multi-document summary is required to when moving from single to multi-document generalize, condense and merge information coming summarization, the properties of human- from multiple sources. written multi-document summaries have not Although single-document summarization is a well- been quantified. In this paper, we empirically studied task (see Mani and Maybury, 1999 for an characterize human-written summaries overview), multi-document summarization is only provided in a widely used summarization recently being studied closely (Marcu & Gerber 2001). corpus by attempting to answer the questions: While close attention has been paid to multi-document Can multi-document summaries that are summarization technologies (Barzilay et al. 2002, written by humans be characterized as Goldstein et al 2000), the inherent properties of human- extractive or generative? Are multi-document written multi-document summaries have not yet been summaries less extractive than single- quantified. -
An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification
Imperial College London Department of Computing An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification Supervisors: Author: Prof Alessandra Russo Clavance Lim Nuri Cingillioglu Submitted in partial fulfillment of the requirements for the MSc degree in Computing Science of Imperial College London September 2019 Contents Abstract 1 Acknowledgements 2 1 Introduction 3 1.1 Motivation .................................. 3 1.2 Aims and objectives ............................ 4 1.3 Outline .................................... 5 2 Background 6 2.1 Overview ................................... 6 2.1.1 Text classification .......................... 6 2.1.2 Training, validation and test sets ................. 6 2.1.3 Cross validation ........................... 7 2.1.4 Hyperparameter optimization ................... 8 2.1.5 Evaluation metrics ......................... 9 2.2 Text classification pipeline ......................... 14 2.3 Feature extraction ............................. 15 2.3.1 Count vectorizer .......................... 15 2.3.2 TF-IDF vectorizer ......................... 16 2.3.3 Word embeddings .......................... 17 2.4 Classifiers .................................. 18 2.4.1 Naive Bayes classifier ........................ 18 2.4.2 Decision tree ............................ 20 2.4.3 Random forest ........................... 21 2.4.4 Logistic regression ......................... 21 2.4.5 Support vector machines ...................... 22 2.4.6 k-Nearest Neighbours ....................... -
Fasttext.Zip: Compressing Text Classification Models
Under review as a conference paper at ICLR 2017 FASTTEXT.ZIP: COMPRESSING TEXT CLASSIFICATION MODELS Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve´ Jegou´ & Tomas Mikolov Facebook AI Research fajoulin,egrave,bojanowski,matthijs,rvj,[email protected] ABSTRACT We consider the problem of producing compact architectures for text classifica- tion, such that the full model fits in a limited amount of memory. After consid- ering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quan- tization artefacts. Combined with simple approaches specifically adapted to text classification, our approach derived from fastText requires, at test time, only a fraction of the memory compared to the original FastText, without noticeably sacrificing quality in terms of classification accuracy. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy. 1 INTRODUCTION Text classification is an important problem in Natural Language Processing (NLP). Real world use- cases include spam filtering or e-mail categorization. It is a core component in more complex sys- tems such as search and ranking. Recently, deep learning techniques based on neural networks have achieved state of the art results in various NLP applications. One of the main successes of deep learning is due to the effectiveness of recurrent networks for language modeling and their application to speech recognition and machine translation (Mikolov, 2012). -
Multi-Document Biography Summarization
Multi-document Biography Summarization Liang Zhou, Miruna Ticrea, Eduard Hovy University of Southern California Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292-6695 {liangz, miruna, hovy} @isi.edu Abstract In this paper we describe a biography summarization system using sentence classification and ideas from information retrieval. Although the individual techniques are not new, assembling and applying them to generate multi-document biographies is new. Our system was evaluated in DUC2004. It is among the top performers in task 5–short summaries focused by person questions. 1 Introduction Automatic text summarization is one form of information management. It is described as selecting a subset of sentences from a document that is in size a small percentage of the original and Figure 1. Overall design of the biography yet is just as informative. Summaries can serve as summarization system. surrogates of the full texts in the context of To determine what and how sentences are Information Retrieval (IR). Summaries are created selected and ranked, a simple IR method and from two types of text sources, a single document experimental classification methods both or a set of documents. Multi-document contributed. The set of top-scoring sentences, after summarization (MDS) is a natural and more redundancy removal, is the resulting biography. elaborative extension of the single-document As yet, the system contains no inter-sentence summarization, and poses additional difficulties on ‘smoothing’ stage. algorithm design. Various kinds of summaries fall In this paper, work in related areas is discussed into two broad categories: generic summaries are in Section 2; a description of our biography corpus the direct derivatives of the source texts; special- used for training and testing the classification interest summaries are generated in response to component is in Section 3; Section 4 explains the queries or topic-oriented questions. -
Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms
Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms Aleksander Wawer Agnieszka Mykowiecka Institute of Computer Science Institute of Computer Science PAS PAS Jana Kazimierza 5 Jana Kazimierza 5 01-248 Warsaw, Poland 01-248 Warsaw, Poland [email protected] [email protected] Abstract Unsupervised WSD algorithms aim at resolv- ing word ambiguity without the use of annotated This paper compares two approaches to corpora. There are two popular categories of word sense disambiguation using word knowledge-based algorithms. The first one orig- embeddings trained on unambiguous syn- inates from the Lesk (1986) algorithm, and ex- onyms. The first one is an unsupervised ploit the number of common words in two sense method based on computing log proba- definitions (glosses) to select the proper meaning bility from sequences of word embedding in a context. Lesk algorithm relies on the set of vectors, taking into account ambiguous dictionary entries and the information about the word senses and guessing correct sense context in which the word occurs. In (Basile et from context. The second method is super- al., 2014) the concept of overlap is replaced by vised. We use a multilayer neural network similarity represented by a DSM model. The au- model to learn a context-sensitive transfor- thors compute the overlap between the gloss of mation that maps an input vector of am- the meaning and the context as a similarity mea- biguous word into an output vector repre- sure between their corresponding vector represen- senting its sense. We evaluate both meth- tations in a semantic space. -
Investigating Classification for Natural Language Processing Tasks
UCAM-CL-TR-721 Technical Report ISSN 1476-2986 Number 721 Computer Laboratory Investigating classification for natural language processing tasks Ben W. Medlock June 2008 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2008 Ben W. Medlock This technical report is based on a dissertation submitted September 2007 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Fitzwilliam College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Abstract This report investigates the application of classification techniques to four natural lan- guage processing (NLP) tasks. The classification paradigm falls within the family of statistical and machine learning (ML) methods and consists of a framework within which a mechanical `learner' induces a functional mapping between elements drawn from a par- ticular sample space and a set of designated target classes. It is applicable to a wide range of NLP problems and has met with a great deal of success due to its flexibility and firm theoretical foundations. The first task we investigate, topic classification, is firmly established within the NLP/ML communities as a benchmark application for classification research. Our aim is to arrive at a deeper understanding of how class granularity affects classification accuracy and to assess the impact of representational issues on different classification models. Our second task, content-based spam filtering, is a highly topical application for classification techniques due to the ever-worsening problem of unsolicited email. -
Linked Data Triples Enhance Document Relevance Classification
applied sciences Article Linked Data Triples Enhance Document Relevance Classification Dinesh Nagumothu * , Peter W. Eklund , Bahadorreza Ofoghi and Mohamed Reda Bouadjenek School of Information Technology, Deakin University, Geelong, VIC 3220, Australia; [email protected] (P.W.E.); [email protected] (B.O.); [email protected] (M.R.B.) * Correspondence: [email protected] Abstract: Standardized approaches to relevance classification in information retrieval use generative statistical models to identify the presence or absence of certain topics that might make a document relevant to the searcher. These approaches have been used to better predict relevance on the basis of what the document is “about”, rather than a simple-minded analysis of the bag of words contained within the document. In more recent times, this idea has been extended by using pre-trained deep learning models and text representations, such as GloVe or BERT. These use an external corpus as a knowledge-base that conditions the model to help predict what a document is about. This paper adopts a hybrid approach that leverages the structure of knowledge embedded in a corpus. In particular, the paper reports on experiments where linked data triples (subject-predicate-object), constructed from natural language elements are derived from deep learning. These are evaluated as additional latent semantic features for a relevant document classifier in a customized news- feed website. The research is a synthesis of current thinking in deep learning models in NLP and information retrieval and the predicate structure used in semantic web research. Our experiments Citation: Nagumothu, D.; Eklund, indicate that linked data triples increased the F-score of the baseline GloVe representations by 6% P.W.; Ofoghi, B.; Bouadjenek, M.R. -
Quesgen Using Nlp 01
Quesgen Using Nlp 01 International Journal of Latest Trends in Engineering and Technology Vol.(13)Issue(2), pp.009-014 DOI: http://dx.doi.org/10.21172/1.132.02 e-ISSN:2278-621X QUESGEN USING NLP Pawan NGP1, Pooja Bahuguni2, Pooja Dattatri3, Shilpi Kumari4, Vikranth B.M5 Abstract— when people read for long hours, they seldom are able to grasp concepts and it gives them false sense of understanding it. The aim of this project is to tackle this problem by processing given text and generating applicable questions and answer. The steps followed are: 1. Candidate key sentences are selected (using Text Rank). 2. Candidate key words are selected from candidate key sentences (RAKE). 3. These selected key sentences and words are stored in the database (MongoDB) and presented to the user through chatbot interface. Keywords— NLP, NLP toolkit, Sentence extraction, Keyword extraction, ChatBot, RAKE, TextRank 1. INTRODUCTION Humans are the most curious by nature. Asking Questions to meet their never-ending quest for information and knowledge. For Example,teachers ask students, questions to evaluate performance of the students, pupils learn by asking questions to teachers,and even our normal life conversation consists of asking questions. Questions are the major part of countless learning interactions. However, with the advent of technology, attention spans of individuals have significantly gone down and they are not able to ask good questions. It has been noticed that when people try to read for long hours, they seldom are able to grasp concepts. But having spent some time reading gives people a false sense of understanding it. -
An Automatic Text Summarization for Malayalam Using Sentence Extraction
International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-3, Issue-8, Aug.-2015 AN AUTOMATIC TEXT SUMMARIZATION FOR MALAYALAM USING SENTENCE EXTRACTION 1RENJITH S R, 2SONY P 1M.Tech Computer and Information Science, Dept.of Computer Science, College of Engineering Cherthala Kerala, India-688541 2Assistant Professor, Dept. of Computer Science, College of Engineering Cherthala, Kerala, India-688541 Abstract—Text Summarization is the process of generating a short summary for the document that contains the significant portion of information. In an automatic text summarization process, a text is given to the computer and the computer returns a shorter less redundant extract of the original text. The proposed method is a sentence extraction based single document text summarization which produces a generic summary for a Malayalam document. Sentences are ranked based on feature scores and Googles PageRank formula. Top k ranked sentences will be included in summary where k depends on the compression ratio between original text and summary. Performance evaluation will be done by comparing the summarization outputs with manual summaries generated by human evaluators. Keywords—Text summarization, Sentence Extraction, Stemming, TF-ISF score, Sentence similarity, PageRank formula, Summary generation. I. INTRODUCTION a summary, which represents the subject matter of an article by understanding the whole meaning, which With enormous growth of information on cyberspace, are generated by reformulating the salient unit conventional Information Retrieval techniques have selected from an input sentences. It may contain some become inefficient for finding relevant information text units which are not present in the input text. An effectively. When we give a keyword to be searched extract is a summary consisting of a number of on the internet, it returns thousands of documents sentences selected from the input text.Sentence overwhelming the user. -
Natural Language Processing Security- and Defense-Related Lessons Learned
July 2021 Perspective EXPERT INSIGHTS ON A TIMELY POLICY ISSUE PETER SCHIRMER, AMBER JAYCOCKS, SEAN MANN, WILLIAM MARCELLINO, LUKE J. MATTHEWS, JOHN DAVID PARSONS, DAVID SCHULKER Natural Language Processing Security- and Defense-Related Lessons Learned his Perspective offers a collection of lessons learned from RAND Corporation projects that employed natural language processing (NLP) tools and methods. It is written as a reference document for the practitioner Tand is not intended to be a primer on concepts, algorithms, or applications, nor does it purport to be a systematic inventory of all lessons relevant to NLP or data analytics. It is based on a convenience sample of NLP practitioners who spend or spent a majority of their time at RAND working on projects related to national defense, national intelligence, international security, or homeland security; thus, the lessons learned are drawn largely from projects in these areas. Although few of the lessons are applicable exclusively to the U.S. Department of Defense (DoD) and its NLP tasks, many may prove particularly salient for DoD, because its terminology is very domain-specific and full of jargon, much of its data are classified or sensitive, its computing environment is more restricted, and its information systems are gen- erally not designed to support large-scale analysis. This Perspective addresses each C O R P O R A T I O N of these issues and many more. The presentation prioritizes • identifying studies conducting health service readability over literary grace. research and primary care research that were sup- We use NLP as an umbrella term for the range of tools ported by federal agencies. -
Automatic Document Summarization by Sentence Extraction
Вычислительные технологии Том 12, № 5, 2007 AUTOMATIC DOCUMENT SUMMARIZATION BY SENTENCE EXTRACTION R. M. Aliguliyev Institute of Information Technology of the National Academy of Sciences of Azerbaijan, Baku e-mail: [email protected] Представлен метод автоматического реферирования документов, который ге- нерирует резюме документа путем кластеризации и извлечения предложений из исходного документа. Преимущество предложенного подхода в том, что сгенери- рованное резюме документа может включать основное содержание практически всех тем, представленных в документе. Для определения оптимального числа кла- стеров введен критерий оценки качества кластеризации. Introduction Automatic document processing is a research field that is currently extremely active. One important task in this field is automatic document summarization, which preserves its information content [1]. With a large volume of text documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Text search and summarization are the two essential technologies that complement each other. Text search engines return a set of documents that seem to be relevant to the user’s query, and text enable quick examinations through the returned documents. In general, automatic document summarization takes a source document (or source documents) as input, extracts the essence of the source(s), and presents a well-formed summary to the user. Mani and Maybury [1] formally defined automatic document summari- zation as the process of distilling the most important information from a source(s) to produce an abridged version for a particular user (or users) and task (or tasks). The process can be decomposed into three phases: analysis, transformation and synthesis. The analysis phase analyzes the input document and selects a few salient features.