Vec2gc - a Graph Based Clustering Method for Text Representations

Total Page:16

File Type:pdf, Size:1020Kb

Vec2gc - a Graph Based Clustering Method for Text Representations Vec2GC - A Graph Based Clustering Method for Text Representations Rajesh N Rao Manojit Chakraborty Robert Bosch Research and Technology Center, India Robert Bosch Research and Technology Center, India Bangalore, India Bangalore, India ABSTRACT NLP pipelines with limited or no labeled data, rely on unsuper- vised methods for document processing. Unsupervised approaches typically depend on clustering of terms or documents. In this pa- per, we introduce a novel clustering algorithm, Vec2GC (Vector to Graph Communities), an end-to-end pipeline to cluster terms or documents for any given text corpus. Our method uses community detection on a weighted graph of the terms or documents, created using text representation learning. Vec2GC clustering algorithm is a density based approach, that supports hierarchical clustering as well. KEYWORDS text clustering, embeddings, document clustering, graph clustering 1 INTRODUCTION Dealing with large corpus of unlabeled domain specific documents is a common challenge faced in industrial NLP pipeline. Unsuper- vised algorithms like clustering are the first step in processing unlabeled corpus to get an overview of the data distribution. Vi- sual exploration based on clustering and dimensionality reduction Figure 1: UMAP 2d plot of 20 Newsgroup dataset algorithms provide a good overview of data distribution. In this context, clustering and dimensionality reduction are important steps. Dimensionality reduction techniques like PCA [5], t-SNE [18] or UMAP [12] would map the document in embedding space on latent semantic indexing, graph representations, ontology and to 2 dimensional space as shown in figure 1. Clustering algorithm lexical chains. would groups semantically similar documents or term together. Traditional cluster algorithm like k-Mean [9], k-medoids [16], DB- 3 ALGORITHM DETAILS SCAN [4] or HDBSCAN [11] with distance metric derived from Vec2GC converts the vector space embeddings of terms or docu- Cosine Similarity [10] do not do a very good job on this. ments to a graph and performs clustering on the constructed graph. We propose the Vec2GC, Vector To Graph Community, a cluster- The algorithm consists of two steps: construction of the graph and ing algorithm that converts the terms or documents in the vector generation of clusters using Graph Community detection algorithm. arXiv:2104.09439v1 [cs.IR] 15 Apr 2021 embedding space [13][7] to a graph and generate clusters based on Graph Community Detection algorithm. 3.1 Graph Construction 2 LITERATURE SURVEY For the graph construction, we consider each term or document embedding as a node. A node can be represented by 0 and its em- Hossain and Angryk [6] represented text documents as hierarchi- bedding represented by E0. And to construct the graph, we measure cal document-graphs to extract frequent subgraphs for generating the cosine similarity of the embeddings, equation (1). An edge is sense-based document clusters. Wang et. al. [19] used vector repre- drawn between two nodes if their cosine similarity is greater than a sentations of documents and run k-means clustering on them to specific threshold \, which is a tuneable parameter in our algorithm. understand general representation power of various embedding generation models. Angelov [1] proposed Top2Vec, which uses E0.E1 joint document and word semantic embedding to find topic vectors 2B¹0,1º = (1) k E kk E k using HDBSCAN as clustering method to find dense regions in the 0 1 embedding space. Saiyad et. al. [16] presented a survey covering The edge weight is determined by the cosine similarity value major significant works on semantic document clustering based and is given by equation 2. Rajesh N Rao and Manojit Chakraborty Algorithm 1: Recursive Graph Community Detection ( 0 2B¹0,1º < \ 퐸¹0,1º = (2) 1 ¹ º ≥ 1 def GetCommunity(6, 2_=>34,CA44,<>3_CℎA4Bℎ,<0G_B8I4) 1−2B ¹0,1º 2B 0,1 \ 2 <>3_8=34G, 2_;8BC = 2><<D=8C~_34C42C8>=_0;6>¹6º Equation 2 maps the cosine similarity to edge weight as shown 3 if <>3_8=34G < \ then below: <>3D;0A8C~ 4 tree.add_node(curr_node) return 1 5 foreach 2><< 8= 2_;8BC do ¹\, 1º ! ¹ , 1º (3) 1 − \ 6 if ;4=¹2><<º ¡ <0G_B8I4 then As cosine similarity tends to 1, edge weight tends to 1. Note in 7 s_g = get_community_subgraph(comm) graph, higher edge weight corresponds to stronger connectivity. 8 =_=>34 = #>34¹º Also, the weights are non-linearly mapped from cosine similarity 9 CA44.033_=>34¹=_=>34º to edge weight. This increases separability between two node pairs 10 퐺4C퐶><<D=8C~¹B_6, =_=>34,CA44,<>3_CℎA4Bℎ,<0G_B8I4º that have similar cosine similarity. For example, a pair of nodes with 2B¹0,1º = 0.9 and another pair with 2B¹G,~º = 0.95 would 11 else ¹º have edge weights of 10 and 20 respectively. A stronger separation 12 =4F_=>34 = #>34 is created for cosine similarity closer to 1. Thus higher weight is 13 CA44.033_=>34¹=4F_=>34º given to embeddings that are very similar to each other. 3.2 Graph Community Detection We construct the graph with words or documents as nodes and with any specific context or it is not close enough to a community edges between nodes with cosine similarity greater than \. to be included as member. The Graph Community Detection algorithm consideres only local neighborhood in community detection. If we consider doc- 4 EXPERIMENT uments in a corpus, the cosine similarity is a strong indicator of We perform extensive set of experiments and comparisons to show similarity between two documents, when cosine similarity is high the advantage of Vec2GC as a clustering algorithm for documents (¡ \), it strongly indicates the two documents are semantically or words in a text corpus. We consider 5 different text document similar. However, when cosine similarity is low (< \), it indicates datasets along with class information. The dataset details are as a dis-similarity between the two documents. And the strength of follows: dis-similarity is not indicated by the value of the cosine similarity. A cosine similarity of 0 2 does not indicate a higher dis-similarity . 4.1 Datasets than cosine similarity of 0.4. Thus we eliminate the notion of dis- similarity by only connecting nodes which have a high degree of 4.1.1 20 newsgroups. The 20 Newsgroups data set comprises of similarity. Thus, all pairs nodes with cosine similarity below the approximately 20,000 newsgroup documents, evenly distributed given threshold are ignored. across 20 different newsgroups, each corresponding to a different 1 Though we discuss the idea of similarity and dis-similarity in topic. the context of documents, the arguments extends equally well to 4.1.2 AG News. AG is a collection of more than 1 million news terms represented by embeddings. articles gathered from more than 2000 news sources by ComeToMy- We apply a standard Graph Community Detection algorithm, Head 2, which is an academic news search engine. The AG’s news Parallel Louvian Method [2] in determining the communities topic classification dataset is developed by Xiang Zhang 3 from the in the graph. We calculate the modularity index [14], given by above news articles collection, consisting of 127600 documents. It equation 4, for each execution of the PLM algorithm. was first used as a text classification benchmark in the following 1 ∑︁ : : paper [20] & = , − 0 1 X ¹2 , 2 º (4) 2< 퐸01 2< 0 1 0,1 4.1.3 BBC Articles. This dataset is a public dataset from the BBC, We execute the Graph Community Detection algorithm recur- comprised of 2225 articles, each labeled under one of 5 categories: 4 sively. The pseudo code of the recursive algorithm in show in Al- Business, Entertainment, Politics, Sport or Tech. gorithm 1 4.1.4 Stackoverflow QA. This is a dataset of 16000 question and answers from the Stackoverflow website 5, labeled under 4 different 3.3 Non Community Nodes categories of coding language - CSharp, JavaScript, Java, Python. 6 Note, not all nodes would be member of a community. There will be nodes that do not belong to any community. Nodes that are not 1http://qwone.com/~jason/20Newsgroups/ connected or not well connected fail to be a member of a community. 2http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html We define such nodes as Non Community nodes. [email protected] 4 If we consider VecGC for term embeddings, we believe there are https://www.kaggle.com/c/learn-ai-bbc/data 5www.stackoverflow.com two reasons for a term to become a Non Community node. Either it 6http://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k. appears in multiple contexts and does not have a strong similarity tar.gz Vec2GC - A Graph Based Clustering Method for Text Representations 4.1.5 DBpedia. DBpedia is a project aiming to extract structured Table 1: Comparison using Doc2Vec Embeddings content from the information created in Wikipedia. 7 This dataset is extracted from the original DBpedia data that provides taxonomic, Dataset Purity Fraction of Fraction of Fraction of hierarchical categories or classes for 342,782 articles. There are 3 Value clusters @ clusters @ clusters @ levels of classes, with 9, 70 and 219 classes respectively. 8 k% purity k% purity k% purity We use two different document embedding algorithms to gener- (KMedoids) (hdbscan) (Vec2CG) ate document embeddings for all text datasets. The first algorithm 50% .53 .76 .89 that we use is Doc2Vec, which creates document embeddings using 20Newsgroup 70% .38 .56 .69 the distributed memory and distributed bag of words models from 90% .07 .20 .39 [7]. We also create document embeddings using Sentence-BERT 50% .98 .98 .99 [15]. It computes dense vector representations for documents, such AG News 70% .74 .90 .94 that similar document embeddings are close in vector space using 90% .20 .63 .80 pretrained language models on transformer networks like BERT [3] 50% 1.0 .99 .99 BBC / RoBERTa [8]/ DistilBERT [17] etc.
Recommended publications
  • Probabilistic Topic Modelling with Semantic Graph
    Probabilistic Topic Modelling with Semantic Graph B Long Chen( ), Joemon M. Jose, Haitao Yu, Fajie Yuan, and Huaizhi Zhang School of Computing Science, University of Glasgow, Sir Alwyns Building, Glasgow, UK [email protected] Abstract. In this paper we propose a novel framework, topic model with semantic graph (TMSG), which couples topic model with the rich knowledge from DBpedia. To begin with, we extract the disambiguated entities from the document collection using a document entity linking system, i.e., DBpedia Spotlight, from which two types of entity graphs are created from DBpedia to capture local and global contextual knowl- edge, respectively. Given the semantic graph representation of the docu- ments, we propagate the inherent topic-document distribution with the disambiguated entities of the semantic graphs. Experiments conducted on two real-world datasets show that TMSG can significantly outperform the state-of-the-art techniques, namely, author-topic Model (ATM) and topic model with biased propagation (TMBP). Keywords: Topic model · Semantic graph · DBpedia 1 Introduction Topic models, such as Probabilistic Latent Semantic Analysis (PLSA) [7]and Latent Dirichlet Analysis (LDA) [2], have been remarkably successful in ana- lyzing textual content. Specifically, each document in a document collection is represented as random mixtures over latent topics, where each topic is character- ized by a distribution over words. Such a paradigm is widely applied in various areas of text mining. In view of the fact that the information used by these mod- els are limited to document collection itself, some recent progress have been made on incorporating external resources, such as time [8], geographic location [12], and authorship [15], into topic models.
    [Show full text]
  • A Hierarchical Clustering Approach for Dbpedia Based Contextual Information of Tweets
    Journal of Computer Science Original Research Paper A Hierarchical Clustering Approach for DBpedia based Contextual Information of Tweets 1Venkatesha Maravanthe, 2Prasanth Ganesh Rao, 3Anita Kanavalli, 2Deepa Shenoy Punjalkatte and 4Venugopal Kuppanna Rajuk 1Department of Computer Science and Engineering, VTU Research Resource Centre, Belagavi, India 2Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bengaluru, India 3Department of Computer Science and Engineering, Ramaiah Institute of Technology, Bengaluru, India 4Department of Computer Science and Engineering, Bangalore University, Bengaluru, India Article history Abstract: The past decade has seen a tremendous increase in the adoption of Received: 21-01-2020 Social Web leading to the generation of enormous amount of user data every Revised: 12-03-2020 day. The constant stream of tweets with an innate complex sentimental and Accepted: 21-03-2020 contextual nature makes searching for relevant information a herculean task. Multiple applications use Twitter for various domain sensitive and analytical Corresponding Authors: Venkatesha Maravanthe use-cases. This paper proposes a scalable context modeling framework for a Department of Computer set of tweets for finding two forms of metadata termed as primary and Science and Engineering, VTU extended contexts. Further, our work presents a hierarchical clustering Research Resource Centre, approach to find hidden patterns by using generated primary and extended Belagavi, India contexts. Ontologies from DBpedia are used for generating primary contexts Email: [email protected] and subsequently to find relevant extended contexts. DBpedia Spotlight in conjunction with DBpedia Ontology forms the backbone for this proposed model. We consider both twitter trend and stream data to demonstrate the application of these contextual parts of information appropriate in clustering.
    [Show full text]
  • Query-Based Multi-Document Summarization by Clustering of Documents
    Query-based Multi-Document Summarization by Clustering of Documents Naveen Gopal K R Prema Nedungadi Dept. of Computer Science and Engineering Amrita CREATE Amrita Vishwa Vidyapeetham Dept. of Computer Science and Engineering Amrita School of Engineering Amrita Vishwa Vidyapeetham Amritapuri, Kollam -690525 Amrita School of Engineering [email protected] Amritapuri, Kollam -690525 [email protected] ABSTRACT based Hierarchical Agglomerative clustering, k-means, Clus- Information Retrieval (IR) systems such as search engines tering Gain retrieve a large set of documents, images and videos in re- sponse to a user query. Computational methods such as Au- 1. INTRODUCTION tomatic Text Summarization (ATS) reduce this information Automatic Text Summarization [12] reduces the volume load enabling users to find information quickly without read- of information by creating a summary from one or more ing the original text. The challenges to ATS include both the text documents. The focus of the summary may be ei- time complexity and the accuracy of summarization. Our ther generic, which captures the important semantic con- proposed Information Retrieval system consists of three dif- cepts of the documents, or query-based, which captures the ferent phases: Retrieval phase, Clustering phase and Sum- sub-concepts of the user query and provides personalized marization phase. In the Clustering phase, we extend the and relevant abstracts based on the matching between input Potential-based Hierarchical Agglomerative (PHA) cluster- query and document collection. Currently, summarization ing method to a hybrid PHA-ClusteringGain-K-Means clus- has gained research interest due to the enormous information tering approach. Our studies using the DUC 2002 dataset load generated particularly on the web including large text, show an increase in both the efficiency and accuracy of clus- audio, and video files etc.
    [Show full text]
  • Wordnet-Based Metrics Do Not Seem to Help Document Clustering
    Wordnet-based metrics do not seem to help document clustering Alexandre Passos1 and Jacques Wainer1 Instituto de Computação (IC) Universidade Estadual de Campinas (UNICAMP) Campinas, SP, Brazil Abstract. In most document clustering systems documents are repre- sented as normalized bags of words and clustering is done maximizing cosine similarity between documents in the same cluster. While this representation was found to be very effective at many different types of clustering, it has some intuitive drawbacks. One such drawback is that documents containing words with similar meanings might be considered very different if they use different words to say the same thing. This happens because in a traditional bag of words, all words are assumed to be orthogonal. In this paper we examine many possible ways of using WordNet to mitigate this problem, and find that WordNet does not help clustering if used only as a means of finding word similarity. 1 Introduction Document clustering is now an established technique, being used to improve the performance of information retrieval systems [11], as an aide to machine translation [12], and as a building block of narrative event chain learning [1]. Toolkits such as Weka [3] and NLTK [10] make it easy for anyone to experiment with document clustering and incorporate it into an application. Nonetheless, the most commonly accepted model—a bag of words representation [17], bisecting k-means algorithm [19] maximizing cosine similarity [20], preprocessing the corpus with a stemmer [14], tf-idf weighting [17], and a possible list of stop words [2]—is complex, unintuitive and has some obvious negative consequences.
    [Show full text]
  • Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints
    Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints Javier Parapar and Alvaro´ Barreiro IRLab, Computer Science Department University of A Coru~na,Spain fjavierparapar,[email protected] Abstract. This paper presents a new approach designed to reduce the computational load of the existing clustering algorithms by trimming down the documents size using fingerprinting methods. Thorough eval- uation was performed over three different collections and considering four different metrics. The presented approach to document clustering achieved good values of effectiveness with considerable save in memory space and computation time. 1 Introduction and Motivation Document's fingerprint could be defined as an abstraction of the original docu- ment that usually implies a reduction in terms of size. In the other hand data clustering consists on the partition of the input data collection in sets that ideally share common properties. This paper studies the effect of using the documents fingerprints as input to the clustering algorithms to achieve a better computa- tional behaviour. Clustering has a long history in Information Retrieval (IR)[1], but only re- cently Liu and Croft in [2] have demonstrated that cluster-based retrieval can also significantly outperform traditional document-based retrieval effectiveness. Other successful applications of clustering algorithms are: document browsing, search results presentation or document summarisation. Our approach tries to be useful in operational systems where the computing time is a critical point and the use of clustering techniques can significantly improve the quality of the outputs of different tasks as the above exposed. Next, section 2 introduces the background in clustering and document repre- sentation.
    [Show full text]
  • Word Sense Disambiguation in Biomedical Ontologies with Term Co-Occurrence Analysis and Document Clustering
    Int. J. Data Mining and Bioinformatics, Vol. 2, No. 3, 2008 193 Word Sense Disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering Bill Andreopoulos∗, Dimitra Alexopoulou and Michael Schroeder Biotechnological Centre, Technischen Universität Dresden, Germany E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] ∗Corresponding author Abstract: With more and more genomes being sequenced, a lot of effort is devoted to their annotation with terms from controlled vocabularies such as the GeneOntology. Manual annotation based on relevant literature is tedious, but automation of this process is difficult. One particularly challenging problem is word sense disambiguation. Terms such as ‘development’ can refer to developmental biology or to the more general sense. Here, we present two approaches to address this problem by using term co-occurrences and document clustering. To evaluate our method we defined a corpus of 331 documents on development and developmental biology. Term co-occurrence analysis achieves an F -measure of 77%. Additionally, applying document clustering improves precision to 82%. We applied the same approach to disambiguate ‘nucleus’, ‘transport’, and ‘spindle’, and we achieved consistent results. Thus, our method is a viable approach towards the automation of literature-based genome annotation. Keywords: WSD; word sense disambiguation; ontologies; GeneOntology; text mining; annotation; Bayes; clustering; GoPubMed; data mining; bioinformatics. Reference to this paper should be made as follows: Andreopoulos, B., Alexopoulou, D. and Schroeder, M. (2008) ‘Word Sense Disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering’, Int. J. Data Mining and Bioinformatics, Vol.
    [Show full text]
  • A Query Focused Multi Document Automatic Summarization
    PACLIC 24 Proceedings 545 A Query Focused Multi Document Automatic Summarization Pinaki Bhaskar and Sivaji Bandyopadhyay Department of Computer Science & Engineering, Jadavpur University, Kolkata – 700032, India [email protected], [email protected] Abstract. The present paper describes the development of a query focused multi-document automatic summarization. A graph is constructed, where the nodes are sentences of the documents and edge scores reflect the correlation measure between the nodes. The system clusters similar texts having related topical features from the graph using edge scores. Next, query dependent weights for each sentence are added to the edge score of the sentence and accumulated with the corresponding cluster score. Top ranked sentence of each cluster is identified and compressed using a dependency parser. The compressed sentences are included in the output summary. The inter-document cluster is revisited in order until the length of the summary is less than the maximum limit. The summarizer has been tested on the standard TAC 2008 test data sets of the Update Summarization Track. Evaluation of the summarizer yielded accuracy scores of 0.10317 (ROUGE-2) and 0.13998 (ROUGE–SU-4). Keywords: Multi Document Summarizer, Query Focused, Cluster based approach, Parsed and Compressed Sentences, ROUGE Evaluation. 1 Introduction Text Summarization, as the process of identifying the most salient information in a document or set of documents (for multi document summarization) and conveying it in less space, became an active field of research in both Information Retrieval (IR) and Natural Language Processing (NLP) communities. Summarization shares some basic techniques with indexing as both are concerned with identification of the essence of a document.
    [Show full text]
  • Semantic Based Document Clustering Using Lexical Chains
    International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 SEMANTIC BASED DOCUMENT CLUSTERING USING LEXICAL CHAINS SHABANA AFREEN1, DR. B. SRINIVASU2 1M.tech Scholar, Dept. of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Telangana-Hyderabad, India. 2Associate Professor, Dept. of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Telangana -Hyderabad, India. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract – Traditional clustering algorithms do not organizing large document collections into smaller consider the semantic relationships among documents so that meaningful and manageable groups, which plays an cannot accurately represent cluster of the documents. To important role in information retrieval, browsing and overcome these problems, introducing semantic information comprehension [1]. from ontology such as WordNet has been widely used to improve the quality of text clustering. However, there exist 1.2 Problem Description several challenges such as extracting core semantics from Feature vectors generated using Bow results in very large texts, assigning appropriate description for the generated dimensional vectors. The feature selected is directly clusters and diversity of vocabulary. proportional to dimension. Extract core semantics from texts selecting the feature will In this project we report our attempt towards integrating reduced number of terms which depict high semantic conre WordNet with lexical chains to alleviate these problems. The content. proposed approach exploits the way we can identify the theme The quality of extracted lexical chains is highly depends on of the document based on disambiguated core semantic the quality and quantity of the concepts within a document.
    [Show full text]
  • Latent Semantic Sentence Clustering for Multi-Document Summarization
    UCAM-CL-TR-802 Technical Report ISSN 1476-2986 Number 802 Computer Laboratory Latent semantic sentence clustering for multi-document summarization Johanna Geiß July 2011 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2011 Johanna Geiß This technical report is based on a dissertation submitted April 2011 by the author for the degree of Doctor of Philosophy to the University of Cambridge, St. Edmund’s College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Latent semantic sentence clustering for multi-document summarization Johanna Geiß Summary This thesis investigates the applicability of Latent Semantic Analysis (LSA) to sentence clustering for Multi-Document Summarization (MDS). In contrast to more shallow approaches like measuring similarity of sentences by word overlap in a traditional vector space model, LSA takes word usage patterns into account. So far LSA has been successfully applied to different Information Retrieval (IR) tasks like information filtering and document classification (Dumais, 2004). In the course of this research, different parameters essential to sentence clustering using a hierarchical agglomerative clustering algorithm (HAC) in general and in combination with LSA in particular are investigated. These parameters include, inter alia, information about the type of vocabulary, the size of the semantic space and the optimal numbers of dimensions to be used in LSA. These parameters have not previously been studied and evaluated in combination with sentence clustering (chapter 4). This thesis also presents the first gold standard for sentence clustering in MDS.
    [Show full text]
  • Clustering News Articles Using K-Means and N-Grams by Desmond
    CLUSTERING NEWS ARTICLES USING K-MEANS AND N-GRAMS BY DESMOND BALA BISANDU (A00019335) SCHOOL OF IT & COMPUTING, AMERICAN UNIVERSITY OF NIGERIA, YOLA, ADAMAWA STATE, NIGERIA SPRING, 2018 CLUSTERING NEWS ARTICLES USING K-MEANS AND N-GRAMS BY DESMOND BALA BISANDU (A00019335) In partial fulfillment of the requirements for the award of degree of Master of Science (M.Sc.) in Computer Science submitted to the School of IT & Computing, American University of Nigeria, Yola SPRING, 2018 ii DECLARATION I, Desmond Bala BISANDU, declare that the work presented in this thesis entitled ‘Clustering News Articles Using K-means and N-grams’ submitted to the School of IT & Computing, American University of Nigeria, in partial fulfillment of the requirements for the award of the Master of Science (M.Sc.) in Computer Science. I have neither plagiarized nor submitted the same work for the award of any other degree. In case this undertaking is found incorrect, my degree may be withdrawn unconditionally by the University. Date: 16th April, 2018 Desmond Bala Bisandu Place: Yola A00019335 iii CERTIFICATION I certify that the work in this document has not been previously submitted for a degree nor neither has it been submitted as a part of a requirements for a degree except fully acknowledged within this text. ---------------------------- --------------------------- Student Date (Desmond Bala Bisandu) ----------------------------- --------------------------- Supervisor Date (Dr. Rajesh Prasad) ------------------------------ --------------------------- Program Chair Date (Dr. Narasimha Rao Vajjhala) ------------------------------ --------------------------- Graduate Coordinator, SITC Date (Dr. Rajesh Prasad) ------------------------------ --------------------------- Dean, SITC Date (Dr. Mathias Fonkam) iv ACKNOWLEDGEMENT My gratitude goes to the Almighty God for giving me the strength, knowledge, and courage to carry out this thesis successfully.
    [Show full text]
  • A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
    A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques Mehdi Allahyari Seyedamin Pouriyeh Mehdi Assefi Computer Science Department Computer Science Department Computer Science Department University of Georgia University of Georgia University of Georgia Athens, GA Athens, GA Athens, GA [email protected] [email protected] [email protected] Saied Safaei Elizabeth D. Trippe Juan B. Gutierrez Computer Science Department Institute of Bioinformatics Department of Mathematics University of Georgia University of Georgia Institute of Bioinformatics Athens, GA Athens, GA University of Georgia [email protected] [email protected] Athens, GA [email protected] Krys Kochut Computer Science Department University of Georgia Athens, GA [email protected] ABSTRACT 1 INTRODUCTION The amount of text that is generated every day is increasing dra- Text Mining (TM) field has gained a great deal of attention in recent matically. This tremendous volume of mostly unstructured text years due the tremendous amount of text data, which are created in cannot be simply processed and perceived by computers. There- a variety of forms such as social networks, patient records, health fore, efficient and effective techniques and algorithms are required care insurance data, news outlets, etc. IDC, in a report sponsored to discover useful patterns. Text mining is the task of extracting by EMC, predicts that the data volume will grow to 40 zettabytes1 meaningful information from text, which has gained significant by 2020, leading to a 50-time growth from the beginning of 2010 attentions in recent years. In this paper, we describe several of [52]. the most fundamental text mining tasks and techniques including Text data is a good example of unstructured information, which text pre-processing, classification and clustering.
    [Show full text]
  • Document Clustering Using K-Means and K-Medoids Rakesh Chandra Balabantaray*, Chandrali Sarma**, Monica Jha***
    Article can be accessed online at http://www.publishingindia.com Document Clustering using K-Means and K-Medoids Rakesh Chandra Balabantaray*, Chandrali Sarma**, Monica Jha*** Abstract clustering are big volume, high dimensionality and complex semantics. Our motive in the present paper is to With the huge upsurge of information in day-to-day’s life, extract particular domain of work from a huge collection it has become difficult to assemble relevant information of documents using K-Means and K-Medoids clustering in nick of time. But people, always are in dearth of algorithm and to obtain best clusters which later on time, they need everything quick. Hence clustering can be used for document summarizations. Document was introduced to gather the relevant information in clustering is a more specific technique for document a cluster. There are several algorithms for clustering organization, automatic topic extraction and fast IR1, information out of which in this paper, we accomplish which has been carried out using K-means clustering K-means and K-Medoids clustering algorithm and a algorithm implemented on a tool named WEKA2(Waikato comparison is carried out to find which algorithm is best Environment for Knowledge Analysis) and K-Medoids for clustering. On the best clusters formed, document algorithm is executed on Java NetBeans IDE to obtain summarization is executed based on sentence weight clusters. This allows a collection of various documents to focus on key point of the whole document, which makes it easier for people to ascertain the information from different domains to be segregated into groups of they want and thus read only those documents which similar domain.
    [Show full text]