Journal of Applied Sciences Research
Total Page:16
File Type:pdf, Size:1020Kb
Copyright © 2015, American-Eurasian Network for Scientific Information publisher JOURNAL OF APPLIED SCIENCES RESEARCH ISSN: 1819-544X EISSN: 1816-157X JOURNAL home page: http://www.aensiweb.com/JASR 2015 October; 11(19): pages 50-55. Published Online 10 November 2015. Research Article A Survey on Correlation between the Topic and Documents Based on the Pachinko Allocation Model 1Dr.C.Sundar and 2V.Sujitha 1Associate Professor, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India. 2PG Scholar, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India. Received: 23 September 2015; Accepted: 25 October 2015 © 2015 AENSI PUBLISHER All rights reserved ABSTRACT Latent Dirichlet allocation (LDA) and other related topic models are increasingly popular tools for summarization and manifold discovery in discrete data. In existing system, a novel information filtering model, Maximum matched Pattern-based Topic Model (MPBTM), is used.The patterns are generated from the words in the word-based topic representations of a traditional topic model such as the LDA model. This ensures that the patterns can well represent the topics because these patterns are comprised of the words which are extracted by LDA based on sample occurrence of the words in the documents. The Maximum matched pat-terns, which are the largest patterns in each equivalence class that exist in the received documents, are used to calculate the relevance words to represent topics. However, LDA does not capture correlations between topics and these not find the hidden topics in the document. To deal with the above problem the pachinko allocation model (PAM) is proposed. Topic models are a suite of algorithms to uncover the hidden thematic structure of a collection of documents. The algorithm improves upon earlier topic models such as LDA by modeling correlations between topics in addition to the word correlations which constitute topics. PAM provides more flexibility and greater expressive power than latent Dirichlet allocation. Keywords: Topic model, information filtering, maximum matched pattern, correlation, hidden topics, PAM. INTRODUCTION important tool because of their ability to identify latent semantic components in un- labeled text data. Data mining is the process of analyzing data Recently, attention has focused onmodels that are from different perspect practice of examining large able not only to identify topics but also to discover pre-existing databases in order to generate new the organization and co-occurrences of the topics information.It is the process of analyzing data from themselves. Text clustering is one of the main different perspectives and summarizing it into useful themes in text mining .It refers to the process of information. Similarly text mining is the process of grouping document with similar contents or topics extracting data from the large set of data. It also into clusters to improve both availability & refers to the process of deriving high quality reliability of the mining [1]. information from text. Topic modeling has become Statistical topic models such as latent dirichlet one of the most popular probabilistic text modelling allocation (LDA) havebeen shown to be effective techniques and has been quickly accepted by tools in topic extraction and analysis. These models machine learning and text mining communities. It can capture word correlations in a collection of can automatically classify documents in a collection textual documents with a low-dimensional set of by a number of topics and represents every multinomial distributions. Recent work in this area document with multiple topics and their has investigated richer structures to also describe corresponding distribution. Topic models are an inter-topic correlations, and led to discovery of large Corresponding Author: Dr.C.Sundar, Associate Professor, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India. E-mail: [email protected] 51 Dr.C.Sundar and V.Sujitha, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 50-55 numbers of more accurate, fine-grained topics. The In this paper, we introduce the pachinko topics discovered by LDA capture correlations allocation model (PAM), which uses a directed among words, but LDA does not explicitly model acyclic graph (DAG) [7], [9] structure to represent correlations among topics. This limitation arises and arbitrary, nested, and possibly sparse topic because the topic proportions in each document are correlations. sampled from a single dirichlet distribution. Fig. 1: DAG Structure The DAG structure in the PAM is extremely The fig.2 shows the working principle of PAM. flexible shows the Fig.1. First it will get the input from the user. The input In PAM, the concepts of topics are extended to contains the collection of documents. Then those be distributions not only over words, but also over documents are processed and it will be preprocessed other topics. The model structure consists of an to remove the uncorrelated words. The pachinko arbitrary DAG, in which each leaf node is associated allocation model is applied to generate the new with a word in the vocabulary, and each non-leaf topics. The model constructs the DAG [7] structure “interior” node corresponds to a topic, having a to represent the topics. The root node occupy the distribution over its children.For example, consider a topics leafnodes are consider as super and sub topics. document collection that discusses four topics: Based on that it will take the correlated words to cooking, health, insurance and drugs. The cooking emulate the new and accurate topic for the given topic co-occurs often with health, while health, document. insuranceand drugs are often discussed together [9]. Now we briefly explain how to generate a A DAG can describe this kind of correlation. For corpus of documents from PAM with this process. each topic, we have one node that is directly The PAM working principle is compared with the connected to the words. There are two additional restaurant process [10]. Each document corresponds nodes at a higher level, where one is the parent of to a restaurant and each word is associated with a cooking and health, and theother is the parent of customer. In Fig. 1 the four-level DAG structure of health, insurance and drugs. PAM, there are an infinite number of super topics represented as categories and an infinite numberof II. Pachinko Allocation Model: sub-topics represented as dishes. Both super topics Pachinko allocation models documents as a and sub-topics are globally shared among all mixtureof distributions over a single set of topics, documents. To generate a word in a document, we using a directed acyclic graph to represent topic co- first sample a super-topic. Then given the super- occurrences.Each node in the graph is a Dirichlet topic, we sample a sub-topic in the same way.Finally distribution. Atthe top level there is a single node. we sample the word from the sub-topic according to Besides the bottomlevel, each node represents a its multinomial distribution over words. The word distribution over nodes inthe next lower level. The correlation is mainly captured to give suitable topic distributions at the bottom level denote distributions for the document. Given the documents it will first over words in the vocabulary. In the simplest check the similarity words and find the maximum version, there is a single layer of dirichlet matched patterns which are having the high distributions between the root node and the word probability [8]. Based on the high probability the distributions at the bottom level. PAM does not words arranged in the descendant order. After that represent word distributions as parents of other the topic will be chosen using the topic model distributions, but it does exhibit several hierarchy- technique. related phenomena. 52 Dr.C.Sundar and V.Sujitha, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 50-55 Client Data set Pachinko allocation model Topics and subtopics Topic model Fig. 2: Architecture diagram In LDA the structure ofDAG is not defined and cross-connected edges, and edges skipping levels. it does not split as the topics as super and sub topics. Some other models can be viewed as special cases of Only it will concentrate on giving the multiple topics PAM.Pachinko allocation captures topic correlations to the given document not giving the accurate one. with a directed acyclic graph (DAG), where each The PAM model will capture the word correlation as leaf node is associated with a word and each interior well as it will give the accurate topic for the given node corresponds to a topic, having a distribution document. The topic modelling acquires the most over its children. An interior node whose children representation of the topics. are all leaves would correspond to a traditional LDA topic. But some interior nodes may also have A. Four level PAM: children that areother topics, thus representing a PAM allows arbitrary DAGs to model the mixture over topics. topiccorrelations, in thispaper; we focus on one While PAM allows arbitrary DAGs to model special structure in our experiments. It is a four-level topic correlations,in this paper we focus on one hierarchy consisting of one root topic, s1 topics at special class of structures. Consider a four-level the second level, s2 topics at the third level and hierarchy consisting of one root topic, s2 topics at words at the bottom. We call the topics at the second the second level, s3 topics at the third level, and level super-topics and the ones at the third level sub- words at the bottom. We call the topics at the second topics. The root is connected to all super-topics, level super-topics and the ones at the third level sub- super-topics are fully connected to sub-topics and topics.