Latent Semantic Kernels for Wordnet: Transforming a Tree-Like Structure Into a Matrix

Total Page:16

File Type:pdf, Size:1020Kb

Latent Semantic Kernels for Wordnet: Transforming a Tree-Like Structure Into a Matrix Latent Semantic Kernels for WordNet: Transforming a Tree-like Structure into a Matrix Young-Bum Kim Yu-Seop Kim Department of Computer Engineering Department of Computer Engineering Hallym University Hallym University Chuncheon, Gangwon, 200-702, Korea Chuncheon, Gangwon, 200-702, Korea [email protected] [email protected] Abstract considering its context. Latent Semantic Kernel, which was proposed by [3], WordNet is one of the most widely used linguistic re- has shown simpli£ed methods to measure the similarity be- sources in the computational linguistics society. However, tween documents and also has been applied to many appli- many applications using the WordNet hierarchical structure cations such as an automatic assessment system of [7]. Un- are suffering from the word sense disambiguation (WSD) like the WordNet based methods, the kernel method fortu- caused by its polysemy. In order to solve the problem, nately has no need to consider the polysemy problem during we propose a matrix representing the WordNet hierarchical measurement process. This is the reason we tried to trans- structure. Firstly, we transform a term as a vector with ele- form the WordNet structure into the kernel form, that is, a ments of each corresponding to a synset of WordNet. Then, matrix. This matrix from WordNet has another nice prop- with singular value decomposition (SVD), we reduce the di- erty that it can easily be integrated with other matrix-based mension size of the vector to represent the latent semantic information in other application later. structure. For evaluation, we implement an automatic as- At £rst, we built a term-synset matrix, like a term- sessment system for short essays and acquire reliable accu- document matrix in traditional IR society, with a given Ko- racy. As a result, the scores which are assessed by the au- rean WordNet which is called KorLEX [10]. We initially tomatic assessment system are signi£cantly correlated with gave a score, which was calculated based on a distance to those of human assessors. The new WordNet is expected to the given synset, to each element of the matrix. Then we be easily combined with other matrix-based approaches. constructed a new semantic kernel from the matrix using the SVD algorithm. For evaluation of the new kernel ma- trix, we implemented an automatic assessment system for 1 Introduction short essay questions. The correlation coef£cient between the automatic system and a human was 0.922. Section 2 will describe the representation of the trans- For a decade, WordNet [4] has been one of the most formed WordNet. The latent semantic kernel method inte- widely used linguistic resources, and is a broad coverage grated with WordNet will be explained in section 3. The lexical network of English words [2]. [1] made use of Wor- experimental results and the concluding remarks will be de- Net to measure the semantic distance between concepts and scribed in section 4 and 5, respectively. [13] used WordNet to disambiguate word senses. [9], [5], and [11] measured the relatedness of concepts, not words themselves, and the relatedness of words were estimated by 2 Transformed WordNet Structure using an assumption proposed by [11]. [11] measured the relatedness between words by calculating the related- Synsets (synonym sets) are representing speci£c under- ness between the most-related pair of concepts. lying lexical concepts in the WordNet. Even though the The assumption would be proper to measure the relat- WordNet has been utilized to solve many semantic prob- edness of words themselves. However, it is less proper to lems in computational linguistics and information retrieval measure the similarity between sentences, paraphrases, and societies, its inherent polysemy problem, that is, one term even documents, and a new complicated method is needed could be shown in multiple synsets, has also caused another to be developed to disambiguate sense of each word with problems. In order to solve the problem, we adapted the latent semantic kernel to the WordNet. eigenvectors associated with the r nonzero eigenvalues of Firstly, we built a term-synset matrix by transforming the AAT and AT A, respectively. The original term-synset ma- WordNet hierarchical structure. Each row vector of the ma- trix (A) has size of m × n. One component matrix (U), trix is associated with each term listed in WordNet and the with m × r, describes the original row entities as vectors of terms are also appearing more than once in corpus data hav- derived orthogonal factor value, another (V ), with n×r, de- ing about 38,000 documents. The vector can be represented scribes the original column entities in the same way, and the like: third (Σ) is a diagonal matrix, with r ×r, containing scaling values when the three components are matrix-multiplied, tj =<s1,s2,...,si,...,sN > (1) the original matrix is reconstructed. The singular vectors corresponding to the k (k ≤ r) largest singular values are , where tj means a row vector for the j-th term and N is s then used to de£ne k-dimensional synset space. Using these the total number of synsets in the WordNet. And i,setto m × k n × k U V zero initially, is calculated by vectors, and matrices k and k may be rede- £ned along with k×k singular value matrix Σk. It is known α A = U Σ V T k s = that k k k k is the closest matrix of rank to the i k (2) 2 original matrix A.AndUk is replaced with P . [8] explains , where α is a constant value. The si is decreasing along more details of above SVD-based methods, latent semantic with the number of edges, 0 ≤ k ≤ kmax, on the path from analysis (LSA). the synset including the term tj to the i-th synset. The kmax decides the range of synsets related with the term. If kmax 4 Evaluation is increasing, more synsets will be regarded as to be related with the synset of the term. In this paper, we decided that In this section, we will £rstly explain the automatic as- k =2 α =2 max and also . sessment procedure developed to evaluate the usefulness of Figure 1 from [12] shows a part of WordNet extract for the new kernel based on WordNet, and then show the use- { the term ‘car’. The term appears in multiple synsets, car, fulness by giving the assessment accuracy. gondola}, {car, railway car}, and {car, automobile}. Then s (α=2) =2 the values of i for the synsets are determined to (20=1) 4.1 Automatic Assessment for Short Es- in this paper. The adjacent synsets to {car, automobile}, say which are {motor vehicle}, {coupe}, {sedan}, and {taxi} (α=2) in £gure 1, are all given (21=2) =1s as their si values. This Figure 2 shows the whole process of the automatic as- procedure is continued until the kmaxth adjacent synsets are sessment system developed to evaluate the new kernel. The faced. whole process is started with the sentence input written by a student. At £rst, Korean Morphological Analyzer 3 Latent Semantic Kernel for Similarity (KMA) [6] extracts main indexed terms from the input sen- Measurement tence. With the term list, an initial vector is constituted with elements of the vocabulary generated from both a large doc- ument collection and WordNet. 16,000 words were listed in With the initial term-synset matrix, A, created above, we the vocabulary. Then, the dimension is reduced by comput- build the latent semantic kernel [3]. Similarity between doc- T ing P d, where P is the kernel from WordNet and d is the uments, d1 and d2, is estimated as follows. initial vector. With model sentences, which were created by instructors and were transformed on the same way as the dT PPd sim(d ,d )=cos(P T d ,PT d )= 1 2 input student sentence, the similarities are estimated by us- 1 2 1 2 T T (3) |P d1||P d2| ing the equation (3), where the student sentence is mapped into d1 and the model sentences are mapped into d2.Inthis P , where is a matrix transforming documents from an input research, we stored £ve model sentences for each question. k(d ,d )=< space to a feature space. A kernel function 1 2 Finally, the highest similarity value is determined as the £- φ(d1),φ(d2) > uses the matrix P to replace φ(d1) with T nal score of the student sentence. P d1. To £nd P , the term-synset matrix A is transformed by 4.2 Comparison to Human Assessors using SVD like A = UΣV T We gave 30 questions about Chinese proverbs to 100 stu- (4) dents, requiring proper meaning of each proverbs. Firstly, a , where Σ is a diagonal matrix composed of nonzero eigen- human assessor decided whether each answer is correct or values of AAT or AT A, and U and V are the orthogonal not and gave one of two scores, 1, and 0, respectively. The Figure 1. WordNet Extract for the term ‘car’ Figure 2. Automatic Assessment Process score of each student are then computed. Likewise, our as- References sessment system does the same thing except that the system gives a partial score to each answer ranged from 0 to 1. [1] A. Budanitsky and G. Hirst. Semantic distance in word- Figure 3 shows the correlation between the scores from net: An experimental, application-oriented evaluation of £ve the human assessor and those of the automatic assessor. The measures. Workshop on WordNet and Other Lexical Re- correlation coef£cient value is 0.922. As shown in £gure sources, Second Meeting of the North Americal Chapter of 3, the assessment system tends to give a little lower score the Association for Computational Linguistics, 2001.
Recommended publications
  • Automatic Wordnet Mapping Using Word Sense Disambiguation*
    Automatic WordNet mapping using word sense disambiguation* Changki Lee Seo JungYun Geunbae Leer Natural Language Processing Lab Natural Language Processing Lab Dept. of Computer Science and Engineering Dept. of Computer Science Pohang University of Science & Technology Sogang University San 31, Hyoja-Dong, Pohang, 790-784, Korea Sinsu-dong 1, Mapo-gu, Seoul, Korea {leeck,gblee }@postech.ac.kr seojy@ ccs.sogang.ac.kr bilingual Korean-English dictionary. The first Abstract sense of 'gwan-mog' has 'bush' as a translation in English and 'bush' has five synsets in This paper presents the automatic WordNet. Therefore the first sense of construction of a Korean WordNet from 'gwan-mog" has five candidate synsets. pre-existing lexical resources. A set of Somehow we decide a synset {shrub, bush} automatic WSD techniques is described for among five candidate synsets and link the sense linking Korean words collected from a of 'gwan-mog' to this synset. bilingual MRD to English WordNet synsets. As seen from this example, when we link the We will show how individual linking senses of Korean words to WordNet synsets, provided by each WSD method is then there are semantic ambiguities. To remove the combined to produce a Korean WordNet for ambiguities we develop new word sense nouns. disambiguation heuristics and automatic mapping method to construct Korean WordNet based on 1 Introduction the existing English WordNet. There is no doubt on the increasing This paper is organized as follows. In section 2, importance of using wide coverage ontologies we describe multiple heuristics for word sense for NLP tasks especially for information disambiguation for sense linking.
    [Show full text]
  • Probabilistic Topic Modelling with Semantic Graph
    Probabilistic Topic Modelling with Semantic Graph B Long Chen( ), Joemon M. Jose, Haitao Yu, Fajie Yuan, and Huaizhi Zhang School of Computing Science, University of Glasgow, Sir Alwyns Building, Glasgow, UK [email protected] Abstract. In this paper we propose a novel framework, topic model with semantic graph (TMSG), which couples topic model with the rich knowledge from DBpedia. To begin with, we extract the disambiguated entities from the document collection using a document entity linking system, i.e., DBpedia Spotlight, from which two types of entity graphs are created from DBpedia to capture local and global contextual knowl- edge, respectively. Given the semantic graph representation of the docu- ments, we propagate the inherent topic-document distribution with the disambiguated entities of the semantic graphs. Experiments conducted on two real-world datasets show that TMSG can significantly outperform the state-of-the-art techniques, namely, author-topic Model (ATM) and topic model with biased propagation (TMBP). Keywords: Topic model · Semantic graph · DBpedia 1 Introduction Topic models, such as Probabilistic Latent Semantic Analysis (PLSA) [7]and Latent Dirichlet Analysis (LDA) [2], have been remarkably successful in ana- lyzing textual content. Specifically, each document in a document collection is represented as random mixtures over latent topics, where each topic is character- ized by a distribution over words. Such a paradigm is widely applied in various areas of text mining. In view of the fact that the information used by these mod- els are limited to document collection itself, some recent progress have been made on incorporating external resources, such as time [8], geographic location [12], and authorship [15], into topic models.
    [Show full text]
  • Employee Matching Using Machine Learning Methods
    Master of Science in Computer Science May 2019 Employee Matching Using Machine Learning Methods Sumeesha Marakani Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies. The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree. Contact Information: Author(s): Sumeesha Marakani E-mail: [email protected] University advisor: Prof. Veselka Boeva Department of Computer Science External advisors: Lars Tornberg [email protected] Daniel Lundgren [email protected] Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract Background. Expertise retrieval is an information retrieval technique that focuses on techniques to identify the most suitable ’expert’ for a task from a list of individ- uals. Objectives. This master thesis is a collaboration with Volvo Cars to attempt ap- plying this concept and match employees based on information that was extracted from an internal tool of the company. In this tool, the employees describe themselves in free flowing text. This text is extracted from the tool and analyzed using Natural Language Processing (NLP) techniques.
    [Show full text]
  • Matrix Decompositions and Latent Semantic Indexing
    Online edition (c)2009 Cambridge UP DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 403 Matrix decompositions and latent 18 semantic indexing On page 123 we introduced the notion of a term-document matrix: an M N matrix C, each of whose rows represents a term and each of whose column× s represents a document in the collection. Even for a collection of modest size, the term-document matrix C is likely to have several tens of thousands of rows and columns. In Section 18.1.1 we first develop a class of operations from linear algebra, known as matrix decomposition. In Section 18.2 we use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix. In Section 18.3 we examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing. While latent semantic in- dexing has not been established as a significant force in scoring and ranking for information retrieval, it remains an intriguing approach to clustering in a number of domains including for collections of text documents (Section 16.6, page 372). Understanding its full potential remains an area of active research. Readers who do not require a refresher on linear algebra may skip Sec- tion 18.1, although Example 18.1 is especially recommended as it highlights a property of eigenvalues that we exploit later in the chapter. 18.1 Linear algebra review We briefly review some necessary background in linear algebra. Let C be an M N matrix with real-valued entries; for a term-document matrix, all × RANK entries are in fact non-negative.
    [Show full text]
  • Universal Or Variation? Semantic Networks in English and Chinese
    Universal or variation? Semantic networks in English and Chinese Understanding the structures of semantic networks can provide great insights into lexico- semantic knowledge representation. Previous work reveals small-world structure in English, the structure that has the following properties: short average path lengths between words and strong local clustering, with a scale-free distribution in which most nodes have few connections while a small number of nodes have many connections1. However, it is not clear whether such semantic network properties hold across human languages. In this study, we investigate the universal structures and cross-linguistic variations by comparing the semantic networks in English and Chinese. Network description To construct the Chinese and the English semantic networks, we used Chinese Open Wordnet2,3 and English WordNet4. The two wordnets have different word forms in Chinese and English but common word meanings. Word meanings are connected not only to word forms, but also to other word meanings if they form relations such as hypernyms and meronyms (Figure 1). 1. Cross-linguistic comparisons Analysis The large-scale structures of the Chinese and the English networks were measured with two key network metrics, small-worldness5 and scale-free distribution6. Results The two networks have similar size and both exhibit small-worldness (Table 1). However, the small-worldness is much greater in the Chinese network (σ = 213.35) than in the English network (σ = 83.15); this difference is primarily due to the higher average clustering coefficient (ACC) of the Chinese network. The scale-free distributions are similar across the two networks, as indicated by ANCOVA, F (1, 48) = 0.84, p = .37.
    [Show full text]
  • Latent Semantic Analysis for Text-Based Research
    Behavior Research Methods, Instruments, & Computers 1996, 28 (2), 197-202 ANALYSIS OF SEMANTIC AND CLINICAL DATA Chaired by Matthew S. McGlone, Lafayette College Latent semantic analysis for text-based research PETER W. FOLTZ New Mexico State University, Las Cruces, New Mexico Latent semantic analysis (LSA) is a statistical model of word usage that permits comparisons of se­ mantic similarity between pieces of textual information. This papersummarizes three experiments that illustrate how LSA may be used in text-based research. Two experiments describe methods for ana­ lyzinga subject's essay for determining from what text a subject learned the information and for grad­ ing the quality of information cited in the essay. The third experiment describes using LSAto measure the coherence and comprehensibility of texts. One of the primary goals in text-comprehension re­ A theoretical approach to studying text comprehen­ search is to understand what factors influence a reader's sion has been to develop cognitive models ofthe reader's ability to extract and retain information from textual ma­ representation ofthe text (e.g., Kintsch, 1988; van Dijk terial. The typical approach in text-comprehension re­ & Kintsch, 1983). In such a model, semantic information search is to have subjects read textual material and then from both the text and the reader's summary are repre­ have them produce some form of summary, such as an­ sented as sets ofsemantic components called propositions. swering questions or writing an essay. This summary per­ Typically, each clause in a text is represented by a single mits the experimenter to determine what information the proposition.
    [Show full text]
  • How Latent Is Latent Semantic Analysis?
    How Latent is Latent Semantic Analysis? Peter Wiemer-Hastings University of Memphis Department of Psychology Campus Box 526400 Memphis TN 38152-6400 [email protected]* Abstract In the late 1980's, a group at Bellcore doing re- search on information retrieval techniques developed a Latent Semantic Analysis (LSA) is a statisti• statistical, corpus-based method for retrieving texts. cal, corpus-based text comparison mechanism Unlike the simple techniques which rely on weighted that was originally developed for the task of matches of keywords in the texts and queries, their information retrieval, but in recent years has method, called Latent Semantic Analysis (LSA), cre• produced remarkably human-like abilities in a ated a high-dimensional, spatial representation of a cor• variety of language tasks. LSA has taken the pus and allowed texts to be compared geometrically. Test of English as a Foreign Language and per• In the last few years, several researchers have applied formed as well as non-native English speakers this technique to a variety of tasks including the syn• who were successful college applicants. It has onym section of the Test of English as a Foreign Lan• shown an ability to learn words at a rate sim• guage [Landauer et al., 1997], general lexical acquisi• ilar to humans. It has even graded papers as tion from text [Landauer and Dumais, 1997], selecting reliably as human graders. We have used LSA texts for students to read [Wolfe et al., 1998], judging as a mechanism for evaluating the quality of the coherence of student essays [Foltz et a/., 1998], and student responses in an intelligent tutoring sys• the evaluation of student contributions in an intelligent tem, and its performance equals that of human tutoring environment [Wiemer-Hastings et al., 1998; raters with intermediate domain knowledge.
    [Show full text]
  • Navigating Dbpedia by Topic Tanguy Raynaud, Julien Subercaze, Delphine Boucard, Vincent Battu, Frederique Laforest
    Fouilla: Navigating DBpedia by Topic Tanguy Raynaud, Julien Subercaze, Delphine Boucard, Vincent Battu, Frederique Laforest To cite this version: Tanguy Raynaud, Julien Subercaze, Delphine Boucard, Vincent Battu, Frederique Laforest. Fouilla: Navigating DBpedia by Topic. CIKM 2018, Oct 2018, Turin, Italy. hal-01860672 HAL Id: hal-01860672 https://hal.archives-ouvertes.fr/hal-01860672 Submitted on 23 Aug 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Fouilla: Navigating DBpedia by Topic Tanguy Raynaud, Julien Subercaze, Delphine Boucard, Vincent Battu, Frédérique Laforest Univ Lyon, UJM Saint-Etienne, CNRS, Laboratoire Hubert Curien UMR 5516 Saint-Etienne, France [email protected] ABSTRACT only the triples that concern this topic. For example, a user is inter- Navigating large knowledge bases made of billions of triples is very ested in Italy through the prism of Sports while another through the challenging. In this demonstration, we showcase Fouilla, a topical prism of Word War II. For each of these topics, the relevant triples Knowledge Base browser that offers a seamless navigational expe- of the Italy entity differ. In such circumstances, faceted browsing rience of DBpedia. We propose an original approach that leverages offers no solution to retrieve the entities relative to a defined topic both structural and semantic contents of Wikipedia to enable a if the knowledge graph does not explicitly contain an adequate topic-oriented filter on DBpedia entities.
    [Show full text]
  • NL Assistant: a Toolkit for Developing Natural Language: Applications
    NL Assistant: A Toolkit for Developing Natural Language Applications Deborah A. Dahl, Lewis M. Norton, Ahmed Bouzid, and Li Li Unisys Corporation Introduction scale lexical resources have been integrated with the toolkit. These include Comlex and WordNet We will be demonstrating a toolkit for as well as additional, internally developed developing natural language-based applications resources. The second strategy is to provide easy and two applications. The goals of this toolkit to use editors for entering linguistic information. are to reduce development time and cost for natural language based applications by reducing Servers the amount of linguistic and programming work needed. Linguistic work has been reduced by Lexical information is supplied by four external integrating large-scale linguistics resources--- servers which are accessed by the natural Comlex (Grishman, et. al., 1993) and WordNet language engine during processing. Syntactic (Miller, 1990). Programming work is reduced by information is supplied by a lexical server based automating some of the programming tasks. The on the 50K word Comlex dictionary available toolkit is designed for both speech- and text- from the Linguistic Data Consortium. Semantic based interface applications. It runs in a information comes from two servers, a KB Windows NT environment. Applications can server based on the noun portion of WordNet run in either Windows NT or Unix. (70K concepts), and a semantics server containing case frame information for 2500 System Components English verbs. A denotations server using unique concept names generated for each The NL Assistant toolkit consists of WordNet synset at ISI links the words in the lexicon to the concepts in the KB.
    [Show full text]
  • Document and Topic Models: Plsa
    10/4/2018 Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline • Topic Models • pLSA • LSA • Model • Fitting via EM • pHITS: link analysis • LDA • Dirichlet distribution • Generative process • Model • Geometric Interpretation • Inference 2 1 10/4/2018 Topic Models: Visual Representation Topic proportions and Topics Documents assignments 3 Topic Models: Importance • For a given corpus, we learn two things: 1. Topic: from full vocabulary set, we learn important subsets 2. Topic proportion: we learn what each document is about • This can be viewed as a form of dimensionality reduction • From large vocabulary set, extract basis vectors (topics) • Represent document in topic space (topic proportions) 푁 퐾 • Dimensionality is reduced from 푤푖 ∈ ℤ푉 to 휃 ∈ ℝ • Topic proportion is useful for several applications including document classification, discovery of semantic structures, sentiment analysis, object localization in images, etc. 4 2 10/4/2018 Topic Models: Terminology • Document Model • Word: element in a vocabulary set • Document: collection of words • Corpus: collection of documents • Topic Model • Topic: collection of words (subset of vocabulary) • Document is represented by (latent) mixture of topics • 푝 푤 푑 = 푝 푤 푧 푝(푧|푑) (푧 : topic) • Note: document is a collection of words (not a sequence) • ‘Bag of words’ assumption • In probability, we call this the exchangeability assumption • 푝 푤1, … , 푤푁 = 푝(푤휎 1 , … , 푤휎 푁 ) (휎: permutation) 5 Topic Models: Terminology (cont’d) • Represent each document as a vector space • A word is an item from a vocabulary indexed by {1, … , 푉}. We represent words using unit‐basis vectors.
    [Show full text]
  • Vector-Based Word Representations for Sentiment Analysis: a Comparative Study
    CORE Metadata, citation and similar papers at core.ac.uk Provided by SEDICI - Repositorio de la UNLP Vector-based word representations for sentiment analysis: a comparative study M. Paula Villegas1, M. José Garciarena Ucelay1, Juan Pablo Fernández1, Miguel A. 2 1 1,3 Álvarez-Carmona , Marcelo L. Errecalde , Leticia C. Cagnina 1LIDIC, Universidad Nacional de San Luis, San Luis, Argentina 2Language Technologies Laboratory, INAOE, Puebla, México 3CONICET, Argentina {villegasmariapaula74, mjgarciarenaucelay, miguelangel.alvarezcarmona, merrecalde, lcagnina}@gmail.com Abstract. New applications of text categorization methods like opinion mining and sentiment analysis, author profiling and plagiarism detection requires more elaborated and effective document representation models than classical Information Retrieval approaches like the Bag of Words representation. In this context, word representation models in general and vector-based word representations in particular have gained increasing interest to overcome or alleviate some of the limitations that Bag of Words-based representations exhibit. In this article, we analyze the use of several vector-based word representations in a sentiment analysis task with movie reviews. Experimental results show the effectiveness of some vector-based word representations in comparison to standard Bag of Words representations. In particular, the Second Order Attributes representation seems to be very robust and effective because independently the classifier used with, the results are good. Keywords: text mining, word-based representations, text categorization, movie reviews, sentiment analysis 1 Introduction Selecting a good document representation model is a key aspect in text categorization tasks. The usual approach, named Bag of Words (BoW) model, considers that words are simple indexes in a term vocabulary. In this model, originated in the information retrieval field, documents are represented as vectors indexed by those words.
    [Show full text]
  • Similarity Detection Using Latent Semantic Analysis Algorithm Priyanka R
    International Journal of Emerging Research in Management &Technology Research Article August ISSN: 2278-9359 (Volume-6, Issue-8) 2017 Similarity Detection Using Latent Semantic Analysis Algorithm Priyanka R. Patil Shital A. Patil PG Student, Department of Computer Engineering, Associate Professor, Department of Computer Engineering, North Maharashtra University, Jalgaon, North Maharashtra University, Jalgaon, Maharashtra, India Maharashtra, India Abstract— imilarity View is an application for visually comparing and exploring multiple models of text and collection of document. Friendbook finds ways of life of clients from client driven sensor information, measures the S closeness of ways of life amongst clients, and prescribes companions to clients if their ways of life have high likeness. Roused by demonstrate a clients day by day life as life records, from their ways of life are separated by utilizing the Latent Dirichlet Allocation Algorithm. Manual techniques can't be utilized for checking research papers, as the doled out commentator may have lacking learning in the exploration disciplines. For different subjective views, causing possible misinterpretations. An urgent need for an effective and feasible approach to check the submitted research papers with support of automated software. A method like text mining method come to solve the problem of automatically checking the research papers semantically. The proposed method to finding the proper similarity of text from the collection of documents by using Latent Dirichlet Allocation (LDA) algorithm and Latent Semantic Analysis (LSA) with synonym algorithm which is used to find synonyms of text index wise by using the English wordnet dictionary, another algorithm is LSA without synonym used to find the similarity of text based on index.
    [Show full text]