Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

Total Page:16

File Type:pdf, Size:1020Kb

Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval Xuerui Wang, Andrew McCallum, Xing Wei University of Massachusetts 140 Governors Dr, Amherst, MA 01003 xuerui, mccallum, xwei @cs.umass.edu f g Abstract cluding parsing, machine translation and information re- trieval. In general, phrases as the whole carry more in- Most topic models, such as latent Dirichlet allocation, formation than the sum of its individual components, thus rely on the bag-of-words assumption. However, word order they are much more crucial in determining the topics of col- and phrases are often critical to capturing the meaning of lections than individual words. Most topic models such as text in many text mining tasks. This paper presents topical latent Dirichlet allocation (LDA) [2], however, assume that n-grams, a topic model that discovers topics as well as top- words are generated independently from each other, i.e., un- ical phrases. The probabilistic model generates words in der the bag-of-words assumption. Adding phrases increases their textual order by, for each word, first sampling a topic, the model’s complexity, but it could be useful in certain con- then sampling its status as a unigram or bigram, and then texts. The possible over complicacy caused by introducing sampling the word from a topic-specific unigram or bigram phrases makes these topic models completely ignore them. distribution. Thus our model can model “white house” as It is true that these models with the bag-of-words assump- a special meaning phrase in the ‘politics’ topic, but not tion have enjoyed a big success, and attracted a lot of inter- in the ‘real estate’ topic. Successive bigrams form longer ests from researchers with different backgrounds. We be- phrases. We present experimental results showing mean- lieve that a topic model considering phrases would be defi- ingful phrases and more interpretable topics from the NIPS nitely more useful in certain applications. data and improved information retrieval performance on a Assume that we conduct topic analysis on a large collec- TREC collection. tion of research papers. The acknowledgment sections of research papers have a distinctive vocabulary. Not surpris- ingly, we would end up with a particular topic on acknowl- edgment (or funding agencies) since many papers have an 1 Introduction acknowledgment section that is not tightly coupled with the content of papers. One might therefore expect to find words Although the bag-of-words assumption is prevalent in such as “thank”, “support” and “grant” in a single topic. document classification and topic models, the great major- One might be very confused, however, to find words like ity of natural language processing methods represent word “health” and “science” in the same topic, unless they are order, including n-gram language models for speech recog- presented in context: “National Institutes of Health” and nition, finite-state models for information extraction and “National Science Foundation”. context-free grammars for parsing. Word order is not only Phrases often have specialized meaning, but not always. important for syntax, but also important for lexical mean- For instance, “neural networks” is considered a phrase be- ing. A collocation is a phrase with meaning beyond the cause of its frequent use as a fixed expression. However, it individual words. For example, the phrase “white house” specifies two distinct concepts: biological neural networks carries a special meaning beyond the appearance of its in- in neuroscience and artificial neural networks in modern us- dividual words, whereas “yellow house” does not. Note, age. Without consulting the context in which the term is lo- however, that whether or not a phrase is a collocation may cated, it is hard to determine its actual meaning. In many sit- depend on the topic context. In the context of a document uations, topic is very useful to accurately capture the mean- about real estate, “white house” may not be a collocation. ing. Furthermore, topic can play a role in phrase discovery. N-gram phrases are fundamentally important in many Considering learning English, a beginner usually has diffi- areas of natural language processing and text mining, in- culty in telling “strong tea” from “powerful tea” [15], which are both grammatically correct. The topic associated with SYMBOL DESCRIPTION “tea” might help to discover the misuse of “powerful”. T number of topics In this paper, we propose a new topical n-gram (TNG) D number of documents model that automatically determines unigram words and W number of unique words phrases based on context and assign mixture of topics to Nd number of word tokens in document d (d) th both individual words and n-gram phrases. The ability to zi the topic associated with the i token in the form phrases only where appropriate is unique to our model, document d distinguishing it from the traditional collocation discovery (d) th xi the bigram status between the (i 1) token methods discussed in Section 3, where a discovered phrase and ith token in the document d − collocation (d) is always treated as a regardless of the context w the ith token in document d (which would possibly make us incorrectly conclude that i θ(d) the multinomial (Discrete) distribution of topics “white house” remains a phrase in a document about real w.r.t. the document d estate). Thus, TNG is not only a topic model that uses φ the multinomial (Discrete) unigram distribution phrases, but also help linguists discover meaningful phrases z of words w.r.t. topic z in right context, in a completely probabilistic manner. We in Figure 1(b), the binomial (Bernoulli) distribution show examples of extracted phrases and more interpretable v of status variables w.r.t. previous word v topics on the NIPS data, and in a text mining application, in Figure 1(c), the binomial (Bernoulli) distribution we present better information retrieval performance on an zv of status variables w.r.t. previous topic z/word v ad-hoc retrieval task over a TREC collection. σzv in Figure 1(a) and (c), the multinomial (Discrete) bigram distribution of words w.r.t. topic z/word v 2 N-gram based Topic Models σv in Figure 1(b), the multinomial (Discrete) bigram distribution of words w.r.t. previous word v α Dirichlet prior of θ Before presenting our topical n-gram model, we first de- β Dirichlet prior of φ scribe two related n-gram models. Notation used in this pa- γ Dirichlet prior of per is listed in Table 1, and the graphical models are showed δ Dirichlet prior of σ in Figure 1. For simplicity, all the models discussed in this section make the 1st order Markov assumption, that is, they are actually bigram models. However, all the models have Table 1. Notation used in this paper the ability to “model” higher order n-grams (n > 2) by concatenating consecutive bigrams. 2.2 LDA Collocation Model (LDACOL) 2.1 Bigram Topic Model (BTM) Starting from the LDA topic model, the LDA colloca- Recently, Wallach develops a bigram topic model [22] on tion model [20] (not yet published) introduces a new set the basis of the hierarchical Dirichlet language model [14], of random variables (for bigram status) x (xi = 1: wi−1 by incorporating the concept of topic into bigram models. and wi form a bigram; xi = 0: they do not) that denote if This model is one solution for the “neural network” exam- a bigram can be formed with the previous token, in addi- ple in Section 1. We assume a dummy word w0 existing at tion to the two sets of random variables z and w in LDA. the beginning of each document. The graphical model pre- Thus, it has the power to decide if to generate a bigram or sentation of this model is shown in Figure 1(a). The gener- a unigram. At this aspect, it is more realistic than the bi- ative process of this model can be described as follows: gram topic model which always generates bigrams. After all, unigrams are the major components in a document. We 1. draw Discrete distributions σzw from a Dirichlet prior assume the status variable x1 is observed, and only a uni- δ for each topic z and each word w; gram is allowed at the beginning of a document. If we want to put more constraints into the model (e.g., no bigram is 2. for each document d, draw a Discrete distribution θ(d) allowed for sentence/paragraph boundary; only a unigram (d) from a Dirichlet prior α; then for each word wi in can be considered for the next word after a stop word is document d: removed; etc.), we can assume that the corresponding sta- tus variables are observed as well. This model’s graphical (d) (d) (a) draw zi from Discrete θ ; and model presentation is shown in Figure 1(b). (d) The generative process of the LDA collocation model is (b) draw wi from Discrete σz(d)w(d) . i i−1 described as follows: α 1. draw Discrete distributions φz from a Dirichlet prior β for each topic z; θ 2. draw Bernoulli distributions w from a Beta prior γ for each word w; ... zi 1 zi zi+1 zi+2 ... − 3. draw Discrete distributions σw from a Dirichlet prior δ for each word w; (d) ... wi 1 wi wi+1 wi+2 ... 4. for each document d, draw a Discrete distribution θ − (d) D from a Dirichlet prior α; then for each word wi in document d: (d) (a) draw xi from Bernoulli w(d) ; σ δ i−1 TW (d) (d) (b) draw zi from Discrete θ ; and (a) Bigram topic model (d) (d) (c) draw wi from Discrete σw(d) if xi = 1; else α i−1 (d) draw wi from Discrete φ (d) .
Recommended publications
  • A Way with Words: Recent Advances in Lexical Theory and Analysis: a Festschrift for Patrick Hanks
    A Way with Words: Recent Advances in Lexical Theory and Analysis: A Festschrift for Patrick Hanks Gilles-Maurice de Schryver (editor) (Ghent University and University of the Western Cape) Kampala: Menha Publishers, 2010, vii+375 pp; ISBN 978-9970-10-101-6, e59.95 Reviewed by Paul Cook University of Toronto In his introduction to this collection of articles dedicated to Patrick Hanks, de Schryver presents a quote from Atkins referring to Hanks as “the ideal lexicographer’s lexicogra- pher.” Indeed, Hanks has had a formidable career in lexicography, including playing an editorial role in the production of four major English dictionaries. But Hanks’s achievements reach far beyond lexicography; in particular, Hanks has made numerous contributions to computational linguistics. Hanks is co-author of the tremendously influential paper “Word association norms, mutual information, and lexicography” (Church and Hanks 1989) and maintains close ties to our field. Furthermore, Hanks has advanced the understanding of the relationship between words and their meanings in text through his theory of norms and exploitations (Hanks forthcoming). The range of Hanks’s interests is reflected in the authors and topics of the articles in this Festschrift; this review concentrates more on those articles that are likely to be of most interest to computational linguists, and does not discuss some articles that appeal primarily to a lexicographical audience. Following the introduction to Hanks and his career by de Schryver, the collection is split into three parts: Theoretical Aspects and Background, Computing Lexical Relations, and Lexical Analysis and Dictionary Writing. 1. Part I: Theoretical Aspects and Background Part I begins with an unfinished article by the late John Sinclair, in which he begins to put forward an argument that multi-word units of meaning should be given the same status in dictionaries as ordinary headwords.
    [Show full text]
  • Finetuning Pre-Trained Language Models for Sentiment Classification of COVID19 Tweets
    Technological University Dublin ARROW@TU Dublin Dissertations School of Computer Sciences 2020 Finetuning Pre-Trained Language Models for Sentiment Classification of COVID19 Tweets Arjun Dussa Technological University Dublin Follow this and additional works at: https://arrow.tudublin.ie/scschcomdis Part of the Computer Engineering Commons Recommended Citation Dussa, A. (2020) Finetuning Pre-trained language models for sentiment classification of COVID19 tweets,Dissertation, Technological University Dublin. doi:10.21427/fhx8-vk25 This Dissertation is brought to you for free and open access by the School of Computer Sciences at ARROW@TU Dublin. It has been accepted for inclusion in Dissertations by an authorized administrator of ARROW@TU Dublin. For more information, please contact [email protected], [email protected]. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License Finetuning Pre-trained language models for sentiment classification of COVID19 tweets Arjun Dussa A dissertation submitted in partial fulfilment of the requirements of Technological University Dublin for the degree of M.Sc. in Computer Science (Data Analytics) September 2020 Declaration I certify that this dissertation which I now submit for examination for the award of MSc in Computing (Data Analytics), is entirely my own work and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the test of my work. This dissertation was prepared according to the regulations for postgraduate study of the Technological University Dublin and has not been submitted in whole or part for an award in any other Institute or University.
    [Show full text]
  • An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification
    Imperial College London Department of Computing An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification Supervisors: Author: Prof Alessandra Russo Clavance Lim Nuri Cingillioglu Submitted in partial fulfillment of the requirements for the MSc degree in Computing Science of Imperial College London September 2019 Contents Abstract 1 Acknowledgements 2 1 Introduction 3 1.1 Motivation .................................. 3 1.2 Aims and objectives ............................ 4 1.3 Outline .................................... 5 2 Background 6 2.1 Overview ................................... 6 2.1.1 Text classification .......................... 6 2.1.2 Training, validation and test sets ................. 6 2.1.3 Cross validation ........................... 7 2.1.4 Hyperparameter optimization ................... 8 2.1.5 Evaluation metrics ......................... 9 2.2 Text classification pipeline ......................... 14 2.3 Feature extraction ............................. 15 2.3.1 Count vectorizer .......................... 15 2.3.2 TF-IDF vectorizer ......................... 16 2.3.3 Word embeddings .......................... 17 2.4 Classifiers .................................. 18 2.4.1 Naive Bayes classifier ........................ 18 2.4.2 Decision tree ............................ 20 2.4.3 Random forest ........................... 21 2.4.4 Logistic regression ......................... 21 2.4.5 Support vector machines ...................... 22 2.4.6 k-Nearest Neighbours .......................
    [Show full text]
  • Fasttext.Zip: Compressing Text Classification Models
    Under review as a conference paper at ICLR 2017 FASTTEXT.ZIP: COMPRESSING TEXT CLASSIFICATION MODELS Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve´ Jegou´ & Tomas Mikolov Facebook AI Research fajoulin,egrave,bojanowski,matthijs,rvj,[email protected] ABSTRACT We consider the problem of producing compact architectures for text classifica- tion, such that the full model fits in a limited amount of memory. After consid- ering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quan- tization artefacts. Combined with simple approaches specifically adapted to text classification, our approach derived from fastText requires, at test time, only a fraction of the memory compared to the original FastText, without noticeably sacrificing quality in terms of classification accuracy. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy. 1 INTRODUCTION Text classification is an important problem in Natural Language Processing (NLP). Real world use- cases include spam filtering or e-mail categorization. It is a core component in more complex sys- tems such as search and ranking. Recently, deep learning techniques based on neural networks have achieved state of the art results in various NLP applications. One of the main successes of deep learning is due to the effectiveness of recurrent networks for language modeling and their application to speech recognition and machine translation (Mikolov, 2012).
    [Show full text]
  • Lexical Analysis
    From words to numbers Parsing, tokenization, extract information Natural Language Processing Piotr Fulmański Lecture goals • Tokenizing your text into words and n-grams (tokens) • Compressing your token vocabulary with stemming and lemmatization • Building a vector representation of a statement Natural Language Processing in Action by Hobson Lane Cole Howard Hannes Max Hapke Manning Publications, 2019 Terminology Terminology • A phoneme is a unit of sound that distinguishes one word from another in a particular language. • In linguistics, a word of a spoken language can be defined as the smallest sequence of phonemes that can be uttered in isolation with objective or practical meaning. • The concept of "word" is usually distinguished from that of a morpheme. • Every word is composed of one or more morphemes. • A morphem is the smallest meaningful unit in a language even if it will not stand on its own. A morpheme is not necessarily the same as a word. The main difference between a morpheme and a word is that a morpheme sometimes does not stand alone, but a word, by definition, always stands alone. Terminology SEGMENTATION • Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. • Word segmentation is the problem of dividing a string of written language into its component words. Text segmentation, retrieved 2020-10-20, https://en.wikipedia.org/wiki/Text_segmentation Terminology SEGMENTATION IS NOT SO EASY • Trying to resolve the question of what a word is and how to divide up text into words we can face many "exceptions": • Is “ice cream” one word or two to you? Don’t both words have entries in your mental dictionary that are separate from the compound word “ice cream”? • What about the contraction “don’t”? Should that string of characters be split into one or two “packets of meaning?” • The single statement “Don’t!” means “Don’t you do that!” or “You, do not do that!” That’s three hidden packets of meaning for a total of five tokens you’d like your machine to know about.
    [Show full text]
  • WELDA: Enhancing Topic Models by Incorporating Local Word Context
    WELDA: Enhancing Topic Models by Incorporating Local Word Context Stefan Bunk Ralf Krestel Hasso Plattner Institute Hasso Plattner Institute Potsdam, Germany Potsdam, Germany [email protected] [email protected] ABSTRACT The distributional hypothesis states that similar words tend to have similar contexts in which they occur. Word embedding models exploit this hypothesis by learning word vectors based on the local context of words. Probabilistic topic models on the other hand utilize word co-occurrences across documents to identify topically related words. Due to their complementary nature, these models define different notions of word similarity, which, when combined, can produce better topical representations. In this paper we propose WELDA, a new type of topic model, which combines word embeddings (WE) with latent Dirichlet allo- cation (LDA) to improve topic quality. We achieve this by estimating topic distributions in the word embedding space and exchanging selected topic words via Gibbs sampling from this space. We present an extensive evaluation showing that WELDA cuts runtime by at least 30% while outperforming other combined approaches with Figure 1: Illustration of an LDA topic in the word embed- respect to topic coherence and for solving word intrusion tasks. ding space. The gray isolines show the learned probability distribution for the topic. Exchanging topic words having CCS CONCEPTS low probability in the embedding space (red minuses) with • Information systems → Document topic models; Document high probability words (green pluses), leads to a superior collection models; • Applied computing → Document analysis; LDA topic. KEYWORDS word by the company it keeps” [11]. In other words, if two words topic models, word embeddings, document representations occur in similar contexts, they are similar.
    [Show full text]
  • Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms
    Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms Aleksander Wawer Agnieszka Mykowiecka Institute of Computer Science Institute of Computer Science PAS PAS Jana Kazimierza 5 Jana Kazimierza 5 01-248 Warsaw, Poland 01-248 Warsaw, Poland [email protected] [email protected] Abstract Unsupervised WSD algorithms aim at resolv- ing word ambiguity without the use of annotated This paper compares two approaches to corpora. There are two popular categories of word sense disambiguation using word knowledge-based algorithms. The first one orig- embeddings trained on unambiguous syn- inates from the Lesk (1986) algorithm, and ex- onyms. The first one is an unsupervised ploit the number of common words in two sense method based on computing log proba- definitions (glosses) to select the proper meaning bility from sequences of word embedding in a context. Lesk algorithm relies on the set of vectors, taking into account ambiguous dictionary entries and the information about the word senses and guessing correct sense context in which the word occurs. In (Basile et from context. The second method is super- al., 2014) the concept of overlap is replaced by vised. We use a multilayer neural network similarity represented by a DSM model. The au- model to learn a context-sensitive transfor- thors compute the overlap between the gloss of mation that maps an input vector of am- the meaning and the context as a similarity mea- biguous word into an output vector repre- sure between their corresponding vector represen- senting its sense. We evaluate both meth- tations in a semantic space.
    [Show full text]
  • Investigating Classification for Natural Language Processing Tasks
    UCAM-CL-TR-721 Technical Report ISSN 1476-2986 Number 721 Computer Laboratory Investigating classification for natural language processing tasks Ben W. Medlock June 2008 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2008 Ben W. Medlock This technical report is based on a dissertation submitted September 2007 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Fitzwilliam College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Abstract This report investigates the application of classification techniques to four natural lan- guage processing (NLP) tasks. The classification paradigm falls within the family of statistical and machine learning (ML) methods and consists of a framework within which a mechanical `learner' induces a functional mapping between elements drawn from a par- ticular sample space and a set of designated target classes. It is applicable to a wide range of NLP problems and has met with a great deal of success due to its flexibility and firm theoretical foundations. The first task we investigate, topic classification, is firmly established within the NLP/ML communities as a benchmark application for classification research. Our aim is to arrive at a deeper understanding of how class granularity affects classification accuracy and to assess the impact of representational issues on different classification models. Our second task, content-based spam filtering, is a highly topical application for classification techniques due to the ever-worsening problem of unsolicited email.
    [Show full text]
  • Generative Topic Embedding: a Continuous Representation of Documents
    Generative Topic Embedding: a Continuous Representation of Documents Shaohua Li1,2 Tat-Seng Chua1 Jun Zhu3 Chunyan Miao2 [email protected] [email protected] [email protected] [email protected] 1. School of Computing, National University of Singapore 2. Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY) 3. Department of Computer Science and Technology, Tsinghua University Abstract as continuous vectors in a low-dimensional em- bedding space (Bengio et al., 2003; Mikolov et al., Word embedding maps words into a low- 2013; Pennington et al., 2014; Levy et al., 2015). dimensional continuous embedding space The learned embedding for a word encodes its by exploiting the local word collocation semantic/syntactic relatedness with other words, patterns in a small context window. On by utilizing local word collocation patterns. In the other hand, topic modeling maps docu- each method, one core component is the embed- ments onto a low-dimensional topic space, ding link function, which predicts a word’s distri- by utilizing the global word collocation bution given its context words, parameterized by patterns in the same document. These their embeddings. two types of patterns are complementary. When it comes to documents, we wish to find a In this paper, we propose a generative method to encode their overall semantics. Given topic embedding model to combine the the embeddings of each word in a document, we two types of patterns. In our model, topics can imagine the document as a “bag-of-vectors”. are represented by embedding vectors, and Related words in the document point in similar di- are shared across documents.
    [Show full text]
  • Linked Data Triples Enhance Document Relevance Classification
    applied sciences Article Linked Data Triples Enhance Document Relevance Classification Dinesh Nagumothu * , Peter W. Eklund , Bahadorreza Ofoghi and Mohamed Reda Bouadjenek School of Information Technology, Deakin University, Geelong, VIC 3220, Australia; [email protected] (P.W.E.); [email protected] (B.O.); [email protected] (M.R.B.) * Correspondence: [email protected] Abstract: Standardized approaches to relevance classification in information retrieval use generative statistical models to identify the presence or absence of certain topics that might make a document relevant to the searcher. These approaches have been used to better predict relevance on the basis of what the document is “about”, rather than a simple-minded analysis of the bag of words contained within the document. In more recent times, this idea has been extended by using pre-trained deep learning models and text representations, such as GloVe or BERT. These use an external corpus as a knowledge-base that conditions the model to help predict what a document is about. This paper adopts a hybrid approach that leverages the structure of knowledge embedded in a corpus. In particular, the paper reports on experiments where linked data triples (subject-predicate-object), constructed from natural language elements are derived from deep learning. These are evaluated as additional latent semantic features for a relevant document classifier in a customized news- feed website. The research is a synthesis of current thinking in deep learning models in NLP and information retrieval and the predicate structure used in semantic web research. Our experiments Citation: Nagumothu, D.; Eklund, indicate that linked data triples increased the F-score of the baseline GloVe representations by 6% P.W.; Ofoghi, B.; Bouadjenek, M.R.
    [Show full text]
  • Document and Topic Models: Plsa
    10/4/2018 Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline • Topic Models • pLSA • LSA • Model • Fitting via EM • pHITS: link analysis • LDA • Dirichlet distribution • Generative process • Model • Geometric Interpretation • Inference 2 1 10/4/2018 Topic Models: Visual Representation Topic proportions and Topics Documents assignments 3 Topic Models: Importance • For a given corpus, we learn two things: 1. Topic: from full vocabulary set, we learn important subsets 2. Topic proportion: we learn what each document is about • This can be viewed as a form of dimensionality reduction • From large vocabulary set, extract basis vectors (topics) • Represent document in topic space (topic proportions) 푁 퐾 • Dimensionality is reduced from 푤푖 ∈ ℤ푉 to 휃 ∈ ℝ • Topic proportion is useful for several applications including document classification, discovery of semantic structures, sentiment analysis, object localization in images, etc. 4 2 10/4/2018 Topic Models: Terminology • Document Model • Word: element in a vocabulary set • Document: collection of words • Corpus: collection of documents • Topic Model • Topic: collection of words (subset of vocabulary) • Document is represented by (latent) mixture of topics • 푝 푤 푑 = 푝 푤 푧 푝(푧|푑) (푧 : topic) • Note: document is a collection of words (not a sequence) • ‘Bag of words’ assumption • In probability, we call this the exchangeability assumption • 푝 푤1, … , 푤푁 = 푝(푤휎 1 , … , 푤휎 푁 ) (휎: permutation) 5 Topic Models: Terminology (cont’d) • Represent each document as a vector space • A word is an item from a vocabulary indexed by {1, … , 푉}. We represent words using unit‐basis vectors.
    [Show full text]
  • Lexical Sense Labeling and Sentiment Potential Analysis Using Corpus-Based Dependency Graph
    mathematics Article Lexical Sense Labeling and Sentiment Potential Analysis Using Corpus-Based Dependency Graph Tajana Ban Kirigin 1,* , Sanda Bujaˇci´cBabi´c 1 and Benedikt Perak 2 1 Department of Mathematics, University of Rijeka, R. Matejˇci´c2, 51000 Rijeka, Croatia; [email protected] 2 Faculty of Humanities and Social Sciences, University of Rijeka, SveuˇcilišnaAvenija 4, 51000 Rijeka, Croatia; [email protected] * Correspondence: [email protected] Abstract: This paper describes a graph method for labeling word senses and identifying lexical sentiment potential by integrating the corpus-based syntactic-semantic dependency graph layer, lexical semantic and sentiment dictionaries. The method, implemented as ConGraCNet application on different languages and corpora, projects a semantic function onto a particular syntactical de- pendency layer and constructs a seed lexeme graph with collocates of high conceptual similarity. The seed lexeme graph is clustered into subgraphs that reveal the polysemous semantic nature of a lexeme in a corpus. The construction of the WordNet hypernym graph provides a set of synset labels that generalize the senses for each lexical cluster. By integrating sentiment dictionaries, we introduce graph propagation methods for sentiment analysis. Original dictionary sentiment values are integrated into ConGraCNet lexical graph to compute sentiment values of node lexemes and lexical clusters, and identify the sentiment potential of lexemes with respect to a corpus. The method can be used to resolve sparseness of sentiment dictionaries and enrich the sentiment evaluation of Citation: Ban Kirigin, T.; lexical structures in sentiment dictionaries by revealing the relative sentiment potential of polysemous Bujaˇci´cBabi´c,S.; Perak, B. Lexical Sense Labeling and Sentiment lexemes with respect to a specific corpus.
    [Show full text]