Discriminative Topic Mining Via Category-Name Guided Text Embedding

Total Page:16

File Type:pdf, Size:1020Kb

Discriminative Topic Mining Via Category-Name Guided Text Embedding Discriminative Topic Mining via Category-Name Guided Text Embedding Yu Meng1∗, Jiaxin Huang1∗, Guangyuan Wang1, Zihan Wang1, Chao Zhang2, Yu Zhang1, Jiawei Han1 1Department of Computer Science, University of Illinois at Urbana-Champaign, IL, USA 2College of Computing, Georgia Institute of Technology, GA, USA 1{yumeng5, jiaxinh3, gwang10, zihanw2, yuz9, hanj}@illinois.edu [email protected] ABSTRACT 1 INTRODUCTION Mining a set of meaningful and distinctive topics automatically To help users effectively and efficiently comprehend a large setof from massive text corpora has broad applications. Existing topic text documents, it is of great interest to generate a set of meaning- models, however, typically work in a purely unsupervised way, ful and coherent topics automatically from a given corpus. Topic which often generate topics that do not fit users’ particular needs models [6, 19] are such unsupervised statistical tools that discover and yield suboptimal performance on downstream tasks. We pro- latent topics from text corpora. Due to their effectiveness in uncov- pose a new task, discriminative topic mining, which leverages a set ering hidden semantic structure in text collections, topic models of user-provided category names to mine discriminative topics from are widely used in text mining [15, 29] and information retrieval text corpora. This new task not only helps a user understand clearly tasks [14, 52]. and distinctively the topics he/she is most interested in, but also Despite of their effectiveness, traditional topic models suffer from benefits directly keyword-driven classification tasks. We develop two noteworthy limitations: (1) Failure to incorporate user guidance. CatE, a novel category-name guided text embedding method for dis- Topic models tend to retrieve the most general and prominent topics criminative topic mining, which effectively leverages minimal user from a text collection, which may not be of a user’s particular guidance to learn a discriminative embedding space and discover interest, or provide a skewed and biased summarization of the category representative terms in an iterative manner. We conduct corpus. (2) Failure to enforce distinctiveness among retrieved topics. a comprehensive set of experiments to show that CatE mines high- Concepts are most effectively interpreted via their uniquely defining quality set of topics guided by category names only, and benefits a features. For example, Egypt is known for pyramids and China is variety of downstream applications including weakly-supervised known for the Great Wall. Topic models, however, do not impose classification and lexical entailment direction identification1. disriminative constraints, resulting in vague interpretations of the retrieved topics. Table 1 demonstrates three retrieved topics from CCS CONCEPTS the New York Times (NYT) annotated corpus [42] via LDA [6]. We • Information systems ! Data mining; Document topic models; can see that it is difficult to clearly define the meaning of thethree Clustering and classification; topics due to an overlap of their semantics (e.g., the term “united states” appears in all three topics). KEYWORDS Table 1: LDA retrieved topics on NYT dataset. The meanings Topic Mining, Discriminative Analysis, Text Embedding, Text Clas- of the retrieved topics have overlap with each other. sification Topic 1 Topic 2 Topic 3 ACM Reference Format: Yu Meng1∗, Jiaxin Huang1∗, Guangyuan Wang1, Zihan Wang1, Chao Zhang2, canada, united states sports, united states united states, iraq Yu Zhang1, Jiawei Han1. 2020. Discriminative Topic Mining via Category- canadian, economy olympic, games government, president Name Guided Text Embedding. In Proceedings of The Web Conference 2020 arXiv:1908.07162v2 [cs.CL] 27 Jan 2020 (WWW ’20), April 20–24, 2020, Taipei, Taiwan. ACM, New York, NY, USA, In order to incorporate user knowledge or preference into topic 11 pages. https://doi.org/10.1145/3366423.3380278 discovery for mining distinctive topics from a text corpus, we pro- pose a new task, Discriminative Topic Mining, which takes only 1Source code can be found at https://github.com/yumeng5/CatE. a set of category names as user guidance, and aims to retrieve a set of representative and discriminative terms under each pro- vided category. In many cases, a user may have a specific set of ∗Equal Contribution. interested topics in mind, or have prior knowledge about the po- tential topics in a corpus. Such user interest or prior knowledge This paper is published under the Creative Commons Attribution 4.0 International may come naturally in the form of a set of category names that (CC BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. could be used to guide the topic discovery process, resulting in WWW ’20, April 20–24, 2020, Taipei, Taiwan more desirable results that better cater to a user’s need and fit spe- © 2020 IW3C2 (International World Wide Web Conference Committee), published cific downstream applications. For example, a user may provide under Creative Commons CC BY 4.0 License. ACM ISBN 978-1-4503-7023-3/20/04. several country names and rely on discriminative topic mining to https://doi.org/10.1145/3366423.3380278 retrieve each country’s provinces, cities, currency, etc. from a text corpus. We will show that this new task not only helps the user to from D for each category ci such that each term in Si semantically clearly and distinctively understand his/her interested topics, but belongs to and only belongs to category ci . also benefits keywords-driven classification tasks. Example 1. Given a set of country names, c : “The United States”, There exist previous studies that attempt to incorporate prior 1 c : “France” and c : “Canada”, it is correct to retrieve “Ontario” knowledge into topic models. Along one line of work, supervised 2 3 as an element in S , because Ontario is a province in Canada and topic models such as Supervised LDA [5] and DiscLDA [23] guide 3 exclusively belongs to Canada semantically. However, it is incorrect the model to predict category labels based on document-level train- to retrieve “North America” as an element in S , because North ing data. While they do improve the discriminative power of unsu- 3 America is a continent and does not belong to any countries. It is pervised topic models on classification tasks, they rely on massive also incorrect to retrieve “English” as an element in S , because hand-labeled documents, which may be difficult to obtain in practi- 3 English is also the national language of the United States. cal applications. Along another line of work that is more similar to our setting, users are asked to provide a set of seed words to The differences between discriminative topic mining and stan- guide the topic discovery process, which is referred to as seed- dard topic modeling are mainly two-fold: (1) Discriminative topic guided topic modeling [1, 21]. However, they still do not impose mining requires a set of user provided category names and only requirements on the distinctiveness of the retrieved topics and thus focuses on retrieving terms belonging to the given categories. (2) are not optimized for discriminative topic presentation and other Discriminative topic mining imposes strong discriminative require- applications such as keyword-driven classification. ments that each retrieved term under the corresponding category We develop a novel category-name guided text embedding method, must belong to and only belong to that category semantically. CatE, for discriminative topic mining. CatE consists of two mod- ules: (1) A category-name guided text embedding learning module 3 CATEGORY-NAME GUIDED EMBEDDING that takes a set of category names to learn category distinctive word In this section, we first formulate a text generative process under embeddings by modeling the text generative process conditioned user guidance, and then cast the learning of the generative process on the user provided categories, and (2) a category representative as a category-name guided text embedding model. Words, docu- word retrieval module that selects category representative words ments and categories are jointly embedded into a shared space based on both word embedding similarity and word distributional where embeddings are not only learned according to the corpus specificity. The two modules collaborate in an iterative way:At generative assumption, but also encouraged to incorporate category each iteration, the former refines word embeddings and category distinctive information. embeddings for accurate representative word retrieval; and the latter selects representative words that will be used by the former 3.1 Motivation at the next iteration. Traditional topic models like LDA [6] use document-topic and topic- Our contributions can be summarized as follows. word distributions to model the text generation process, where an (1) We propose discriminative topic mining, a new task for topic dis- obvious defect exists due to the bag-of-words generation assumption— covery from text corpora with a set of category names as the only each word is drawn independently from the topic-word distribution supervision. We show qualitatively and quantitatively that this without considering the correlations between adjacent words. In new task helps users obtain a clear and distinctive understand- addition, topic models make explicit probabilistic assumptions re- ing of interested topics, and directly benefits keyword-driven
Recommended publications
  • Multi-View Learning for Hierarchical Topic Detection on Corpus of Documents
    Multi-view learning for hierarchical topic detection on corpus of documents Juan Camilo Calero Espinosa Universidad Nacional de Colombia Facultad de Ingenieria, Departamento de Ingenieria de Sistemas e Industrial. Bogot´a,Colombia 2021 Multi-view learning for hierarchical topic detection on corpus of documents Juan Camilo Calero Espinosa Tesis presentada como requisito parcial para optar al t´ıtulode: Magister en Ingenier´ıade Sistemas y Computaciøn´ Director: Ph.D. Luis Fernando Ni~noV. L´ıneade Investigaci´on: Procesamiento de lenguaje natural Grupo de Investigaci´on: Laboratorio de investigaci´onen sistemas inteligentes - LISI Universidad Nacional de Colombia Facultad de Ingenieria, Departamento de Ingenieria en Sistemas e Industrial. Bogot´a,Colombia 2021 To my parents Maria Helena and Jaime. To my aunts Patricia and Rosa. To my grandmothers Lilia and Santos. Acknowledgements To Camilo Alberto Pino, as the original thesis idea was his, and for his invaluable teaching of multi-view learning. To my thesis advisor, Luis Fernando Ni~no,and the Laboratorio de investigaci´onen sistemas inteligentes - LISI, for constantly allowing me to learn new knowl- edge, and for their valuable recommendations on the thesis. V Abstract Topic detection on a large corpus of documents requires a considerable amount of com- putational resources, and the number of topics increases the burden as well. However, even a large number of topics might not be as specific as desired, or simply the topic quality starts decreasing after a certain number. To overcome these obstacles, we propose a new method- ology for hierarchical topic detection, which uses multi-view clustering to link different topic models extracted from document named entities and part of speech tags.
    [Show full text]
  • Text Mining Course for KNIME Analytics Platform
    Text Mining Course for KNIME Analytics Platform KNIME AG Copyright © 2018 KNIME AG Table of Contents 1. The Open Analytics Platform 2. The Text Processing Extension 3. Importing Text 4. Enrichment 5. Preprocessing 6. Transformation 7. Classification 8. Visualization 9. Clustering 10. Supplementary Workflows Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 2 Noncommercial-Share Alike license 1 https://creativecommons.org/licenses/by-nc-sa/4.0/ Overview KNIME Analytics Platform Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 3 Noncommercial-Share Alike license 1 https://creativecommons.org/licenses/by-nc-sa/4.0/ What is KNIME Analytics Platform? • A tool for data analysis, manipulation, visualization, and reporting • Based on the graphical programming paradigm • Provides a diverse array of extensions: • Text Mining • Network Mining • Cheminformatics • Many integrations, such as Java, R, Python, Weka, H2O, etc. Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 4 Noncommercial-Share Alike license 2 https://creativecommons.org/licenses/by-nc-sa/4.0/ Visual KNIME Workflows NODES perform tasks on data Not Configured Configured Outputs Inputs Executed Status Error Nodes are combined to create WORKFLOWS Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 5 Noncommercial-Share Alike license 3 https://creativecommons.org/licenses/by-nc-sa/4.0/ Data Access • Databases • MySQL, MS SQL Server, PostgreSQL • any JDBC (Oracle, DB2, …) • Files • CSV, txt
    [Show full text]
  • Application of Machine Learning in Automatic Sentiment Recognition from Human Speech
    Application of Machine Learning in Automatic Sentiment Recognition from Human Speech Zhang Liu Ng EYK Anglo-Chinese Junior College College of Engineering Singapore Nanyang Technological University (NTU) Singapore [email protected] Abstract— Opinions and sentiments are central to almost all human human activities and have a wide range of applications. As activities and have a wide range of applications. As many decision many decision makers turn to social media due to large makers turn to social media due to large volume of opinion data volume of opinion data available, efficient and accurate available, efficient and accurate sentiment analysis is necessary to sentiment analysis is necessary to extract those data. Business extract those data. Hence, text sentiment analysis has recently organisations in different sectors use social media to find out become a popular field and has attracted many researchers. However, consumer opinions to improve their products and services. extracting sentiments from audio speech remains a challenge. This project explores the possibility of applying supervised Machine Political party leaders need to know the current public Learning in recognising sentiments in English utterances on a sentiment to come up with campaign strategies. Government sentence level. In addition, the project also aims to examine the effect agencies also monitor citizens’ opinions on social media. of combining acoustic and linguistic features on classification Police agencies, for example, detect criminal intents and cyber accuracy. Six audio tracks are randomly selected to be training data threats by analysing sentiment valence in social media posts. from 40 YouTube videos (monologue) with strong presence of In addition, sentiment information can be used to make sentiments.
    [Show full text]
  • Nonstationary Latent Dirichlet Allocation for Speech Recognition
    Nonstationary Latent Dirichlet Allocation for Speech Recognition Chuang-Hua Chueh and Jen-Tzung Chien Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, Taiwan, ROC {chchueh,chien}@chien.csie.ncku.edu.tw model was presented for document representation with time Abstract evolution. Current parameters served as the prior information Latent Dirichlet allocation (LDA) has been successful for to estimate new parameters for next time period. Furthermore, document modeling. LDA extracts the latent topics across a continuous time dynamic topic model [12] was developed by documents. Words in a document are generated by the same incorporating a Brownian motion in the dynamic topic model, topic distribution. However, in real-world documents, the and so the continuous-time topic evolution was fulfilled. usage of words in different paragraphs is varied and Sparse variational inference was used to reduce the accompanied with different writing styles. This study extends computational complexity. In [6][7], LDA was merged with the LDA and copes with the variations of topic information the hidden Markov model (HMM) as the HMM-LDA model within a document. We build the nonstationary LDA (NLDA) where the Markov states were used to discover the syntactic by incorporating a Markov chain which is used to detect the structure of a document. Each word was generated either from stylistic segments in a document. Each segment corresponds to the topic or the syntactic state. The syntactic words and a particular style in composition of a document. This NLDA content words were modeled separately. can exploit the topic information between documents as well This study also considers the superiority of HMM in as the word variations within a document.
    [Show full text]
  • Large-Scale Hierarchical Topic Models
    Large-Scale Hierarchical Topic Models Jay Pujara Peter Skomoroch Department of Computer Science LinkedIn Corporation University of Maryland 2029 Stierlin Ct. College Park, MD 20742 Mountain View, CA 94043 [email protected] [email protected] Abstract In the past decade, a number of advances in topic modeling have produced sophis- ticated models that are capable of generating hierarchies of topics. One challenge for these models is scalability: they are incapable of working at the massive scale of millions of documents and hundreds of thousands of terms. We address this challenge with a technique that learns a hierarchy of topics by iteratively apply- ing topic models and processing subtrees of the hierarchy in parallel. This ap- proach has a number of scalability advantages compared to existing techniques, and shows promising results in experiments assessing runtime and human evalu- ations of quality. We detail extensions to this approach that may further improve hierarchical topic modeling for large-scale applications. 1 Motivation With massive datasets and corresponding computational resources readily available, the Big Data movement aims to provide deep insights into real-world data. Realizing this goal can require new approaches to well-studied problems. Complex models that, for example, incorporate many de- pendencies between parameters have alluring results for small datasets and single machines but are difficult to adapt to the Big Data paradigm. Topic models are an interesting example of this phenomenon. In the last decade, a number of sophis- ticated techniques have been developed to model collections of text, from Latent Dirichlet Allocation (LDA)[1] through extensions using statistical machinery such as the nested Chinese Restaurant Pro- cess [2][3] and Pachinko Allocation[4].
    [Show full text]
  • Text KM & Text Mining
    Text KM & Text Mining An Introduction to Managing Knowledge in Unstructured Natural Language Documents Dekai Wu HKUST Human Language Technology Center Department of Computer Science and Engineering University of Science and Technology Hong Kong http://www.cs.ust.hk/~dekai © 2008 Dekai Wu Lecture Objectives Introduction to the concept of Text KM and Text Mining (TM) How to exploit knowledge encoded in text form How text mining is different from data mining Introduction to the various aspects of Natural Language Processing (NLP) Introduction to the different tools and methods available for TM HKUST Human Language Technology Center © 2008 Dekai Wu Textual Knowledge Management Text KM oversees the storage, capturing and sharing of knowledge encoded in unstructured natural language documents 80-90% of an organization’s explicit knowledge resides in plain English (Chinese, Japanese, Spanish, …) documents – not in structured relational databases! Case libraries are much more reasonably stored as natural language documents, than encoded into relational databases Most knowledge encoded as text will never pass through explicit KM processes (eg, email) HKUST Human Language Technology Center © 2008 Dekai Wu Text Mining Text Mining analyzes unstructured natural language documents to extract targeted types of knowledge Extracts knowledge that can then be inserted into databases, thereby facilitating structured data mining techniques Provides a more natural user interface for entering knowledge, for both employees and developers Reduces
    [Show full text]
  • Journal of Applied Sciences Research
    Copyright © 2015, American-Eurasian Network for Scientific Information publisher JOURNAL OF APPLIED SCIENCES RESEARCH ISSN: 1819-544X EISSN: 1816-157X JOURNAL home page: http://www.aensiweb.com/JASR 2015 October; 11(19): pages 50-55. Published Online 10 November 2015. Research Article A Survey on Correlation between the Topic and Documents Based on the Pachinko Allocation Model 1Dr.C.Sundar and 2V.Sujitha 1Associate Professor, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India. 2PG Scholar, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India. Received: 23 September 2015; Accepted: 25 October 2015 © 2015 AENSI PUBLISHER All rights reserved ABSTRACT Latent Dirichlet allocation (LDA) and other related topic models are increasingly popular tools for summarization and manifold discovery in discrete data. In existing system, a novel information filtering model, Maximum matched Pattern-based Topic Model (MPBTM), is used.The patterns are generated from the words in the word-based topic representations of a traditional topic model such as the LDA model. This ensures that the patterns can well represent the topics because these patterns are comprised of the words which are extracted by LDA based on sample occurrence of the words in the documents. The Maximum matched pat-terns, which are the largest patterns in each equivalence class that exist in the received documents, are used to calculate the relevance words to represent topics. However, LDA does not capture correlations between topics and these not find the hidden topics in the document. To deal with the above problem the pachinko allocation model (PAM) is proposed.
    [Show full text]
  • Incorporating Domain Knowledge in Latent Topic Models
    INCORPORATING DOMAIN KNOWLEDGE IN LATENT TOPIC MODELS by David Michael Andrzejewski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Sciences) at the UNIVERSITY OF WISCONSIN–MADISON 2010 c Copyright by David Michael Andrzejewski 2010 All Rights Reserved i For my parents and my Cho. ii ACKNOWLEDGMENTS Obviously none of this would have been possible without the diligent advising of Mark Craven and Xiaojin (Jerry) Zhu. Taking a bioinformatics class from Mark as an undergraduate initially got me excited about the power of statistical machine learning to extract insights from seemingly impenetrable datasets. Jerry’s enthusiasm for research and relentless pursuit of excellence were a great inspiration for me. On countless occasions, Mark and Jerry have selflessly donated their time and effort to help me develop better research skills and to improve the quality of my work. I would have been lucky to have even a single advisor as excellent as either Mark or Jerry; I have been extremely fortunate to have them both as co-advisors. My committee members have also been indispensable. Jude Shavlik has always brought an emphasis on clear communication and solid experimental technique which will hopefully stay with me for the rest of my career. Michael Newton helped me understand the modeling issues in this research from a statistical perspective. Working with prelim committee member Ben Liblit gave me the exciting opportunity to apply machine learning to a very challenging problem in the debugging work presented in Chapter 4. I also learned a lot about how other computer scientists think from meetings with Ben.
    [Show full text]
  • Extração De Informação Semântica De Conteúdo Da Web 2.0
    Mestrado em Engenharia Informática Dissertação Relatório Final Extração de Informação Semântica de Conteúdo da Web 2.0 Ana Rita Bento Carvalheira [email protected] Orientador: Paulo Jorge de Sousa Gomes [email protected] Data: 1 de Julho de 2014 Agradecimentos Gostaria de começar por agradecer ao Professor Paulo Gomes pelo profissionalismo e apoio incondicional, pela sincera amizade e a total disponibilidade demonstrada ao longo do ano. O seu apoio, não só foi determinante para a elaboração desta tese, como me motivou sempre a querer saber mais e ter vontade de fazer melhor. À minha Avó Maria e Avô Francisco, por sempre estarem presentes quando eu precisei, pelo carinho e afeto, bem como todo o esforço que fizeram para que nunca me faltasse nada. Espero um dia poder retribuir de alguma forma tudo aquilo que fizeram por mim. Aos meus Pais, pelos ensinamentos e valores transmitidos, por tudo o que me proporcionaram e por toda a disponibilidade e dedicação que, constantemente, me oferecem. Tudo aquilo que sou, devo-o a vocês. Ao David agradeço toda a ajuda e compreensão ao longo do ano, todo o carinho e apoio demonstrado em todas as minhas decisões e por sempre me ter encorajado a seguir os meus sonhos. Admiro-te sobretudo pela tua competência e humildade, pela transmissão de força e confiança que me dás em todos os momentos. Resumo A massiva proliferação de blogues e redes sociais fez com que o conteúdo gerado pelos utilizadores, presente em plataformas como o Twitter ou Facebook, se tornasse bastante valioso pela quantidade de informação passível de ser extraída e explorada.
    [Show full text]
  • Knowledge-Powered Deep Learning for Word Embedding
    Knowledge-Powered Deep Learning for Word Embedding Jiang Bian, Bin Gao, and Tie-Yan Liu Microsoft Research {jibian,bingao,tyliu}@microsoft.com Abstract. The basis of applying deep learning to solve natural language process- ing tasks is to obtain high-quality distributed representations of words, i.e., word embeddings, from large amounts of text data. However, text itself usually con- tains incomplete and ambiguous information, which makes necessity to leverage extra knowledge to understand it. Fortunately, text itself already contains well- defined morphological and syntactic knowledge; moreover, the large amount of texts on the Web enable the extraction of plenty of semantic knowledge. There- fore, it makes sense to design novel deep learning algorithms and systems in order to leverage the above knowledge to compute more effective word embed- dings. In this paper, we conduct an empirical study on the capacity of leveraging morphological, syntactic, and semantic knowledge to achieve high-quality word embeddings. Our study explores these types of knowledge to define new basis for word representation, provide additional input information, and serve as auxiliary supervision in deep learning, respectively. Experiments on an analogical reason- ing task, a word similarity task, and a word completion task have all demonstrated that knowledge-powered deep learning can enhance the effectiveness of word em- bedding. 1 Introduction With rapid development of deep learning techniques in recent years, it has drawn in- creasing attention to train complex and deep models on large amounts of data, in order to solve a wide range of text mining and natural language processing (NLP) tasks [4, 1, 8, 13, 19, 20].
    [Show full text]
  • An Unsupervised Topic Segmentation Model Incorporating Word Order
    An Unsupervised Topic Segmentation Model Incorporating Word Order Shoaib Jameel and Wai Lam The Chinese University of Hong Kong One line summary of the work We will see how maintaining the document structure such as paragraphs, sentences, and the word order helps improve the performance of a topic model. Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 1 Outline Motivation Related Work I Probabilistic Unigram Topic Model (LDA) I Probabilistic N-gram Topic Models I Topic Segmentation Models Overview of our model I Our N-gram Topic Segmentation model (NTSeg) Text Mining Experiments of NTSeg I Word-Topic and Segment-Topic Correlation Graph I Topic Segmentation Experiment I Document Classification Experiment I Document Likelihood Experiment Conclusions and Future Directions Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 2 Motivation Many works in the topic modeling literature assume exchangeability among the words. As a result we see many ambiguous words in topics. For example, consider few topics obtained from the NIPS collection using the Latent Dirichlet Allocation (LDA) model: Example Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 architecture order connectionist potential prior recurrent first role membrane bayesian network second binding current data module analysis structures synaptic evidence modules small distributed dendritic experts The problem with the LDA model Words in topics are not insightful. Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 3 Motivation Many works in the topic modeling literature assume exchangeability among the words. As a result we see many ambiguous words in topics. For example, consider few topics obtained from the NIPS collection using the Latent Dirichlet Allocation (LDA) model: Example Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 architecture order connectionist potential prior recurrent first role membrane bayesian network second binding current data module analysis structures synaptic evidence modules small distributed dendritic experts The problem with the LDA model Words in topics are not insightful.
    [Show full text]
  • A New Method and Application for Studying Political Text in Multiple Languages
    The Pennsylvania State University The Graduate School THE RADICAL RIGHT IN PARLIAMENT: A NEW METHOD AND APPLICATION FOR STUDYING POLITICAL TEXT IN MULTIPLE LANGUAGES A Dissertation in Political Science by Mitchell Goist Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2020 ii The dissertation of Mitchell Goist was reviewed and approved* by the following: Burt L. Monroe Liberal Arts Professor of Political Science Dissertation Advisor Chair of Committee Bruce Desmarais Associate Professor of Political Science Matt Golder Professor of Political Science Sarah Rajtmajer Assistant Professor of Information Science and Tecnology Glenn Palmer Professor of Political Science and Director of Graduate Studies iii ABSTRACT Since a new wave of radical right support in the early 1980s, scholars have sought to understand the motivations and programmatic appeals of far-right parties. However, due to their small size and dearth of data, existing methodological approaches were did not allow the direct study of these parties’ behavior in parliament. Using a collection of parliamentary speeches from the United Kingdom, Germany, Spain, Italy, the Netherlands, Finland, Sweden, and the Czech Re- public, Chapter 1 of this dissertation addresses this problem by developing a new model for the study of political text in multiple languages. Using this new method allows the construction of a shared issue space where each party is embedded regardless of the language spoken in the speech or the country of origin. Chapter 2 builds on this new method by explicating the ideolog- ical appeals of radical right parties. It finds that in some instances radical right parties behave similarly to mainstream, center-right parties, but distinguish themselves by a focus on individual crime and an emphasis on negative rhetorical frames.
    [Show full text]