Distributional Semantics
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Approximate Pattern Matching Using Hierarchical Graph Construction and Sparse Distributed Representation
Portland State University PDXScholar Dissertations and Theses Dissertations and Theses 9-29-2020 Approximate Pattern Matching Using Hierarchical Graph Construction and Sparse Distributed Representation Aakanksha Mathuria Portland State University Follow this and additional works at: https://pdxscholar.library.pdx.edu/open_access_etds Part of the Electrical and Computer Engineering Commons Let us know how access to this document benefits ou.y Recommended Citation Mathuria, Aakanksha, "Approximate Pattern Matching Using Hierarchical Graph Construction and Sparse Distributed Representation" (2020). Dissertations and Theses. Paper 5581. https://doi.org/10.15760/etd.7453 This Thesis is brought to you for free and open access. It has been accepted for inclusion in Dissertations and Theses by an authorized administrator of PDXScholar. Please contact us if we can make this document more accessible: [email protected]. Approximate Pattern Matching using Hierarchical Graph Construction and Sparse Distributed Representation by Aakanksha Mathuria A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering Thesis Committee: Dan Hammerstrom, Chair Christof Teuscher Nirupama Bulusu Portland State University 2020 Abstract With recent developments in deep networks, there have been significant advances in visual object detection and recognition. However, some of these networks are still easily fooled/hacked and have shown ”bag of features” kinds of failures. Some of this is due to the fact that even deep networks make only marginal use of the complex structure that exists in real-world images. Primate visual systems appear to capture the structure in images, but how? In the research presented here, we are studying approaches for robust pattern matching using static, 2D Blocks World images based on graphical representations of the various components of an image. -
Probabilistic Topic Modelling with Semantic Graph
Probabilistic Topic Modelling with Semantic Graph B Long Chen( ), Joemon M. Jose, Haitao Yu, Fajie Yuan, and Huaizhi Zhang School of Computing Science, University of Glasgow, Sir Alwyns Building, Glasgow, UK [email protected] Abstract. In this paper we propose a novel framework, topic model with semantic graph (TMSG), which couples topic model with the rich knowledge from DBpedia. To begin with, we extract the disambiguated entities from the document collection using a document entity linking system, i.e., DBpedia Spotlight, from which two types of entity graphs are created from DBpedia to capture local and global contextual knowl- edge, respectively. Given the semantic graph representation of the docu- ments, we propagate the inherent topic-document distribution with the disambiguated entities of the semantic graphs. Experiments conducted on two real-world datasets show that TMSG can significantly outperform the state-of-the-art techniques, namely, author-topic Model (ATM) and topic model with biased propagation (TMBP). Keywords: Topic model · Semantic graph · DBpedia 1 Introduction Topic models, such as Probabilistic Latent Semantic Analysis (PLSA) [7]and Latent Dirichlet Analysis (LDA) [2], have been remarkably successful in ana- lyzing textual content. Specifically, each document in a document collection is represented as random mixtures over latent topics, where each topic is character- ized by a distribution over words. Such a paradigm is widely applied in various areas of text mining. In view of the fact that the information used by these mod- els are limited to document collection itself, some recent progress have been made on incorporating external resources, such as time [8], geographic location [12], and authorship [15], into topic models. -
Employee Matching Using Machine Learning Methods
Master of Science in Computer Science May 2019 Employee Matching Using Machine Learning Methods Sumeesha Marakani Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies. The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree. Contact Information: Author(s): Sumeesha Marakani E-mail: [email protected] University advisor: Prof. Veselka Boeva Department of Computer Science External advisors: Lars Tornberg [email protected] Daniel Lundgren [email protected] Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract Background. Expertise retrieval is an information retrieval technique that focuses on techniques to identify the most suitable ’expert’ for a task from a list of individ- uals. Objectives. This master thesis is a collaboration with Volvo Cars to attempt ap- plying this concept and match employees based on information that was extracted from an internal tool of the company. In this tool, the employees describe themselves in free flowing text. This text is extracted from the tool and analyzed using Natural Language Processing (NLP) techniques. -
Matrix Decompositions and Latent Semantic Indexing
Online edition (c)2009 Cambridge UP DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 403 Matrix decompositions and latent 18 semantic indexing On page 123 we introduced the notion of a term-document matrix: an M N matrix C, each of whose rows represents a term and each of whose column× s represents a document in the collection. Even for a collection of modest size, the term-document matrix C is likely to have several tens of thousands of rows and columns. In Section 18.1.1 we first develop a class of operations from linear algebra, known as matrix decomposition. In Section 18.2 we use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix. In Section 18.3 we examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing. While latent semantic in- dexing has not been established as a significant force in scoring and ranking for information retrieval, it remains an intriguing approach to clustering in a number of domains including for collections of text documents (Section 16.6, page 372). Understanding its full potential remains an area of active research. Readers who do not require a refresher on linear algebra may skip Sec- tion 18.1, although Example 18.1 is especially recommended as it highlights a property of eigenvalues that we exploit later in the chapter. 18.1 Linear algebra review We briefly review some necessary background in linear algebra. Let C be an M N matrix with real-valued entries; for a term-document matrix, all × RANK entries are in fact non-negative. -
Semantic Computing
SEMANTIC COMPUTING Lecture 7: Introduction to Distributional and Distributed Semantics Dagmar Gromann International Center For Computational Logic TU Dresden, 30 November 2018 Overview • Distributional Semantics • Distributed Semantics – Word Embeddings Dagmar Gromann, 30 November 2018 Semantic Computing 2 Distributional Semantics Dagmar Gromann, 30 November 2018 Semantic Computing 3 Distributional Semantics Definition • meaning of a word is the set of contexts in which it occurs • no other information is used than the corpus-derived information about word distribution in contexts (co-occurrence information of words) • semantic similarity can be inferred from proximity in contexts • At the very core: Distributional Hypothesis Distributional Hypothesis “similarity of meaning correlates with similarity of distribution” - Harris Z. S. (1954) “Distributional structure". Word, Vol. 10, No. 2-3, pp. 146-162 meaning = use = distribution in context => semantic distance Dagmar Gromann, 30 November 2018 Semantic Computing 4 Remember Lecture 1? Types of Word Meaning • Encyclopaedic meaning: words provide access to a large inventory of structured knowledge (world knowledge) • Denotational meaning: reference of a word to object/concept or its “dictionary definition” (signifier <-> signified) • Connotative meaning: word meaning is understood by its cultural or emotional association (positive, negative, neutral conntation; e.g. “She’s a dragon” in Chinese and English) • Conceptual meaning: word meaning is associated with the mental concepts it gives access to (e.g. prototype theory) • Distributional meaning: “You shall know a word by the company it keeps” (J.R. Firth 1957: 11) John Rupert Firth (1957). "A synopsis of linguistic theory 1930-1955." In Special Volume of the Philological Society. Oxford: Oxford University Press. Dagmar Gromann, 30 November 2018 Semantic Computing 5 Distributional Hypothesis in Practice Study by McDonald and Ramscar (2001): • The man poured from a balack into a handleless cup. -
Latent Semantic Analysis for Text-Based Research
Behavior Research Methods, Instruments, & Computers 1996, 28 (2), 197-202 ANALYSIS OF SEMANTIC AND CLINICAL DATA Chaired by Matthew S. McGlone, Lafayette College Latent semantic analysis for text-based research PETER W. FOLTZ New Mexico State University, Las Cruces, New Mexico Latent semantic analysis (LSA) is a statistical model of word usage that permits comparisons of se mantic similarity between pieces of textual information. This papersummarizes three experiments that illustrate how LSA may be used in text-based research. Two experiments describe methods for ana lyzinga subject's essay for determining from what text a subject learned the information and for grad ing the quality of information cited in the essay. The third experiment describes using LSAto measure the coherence and comprehensibility of texts. One of the primary goals in text-comprehension re A theoretical approach to studying text comprehen search is to understand what factors influence a reader's sion has been to develop cognitive models ofthe reader's ability to extract and retain information from textual ma representation ofthe text (e.g., Kintsch, 1988; van Dijk terial. The typical approach in text-comprehension re & Kintsch, 1983). In such a model, semantic information search is to have subjects read textual material and then from both the text and the reader's summary are repre have them produce some form of summary, such as an sented as sets ofsemantic components called propositions. swering questions or writing an essay. This summary per Typically, each clause in a text is represented by a single mits the experimenter to determine what information the proposition. -
A Comparison of Word Embeddings and N-Gram Models for Dbpedia Type and Invalid Entity Detection †
information Article A Comparison of Word Embeddings and N-gram Models for DBpedia Type and Invalid Entity Detection † Hanqing Zhou *, Amal Zouaq and Diana Inkpen School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa ON K1N 6N5, Canada; [email protected] (A.Z.); [email protected] (D.I.) * Correspondence: [email protected]; Tel.: +1-613-562-5800 † This paper is an extended version of our conference paper: Hanqing Zhou, Amal Zouaq, and Diana Inkpen. DBpedia Entity Type Detection using Entity Embeddings and N-Gram Models. In Proceedings of the International Conference on Knowledge Engineering and Semantic Web (KESW 2017), Szczecin, Poland, 8–10 November 2017, pp. 309–322. Received: 6 November 2018; Accepted: 20 December 2018; Published: 25 December 2018 Abstract: This article presents and evaluates a method for the detection of DBpedia types and entities that can be used for knowledge base completion and maintenance. This method compares entity embeddings with traditional N-gram models coupled with clustering and classification. We tackle two challenges: (a) the detection of entity types, which can be used to detect invalid DBpedia types and assign DBpedia types for type-less entities; and (b) the detection of invalid entities in the resource description of a DBpedia entity. Our results show that entity embeddings outperform n-gram models for type and entity detection and can contribute to the improvement of DBpedia’s quality, maintenance, and evolution. Keywords: semantic web; DBpedia; entity embedding; n-grams; type identification; entity identification; data mining; machine learning 1. Introduction The Semantic Web is defined by Berners-Lee et al. -
How Latent Is Latent Semantic Analysis?
How Latent is Latent Semantic Analysis? Peter Wiemer-Hastings University of Memphis Department of Psychology Campus Box 526400 Memphis TN 38152-6400 [email protected]* Abstract In the late 1980's, a group at Bellcore doing re- search on information retrieval techniques developed a Latent Semantic Analysis (LSA) is a statisti• statistical, corpus-based method for retrieving texts. cal, corpus-based text comparison mechanism Unlike the simple techniques which rely on weighted that was originally developed for the task of matches of keywords in the texts and queries, their information retrieval, but in recent years has method, called Latent Semantic Analysis (LSA), cre• produced remarkably human-like abilities in a ated a high-dimensional, spatial representation of a cor• variety of language tasks. LSA has taken the pus and allowed texts to be compared geometrically. Test of English as a Foreign Language and per• In the last few years, several researchers have applied formed as well as non-native English speakers this technique to a variety of tasks including the syn• who were successful college applicants. It has onym section of the Test of English as a Foreign Lan• shown an ability to learn words at a rate sim• guage [Landauer et al., 1997], general lexical acquisi• ilar to humans. It has even graded papers as tion from text [Landauer and Dumais, 1997], selecting reliably as human graders. We have used LSA texts for students to read [Wolfe et al., 1998], judging as a mechanism for evaluating the quality of the coherence of student essays [Foltz et a/., 1998], and student responses in an intelligent tutoring sys• the evaluation of student contributions in an intelligent tem, and its performance equals that of human tutoring environment [Wiemer-Hastings et al., 1998; raters with intermediate domain knowledge. -
Navigating Dbpedia by Topic Tanguy Raynaud, Julien Subercaze, Delphine Boucard, Vincent Battu, Frederique Laforest
Fouilla: Navigating DBpedia by Topic Tanguy Raynaud, Julien Subercaze, Delphine Boucard, Vincent Battu, Frederique Laforest To cite this version: Tanguy Raynaud, Julien Subercaze, Delphine Boucard, Vincent Battu, Frederique Laforest. Fouilla: Navigating DBpedia by Topic. CIKM 2018, Oct 2018, Turin, Italy. hal-01860672 HAL Id: hal-01860672 https://hal.archives-ouvertes.fr/hal-01860672 Submitted on 23 Aug 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Fouilla: Navigating DBpedia by Topic Tanguy Raynaud, Julien Subercaze, Delphine Boucard, Vincent Battu, Frédérique Laforest Univ Lyon, UJM Saint-Etienne, CNRS, Laboratoire Hubert Curien UMR 5516 Saint-Etienne, France [email protected] ABSTRACT only the triples that concern this topic. For example, a user is inter- Navigating large knowledge bases made of billions of triples is very ested in Italy through the prism of Sports while another through the challenging. In this demonstration, we showcase Fouilla, a topical prism of Word War II. For each of these topics, the relevant triples Knowledge Base browser that offers a seamless navigational expe- of the Italy entity differ. In such circumstances, faceted browsing rience of DBpedia. We propose an original approach that leverages offers no solution to retrieve the entities relative to a defined topic both structural and semantic contents of Wikipedia to enable a if the knowledge graph does not explicitly contain an adequate topic-oriented filter on DBpedia entities. -
Distributional Semantics
Distributional Semantics Computational Linguistics: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM YOAV GOLDBERG AND OMER LEVY Computational Linguistics: Jordan Boyd-Graber UMD Distributional Semantics 1 / 19 j j From Distributional to Distributed Semantics The new kid on the block Deep learning / neural networks “Distributed” word representations Feed text into neural-net. Get back “word embeddings”. Each word is represented as a low-dimensional vector. Vectors capture “semantics” word2vec (Mikolov et al) Computational Linguistics: Jordan Boyd-Graber UMD Distributional Semantics 2 / 19 j j From Distributional to Distributed Semantics This part of the talk word2vec as a black box a peek inside the black box relation between word-embeddings and the distributional representation tailoring word embeddings to your needs using word2vec Computational Linguistics: Jordan Boyd-Graber UMD Distributional Semantics 3 / 19 j j word2vec Computational Linguistics: Jordan Boyd-Graber UMD Distributional Semantics 4 / 19 j j word2vec Computational Linguistics: Jordan Boyd-Graber UMD Distributional Semantics 5 / 19 j j word2vec dog cat, dogs, dachshund, rabbit, puppy, poodle, rottweiler, mixed-breed, doberman, pig sheep cattle, goats, cows, chickens, sheeps, hogs, donkeys, herds, shorthorn, livestock november october, december, april, june, february, july, september, january, august, march jerusalem tiberias, jaffa, haifa, israel, palestine, nablus, damascus katamon, ramla, safed teva pfizer, schering-plough, novartis, astrazeneca, glaxosmithkline, sanofi-aventis, mylan, sanofi, genzyme, pharmacia Computational Linguistics: Jordan Boyd-Graber UMD Distributional Semantics 6 / 19 j j Working with Dense Vectors Word Similarity Similarity is calculated using cosine similarity: dog~ cat~ sim(dog~ ,cat~ ) = dog~ · cat~ jj jj jj jj For normalized vectors ( x = 1), this is equivalent to a dot product: jj jj sim(dog~ ,cat~ ) = dog~ cat~ · Normalize the vectors when loading them. -
Facing the Facts of Fake: a Distributional Semantics and Corpus Annotation Approach Bert Cappelle, Pascal Denis, Mikaela Keller
Facing the facts of fake: a distributional semantics and corpus annotation approach Bert Cappelle, Pascal Denis, Mikaela Keller To cite this version: Bert Cappelle, Pascal Denis, Mikaela Keller. Facing the facts of fake: a distributional semantics and corpus annotation approach. Yearbook of the German Cognitive Linguistics Association, De Gruyter, 2018, 6 (9-42). hal-01959609 HAL Id: hal-01959609 https://hal.archives-ouvertes.fr/hal-01959609 Submitted on 18 Dec 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Facing the facts of fake: a distributional semantics and corpus annotation approach Bert Cappelle, Pascal Denis and Mikaela Keller Université de Lille, Inria Lille Nord Europe Fake is often considered the textbook example of a so-called ‘privative’ adjective, one which, in other words, allows the proposition that ‘(a) fake x is not (an) x’. This study tests the hypothesis that the contexts of an adjective-noun combination are more different from the contexts of the noun when the adjective is such a ‘privative’ one than when it is an ordinary (subsective) one. We here use ‘embeddings’, that is, dense vector representations based on word co-occurrences in a large corpus, which in our study is the entire English Wikipedia as it was in 2013. -
Sentiment, Stance and Applications of Distributional Semantics
Sentiment, stance and applications of distributional semantics Maria Skeppstedt Sentiment, stance and applications of distributional semantics Sentiment analysis (opinion mining) • Aims at determining the attitude of the speaker/ writer * In a chunk of text (e.g., a document or sentence) * Towards a certain topic • Categories: * Positive, negative, (neutral) * More detailed Movie reviews • Excruciatingly unfunny and pitifully unromantic. • The picture emerges as a surprisingly anemic disappointment. • A sensitive, insightful and beautifully rendered film. • The spark of special anime magic here is unmistak- able and hard to resist. • Maybe you‘ll be lucky , and there‘ll be a power outage during your screening so you can get your money back. Methods for sentiment analysis • Train a machine learning classifier • Use lexicon and rules • (or combine the methods) Train a machine learning classifier Annotating text and training the model Excruciatingly Excruciatingly unfunny unfunny Applying the model pitifully unromantic. The model pitifully unromantic. Examples of machine learning Frameworks • Scikit-learn • NLTK (natural language toolkit) • CRF++ Methods for sentiment analysis • Train a machine learning classifier * See for instance Socher et al. (2013) that classified English movie review sentences into positive and negative sentiment. * Trained their classifier on 11,855 manually labelled sentences. (R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, C. Potts, Recursive deep models for semantic composition- ality over a sentiment treebank) Methods based on lexicons and rules • Lexicons for different polarities • Sometimes with different strength on the words * Does not need the large amount of training data that is typically required for machine learning methods * More flexible (e.g, to add new sentiment poles and new languages) Commercial use of sentiment analysis I want to retrieve everything positive and negative that is said about the products I sell.