Semantic SPARQL Similarity Search Over RDF Knowledge Graphs

Total Page:16

File Type:pdf, Size:1020Kb

Semantic SPARQL Similarity Search Over RDF Knowledge Graphs Semantic SPARQL Similarity Search Over RDF Knowledge Graphs Weiguo Zheng1, Lei Zou1, Wei Peng1, Xifeng Yan2, Shaoxu Song3, Dongyan Zhao1 1Peking University, Beijing, China, 100080; 2University of California at Santa Barbara, California, USA, 93106; 3Tsinghua University, Beijing, China, 100084. fzhengweiguo,zoulei,pengw,[email protected],[email protected],[email protected] ABSTRACT world fact. Thus, if we want to find more answers to a question, RDF knowledge graphs have attracted increasing attentions these complex SPARQL queries that contain multiple UNION operators years. However, due to the schema-free nature of RDF data, it is are required. Clearly, it is very difficult for users (even the profes- very difficult for users to have full knowledge of the underlying sional users) to conceive the complicated SPARQL queries that not schema. Furthermore, the same kind of information can be repre- only conform to the syntax but also consider the flexible underlying sented in diverse graph fragments. Hence, it is a huge challenge to schemas. We illustrate the challenges by the following motivating formulate complex SPARQL expressions by taking the union of all example. possible structures. 1.1 Motivating Example In this paper, we propose an effective framework to access the RDF repository even if users have no full knowledge of the un- Fig. 1 presents a piece of RDF graph extracted from DBpedia. derlying schema. Specifically, given a SPARQL query, the system Assume that we want to find the cars that are produced in Ger- could return as more answers that match the query based on the many. There are at least three different German car brands, such as semantic similarity as possible. Interestingly, we propose a sys- Porsche Cayenne, Mercedes Benz and BMWX6, which are stored tematic method to mine diverse semantically equivalent structure in three different schemas in Fig. 1. In order to enable the query, patterns. More importantly, incorporating both structural and se- we should issue the following SPARQL query, which is composed mantic similarities we are the first to propose a novel similarity of three subqueries corresponding to the query graphs q1, q2, and measure, semantic graph edit distance. In order to improve the q3 in Fig. 2(a). efficiency performance, we apply the semantic summary graph to SELECT ?x WHERE f summarize the knowledge graph, which supports both high-level f?x <type> Automobile. ?x <p r o d u c t i o n > Germany . g pruning and drill-down pruning. We also devise an effective lower UNION bound based on the TA-style access to each of the candidate sets. f?x <type> Automobile. ?x <assembly> Germany . g Extensive experiments over real datasets confirm the effectiveness UNION f?x <type> Automobile. ?x <manufacturer > ?y . and efficiency of our approach. ?y <l o c a t i o n > Germany .gg Since different structural patterns may express the same seman- 1. INTRODUCTION tic meanings, formulating a SPARQL query that considers all pos- An RDF repository, which consists of a set of triples hsubject, sible structures is not a trivial task. Although we can conceive the predicate, objecti, can be modeled as an RDF graph, where the complete SPARQL query for “the cars produced in Germany”, we vertices represent subjects and objects, and the labeled edges cor- cannot always use the same query pattern to answer other ques- respond to predicates. The rapidly growing RDF knowledge repos- tions. For instance, in order to find all cars produced in Aus- itories, such as DBpedia, Yago and Freebase, increase the demand tralia, we should add another SPARQL subquery “?x < type > for managing graph data effectively and efficiently. AutomobileO f Australia” to capture the complete answers. SPARQL, a structural query language proposed by W3C, is de- signed for querying RDF data. Since SPARQL queries can be rep- Benz manufacturer Mercedes_Benz type ty resented as query graphs [28], a SPARQL query can be answered p on BMW_X6 e by performing the graph pattern matching over RDF graphs [1]. locati Company p rodu Germany Due to the “schema-free” nature of RDF data, different data con- type ction tributors may adopt different schemas to describe the same real- Chery ly emb ype ass t assembly type Porsche_Cayenne type BYD Automobile p Statesman_V6 rodu ction pe China Actor This work is licensed under the Creative Commons Attribution- ty Japan t y on NonCommercial-NoDerivatives 4.0 International License. To view a copy p Toyota ucti e d pro ty e p p bornIn of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For e e y p production t type any use beyond those covered by this license, obtain permission by emailing ty Automobile Andy Lau [email protected]. OfAustralia Proceedings of the VLDB Endowment, Vol. 9, No. 11 Honda AsianCoutry Copyright 2016 VLDB Endowment 2150-8097/16/07. Figure 1: An RDF knowledge graph. 840 SELECT ?x WHERE { ?x ?x { ?x <type> Automobile. type production (q1) SELECT ?x WHERE type production ?x <production> Germany.} { ?x <type> Automobile. UNION Automobile Germany ?x <production> Germany.} Automobile Germany { ?x <type> Automobile. ?x <assembly> Germany. } ?x (q1) (q2) UNION type assembly { ?x <type> Automobile. ?x <manufacturer> ?y. Automobile Germany Query rewriting semantic graph patterns ?y <location> Germany.} } ?x (Exact) SPARQL SPARQL Similarity Search type manufacturer (q3) Query location VM Automobile ?y VM Porsche_Cayenne Germany Porsche_Cayenne Answers: Answers: Benz Benz BMW_X6 BMW_X6 (a) Traditional method (b) Our method Figure 2: Traditional method vs. our method To obtain more correct answers, the traditional method (as shown vertex labels by a similarity function. However, it requires that q in Fig. 2(a)) demands users to have the full knowledge about the and its match must share the same graph structure including the schema of an RDF graph. In other words, it requires that user- edge label constraints. Recently, SLQ [23] presents a query engine s should not only know all the predicates in the knowledge base that integrates a set of transformation functions, such as “synonym” clearly, but also be aware of different structural expressions for i- and “distance”. Although it can plug in the “distance” transfor- dentical semantic facts. It will be more difficult for open-domain mation (i.e., transforming an edge to a shortest path), more com- knowledge graphs, such as DBpedia. Every coin has two sides. plicated structures (e.g., graphs) are hard to deal with. Further- The “schema-free” nature of RDF facilitates the dataset construc- more, structural transformation (corresponding to “distance”) and tion, but it inevitably leads to the inherent difficulty of querying the semantic transformation (corresponding to “ontology”) are taken knowledge base. into consideration separately. The goal of this paper is to provide an effective way to access the 1.3 Challenges and Contributions RDF repository even if one has no full knowledge of the underlying Challenge 1: Mining Diverse Structure Patterns with Equiva- schema. To this end, we provide an effective query model. Given lent Semantic Meanings. It is a common case that many subgraph- an RDF graph G, a user just needs to write a SPARQL query fol- s of a knowledge graph convey the same semantic meaning even if lowing one possible schema to express his/her query intention. The they do not share the identical structure. For example, graphs g , system should return as many answers that semantically match the 1 g2, and g3 in Fig. 3, are three different subgraphs extracted from query as possible. Fig. 2(b) illustrates our framework of SPARQL the knowledge graph G in Fig. 1. Although they are different in similarity search. terms of graph structures, they share the same semantic meaning, For example, given the SPARQL query in Fig. 2(b) (it corre- i.e., the automobiles that are produced in Germany. Note that the sponds to q1), our system can find all cars that are produced in task of mining diverse structural patterns with equivalent seman- Germany. Graphs g1, g2, and g3 in Figs. 3(a), 3(b), and 3(c) are tic meanings is different from schema mapping [13] and ontology three of the matches based on the semantic similarity. alignment [16]. (1) Different inputs: Schema mapping and ontolo- 1.2 Limitations of Existing Approaches gy alignment take two schema/ontologies as inputs; but our input is Although lots of efforts have been devoted to the graph similarity a single knowledge graph. (2) Different outputs: Our task is to find search [24, 8, 2, 7, 22, 23], they suffer from various drawbacks. sets of graph patterns that describe the same semantic meanings. Resorting to Structure Similarity. Several approaches are pro- However, schema mapping and ontology alignment aim to find the posed for the approximate subgraph query, but most of them focus mapping between two elements from two schemas or two concepts on the structure similarity, such as SAPPER [24], kGPM [2] and from two ontologies. Ness [8]. SAPPER [24] investigates the problem of approximate In this paper, we propose an instance-driven approach to mine subgraph search allowing some edges unmatched. It does not sup- these semantically equivalent patterns. According to the mining port the vertex/edge label substitution. kGPM [2] proposes a graph results, we define three representative graph patterns of semantic e- pattern query, which allows a path to match an edge. However, quivalence, i.e., concept generation, edge redirection, and inductive it restricts that the vertex/edge labels specified in the query graph inference, over the RDF knowledge graph. q should be exactly matched. Exploiting the neighborhood-based Challenge 2: Measuring Semantic Similarity in a Uniform Man- similarity measure, Ness [8] and NeMa [7] try to identify the top-k ner.
Recommended publications
  • Combining Semantic and Lexical Measures to Evaluate Medical Terms Similarity?
    Combining semantic and lexical measures to evaluate medical terms similarity? Silvio Domingos Cardoso1;2, Marcos Da Silveira1, Ying-Chi Lin3, Victor Christen3, Erhard Rahm3, Chantal Reynaud-Dela^ıtre2, and C´edricPruski1 1 LIST, Luxembourg Institute of Science and Technology, Luxembourg fsilvio.cardoso,marcos.dasilveira,[email protected] 2 LRI, University of Paris-Sud XI, France [email protected] 3 Department of Computer Science, Universit¨atLeipzig, Germany flin,christen,[email protected] Abstract. The use of similarity measures in various domains is corner- stone for different tasks ranging from ontology alignment to information retrieval. To this end, existing metrics can be classified into several cate- gories among which lexical and semantic families of similarity measures predominate but have rarely been combined to complete the aforemen- tioned tasks. In this paper, we propose an original approach combining lexical and ontology-based semantic similarity measures to improve the evaluation of terms relatedness. We validate our approach through a set of experiments based on a corpus of reference constructed by domain experts of the medical field and further evaluate the impact of ontology evolution on the used semantic similarity measures. Keywords: Similarity measures · Ontology evolution · Semantic Web · Medical terminologies 1 Introduction Measuring the similarity between terms is at the heart of many research inves- tigations. In ontology matching, similarity measures are used to evaluate the relatedness between concepts from different ontologies [9]. The outcomes are the mappings between the ontologies, increasing the coverage of domain knowledge and optimize semantic interoperability between information systems. In informa- tion retrieval, similarity measures are used to evaluate the relatedness between units of language (e.g., words, sentences, documents) to optimize search [39].
    [Show full text]
  • Evaluating Lexical Similarity to Build Sentiment Similarity
    Evaluating Lexical Similarity to Build Sentiment Similarity Grégoire Jadi∗, Vincent Claveauy, Béatrice Daille∗, Laura Monceaux-Cachard∗ ∗ LINA - Univ. Nantes {gregoire.jadi beatrice.daille laura.monceaux}@univ-nantes.fr y IRISA–CNRS, France [email protected] Abstract In this article, we propose to evaluate the lexical similarity information provided by word representations against several opinion resources using traditional Information Retrieval tools. Word representation have been used to build and to extend opinion resources such as lexicon, and ontology and their performance have been evaluated on sentiment analysis tasks. We question this method by measuring the correlation between the sentiment proximity provided by opinion resources and the semantic similarity provided by word representations using different correlation coefficients. We also compare the neighbors found in word representations and list of similar opinion words. Our results show that the proximity of words in state-of-the-art word representations is not very effective to build sentiment similarity. Keywords: opinion lexicon evaluation, sentiment analysis, spectral representation, word embedding 1. Introduction tend or build sentiment lexicons. For example, (Toh and The goal of opinion mining or sentiment analysis is to iden- Wang, 2014) uses the syntactic categories from WordNet tify and classify texts according to the opinion, sentiment or as features for a CRF to extract aspects and WordNet re- emotion that they convey. It requires a good knowledge of lations such as antonymy and synonymy to extend an ex- the sentiment properties of each word. This properties can isting opinion lexicons. Similarly, (Agarwal et al., 2011) be either discovered automatically by machine learning al- look for synonyms in WordNet to find the polarity of words gorithms, or embedded into language resources.
    [Show full text]
  • Using Prior Knowledge to Guide BERT's Attention in Semantic
    Using Prior Knowledge to Guide BERT’s Attention in Semantic Textual Matching Tasks Tingyu Xia Yue Wang School of Artificial Intelligence, Jilin University School of Information and Library Science Key Laboratory of Symbolic Computation and Knowledge University of North Carolina at Chapel Hill Engineering, Jilin University [email protected] [email protected] Yuan Tian∗ Yi Chang∗ School of Artificial Intelligence, Jilin University School of Artificial Intelligence, Jilin University Key Laboratory of Symbolic Computation and Knowledge International Center of Future Science, Jilin University Engineering, Jilin University Key Laboratory of Symbolic Computation and Knowledge [email protected] Engineering, Jilin University [email protected] ABSTRACT 1 INTRODUCTION We study the problem of incorporating prior knowledge into a deep Measuring semantic similarity between two pieces of text is a fun- Transformer-based model, i.e., Bidirectional Encoder Representa- damental and important task in natural language understanding. tions from Transformers (BERT), to enhance its performance on Early works on this task often leverage knowledge resources such semantic textual matching tasks. By probing and analyzing what as WordNet [28] and UMLS [2] as these resources contain well- BERT has already known when solving this task, we obtain better defined types and relations between words and concepts. Recent understanding of what task-specific knowledge BERT needs the works have shown that deep learning models are more effective most and where it is most needed. The analysis further motivates on this task by learning linguistic knowledge from large-scale text us to take a different approach than most existing works. Instead of data and representing text (words and sentences) as continuous using prior knowledge to create a new training task for fine-tuning trainable embedding vectors [10, 17].
    [Show full text]
  • The Design & Implementation of an Abstract Semantic Graph For
    Clemson University TigerPrints All Dissertations Dissertations 12-2011 The esiD gn & Implementation of an Abstract Semantic Graph for Statement-Level Dynamic Analysis of C++ Applications Edward Duffy Clemson University, [email protected] Follow this and additional works at: https://tigerprints.clemson.edu/all_dissertations Part of the Computer Sciences Commons Recommended Citation Duffy, Edward, "The eD sign & Implementation of an Abstract Semantic Graph for Statement-Level Dynamic Analysis of C++ Applications" (2011). All Dissertations. 832. https://tigerprints.clemson.edu/all_dissertations/832 This Dissertation is brought to you for free and open access by the Dissertations at TigerPrints. It has been accepted for inclusion in All Dissertations by an authorized administrator of TigerPrints. For more information, please contact [email protected]. THE DESIGN &IMPLEMENTATION OF AN ABSTRACT SEMANTIC GRAPH FOR STATEMENT-LEVEL DYNAMIC ANALYSIS OF C++ APPLICATIONS A Dissertation Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Computer Science by Edward B. Duffy December 2011 Accepted by: Dr. Brian A. Malloy, Committee Chair Dr. James B. von Oehsen Dr. Jason P. Hallstrom Dr. Pradip K. Srimani In this thesis, we describe our system, Hylian, for statement-level analysis, both static and dynamic, of a C++ application. We begin by extending the GNU gcc parser to generate parse trees in XML format for each of the compilation units in a C++ application. We then provide verification that the generated parse trees are structurally equivalent to the code in the original C++ application. We use the generated parse trees, together with an augmented version of the gcc test suite, to recover a grammar for the C++ dialect that we parse.
    [Show full text]
  • Latent Semantic Analysis for Text-Based Research
    Behavior Research Methods, Instruments, & Computers 1996, 28 (2), 197-202 ANALYSIS OF SEMANTIC AND CLINICAL DATA Chaired by Matthew S. McGlone, Lafayette College Latent semantic analysis for text-based research PETER W. FOLTZ New Mexico State University, Las Cruces, New Mexico Latent semantic analysis (LSA) is a statistical model of word usage that permits comparisons of se­ mantic similarity between pieces of textual information. This papersummarizes three experiments that illustrate how LSA may be used in text-based research. Two experiments describe methods for ana­ lyzinga subject's essay for determining from what text a subject learned the information and for grad­ ing the quality of information cited in the essay. The third experiment describes using LSAto measure the coherence and comprehensibility of texts. One of the primary goals in text-comprehension re­ A theoretical approach to studying text comprehen­ search is to understand what factors influence a reader's sion has been to develop cognitive models ofthe reader's ability to extract and retain information from textual ma­ representation ofthe text (e.g., Kintsch, 1988; van Dijk terial. The typical approach in text-comprehension re­ & Kintsch, 1983). In such a model, semantic information search is to have subjects read textual material and then from both the text and the reader's summary are repre­ have them produce some form of summary, such as an­ sented as sets ofsemantic components called propositions. swering questions or writing an essay. This summary per­ Typically, each clause in a text is represented by a single mits the experimenter to determine what information the proposition.
    [Show full text]
  • Ying Liu, Xi Yang
    Liu, Y. & Yang, X. (2018). A similar legal case retrieval Liu & Yang system by multiple speech question and answer. In Proceedings of The 18th International Conference on Electronic Business (pp. 72-81). ICEB, Guilin, China, December 2-6. A Similar Legal Case Retrieval System by Multiple Speech Question and Answer (Full paper) Ying Liu, School of Computer Science and Information Engineering, Guangxi Normal University, China, [email protected] Xi Yang*, School of Computer Science and Information Engineering, Guangxi Normal University, China, [email protected] ABSTRACT In real life, on the one hand people often lack legal knowledge and legal awareness; on the other hand lawyers are busy, time- consuming, and expensive. As a result, a series of legal consultation problems cannot be properly handled. Although some legal robots can answer some problems about basic legal knowledge that are matched by keywords, they cannot do similar case retrieval and sentencing prediction according to the criminal facts described in natural language. To overcome the difficulty, we propose a similar case retrieval system based on natural language understanding. The system uses online speech synthesis of IFLYTEK and speech reading and writing technology, integrates natural language semantic processing technology and multiple rounds of question-and-answer dialogue mechanism to realise the legal knowledge question and answer with the memory-based context processing ability, and finally retrieves a case that is most similar to the criminal facts that the user consulted. After trial use, the system has a good level of human-computer interaction and high accuracy of information retrieval, which can largely satisfy people's consulting needs for legal issues.
    [Show full text]
  • Untangling Semantic Similarity: Modeling Lexical Processing Experiments with Distributional Semantic Models
    Untangling Semantic Similarity: Modeling Lexical Processing Experiments with Distributional Semantic Models. Farhan Samir Barend Beekhuizen Suzanne Stevenson Department of Computer Science Department of Language Studies Department of Computer Science University of Toronto University of Toronto, Mississauga University of Toronto ([email protected]) ([email protected]) ([email protected]) Abstract DSMs thought to instantiate one but not the other kind of sim- Distributional semantic models (DSMs) are substantially var- ilarity have been found to explain different priming effects on ied in the types of semantic similarity that they output. Despite an aggregate level (as opposed to an item level). this high variance, the different types of similarity are often The underlying conception of semantic versus associative conflated as a monolithic concept in models of behavioural data. We apply the insight that word2vec’s representations similarity, however, has been disputed (Ettinger & Linzen, can be used for capturing both paradigmatic similarity (sub- 2016; McRae et al., 2012), as has one of the main ways in stitutability) and syntagmatic similarity (co-occurrence) to two which it is operationalized (Gunther¨ et al., 2016). In this pa- sets of experimental findings (semantic priming and the effect of semantic neighbourhood density) that have previously been per, we instead follow another distinction, based on distri- modeled with monolithic conceptions of DSM-based seman- butional properties (e.g. Schutze¨ & Pedersen, 1993), namely, tic similarity. Using paradigmatic and syntagmatic similarity that between syntagmatically related words (words that oc- based on word2vec, we show that for some tasks and types of items the two types of similarity play complementary ex- cur in each other’s near proximity, such as drink–coffee), planatory roles, whereas for others, only syntagmatic similar- and paradigmatically related words (words that can be sub- ity seems to matter.
    [Show full text]
  • Measuring Semantic Similarity of Words Using Concept Networks
    Measuring semantic similarity of words using concept networks Gabor´ Recski Eszter Iklodi´ Research Institute for Linguistics Dept of Automation and Applied Informatics Hungarian Academy of Sciences Budapest U of Technology and Economics H-1068 Budapest, Benczur´ u. 33 H-1117 Budapest, Magyar tudosok´ krt. 2 [email protected] [email protected] Katalin Pajkossy Andras´ Kornai Department of Algebra Institute for Computer Science Budapest U of Technology and Economics Hungarian Academy of Sciences H-1111 Budapest, Egry J. u. 1 H-1111 Budapest, Kende u. 13-17 [email protected] [email protected] Abstract using word embeddings and WordNet. In Sec- tion 3 we briefly introduce the 4lang resources We present a state-of-the-art algorithm and the formalism it uses for encoding the mean- for measuring the semantic similarity of ing of words as directed graphs of concepts, then word pairs using novel combinations of document our efforts to develop novel 4lang- word embeddings, WordNet, and the con- based similarity features. Besides improving the cept dictionary 4lang. We evaluate our performance of existing systems for measuring system on the SimLex-999 benchmark word similarity, the goal of the present project is to data. Our top score of 0.76 is higher than examine the potential of 4lang representations in any published system that we are aware of, representing non-trivial lexical relationships that well beyond the average inter-annotator are beyond the scope of word embeddings and agreement of 0.67, and close to the 0.78 standard linguistic ontologies. average correlation between a human rater Section 4 presents our results and pro- and the average of all other ratings, sug- vides rough error analysis.
    [Show full text]
  • Application of Graph Databases for Static Code Analysis of Web-Applications
    Application of Graph Databases for Static Code Analysis of Web-Applications Daniil Sadyrin [0000-0001-5002-3639], Andrey Dergachev [0000-0002-1754-7120], Ivan Loginov [0000-0002-6254-6098], Iurii Korenkov [0000-0002-8948-2776], and Aglaya Ilina [0000-0003-1866-7914] ITMO University, Kronverkskiy prospekt, 49, St. Petersburg, 197101, Russia [email protected], [email protected], [email protected], [email protected], [email protected] Abstract. Graph databases offer a very flexible data model. We present the approach of static code analysis using graph databases. The main stage of the analysis algorithm is the construction of ASG (Abstract Source Graph), which represents relationships between AST (Abstract Syntax Tree) nodes. The ASG is saved to a graph database (like Neo4j) and queries to the database are made to get code properties for analysis. The approach is applied to detect and exploit Object Injection vulnerability in PHP web-applications. This vulnerability occurs when unsanitized user data enters PHP unserialize function. Successful exploitation of this vulnerability means building of “object chain”: a nested object, in the process of deserializing of it, a sequence of methods is being called leading to dangerous function call. In time of deserializing, some “magic” PHP methods (__wakeup or __destruct) are called on the object. To create the “object chain”, it’s necessary to analyze methods of classes declared in web-application, and find sequence of methods called from “magic” methods. The main idea of author’s approach is to save relationships between methods and functions in graph database and use queries to the database on Cypher language to find appropriate method calls.
    [Show full text]
  • Semantic Similarity Between Sentences
    International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 SEMANTIC SIMILARITY BETWEEN SENTENCES PANTULKAR SRAVANTHI1, DR. B. SRINIVASU2 1M.tech Scholar Dept. of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Telangana- Hyderabad, India 2Associate Professor - Dept. of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Telangana- Hyderabad, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - The task of measuring sentence similarity is [1].Sentence similarity is one of the core elements of Natural defined as determining how similar the meanings of two Language Processing (NLP) tasks such as Recognizing sentences are. Computing sentence similarity is not a trivial Textual Entailment (RTE)[2] and Paraphrase Recognition[3]. task, due to the variability of natural language - expressions. Given two sentences, the task of measuring sentence Measuring semantic similarity of sentences is closely related to similarity is defined as determining how similar the meaning semantic similarity between words. It makes a relationship of two sentences is. The higher the score, the more similar between a word and the sentence through their meanings. The the meaning of the two sentences. WordNet and similarity intention is to enhance the concepts of semantics over the measures play an important role in sentence level similarity syntactic measures that are able to categorize the pair of than document level[4]. sentences effectively. Semantic similarity plays a vital role in Natural language processing, Informational Retrieval, Text 1.1 Problem Description Mining, Q & A systems, text-related research and application area.
    [Show full text]
  • Semantic Sentence Similarity: Size Does Not Always Matter
    INTERSPEECH 2021 30 August – 3 September, 2021, Brno, Czechia Semantic sentence similarity: size does not always matter Danny Merkx, Stefan L. Frank, Mirjam Ernestus Centre for Language Studies, Radboud University, Nijmegen, The Netherlands [email protected], [email protected], [email protected] Abstract liest VGS models, used a database of around 8,000 utterances This study addresses the question whether visually [15]. Harwath and colleagues introduced the first ‘modern’ neu- grounded speech recognition (VGS) models learn to capture ral network based approach which was trained on the Flickr8k sentence semantics without access to any prior linguistic knowl- Audio Caption corpus, a corpus with 40,000 utterances [16]. edge. We produce synthetic and natural spoken versions of a This corpus was quickly followed up by Places Audio Captions well known semantic textual similarity database and show that (400,000 utterances) [17] and, most recently, by SpokenCOCO our VGS model produces embeddings that correlate well with (600,000 utterances) [14]. human semantic similarity judgements. Our results show that However, previous work on visual grounding using writ- a model trained on a small image-caption database outperforms ten captions has shown that larger databases do not always re- two models trained on much larger databases, indicating that sult in better models. In [18], we compared models trained on database size is not all that matters. We also investigate the im- the written captions of Flickr8k [19] and MSCOCO [20]. We portance of having multiple captions per image and find that showed that, although the much larger MSCOCO (600k sen- this is indeed helpful even if the total number of images is tences) achieved better performance on the training task, the lower, suggesting that paraphrasing is a valuable learning sig- model trained on the smaller Flickr database performed better nal.
    [Show full text]
  • Word Embeddings for Semantic Similarity: Comparing LDA with Word2vec
    Word embeddings for semantic similarity: comparing LDA with Word2vec Luandrie Potgieter1 and Alta de Waal12 1 Department of Statistics, University of Pretoria 2 Center for Artificial Intelligence Research (CAIR) 1 Introduction Distributional semantics is a subfield in Natural Language Processing (NLP) that studies methods to derive meaning, and semantic representations for text. These representations can be thought of as statistical distributions of words assuming that it characterises semantic behaviour. These statistical distributions are often referred to as embeddings, or lower dimensional representations of the text. The two main categories of embeddings are count-based and prediction-based methods. Count-based methods most often result in global word embeddings as it utilizes the frequency of words in documents. Prediction-based embeddings on the other hand are often referred to as local embeddings as it takes into account a window of words adjacent to the word of interest. Once a corpus is represented by its semantic distribution (which is in a vector space), all kinds of similarity measures can be applied. This again leads to other NLP applications such as collaborative filtering, aspect-based sentiment analysis, intent classification for chatbots and machine translation. In this research, we investigate the appropriateness of Latent Dirichlet Allo- cation (LDA) [1] and word2vec [4] embeddings as semantic representations of a corpus. Once the semantic representation of a corpus is obtained, the distances between the documents and query documents are obtained by means of dis- tance measures such as cosine, Euclidean or Jensen-Shannon. We expect short distances between documents with similar semantic representations and longer distances between documents with different semantic representations.
    [Show full text]