Semantic Text Similarity Using N-Grams, Wordnet, Syntactic Analysis, ESA and Information Retrieval Based Features

Total Page:16

File Type:pdf, Size:1020Kb

Semantic Text Similarity Using N-Grams, Wordnet, Syntactic Analysis, ESA and Information Retrieval Based Features LIPN-CORE: Semantic Text Similarity using n-grams, WordNet, Syntactic Analysis, ESA and Information Retrieval based Features Davide Buscaldi, Joseph Le Roux, Adrian Popescu Jorge J. Garc´ıa Flores CEA, LIST, Laboratoire d’Informatique de Paris Nord, Vision & Content CNRS, (UMR 7030) Engineering Laboratory Universite´ Paris 13, Sorbonne Paris Cite,´ F-91190 Gif-sur-Yvette, France F-93430, Villetaneuse, France [email protected] {buscaldi,joseph.le-roux,jgflores} @lipn.univ-paris13.fr Abstract tures for the global system. These measures include simple distances like Levenshtein edit distance, co- This paper describes the system used by the sine, Named Entities overlap and more complex dis- LIPN team in the Semantic Textual Similarity tances like Explicit Semantic Analysis, WordNet- task at *SEM 2013. It uses a support vector re- gression model, combining different text sim- based similarity, IR-based similarity, and a similar- ilarity measures that constitute the features. ity measure based on syntactic dependencies. These measures include simple distances like The paper is organized as follows. Measures are Levenshtein edit distance, cosine, Named En- presented in Section 2. Then the regression model, tities overlap and more complex distances like based on Support Vector Machines, is described in Explicit Semantic Analysis, WordNet-based Section 3. Finally we discuss the results of the sys- similarity, IR-based similarity, and a similar- tem in Section 4. ity measure based on syntactic dependencies. 2 Text Similarity Measures 1 Introduction 2.1 WordNet-based Conceptual Similarity The Semantic Textual Similarity task (STS) at (Proxigenea) *SEM 2013 requires systems to grade the degree of similarity between pairs of sentences. It is closely First of all, sentences p and q are analysed in or- related to other well known tasks in NLP such as tex- der to extract all the included WordNet synsets. For tual entailment, question answering or paraphrase each WordNet synset, we keep noun synsets and put detection. However, as noticed in (Bar¨ et al., 2012), into the set of synsets associated to the sentence, Cp the major difference is that STS systems must give a and Cq, respectively. If the synsets are in one of the graded, as opposed to binary, answer. other POS categories (verb, adjective, adverb) we One of the most successful systems in *SEM look for their derivationally related forms in order 2012 STS, (Bar¨ et al., 2012), managed to grade pairs to find a related noun synset: if there is one, we put of sentences accurately by combining focused mea- this synsets in Cp (or Cq). For instance, the word sures, either simple ones based on surface features “playing” can be associated in WordNet to synset (ie n-grams), more elaborate ones based on lexical (v)play#2, which has two derivationally related semantics, or measures requiring external corpora forms corresponding to synsets (n)play#5 and such as Explicit Semantic Analysis, into a robust (n)play#6: these are the synsets that are added measure by using a log-linear regression model. to the synset set of the sentence. No disambiguation The LIPN-CORE system is built upon this idea of process is carried out, so we take all possible mean- combining simple measures with a regression model ings into account. to obtain a robust and accurate measure of tex- Given Cp and Cq as the sets of concepts contained tual similarity, using the individual measures as fea- in sentences p and q, respectively, with |Cp| ≥ |Cq|, 162 Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task, pages 162–168, Atlanta, Georgia, June 13-14, 2013. c 2013 Association for Computational Linguistics the conceptual similarity between p and q is calcu- Web 1T (Brants and Franz, 2006) frequency counts. lated as: The semantic similarity between words is calculated P as: max s(c1, c2) c ∈C c1∈Cp 2 q ss(p, q) = (1) ws(w , w ) = max s (c , c ). |Cp| i j jc i j (5) ci∈Wi,cj inWj where s(c1, c2) is a conceptual similarity measure. where W and W are the sets containing all synsets Concept similarity can be calculated by different i j in WordNet corresponding to word w and w , re- ways. For the participation in the 2013 Seman- i j spectively. The IC values used are those calcu- tic Textual Similarity task, we used a variation of lated by Ted Pedersen (Pedersen et al., 2004) on the the Wu-Palmer formula (Wu and Palmer, 1994) British National Corpus1. named “ProxiGenea” (from the french Proximite´ Gen´ ealogique,´ genealogical proximity), introduced 2.3 Syntactic Dependencies by (Dudognon et al., 2010), which is inspired by the analogy between a family tree and the concept hi- We also wanted for our systems to take syntac- erarchy in WordNet. Among the different formula- tic similarity into account. As our measures are tions proposed by (Dudognon et al., 2010), we chose lexically grounded, we chose to use dependen- the ProxiGenea3 variant, already used in the STS cies rather than constituents. Previous experiments 2012 task by the IRIT team (Buscaldi et al., 2012). showed that converting constituents to dependen- The ProxiGenea3 measure is defined as: cies still achieved best results on out-of-domain texts (Le Roux et al., 2012), so we decided to use 1 s(c , c ) = (2) a 2-step architecture to obtain syntactic dependen- 1 2 1 + d(c ) + d(c ) − 2 · d(c ) 1 2 0 cies. First we parsed pairs of sentences with the 2 where c0 is the most specific concept that is present LORG parser . Second we converted the resulting 3 both in the synset path of c1 and c2 (that is, the Least parse trees to Stanford dependencies . Common Subsumer or LCS). The function returning Given the sets of parsed dependencies Dp and Dq, the depth of a concept is noted with d. for sentence p and q, a dependency d ∈ Dx is a triple (l, h, t) where l is the dependency label (for in- 2.2 IC-based Similarity stance, dobj or prep), h the governor and t the depen- dant. We define the following similarity measure be- This measure has been proposed by (Mihalcea et tween two syntactic dependencies d1 = (l1, h1, t1) al., 2006) as a corpus-based measure which uses and d2 = (l2, h2, t2): Resnik’s Information Content (IC) and the Jiang- dsim(d1, d2) = Lev(l1, l2) Conrath (Jiang and Conrath, 1997) similarity metric: idf ∗ s (h , h ) + idf ∗ s (t , t ) ∗ h WN 1 2 t WN 1 2 1 2 sjc(c1, c2) = (3) (6) IC(c1) + IC(c2) − 2 · IC(c0) where idf = max(idf(h ), idf(h )) and idf = where IC is the information content introduced by h 1 2 t max(idf(t ), idf(t )) are the inverse document fre- (Resnik, 1995) as IC(c) = − log P (c). 1 2 The similarity between two text segments T1 and quencies calculated on Google Web 1T for the gov- T2 is therefore determined as: ernors and the dependants (we retain the maximum 0 P max ws(w, w2) ∗ idf(w) for each pair), and sWN is calculated using formula 1 w∈{T } w2∈{T2} sim(T ,T ) = B 1 2, with two differences: 1 2 2 @ P idf(w) w∈{T1} P 1 • if the two words to be compared are antonyms, max ws(w, w1) ∗ idf(w) w ∈{T } w∈{T2} 1 1 then the returned score is 0; + C(4) P idf(w) A 1 w∈{T2} http://www.d.umn.edu/˜tpederse/similarity.html 2 https://github.com/CNGLdlab/LORG-Release where idf(w) is calculated as the inverse document 3We used the default built-in converter provided with the frequency of word w, taking into account Google Stanford Parser (2012-11-12 revision). 163 • if one of the words to be compared is not in weighted vector of Wikipedia concepts. Weights WordNet, their similarity is calculated using are supposed to quantify the strength of the relation the Levenshtein distance. between a word and each Wikipedia concept using the tf-idf measure. A text is then represented as a The similarity score between p and q, is then cal- high-dimensional real valued vector space spanning culated as: all along the Wikipedia database. For this particular P max dsim(di, dj) task we adapt the research-esa implementation d inD di∈Dp j q (Sorg and Cimiano, 2008)7 to our own home-made sSD(p, q) = max , |Dp| weighted vectors corresponding to a Wikipedia snapshot of February 4th, 2013. P max dsim(di, dj) dj inDp di∈Dq 2.6 N-gram based Similarity |D | q This feature is based on the Clustered Keywords Po- (7) sitional Distance (CKPD) model proposed in (Bus- caldi et al., 2009) for the passage retrieval task. 2.4 Information Retrieval-based Similarity The similarity between a text fragment p and an- Let us consider two texts p and q, an Information Re- other text fragment q is calculated as: trieval (IR) system S and a document collection D indexed by S. This measure is based on the assump- X 1 tion that p and q are similar if the documents re- h(x, P ) d(x, xmax) trieved by S for the two texts, used as input queries, ∀x∈Q simngrams(p, q) = Pn (9) are ranked similarly. i=1 wi Let be Lp = {dp , . , dp } and Lq = 1 K Where P is the set of n-grams with the highest {dq1 , . , dqK }, dxi ∈ D the sets of the top K docu- ments retrieved by S for texts p and q, respectively.
Recommended publications
  • Combining Semantic and Lexical Measures to Evaluate Medical Terms Similarity?
    Combining semantic and lexical measures to evaluate medical terms similarity? Silvio Domingos Cardoso1;2, Marcos Da Silveira1, Ying-Chi Lin3, Victor Christen3, Erhard Rahm3, Chantal Reynaud-Dela^ıtre2, and C´edricPruski1 1 LIST, Luxembourg Institute of Science and Technology, Luxembourg fsilvio.cardoso,marcos.dasilveira,[email protected] 2 LRI, University of Paris-Sud XI, France [email protected] 3 Department of Computer Science, Universit¨atLeipzig, Germany flin,christen,[email protected] Abstract. The use of similarity measures in various domains is corner- stone for different tasks ranging from ontology alignment to information retrieval. To this end, existing metrics can be classified into several cate- gories among which lexical and semantic families of similarity measures predominate but have rarely been combined to complete the aforemen- tioned tasks. In this paper, we propose an original approach combining lexical and ontology-based semantic similarity measures to improve the evaluation of terms relatedness. We validate our approach through a set of experiments based on a corpus of reference constructed by domain experts of the medical field and further evaluate the impact of ontology evolution on the used semantic similarity measures. Keywords: Similarity measures · Ontology evolution · Semantic Web · Medical terminologies 1 Introduction Measuring the similarity between terms is at the heart of many research inves- tigations. In ontology matching, similarity measures are used to evaluate the relatedness between concepts from different ontologies [9]. The outcomes are the mappings between the ontologies, increasing the coverage of domain knowledge and optimize semantic interoperability between information systems. In informa- tion retrieval, similarity measures are used to evaluate the relatedness between units of language (e.g., words, sentences, documents) to optimize search [39].
    [Show full text]
  • Intelligent Chat Bot
    INTELLIGENT CHAT BOT A. Mohamed Rasvi, V.V. Sabareesh, V. Suthajebakumari Computer Science and Engineering, Kamaraj College of Engineering and Technology, India ABSTRACT This paper discusses the workflow of intelligent chat bot powered by various artificial intelligence algorithms. The replies for messages in chats are trained against set of predefined questions and chat messages. These trained data sets are stored in database. Relying on one machine-learning algorithm showed inaccurate performance, so this bot is powered by four different machine-learning algorithms to make a decision. The inference engine pre-processes the received message then matches it against the trained datasets based on the AI algorithms. The AIML provides similar way of replying to a message in online chat bots using simple XML based mechanism but the method of employing AI provides accurate replies than the widely used AIML in the Internet. This Intelligent chat bot can be used to provide assistance for individual like answering simple queries to booking a ticket for a trip and also when trained properly this can be used as a replacement for a teacher which teaches a subject or even to teach programming. Keywords : AIML, Artificial Intelligence, Chat bot, Machine-learning, String Matching. I. INTRODUCTION Social networks are attracting masses and gaining huge momentum. They allow instant messaging and sharing features. Guides and technical help desks provide on demand tech support through chat services or voice call. Queries are taken to technical support team from customer to clear their doubts. But this process needs a dedicated support team to answer user‟s query which is a lot of man power.
    [Show full text]
  • An Ensemble Regression Approach for Ocr Error Correction
    AN ENSEMBLE REGRESSION APPROACH FOR OCR ERROR CORRECTION by Jie Mei Submitted in partial fulfillment of the requirements for the degree of Master of Computer Science at Dalhousie University Halifax, Nova Scotia March 2017 © Copyright by Jie Mei, 2017 Table of Contents List of Tables ................................... iv List of Figures .................................. v Abstract ...................................... vi List of Symbols Used .............................. vii Acknowledgements ............................... viii Chapter 1 Introduction .......................... 1 1.1 Problem Statement............................ 1 1.2 Proposed Model .............................. 2 1.3 Contributions ............................... 2 1.4 Outline ................................... 3 Chapter 2 Background ........................... 5 2.1 OCR Procedure .............................. 5 2.2 OCR-Error Characteristics ........................ 6 2.3 Modern Post-Processing Models ..................... 7 Chapter 3 Compositional Correction Frameworks .......... 9 3.1 Noisy Channel ............................... 11 3.1.1 Error Correction Models ..................... 12 3.1.2 Correction Inferences ....................... 13 3.2 Confidence Analysis ............................ 16 3.2.1 Error Correction Models ..................... 16 3.2.2 Correction Inferences ....................... 17 3.3 Framework Comparison ......................... 18 ii Chapter 4 Proposed Model ........................ 21 4.1 Error Detection .............................. 22
    [Show full text]
  • Evaluating Lexical Similarity to Build Sentiment Similarity
    Evaluating Lexical Similarity to Build Sentiment Similarity Grégoire Jadi∗, Vincent Claveauy, Béatrice Daille∗, Laura Monceaux-Cachard∗ ∗ LINA - Univ. Nantes {gregoire.jadi beatrice.daille laura.monceaux}@univ-nantes.fr y IRISA–CNRS, France [email protected] Abstract In this article, we propose to evaluate the lexical similarity information provided by word representations against several opinion resources using traditional Information Retrieval tools. Word representation have been used to build and to extend opinion resources such as lexicon, and ontology and their performance have been evaluated on sentiment analysis tasks. We question this method by measuring the correlation between the sentiment proximity provided by opinion resources and the semantic similarity provided by word representations using different correlation coefficients. We also compare the neighbors found in word representations and list of similar opinion words. Our results show that the proximity of words in state-of-the-art word representations is not very effective to build sentiment similarity. Keywords: opinion lexicon evaluation, sentiment analysis, spectral representation, word embedding 1. Introduction tend or build sentiment lexicons. For example, (Toh and The goal of opinion mining or sentiment analysis is to iden- Wang, 2014) uses the syntactic categories from WordNet tify and classify texts according to the opinion, sentiment or as features for a CRF to extract aspects and WordNet re- emotion that they convey. It requires a good knowledge of lations such as antonymy and synonymy to extend an ex- the sentiment properties of each word. This properties can isting opinion lexicons. Similarly, (Agarwal et al., 2011) be either discovered automatically by machine learning al- look for synonyms in WordNet to find the polarity of words gorithms, or embedded into language resources.
    [Show full text]
  • Using Prior Knowledge to Guide BERT's Attention in Semantic
    Using Prior Knowledge to Guide BERT’s Attention in Semantic Textual Matching Tasks Tingyu Xia Yue Wang School of Artificial Intelligence, Jilin University School of Information and Library Science Key Laboratory of Symbolic Computation and Knowledge University of North Carolina at Chapel Hill Engineering, Jilin University [email protected] [email protected] Yuan Tian∗ Yi Chang∗ School of Artificial Intelligence, Jilin University School of Artificial Intelligence, Jilin University Key Laboratory of Symbolic Computation and Knowledge International Center of Future Science, Jilin University Engineering, Jilin University Key Laboratory of Symbolic Computation and Knowledge [email protected] Engineering, Jilin University [email protected] ABSTRACT 1 INTRODUCTION We study the problem of incorporating prior knowledge into a deep Measuring semantic similarity between two pieces of text is a fun- Transformer-based model, i.e., Bidirectional Encoder Representa- damental and important task in natural language understanding. tions from Transformers (BERT), to enhance its performance on Early works on this task often leverage knowledge resources such semantic textual matching tasks. By probing and analyzing what as WordNet [28] and UMLS [2] as these resources contain well- BERT has already known when solving this task, we obtain better defined types and relations between words and concepts. Recent understanding of what task-specific knowledge BERT needs the works have shown that deep learning models are more effective most and where it is most needed. The analysis further motivates on this task by learning linguistic knowledge from large-scale text us to take a different approach than most existing works. Instead of data and representing text (words and sentences) as continuous using prior knowledge to create a new training task for fine-tuning trainable embedding vectors [10, 17].
    [Show full text]
  • Latent Semantic Analysis for Text-Based Research
    Behavior Research Methods, Instruments, & Computers 1996, 28 (2), 197-202 ANALYSIS OF SEMANTIC AND CLINICAL DATA Chaired by Matthew S. McGlone, Lafayette College Latent semantic analysis for text-based research PETER W. FOLTZ New Mexico State University, Las Cruces, New Mexico Latent semantic analysis (LSA) is a statistical model of word usage that permits comparisons of se­ mantic similarity between pieces of textual information. This papersummarizes three experiments that illustrate how LSA may be used in text-based research. Two experiments describe methods for ana­ lyzinga subject's essay for determining from what text a subject learned the information and for grad­ ing the quality of information cited in the essay. The third experiment describes using LSAto measure the coherence and comprehensibility of texts. One of the primary goals in text-comprehension re­ A theoretical approach to studying text comprehen­ search is to understand what factors influence a reader's sion has been to develop cognitive models ofthe reader's ability to extract and retain information from textual ma­ representation ofthe text (e.g., Kintsch, 1988; van Dijk terial. The typical approach in text-comprehension re­ & Kintsch, 1983). In such a model, semantic information search is to have subjects read textual material and then from both the text and the reader's summary are repre­ have them produce some form of summary, such as an­ sented as sets ofsemantic components called propositions. swering questions or writing an essay. This summary per­ Typically, each clause in a text is represented by a single mits the experimenter to determine what information the proposition.
    [Show full text]
  • Practice with Python
    CSI4108-01 ARTIFICIAL INTELLIGENCE 1 Word Embedding / Text Processing Practice with Python 2018. 5. 11. Lee, Gyeongbok Practice with Python 2 Contents • Word Embedding – Libraries: gensim, fastText – Embedding alignment (with two languages) • Text/Language Processing – POS Tagging with NLTK/koNLPy – Text similarity (jellyfish) Practice with Python 3 Gensim • Open-source vector space modeling and topic modeling toolkit implemented in Python – designed to handle large text collections, using data streaming and efficient incremental algorithms – Usually used to make word vector from corpus • Tutorial is available here: – https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials – https://rare-technologies.com/word2vec-tutorial/ • Install – pip install gensim Practice with Python 4 Gensim for Word Embedding • Logging • Input Data: list of word’s list – Example: I have a car , I like the cat → – For list of the sentences, you can make this by: Practice with Python 5 Gensim for Word Embedding • If your data is already preprocessed… – One sentence per line, separated by whitespace → LineSentence (just load the file) – Try with this: • http://an.yonsei.ac.kr/corpus/example_corpus.txt From https://radimrehurek.com/gensim/models/word2vec.html Practice with Python 6 Gensim for Word Embedding • If the input is in multiple files or file size is large: – Use custom iterator and yield From https://rare-technologies.com/word2vec-tutorial/ Practice with Python 7 Gensim for Word Embedding • gensim.models.Word2Vec Parameters – min_count:
    [Show full text]
  • NLP - Assignment 2
    NLP - Assignment 2 Week 2 December 27th, 2016 1. A 5-gram model is a order Markov Model: (a) Six (b) Five (c) Four (d) Constant Ans : c) Four 2. For the following corpus C1 of 3 sentences, what is the total count of unique bi- grams for which the likelihood will be estimated? Assume we do not perform any pre-processing, and we are using the corpus as given. (i) ice cream tastes better than any other food (ii) ice cream is generally served after the meal (iii) many of us have happy childhood memories linked to ice cream (a) 22 (b) 27 (c) 30 (d) 34 Ans : b) 27 3. Arrange the words \curry, oil and tea" in descending order, based on the frequency of their occurrence in the Google Books n-grams. The Google Books n-gram viewer is available at https://books.google.com/ngrams: (a) tea, oil, curry (c) curry, tea, oil (b) curry, oil, tea (d) oil, tea, curry Ans: d) oil, tea, curry 4. Given a corpus C2, The Maximum Likelihood Estimation (MLE) for the bigram \ice cream" is 0.4 and the count of occurrence of the word \ice" is 310. The likelihood of \ice cream" after applying add-one smoothing is 0:025, for the same corpus C2. What is the vocabulary size of C2: 1 (a) 4390 (b) 4690 (c) 5270 (d) 5550 Ans: b)4690 The Questions from 5 to 10 require you to analyse the data given in the corpus C3, using a programming language of your choice.
    [Show full text]
  • Ying Liu, Xi Yang
    Liu, Y. & Yang, X. (2018). A similar legal case retrieval Liu & Yang system by multiple speech question and answer. In Proceedings of The 18th International Conference on Electronic Business (pp. 72-81). ICEB, Guilin, China, December 2-6. A Similar Legal Case Retrieval System by Multiple Speech Question and Answer (Full paper) Ying Liu, School of Computer Science and Information Engineering, Guangxi Normal University, China, [email protected] Xi Yang*, School of Computer Science and Information Engineering, Guangxi Normal University, China, [email protected] ABSTRACT In real life, on the one hand people often lack legal knowledge and legal awareness; on the other hand lawyers are busy, time- consuming, and expensive. As a result, a series of legal consultation problems cannot be properly handled. Although some legal robots can answer some problems about basic legal knowledge that are matched by keywords, they cannot do similar case retrieval and sentencing prediction according to the criminal facts described in natural language. To overcome the difficulty, we propose a similar case retrieval system based on natural language understanding. The system uses online speech synthesis of IFLYTEK and speech reading and writing technology, integrates natural language semantic processing technology and multiple rounds of question-and-answer dialogue mechanism to realise the legal knowledge question and answer with the memory-based context processing ability, and finally retrieves a case that is most similar to the criminal facts that the user consulted. After trial use, the system has a good level of human-computer interaction and high accuracy of information retrieval, which can largely satisfy people's consulting needs for legal issues.
    [Show full text]
  • Untangling Semantic Similarity: Modeling Lexical Processing Experiments with Distributional Semantic Models
    Untangling Semantic Similarity: Modeling Lexical Processing Experiments with Distributional Semantic Models. Farhan Samir Barend Beekhuizen Suzanne Stevenson Department of Computer Science Department of Language Studies Department of Computer Science University of Toronto University of Toronto, Mississauga University of Toronto ([email protected]) ([email protected]) ([email protected]) Abstract DSMs thought to instantiate one but not the other kind of sim- Distributional semantic models (DSMs) are substantially var- ilarity have been found to explain different priming effects on ied in the types of semantic similarity that they output. Despite an aggregate level (as opposed to an item level). this high variance, the different types of similarity are often The underlying conception of semantic versus associative conflated as a monolithic concept in models of behavioural data. We apply the insight that word2vec’s representations similarity, however, has been disputed (Ettinger & Linzen, can be used for capturing both paradigmatic similarity (sub- 2016; McRae et al., 2012), as has one of the main ways in stitutability) and syntagmatic similarity (co-occurrence) to two which it is operationalized (Gunther¨ et al., 2016). In this pa- sets of experimental findings (semantic priming and the effect of semantic neighbourhood density) that have previously been per, we instead follow another distinction, based on distri- modeled with monolithic conceptions of DSM-based seman- butional properties (e.g. Schutze¨ & Pedersen, 1993), namely, tic similarity. Using paradigmatic and syntagmatic similarity that between syntagmatically related words (words that oc- based on word2vec, we show that for some tasks and types of items the two types of similarity play complementary ex- cur in each other’s near proximity, such as drink–coffee), planatory roles, whereas for others, only syntagmatic similar- and paradigmatically related words (words that can be sub- ity seems to matter.
    [Show full text]
  • 3 Dictionaries and Tolerant Retrieval
    Online edition (c)2009 Cambridge UP DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 49 Dictionaries and tolerant 3 retrieval In Chapters 1 and 2 we developed the ideas underlying inverted indexes for handling Boolean and proximity queries. Here, we develop techniques that are robust to typographical errors in the query, as well as alternative spellings. In Section 3.1 we develop data structures that help the search for terms in the vocabulary in an inverted index. In Section 3.2 we study WILDCARD QUERY the idea of a wildcard query: a query such as *a*e*i*o*u*, which seeks doc- uments containing any term that includes all the five vowels in sequence. The * symbol indicates any (possibly empty) string of characters. Users pose such queries to a search engine when they are uncertain about how to spell a query term, or seek documents containing variants of a query term; for in- stance, the query automat* would seek documents containing any of the terms automatic, automation and automated. We then turn to other forms of imprecisely posed queries, focusing on spelling errors in Section 3.3. Users make spelling errors either by accident, or because the term they are searching for (e.g., Herman) has no unambiguous spelling in the collection. We detail a number of techniques for correcting spelling errors in queries, one term at a time as well as for an entire string of query terms. Finally, in Section 3.4 we study a method for seeking vo- cabulary terms that are phonetically close to the query term(s).
    [Show full text]
  • Measuring Semantic Similarity of Words Using Concept Networks
    Measuring semantic similarity of words using concept networks Gabor´ Recski Eszter Iklodi´ Research Institute for Linguistics Dept of Automation and Applied Informatics Hungarian Academy of Sciences Budapest U of Technology and Economics H-1068 Budapest, Benczur´ u. 33 H-1117 Budapest, Magyar tudosok´ krt. 2 [email protected] [email protected] Katalin Pajkossy Andras´ Kornai Department of Algebra Institute for Computer Science Budapest U of Technology and Economics Hungarian Academy of Sciences H-1111 Budapest, Egry J. u. 1 H-1111 Budapest, Kende u. 13-17 [email protected] [email protected] Abstract using word embeddings and WordNet. In Sec- tion 3 we briefly introduce the 4lang resources We present a state-of-the-art algorithm and the formalism it uses for encoding the mean- for measuring the semantic similarity of ing of words as directed graphs of concepts, then word pairs using novel combinations of document our efforts to develop novel 4lang- word embeddings, WordNet, and the con- based similarity features. Besides improving the cept dictionary 4lang. We evaluate our performance of existing systems for measuring system on the SimLex-999 benchmark word similarity, the goal of the present project is to data. Our top score of 0.76 is higher than examine the potential of 4lang representations in any published system that we are aware of, representing non-trivial lexical relationships that well beyond the average inter-annotator are beyond the scope of word embeddings and agreement of 0.67, and close to the 0.78 standard linguistic ontologies. average correlation between a human rater Section 4 presents our results and pro- and the average of all other ratings, sug- vides rough error analysis.
    [Show full text]