Arxiv:1904.05033V1

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:1904.05033V1 Better Word Embeddings by Disentangling Contextual n-Gram Information Prakhar Gupta* Matteo Pagliardini* Martin Jaggi EPFL, Switzerland Iprova SA, Switzerland EPFL, Switzerland [email protected] [email protected] [email protected] Abstract augmenting word-context pairs with sub-word information in the form of character n-grams Pre-trained word vectors are ubiquitous in Natural Language Processing applications. In (Bojanowski et al., 2017), especially for morpho- this paper, we show how training word em- logically rich languages. Nevertheless, to the best beddings jointly with bigram and even trigram of our knowledge, no method has been introduced embeddings, results in improved unigram em- leveraging collocations of words with higher order beddings. We claim that training word embed- word n-grams such as bigrams or trigrams as well dings along with higher n-gram embeddings as character n-grams together. helps in the removal of the contextual infor- In this paper, we show how using higher order mation from the unigrams, resulting in better stand-alone word embeddings. We empirically word n-grams along with unigrams during training show the validity of our hypothesis by outper- can significantly improve the quality of obtained forming other competing word representation word embeddings. The addition furthermore helps models by a significant margin on a wide va- to disentangle contextual information present in riety of tasks. We make our models publicly the training data from the unigrams and results in available. overall better distributed word representations. 1 Introduction To validate our claim, we train two mod- ifications of CBOW augmented with word-n- Distributed word representations are essential gram information during training. One is a building blocks of modern NLP systems. Used recent sentence embedding method, Sent2Vec as features in downstream applications, they often (Pagliardini et al., 2018), which we repurpose to enhance generalization of models trained on a lim- obtain word vectors. The second method we pro- ited amount of data. They do so by capturing rel- pose is a modification of CBOW enriched with evant distributional information about words from character n-gram information (Bojanowski et al., large volumes of unlabeled text. 2017) that we again augment with word n-gram Efficient methods to learn word vectors have information. In both cases, we compare the result- been introduced in the past, most of them based ing vectors with the most widely used word em- on the distributional hypothesis of Harris (1954); bedding methods on word similarity and analogy arXiv:1904.05033v1 [cs.CL] 10 Apr 2019 Firth (1957): “a word is characterized by the com- tasks and show significant quality improvements. pany it keeps”. While a standard approach re- The code used to train the models presented in this lies on global corpus statistics (Pennington et al., paper as well as the models themselves are made 2014) formulated as a matrix factorization us- available to the public1. ing mean square reconstruction loss, other widely used methods are the bilinear word2vec architec- 2 Model Description tures introduced by Mikolov et al. (2013a): While skip-gram aims to predict nearby words from a Before introducing our model, we recapitulate given word, CBOW predicts a target word from fundamental existing word embeddings methods. its set of context words. CBOW and skip-gram models. Continuous Recently, significant improvements in the qual- bag-of-words (CBOW) and skip-gram models are ity of the word embeddings were obtained by 1publicly available on * indicates equal contribution http://github.com/epfml/sent2vec standard log-bilinear models for obtaining word statistics by factorizing the word-context co- embeddings based on word-context pair informa- occurrence matrix. tion (Mikolov et al., 2013a). Context here refers to Ngram2vec. In order to leverage the perfor- a symmetric window centered on the target word mance of word vectors, training of word vec- wt, containing the surrounding tokens at a distance tors using the skip-gram objective function with less than some window size ws: Ct = {wk | k ∈ negative sampling is augmented with n-gram co- [t − ws,t + ws]}. The CBOW model tries to pre- occurrence information (Zhao et al., 2017). dict the target word given its context, maximizing T the likelihood Qt=1 p(wt|Ct), whereas skip-gram 2.1 Improving unigram embeddings by learns by predicting the context for a given target adding higher order word-n-grams to T word maximizing Qt=1 p(Ct|wt). To model those contexts probabilities, a softmax activation is used on top CBOW-char with word n-grams. We propose to of the inner product between a target vector uwt 1 augment CBOW-char to additionally use word n- and its context vector vw. |Ct| Pw∈Ct gram context vectors (in addition to char n-grams To overcome the computational bottleneck of and the context word itself). More precisely, dur- the softmax for large vocabulary, negative sam- ing training, the context vector for a given word w pling or noise contrastive estimation are well- t is given by the average of all word-n-grams N , all established (Mikolov et al., 2013b), with the idea t char-n-grams, and all unigrams in the span of the of employing simpler pairwise binary classifier current context window C : loss functions to differentiate between the valid t context Ct and fake contexts NCt sampled at ran- vw + vn + vc dom. While generating target-context pairs, both v := Pw∈Ct Pn∈Nt Pw∈Ct Pc∈Ww |C | + |N | + |W | CBOW and skip-gram also use input word sub- t t Pw∈Ct w (1) sampling, discarding higher-frequency words with For a given sentence, we apply input subsam- higher probability during training, in order to pre- pling and a sliding context window as for standard vent the model from overfitting the most frequent CBOW. In addition, we keep the mapping from tokens. Standard CBOW also uses a dynamic the subsampled sentence to the original sentence context window size: for each subsampled tar- for the purpose of extracting word n-grams from get word w, the size of its associated context the original sequence of words, within the span window is sampled uniformly between 1 and ws of the context window. Word n-grams are added (Mikolov et al., 2013b). to the context using the hashing trick in the same Adding character n-grams. Bojanowski et al. way char-n-grams are handled. We use two dif- (2017) have augmented CBOW and skip-gram by ferent hashing index ranges to ensure there is no adding character n-grams to the context represen- collision between char n-gram and word n-gram tations. Word vectors are expressed as the sum representations. of its unigram and average of its character n-gram Sent2Vec for word embeddings. Initially im- embeddings W : w plemented for sentence embeddings, Sent2Vec 1 (Pagliardini et al., 2018) can be seen as a deriva- v := vw + X vc tive of word2vec’s CBOW. The key differences |Ww| c∈Ww between CBOW and Sent2Vec are the removal of Character n-grams are hashed to an index in the input subsampling, considering the entire sen- the embedding matrix . The training remains the tence as context, as well as the addition of word- same as for CBOW and skip-gram. This approach n-grams. greatly improves the performances of CBOW and Here, word and n-grams embeddings from an skip-gram on morpho-syntactic tasks. For the rest entire sentence are averaged to form the corre- of the paper, we will refer to the CBOW and skip- sponding sentence (context) embedding. gram methods enriched with subword-information For both proposed CBOW-char and Sent2Vec as CBOW-char and skip-gram-char respectively. models, we employ dropout on word n-grams dur- GloVe. Instead of training online on local win- ing training. For both models, word embeddings dow contexts, GloVe vectors (Pennington et al., are obtained by simply discarding the higher order 2014) are trained using global co-occurrence n-gram embeddings after training. Model WS 353 WS 353 Relatedness WS 353 Similarity CBOW-char .709 ± .006 .626 ± .009 .783 ± .004 CBOW-char + bi. .719 ± .007 .652 ± .010 .778 ± .007 CBOW-char + bi. + tri. .727 ± .008 .664 ± .008 .783 ± .004 Sent2Vec uni. .705 ± .004 .593 ± .005 .793 ± .006 Sent2Vec uni. + bi. .755 ± .005 .683 ± .008 .817 ± .007 Sent2Vec uni. + bi. + tri. .780 ± .003 .721 ± .006 .828 ± .003 Model SimLex-999 MEN Rare Words Mechanical Turk CBOW-char .424 ± .004 .769 ± .002 .497 ± .002 .675 ± .007 CBOW-char + bi. .436 ± .004 .786 ± .002 .506 ± .001 .671 ± .007 CBOW-char + bi. + tri. .441 ± .003 .788 ± .002 .509 ± .003 .678 ± .010 Sent2Vec uni. .450 ± .003 .765 ± .002 .444 ± .001 .625 ± .005 Sent2Vec uni. + bi. .440 ± .002 .791 ± .002 .430 ± .002 .661 ± .005 Sent2Vec uni. + bi. + tri. .464 ± .003 .798 ± .001 .432 ± .003 .658 ± .006 Google Google Model MSR (Syntactic Analogies) (Semantic Analogies) CBOW-char .920 ± .001 .799 ± .004 .842 ± .002 CBOW-char + bi. .928 ± .003 .798 ± .006 .856 ± .004 CBOW-char + bi. + tri. .929 ± .001 .794 ± .005 .857 ± .002 Sent2Vec uni. .826 ± .003 .847 ± .003 .734 ± .003 Sent2Vec uni. + bi. .843 ± .004 .844 ± .002 .754 ± .004 Sent2Vec uni. + bi. + tri. .837 ± .003 .853 ± .003 .745 ± .001 Table 1: Impact of using word n-grams: Models are compared using Spearman correlation measures for word similarity tasks and accuracy for word analogy tasks. Top performances on each dataset are shown in bold. An underline shows the best model(s) restricted to each architecture type. The abbreviations uni., bi., and tri. stand for unigrams, bigrams, and trigrams respectively. 3 Experimental Setup best performing model and is included in the com- parison. 3.1 Training For each method, we extensively tuned hyper- We train all competing models on a wikipedia parameters starting from the recommended val- dump of 69 million sentences containing 1.7 bil- ues.
Recommended publications
  • Classifying Relevant Social Media Posts During Disasters Using Ensemble of Domain-Agnostic and Domain-Specific Word Embeddings
    AAAI 2019 Fall Symposium Series: AI for Social Good 1 Classifying Relevant Social Media Posts During Disasters Using Ensemble of Domain-agnostic and Domain-specific Word Embeddings Ganesh Nalluru, Rahul Pandey, Hemant Purohit Volgenau School of Engineering, George Mason University Fairfax, VA, 22030 fgn, rpandey4, [email protected] Abstract to provide relevant intelligence inputs to the decision-makers (Hughes and Palen 2012). However, due to the burstiness of The use of social media as a means of communication has the incoming stream of social media data during the time of significantly increased over recent years. There is a plethora of information flow over the different topics of discussion, emergencies, it is really hard to filter relevant information which is widespread across different domains. The ease of given the limited number of emergency service personnel information sharing has increased noisy data being induced (Castillo 2016). Therefore, there is a need to automatically along with the relevant data stream. Finding such relevant filter out relevant posts from the pile of noisy data coming data is important, especially when we are dealing with a time- from the unconventional information channel of social media critical domain like disasters. It is also more important to filter in a real-time setting. the relevant data in a real-time setting to timely process and Our contribution: We provide a generalizable classification leverage the information for decision support. framework to classify relevant social media posts for emer- However, the short text and sometimes ungrammatical nature gency services. We introduce the framework in the Method of social media data challenge the extraction of contextual section, including the description of domain-agnostic and information cues, which could help differentiate relevant vs.
    [Show full text]
  • Injection of Automatically Selected Dbpedia Subjects in Electronic
    Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hospitalization Prediction Raphaël Gazzotti, Catherine Faron Zucker, Fabien Gandon, Virginie Lacroix-Hugues, David Darmon To cite this version: Raphaël Gazzotti, Catherine Faron Zucker, Fabien Gandon, Virginie Lacroix-Hugues, David Darmon. Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hos- pitalization Prediction. SAC 2020 - 35th ACM/SIGAPP Symposium On Applied Computing, Mar 2020, Brno, Czech Republic. 10.1145/3341105.3373932. hal-02389918 HAL Id: hal-02389918 https://hal.archives-ouvertes.fr/hal-02389918 Submitted on 16 Dec 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hospitalization Prediction Raphaël Gazzotti Catherine Faron-Zucker Fabien Gandon Université Côte d’Azur, Inria, CNRS, Université Côte d’Azur, Inria, CNRS, Inria, Université Côte d’Azur, CNRS, I3S, Sophia-Antipolis, France I3S, Sophia-Antipolis, France
    [Show full text]
  • Information Extraction Based on Named Entity for Tourism Corpus
    Information Extraction based on Named Entity for Tourism Corpus Chantana Chantrapornchai Aphisit Tunsakul Dept. of Computer Engineering Dept. of Computer Engineering Faculty of Engineering Faculty of Engineering Kasetsart University Kasetsart University Bangkok, Thailand Bangkok, Thailand [email protected] [email protected] Abstract— Tourism information is scattered around nowa- The ontology is extracted based on HTML web structure, days. To search for the information, it is usually time consuming and the corpus is based on WordNet. For these approaches, to browse through the results from search engine, select and the time consuming process is the annotation which is to view the details of each accommodation. In this paper, we present a methodology to extract particular information from annotate the type of name entity. In this paper, we target at full text returned from the search engine to facilitate the users. the tourism domain, and aim to extract particular information Then, the users can specifically look to the desired relevant helping for ontology data acquisition. information. The approach can be used for the same task in We present the framework for the given named entity ex- other domains. The main steps are 1) building training data traction. Starting from the web information scraping process, and 2) building recognition model. First, the tourism data is gathered and the vocabularies are built. The raw corpus is used the data are selected based on the HTML tag for corpus to train for creating vocabulary embedding. Also, it is used building. The data is used for model creation for automatic for creating annotated data.
    [Show full text]
  • A Second Look at Word Embeddings W4705: Natural Language Processing
    A second look at word embeddings W4705: Natural Language Processing Fei-Tzin Lee October 23, 2019 Fei-Tzin Lee Word embeddings October 23, 2019 1 / 39 Overview Last time... • Distributional representations (SVD) • word2vec • Analogy performance Fei-Tzin Lee Word embeddings October 23, 2019 2 / 39 Overview This time • Homework-related topics • Non-homework topics Fei-Tzin Lee Word embeddings October 23, 2019 3 / 39 Overview Outline 1 GloVe 2 How good are our embeddings? 3 The math behind the models? 4 Word embeddings, new and improved Fei-Tzin Lee Word embeddings October 23, 2019 4 / 39 GloVe Outline 1 GloVe 2 How good are our embeddings? 3 The math behind the models? 4 Word embeddings, new and improved Fei-Tzin Lee Word embeddings October 23, 2019 5 / 39 GloVe A recap of GloVe Our motivation: • SVD places too much emphasis on unimportant matrix entries • word2vec never gets to look at global co-occurrence statistics Can we create a new model that balances the strengths of both of these to better express linear structure? Fei-Tzin Lee Word embeddings October 23, 2019 6 / 39 GloVe Setting We'll use more or less the same setting as in previous models: • A corpus D • A word vocabulary V from which both target and context words are drawn • A co-occurrence matrix Mij , from which we can calculate conditional probabilities Pij = Mij =Mi Fei-Tzin Lee Word embeddings October 23, 2019 7 / 39 GloVe The GloVe objective, in overview Idea: we want to capture not individual probabilities of co-occurrence, but ratios of co-occurrence probability between pairs (wi ; wk ) and (wj ; wk ).
    [Show full text]
  • Cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information
    The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information Shaosheng Cao,1,2 Wei Lu,2 Jun Zhou,1 Xiaolong Li1 1 AI Department, Ant Financial Services Group 2 Singapore University of Technology and Design {shaosheng.css, jun.zhoujun, xl.li}@antfin.com [email protected] Abstract We propose cw2vec, a novel method for learning Chinese word embeddings. It is based on our observation that ex- ploiting stroke-level information is crucial for improving the learning of Chinese word embeddings. Specifically, we de- sign a minimalist approach to exploit such features, by us- ing stroke n-grams, which capture semantic and morpholog- ical level information of Chinese words. Through qualita- tive analysis, we demonstrate that our model is able to ex- Figure 1: Radical v.s. components v.s. stroke n-gram tract semantic information that cannot be captured by exist- ing methods. Empirical results on the word similarity, word analogy, text classification and named entity recognition tasks show that the proposed approach consistently outperforms Bojanowski et al. 2016; Cao and Lu 2017). While these ap- state-of-the-art approaches such as word-based word2vec and proaches were shown effective, they largely focused on Eu- GloVe, character-based CWE, component-based JWE and ropean languages such as English, Spanish and German that pixel-based GWE. employ the Latin script in their writing system. Therefore the methods developed are not directly applicable to lan- guages such as Chinese that employ a completely different 1. Introduction writing system. Word representation learning has recently received a sig- In Chinese, each word typically consists of less charac- nificant amount of attention in the field of natural lan- ters than English1, where each character conveys fruitful se- guage processing (NLP).
    [Show full text]
  • Effects of Pre-Trained Word Embeddings on Text-Based Deception Detection
    2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress Effects of Pre-trained Word Embeddings on Text-based Deception Detection David Nam, Jerin Yasmin, Farhana Zulkernine School of Computing Queen’s University Kingston, Canada Email: {david.nam, jerin.yasmin, farhana.zulkernine}@queensu.ca Abstract—With e-commerce transforming the way in which be wary of any deception that the reviews might present. individuals and businesses conduct trades, online reviews have Deception can be defined as a dishonest or illegal method become a great source of information among consumers. With in which misleading information is intentionally conveyed 93% of shoppers relying on online reviews to make their purchasing decisions, the credibility of reviews should be to others for a specific gain [2]. strongly considered. While detecting deceptive text has proven Companies or certain individuals may attempt to deceive to be a challenge for humans to detect, it has been shown consumers for various reasons to influence their decisions that machines can be better at distinguishing between truthful about purchasing a product or a service. While the motives and deceptive online information by applying pattern analysis for deception may be hard to determine, it is clear why on a large amount of data. In this work, we look at the use of several popular pre-trained word embeddings (Word2Vec, businesses may have a vested interest in producing deceptive GloVe, fastText) with deep neural network models (CNN, reviews about their products.
    [Show full text]
  • Evaluation of Machine Learning Algorithms for Sms Spam Filtering
    EVALUATION OF MACHINE LEARNING ALGORITHMS FOR SMS SPAM FILTERING David Bäckman Bachelor Thesis, 15 credits Bachelor Of Science Programme in Computing Science 2019 Abstract The purpose of this thesis is to evaluate dierent machine learning algorithms and methods for text representation in order to determine what is best suited to use to distinguish between spam SMS and legitimate SMS. A data set that contains 5573 real SMS has been used to train the algorithms K-Nearest Neighbor, Support Vector Machine, Naive Bayes and Logistic Regression. The dierent methods that have been used to represent text are Bag of Words, Bigram and Word2Vec. In particular, it has been investigated if semantic text representations can improve the performance of classication. A total of 12 combinations have been evaluated with help of the metrics accuracy and F1-score. The results shows that Logistic Regression together with Bag of Words reach the highest accuracy and F1-score. Bigram as text representation seems to work worse then the others methods. Word2Vec can increase the performnce for K- Nearst Neigbor but not for the other algorithms. Acknowledgements I would like to thank my supervisor Kai-Florian Richter for all good advice and guidance through the project. I would also like to thank all my classmates for help and support during the education, you have made it possible for me to reach this day. Contents 1 Introduction 1 1.1 Background1 1.2 Purpose and Research Questions1 2 Related Work 3 3 Theoretical Background 5 3.1 The Choice of Algorithms5 3.2 Classication
    [Show full text]
  • NLP - Assignment 2
    NLP - Assignment 2 Week 2 December 27th, 2016 1. A 5-gram model is a order Markov Model: (a) Six (b) Five (c) Four (d) Constant Ans : c) Four 2. For the following corpus C1 of 3 sentences, what is the total count of unique bi- grams for which the likelihood will be estimated? Assume we do not perform any pre-processing, and we are using the corpus as given. (i) ice cream tastes better than any other food (ii) ice cream is generally served after the meal (iii) many of us have happy childhood memories linked to ice cream (a) 22 (b) 27 (c) 30 (d) 34 Ans : b) 27 3. Arrange the words \curry, oil and tea" in descending order, based on the frequency of their occurrence in the Google Books n-grams. The Google Books n-gram viewer is available at https://books.google.com/ngrams: (a) tea, oil, curry (c) curry, tea, oil (b) curry, oil, tea (d) oil, tea, curry Ans: d) oil, tea, curry 4. Given a corpus C2, The Maximum Likelihood Estimation (MLE) for the bigram \ice cream" is 0.4 and the count of occurrence of the word \ice" is 310. The likelihood of \ice cream" after applying add-one smoothing is 0:025, for the same corpus C2. What is the vocabulary size of C2: 1 (a) 4390 (b) 4690 (c) 5270 (d) 5550 Ans: b)4690 The Questions from 5 to 10 require you to analyse the data given in the corpus C3, using a programming language of your choice.
    [Show full text]
  • Smart Ubiquitous Chatbot for COVID-19 Assistance with Deep Learning Sentiment Analysis Model During and After Quarantine
    Smart Ubiquitous Chatbot for COVID-19 Assistance with Deep learning Sentiment Analysis Model during and after quarantine Nourchène Ouerhani ( [email protected] ) University of Manouba, National School of Computer Sciences, RIADI Laboratory, 2010, Manouba, Tunisia Ahmed Maalel University of Manouba, National School of Computer Sciences, RIADI Laboratory, 2010, Manouba, Tunisia Henda Ben Ghézala University of Manouba, National School of Computer Sciences, RIADI Laboratory, 2010, Manouba, Tunisia Soulaymen Chouri Vialytics Lautenschlagerstr Research Article Keywords: Chatbot, COVID-19, Natural Language Processing, Deep learning, Mental Health, Ubiquity Posted Date: June 25th, 2020 DOI: https://doi.org/10.21203/rs.3.rs-33343/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Noname manuscript No. (will be inserted by the editor) Smart Ubiquitous Chatbot for COVID-19 Assistance with Deep learning Sentiment Analysis Model during and after quarantine Nourch`ene Ouerhani · Ahmed Maalel · Henda Ben Gh´ezela · Soulaymen Chouri Received: date / Accepted: date Abstract The huge number of deaths caused by the posed method is a ubiquitous healthcare service that is novel pandemic COVID-19, which can affect anyone presented by its four interdependent modules: Informa- of any sex, age and socio-demographic status in the tion Understanding Module (IUM) in which the NLP is world, presents a serious threat for humanity and so- done, Data Collector Module (DCM) that collect user’s ciety. At this point, there are two types of citizens, non-confidential information to be used later by the Ac- those oblivious of this contagious disaster’s danger that tion Generator Module (AGM) that generates the chat- could be one of the causes of its spread, and those who bots answers which are managed through its three sub- show erratic or even turbulent behavior since fear and modules.
    [Show full text]
  • 3 Dictionaries and Tolerant Retrieval
    Online edition (c)2009 Cambridge UP DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 49 Dictionaries and tolerant 3 retrieval In Chapters 1 and 2 we developed the ideas underlying inverted indexes for handling Boolean and proximity queries. Here, we develop techniques that are robust to typographical errors in the query, as well as alternative spellings. In Section 3.1 we develop data structures that help the search for terms in the vocabulary in an inverted index. In Section 3.2 we study WILDCARD QUERY the idea of a wildcard query: a query such as *a*e*i*o*u*, which seeks doc- uments containing any term that includes all the five vowels in sequence. The * symbol indicates any (possibly empty) string of characters. Users pose such queries to a search engine when they are uncertain about how to spell a query term, or seek documents containing variants of a query term; for in- stance, the query automat* would seek documents containing any of the terms automatic, automation and automated. We then turn to other forms of imprecisely posed queries, focusing on spelling errors in Section 3.3. Users make spelling errors either by accident, or because the term they are searching for (e.g., Herman) has no unambiguous spelling in the collection. We detail a number of techniques for correcting spelling errors in queries, one term at a time as well as for an entire string of query terms. Finally, in Section 3.4 we study a method for seeking vo- cabulary terms that are phonetically close to the query term(s).
    [Show full text]
  • A Comparison of Word Embeddings and N-Gram Models for Dbpedia Type and Invalid Entity Detection †
    information Article A Comparison of Word Embeddings and N-gram Models for DBpedia Type and Invalid Entity Detection † Hanqing Zhou *, Amal Zouaq and Diana Inkpen School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa ON K1N 6N5, Canada; [email protected] (A.Z.); [email protected] (D.I.) * Correspondence: [email protected]; Tel.: +1-613-562-5800 † This paper is an extended version of our conference paper: Hanqing Zhou, Amal Zouaq, and Diana Inkpen. DBpedia Entity Type Detection using Entity Embeddings and N-Gram Models. In Proceedings of the International Conference on Knowledge Engineering and Semantic Web (KESW 2017), Szczecin, Poland, 8–10 November 2017, pp. 309–322. Received: 6 November 2018; Accepted: 20 December 2018; Published: 25 December 2018 Abstract: This article presents and evaluates a method for the detection of DBpedia types and entities that can be used for knowledge base completion and maintenance. This method compares entity embeddings with traditional N-gram models coupled with clustering and classification. We tackle two challenges: (a) the detection of entity types, which can be used to detect invalid DBpedia types and assign DBpedia types for type-less entities; and (b) the detection of invalid entities in the resource description of a DBpedia entity. Our results show that entity embeddings outperform n-gram models for type and entity detection and can contribute to the improvement of DBpedia’s quality, maintenance, and evolution. Keywords: semantic web; DBpedia; entity embedding; n-grams; type identification; entity identification; data mining; machine learning 1. Introduction The Semantic Web is defined by Berners-Lee et al.
    [Show full text]
  • Linked Data Triples Enhance Document Relevance Classification
    applied sciences Article Linked Data Triples Enhance Document Relevance Classification Dinesh Nagumothu * , Peter W. Eklund , Bahadorreza Ofoghi and Mohamed Reda Bouadjenek School of Information Technology, Deakin University, Geelong, VIC 3220, Australia; [email protected] (P.W.E.); [email protected] (B.O.); [email protected] (M.R.B.) * Correspondence: [email protected] Abstract: Standardized approaches to relevance classification in information retrieval use generative statistical models to identify the presence or absence of certain topics that might make a document relevant to the searcher. These approaches have been used to better predict relevance on the basis of what the document is “about”, rather than a simple-minded analysis of the bag of words contained within the document. In more recent times, this idea has been extended by using pre-trained deep learning models and text representations, such as GloVe or BERT. These use an external corpus as a knowledge-base that conditions the model to help predict what a document is about. This paper adopts a hybrid approach that leverages the structure of knowledge embedded in a corpus. In particular, the paper reports on experiments where linked data triples (subject-predicate-object), constructed from natural language elements are derived from deep learning. These are evaluated as additional latent semantic features for a relevant document classifier in a customized news- feed website. The research is a synthesis of current thinking in deep learning models in NLP and information retrieval and the predicate structure used in semantic web research. Our experiments Citation: Nagumothu, D.; Eklund, indicate that linked data triples increased the F-score of the baseline GloVe representations by 6% P.W.; Ofoghi, B.; Bouadjenek, M.R.
    [Show full text]