Towards Building a Multilingual Sememe Knowledge Base

Total Page:16

File Type:pdf, Size:1020Kb

Towards Building a Multilingual Sememe Knowledge Base Towards Building a Multilingual Sememe Knowledge Base: Predicting Sememes for BabelNet Synsets Fanchao Qi1∗, Liang Chang2∗y, Maosong Sun13z, Sicong Ouyang2y, Zhiyuan Liu1 1Department of Computer Science and Technology, Tsinghua University Institute for Artificial Intelligence, Tsinghua University Beijing National Research Center for Information Science and Technology 2Beijing University of Posts and Telecommunications 3Jiangsu Collaborative Innovation Center for Language Ability, Jiangsu Normal University [email protected], [email protected] fsms, [email protected], [email protected] Abstract word husband A sememe is defined as the minimum semantic unit of human languages. Sememe knowledge bases (KBs), which contain sense "married man" "carefully use" words annotated with sememes, have been successfully ap- plied to many NLP tasks. However, existing sememe KBs are built on only a few languages, which hinders their widespread human economize utilization. To address the issue, we propose to build a uni- sememe fied sememe KB for multiple languages based on BabelNet, a family male spouse multilingual encyclopedic dictionary. We first build a dataset serving as the seed of the multilingual sememe KB. It man- ually annotates sememes for over 15 thousand synsets (the entries of BabelNet). Then, we present a novel task of auto- Figure 1: Sememe annotation of the word “husband” in matic sememe prediction for synsets, aiming to expand the HowNet. seed dataset into a usable KB. We also propose two simple and effective models, which exploit different information of synsets. Finally, we conduct quantitative and qualitative anal- sememes to annotate senses of over 100 thousand Chinese yses to explore important factors and difficulties in the task. and English words. Figure 1 illustrates an example of how All the source code and data of this work can be obtained on words are annotated with sememes in HowNet. https://github.com/thunlp/BabelNet-Sememe-Prediction. Different from most linguistic KBs like WordNet (Miller 1995), which explain meanings of words by word-level re- Introduction lations, sememe KBs like HowNet provide intensional defi- nitions for words using infra-word sememes. Sememe KBs A word is the smallest element of human languages that can have two unique strengths. The first one is their sememe-to- stand by itself, but not the smallest indivisible semantic unit. word semantic compositionality, which endows them with In fact, the meaning of a word can be divided into smaller special suitability for integration into neural networks (Gu components. For example, one of the meanings of “man” et al. 2018; Qi et al. 2019a). The second one is their na- can be represented as the composition of the meanings ture that limited sememes can represent unlimited meanings, sememe of “human”, “male” and “adult”. In linguistics, a which makes sememes very useful in low data regimes, e.g., (Bloomfield 1926) is defined as the minimum semantic unit improving embeddings of low-frequency words (Sun and arXiv:1912.01795v1 [cs.CL] 4 Dec 2019 of human languages. Some linguists believe that meanings Chen 2016; Niu et al. 2017). In fact, sememe KBs have been of all the words in any language can be decomposed of a proven beneficial to various NLP tasks such as word sense limited set of predefined sememes, which is related to the disambiguation (Duan, Zhao, and Xu 2007) and sentiment idea of universal semantic primitives (Wierzbicka 1996). analysis (Fu et al. 2013). Sememes are implicit in words. To utilize them in practi- Most languages have no sememe KBs, which prevents cal applications, people manually annotate words with pre- NLP applications of these languages benefiting from se- defined sememes to construct sememe knowledge bases meme knowledge. However, building a sememe KB for a (KBs). HowNet (Dong and Dong 2003) is the most fa- new language from scratch is time-consuming and labor- mous one, which uses about 2; 000 language-independent intensive — the construction of HowNet takes several lin- ∗Indicates equal contribution guistic experts more than two decades. To tackle this chal- yWork done during internship at Tsinghua University lenge, Qi et al. (2018) present the task of cross-lingual lexi- zCorresponding author cal sememe prediction (CLSP), aiming to facilitate the con- Copyright c 2020, Association for the Advancement of Artificial struction of a new language’s sememe KB by predicting se- Intelligence (www.aaai.org). All rights reserved. memes for words in that language. However, CLSP can pre- proposing two simple and effective models and conducting bn:00045106n human detailed analyses of factors in SPBS. en: husband, hubby annotate family zh: 丈夫, ⽼公, 先⽣, 夫 Dataset and Task Formalization fr: mari, époux, marié male BabelSememe Dataset de: Ehemann, Gemahl, Gatte …… spouse We build the BabelSememe dataset, which is expected to be the seed of a multilingual sememe KB and expanded A BabelNet synset sememes steadily by automatic sememe prediction together with ex- amination of humans. Sememe annotation in HowNet em- bodies hierarchical structures of sememes, as shown in Fig- Figure 2: Annotating sememes for the BabelNet synset ure 1. Nevertheless, considering the structures of sememes whose ID is bn:00045106n. The synset comprises words are seldom used in existing applications (Niu et al. 2017; in different languages (multilingual synonyms) having the Qi et al. 2019a) and it is very hard for ordinary people to same meaning “the man a woman is married to”, and they make structured sememe annotation, we ignore them in Ba- share the four sememes on the right part. belSememe. Thus, each BabelNet synset in BabelSememe is annotated with a set of sememes (as shown in Figure 2). In the following part, we elaborate how we build the dataset dict sememes for only one language at a time, which means and provide its statistics. repetitive efforts, including manual correctness checking conducted by native speakers, are required when construct- ing sememe KBs for multiple languages. Selecting Target Synsets We first select 20 thousand To solve this problem, we propose to build a unified se- synsets as target synsets. Each of them includes English and 1 meme KB for multiple languages, namely a multilingual Chinese synonyms annotated with sememes in HowNet, so sememe KB, based on BabelNet (Navigli and Ponzetto that we can generate candidate sememes for them. 2012a), which is a more economical and efficient way to transfer sememe knowledge to other languages. BabelNet is Generating Candidate Sememes We generate candidate a multilingual encyclopedic dictionary and comprises over sememes for each target synset using the sememes annotated 15 million entries called BabelNet synsets. Each BabelNet to its synonyms. Some sememes of the synonyms in a target synset contains words in multiple languages with the same synset should be annotated to the target synset. For exam- meaning (multilingual synonyms), and they should have the ple, the word “husband” is annotated with five sememes in same sememe annotation. Therefore, building a multilingual HowNet (Figure 1) in total, and the four sememes annotated sememe KB by annotating sememes for BabelNet synsets to the sense “married man” should be annotated to the Ba- can actually provide sememe annotation for words in multi- belNet synset bn:00045106n (Figure 2). Thus, we group the ple languages simultaneously (Figure 2 shows an example). sememes of all the synonyms in a target synset together to To advance the creation of a multilingual sememe KB, we form the candidate sememe set of the target synset. build a seed dataset named BabelSememe, which contains about 15 thousand BabelNet synsets manually annotated Annotating Appropriate Sememes We ask more than with sememes. Also, we present a novel task of automatic 100 annotation participants to select appropriate sememes sememe prediction for BabelNet synsets (SPBS), aiming to from corresponding candidate sememe set for each target gradually expand the seed dataset into a usable multilingual synset. All the participants have a good grasp of both Chi- sememe KB. In addition, we exploratively put forward two nese and English. We show them Chinese and English syn- simple and effective models for SPBS, which utilize differ- onyms as well as definitions of each synset, making sure ent information incorporated in BabelNet synsets. The first its meaning is fully understood. When annotation is fin- model exploits semantic information and recommends sim- ished, each target synset has been annotated by at least three ilar sememes to semantically close BabelNet synsets, while participants. We remove the synsets whose inter-annotator the second uses relational information and directly predicts agreement (IAA) is poor, where we use Krippendorff’s alpha relations between sememes and BabelNet synsets. In ex- coefficient (Krippendorff 2013) to measure IAA. Finally, periments, we evaluate sememe prediction performance of 15; 756 BabelNet synsets are retained, which are annotated the two models on BabelSememe, finding they achieve sat- with 43; 154 sememes selected from 104; 938 candidate se- isfying results. Moreover, the ensemble of the two models memes, and their average Krippendorff’s alpha coefficient is yield obvious performance enhancement. Finally, we con- 0.702. duct detailed quantitative and qualitative analyses to explore the factors influencing sememe prediction results, aiming to Dataset Statistics Detailed statistics of BabelNet synsets point out the characteristics and difficulties
Recommended publications
  • Words and Alternative Basic Units for Linguistic Analysis
    Words and alternative basic units for linguistic analysis 1 Words and alternative basic units for linguistic analysis Jens Allwood SCCIIL Interdisciplinary Center, University of Gothenburg A. P. Hendrikse, Department of Linguistics, University of South Africa, Pretoria Elisabeth Ahlsén SCCIIL Interdisciplinary Center, University of Gothenburg Abstract The paper deals with words and possible alternative to words as basic units in linguistic theory, especially in interlinguistic comparison and corpus linguistics. A number of ways of defining the word are discussed and related to the analysis of linguistic corpora and to interlinguistic comparisons between corpora of spoken interaction. Problems associated with words as the basic units and alternatives to the traditional notion of word as a basis for corpus analysis and linguistic comparisons are presented and discussed. 1. What is a word? To some extent, there is an unclear view of what counts as a linguistic word, generally, and in different language types. This paper is an attempt to examine various construals of the concept “word”, in order to see how “words” might best be made use of as units of linguistic comparison. Using intuition, we might say that a word is a basic linguistic unit that is constituted by a combination of content (meaning) and expression, where the expression can be phonetic, orthographic or gestural (deaf sign language). On closer examination, however, it turns out that the notion “word” can be analyzed and specified in several different ways. Below we will consider the following three main ways of trying to analyze and define what a word is: (i) Analysis and definitions building on observation and supposed easy discovery (ii) Analysis and definitions building on manipulability (iii) Analysis and definitions building on abstraction 2.
    [Show full text]
  • ON SOME CATEGORIES for DESCRIBING the SEMOLEXEMIC STRUCTURE by Yoshihiko Ikegami
    ON SOME CATEGORIES FOR DESCRIBING THE SEMOLEXEMIC STRUCTURE by Yoshihiko Ikegami 1. A lexeme is the minimum unit that carries meaning. Thus a lexeme can be a "word" as well as an affix (i.e., something smaller than a word) or an idiom (i.e,, something larger than a word). 2. A sememe is a unit of meaning that can be realized as a single lexeme. It is defined as a structure constituted by those features having distinctive functions (i.e., serving to distinguish the sememe in question from other semernes that contrast with it).' A question that arises at this point is whether or not one lexeme always corresponds to just one serneme and no more. Three theoretical positions are foreseeable: (I) one which holds that one lexeme always corresponds to just one sememe and no more, (2) one which holds that one lexeme corresponds to an indefinitely large number of sememes, and (3) one which holds that one lexeme corresponds to a certain limited number of sememes. These three positions wiIl be referred to as (1) the "Grundbedeutung" theory, (2) the "use" theory, and (3) the "polysemy" theory, respectively. The Grundbedeutung theory, however attractive in itself, is to be rejected as unrealistic. Suppose a preliminary analysis has revealed that a lexeme seems to be used sometimes in an "abstract" sense and sometimes in a "concrete" sense. In order to posit a Grundbedeutung under such circumstances, it is to be assumed that there is a still higher level at which "abstract" and "concrete" are neutralized-this is certainly a theoretical possibility, but it seems highly unlikely and unrealistic from a psychological point of view.
    [Show full text]
  • Issues in Text-To-Speech for French
    ISSUES IN TEXT-TO-SPEECH FOR FRENCH Evelyne Tzoukermann AT&T Bell Laboratories 600 Mountain Avenue, Murray tlill, N.J. 07974 evelyne@rcsearch, art.corn Abstract in the standard International Phonetic Alphabi,t; the second column ASCII shows the ascii correspon- This paper reports the progress of the French dence of these characters for the text-to-speech text-to-speech system being developed at AT&T system, and the third column shows art example Bell Laboratories as part of a larger project for of the phoneme in a French word. multilingual text-to-speech systems, including lan- guages such as Spanish, Italian, German, Rus- Consonant s Vowels sian, and Chinese. These systems, based on di- IPA ASCII WORD IPA ASCII WORD phone and triphone concatenation, follow the gen- p p paix i i vive eral framework of the Bell Laboratories English t t tout e e the TTS system [?], [?]. This paper provides a de- k k eas e g aisc scription of the approach, the current status of the b b bas a a table French text-to-speech project, and some problems iI d dos u a time particular to French. g g gai 3 > homme m m mais o o tgt n n liOn u U boue 1 Introduction .p N gagner y y tour l 1 livre n ellX In this paper, the new French text-to-sIieech sys- f f faux ce @ seul tem being developed at AT&T is presented; sev- s s si o & peser eral steps have been already achieved while others f S chanter I bain are still in progress.
    [Show full text]
  • Multi-Disciplinary Lexicography
    Multi-disciplinary Lexicography Multi-disciplinary Lexicography: Traditions and Challenges of the XXIst Century Edited by Olga M. Karpova and Faina I. Kartashkova Multi-disciplinary Lexicography: Traditions and Challenges of the XXIst Century, Edited by Olga M. Karpova and Faina I. Kartashkova This book first published 2013 Cambridge Scholars Publishing 12 Back Chapman Street, Newcastle upon Tyne, NE6 2XX, UK British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Copyright © 2013 by Olga M. Karpova and Faina I. Kartashkova and contributors All rights for this book reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner. ISBN (10): 1-4438-4256-7, ISBN (13): 978-1-4438-4256-3 TABLE OF CONTENTS List of Illustrations ..................................................................................... ix List of Tables............................................................................................... x Editors’ Preface .......................................................................................... xi Olga M. Karpova and Faina I. Kartashkova Ivanovo Lexicographic School................................................................ xvii Ekaterina A. Shilova Part I: Dictionary as a Cross-road of Language and Culture Chapter One................................................................................................
    [Show full text]
  • Download Article
    Advances in Social Science, Education and Humanities Research (ASSEHR), volume 312 International Conference "Topical Problems of Philology and Didactics: Interdisciplinary Approach in Humanities and Social Sciences" (TPHD 2018) Methods of Identifying Members of Synonymic Row Juliya A. Litvinova Elena A. Maklakova Chair of Foreign Languages Chair of Foreign Languages Federal State Budget Educational Institution of Higher Federal State Budget Educational Institution of Higher Education Voronezh State University of Forestry and Education Voronezh State University of Forestry and Technologies named after G.F. Morozov Technologies named after G.F. Morozov Voronezh, Russia Voronezh, Russia [email protected] Affiliation): dept. name of organization Abstract— This article is devoted to identifying the criteria of analysis, method of field modeling, method of semantic synonymity of lexical items. The existence of different definitions interpretation, method of generalization of dictionary of synonymy, the selection criteria of items in the synonymic row definitions, methods of quantitative, lexicographical, indicate the insufficient study and incoherence of this contextual, psycholinguistic analysis. Data for the study were phenomenon in linguistics. The study of the semantics of lexical 2 lexical items in the Russian language (gorodok, gorodishko) items allows explaining the most accurately and authentically the integration and differentiation of lexical items close in meaning. obtained from the Russian Explanatory Dictionaries (V. I. The description of the meaning structure (sememe) is possible Dahl, D. N. Ushakov, S. I. Ozhegov, A. P. Evgenieva, S. A. through the description of its seme composition. The methods of Kuznetsov, T. F. Efremova), Russian National Corpus seme semasiology (lexicographic, psycholinguistic, contextual) (ruscorpora.ru). allow revealing various components in the sememe structure.
    [Show full text]
  • The Art of Lexicography - Niladri Sekhar Dash
    LINGUISTICS - The Art of Lexicography - Niladri Sekhar Dash THE ART OF LEXICOGRAPHY Niladri Sekhar Dash Linguistic Research Unit, Indian Statistical Institute, Kolkata, India Keywords: Lexicology, linguistics, grammar, encyclopedia, normative, reference, history, etymology, learner’s dictionary, electronic dictionary, planning, data collection, lexical extraction, lexical item, lexical selection, typology, headword, spelling, pronunciation, etymology, morphology, meaning, illustration, example, citation Contents 1. Introduction 2. Definition 3. The History of Lexicography 4. Lexicography and Allied Fields 4.1. Lexicology and Lexicography 4.2. Linguistics and Lexicography 4.3. Grammar and Lexicography 4.4. Encyclopedia and lexicography 5. Typological Classification of Dictionary 5.1. General Dictionary 5.2. Normative Dictionary 5.3. Referential or Descriptive Dictionary 5.4. Historical Dictionary 5.5. Etymological Dictionary 5.6. Dictionary of Loanwords 5.7. Encyclopedic Dictionary 5.8. Learner's Dictionary 5.9. Monolingual Dictionary 5.10. Special Dictionaries 6. Electronic Dictionary 7. Tasks for Dictionary Making 7.1. Panning 7.2. Data Collection 7.3. Extraction of lexical items 7.4. SelectionUNESCO of Lexical Items – EOLSS 7.5. Mode of Lexical Selection 8. Dictionary Making: General Dictionary 8.1. HeadwordsSAMPLE CHAPTERS 8.2. Spelling 8.3. Pronunciation 8.4. Etymology 8.5. Morphology and Grammar 8.6. Meaning 8.7. Illustrative Examples and Citations 9. Conclusion Acknowledgements ©Encyclopedia of Life Support Systems (EOLSS) LINGUISTICS - The Art of Lexicography - Niladri Sekhar Dash Glossary Bibliography Biographical Sketch Summary The art of dictionary making is as old as the field of linguistics. People started to cultivate this field from the very early age of our civilization, probably seven to eight hundred years before the Christian era.
    [Show full text]
  • Verbs of 'Preparing Something for Eating by Heating It in a Particular
    DEPARTAMENTO DE FILOLOGÍA INGLESA Y ALEMANA Verbs of ‘preparing something for eating by heating it in a particular way’: a lexicological analysis Grado en Estudios Ingleses Fabián García Díaz Tutora: Mª del Carmen Fumero Pérez San Cristóbal de La Laguna, Tenerife 8 de septiembre de 2015 INDEX 1. Abstract ................................................................................................................................. 3 2. Introduction .......................................................................................................................... 4 3. Theoretical perspective ........................................................................................................ 6 4. Analysis: verbs of to prepare something for eating by heating it in a particular way: cook, fry and roast. ................................................................................................................... 9 4.1. Corpus selection .............................................................................................................. 9 4.2. Verb selection ................................................................................................................ 11 5. Paradigmatic relations ....................................................................................................... 13 5.1. Semantic components and lexematic analysis ............................................................... 13 5.2. Lexical relations ...........................................................................................................
    [Show full text]
  • Improved Word Representation Learning with Sememes
    Improved Word Representation Learning with Sememes Yilin Niu1∗, Ruobing Xie1,∗ Zhiyuan Liu1;2 ,y Maosong Sun1;2 1 Department of Computer Science and Technology, State Key Lab on Intelligent Technology and Systems, National Lab for Information Science and Technology, Tsinghua University, Beijing, China 2 Jiangsu Collaborative Innovation Center for Language Ability, Jiangsu Normal University, Xuzhou 221009 China Abstract for each word. Hence, people manually annotate word sememes and build linguistic common-sense Sememes are minimum semantic units of knowledge bases. word meanings, and the meaning of each HowNet (Dong and Dong, 2003) is one of such word sense is typically composed by sev- knowledge bases, which annotates each concep- eral sememes. Since sememes are not ex- t in Chinese with one or more relevant sememes. plicit for each word, people manually an- Different from WordNet (Miller, 1995), the phi- notate word sememes and form linguistic losophy of HowNet emphasizes the significance of common-sense knowledge bases. In this part and attribute represented by sememes. paper, we present that, word sememe in- HowNet has been widely utilized in word similar- formation can improve word representa- ity computation (Liu and Li, 2002) and sentiment tion learning (WRL), which maps word- analysis (Xianghua et al., 2013), and in section 3.2 s into a low-dimensional semantic space we will give a detailed introduction to sememes, and serves as a fundamental step for many senses and words in HowNet. NLP tasks. The key idea is to utilize In this paper, we aim to incorporate word se- word sememes to capture exact meanings memes into word representation learning (WRL) of a word within specific contexts accu- and learn improved word embeddings in a low- rately.
    [Show full text]
  • Towards a Deeper Understanding of Meaning in Language
    Valentina Gavranović www.singidunum.ac.rs Valentina Gavranović TOWARDS A TOWARDS A DEEPER DEEPER UNDERSTANDING UNDERSTANDING OF MEANING OF MEANING IN LANGUAGE AIN COURSEBOOK LANGUAGE IN ENGLISH SEMANTICS A COURSEBOOK IN ENGLISH SEMANTICS is coursebook has been written for undergraduate students majoring in English, with the aim to introduce them with the main concepts found in the domain of semantics, and help them understand the complexities and dierent aspects of meaning in language. Although the book is primarily intended for students who have no previous theoretical knowledge in semantics, it can also be used by those who want to consolidate and expand on their existing learning experience in linguistics and semantics. e scope of concepts and topics covered in this book have been framed within a rounded context which oers the basics in semantics, and provides references and incentives for students’ further autonomous investigation and research. e author endeavoured to present complex linguistic issues in such a way that any language learner with no previous knowledge in this area can use the book. However, anyone who uses this book needs to read it very carefully so that they could understand the theory and apply it while tackling the examples illustrating the concepts provided in this book. Furthermore, the very nature of linguistics demands a thorough and careful reading, rereading and questioning. e book is composed of six chapters, and, within them, there are several sections and subsections. Each chapter contains theoretical explanations followed by a set of questions and exercises whose aim is to help the learners revise the theoretical concepts and apply them in practice.
    [Show full text]
  • Kernerman Kdictionaries.Com/Kdn DICTIONARY News
    Number 19 y July 2011 Kernerman kdictionaries.com/kdn DICTIONARY News Integrating phonetic transcription in a Brazilian Portuguese dictionary Luiz Carlos Cagliari 1. The dictionary in everyday work. It is always common for a phonetician to K Dictionaries has developed a series of dictionaries for learners transcribe his or her own language. When doing that, all kinds of various languages, including Portuguese/French. This of sound variation are registered, according to the speakers’ dictionary was based on European Portuguese, whose words pronunciation. On the other hand, when doing phonology, the had a phonetic transcription with European pronunciation. A sound patterns are interpreted following theories in order to group of lexicographers from the University of São Paulo, under get an abstract sound system of the language. In this case, the the supervision of Ieda Maria Alves, adapted the Portuguese phonetic variation is accommodated into their phonemes. The entries to Brazilian vocabulary, and another team from UNESP phonological transcription does not represent a pronunciation (São Paulo State University) at São José do Rio Preto, under of the language, in the same sense as phonetic transcription the supervision of Claudia Xatara, adapted to Brazilian the does. Moreover, phonological transcription does not relate to Poruguese translations of the French entries. language in the same way as orthography does. According In line with the change from European to Brazilian to the rules of writing systems, orthography has the function Portuguese, I gathered a group of students to be involved in of neutralizing the phonetic variation of a language on the modifying the phonetic transcription. The collaborators were word pronunciation level, allowing all speakers to read.
    [Show full text]
  • Producing an Encyclopedic Dictionary Using Patent Documents
    Producing an Encyclopedic Dictionary Using Patent Documents Atsushi Fujii Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550, Japan [email protected] Abstract Although the World Wide Web has of late become an important source to consult for the meaning of words, a number of technical terms related to high technology are not found on the Web. This paper describes a method to produce an encyclopedic dictionary for high-tech terms from patent information. We used a collection of unexamined patent applications published by the Japanese Patent Office as a source corpus. Given this collection, we extracted terms as headword candidates and retrieved applications including those headwords. Then, we extracted paragraph-style descriptions and categorized them into technical domains. We also extracted related terms for each headword. We have produced a dictionary including approximately 400 000 Japanese terms as headwords. We have also implemented an interface with which users can explore our dictionary by reading text descriptions and viewing a related-term graph. 1. Introduction CLONE has been used for various research purposes. Term descriptions that have been carefully organized At the same time, we have identified that descriptions of in hand-compiled dictionaries and encyclopedias pro- technical terms associated with high technology are not vide valuable linguistic knowledge for human use and necessarily found on the Web. Example terms are “pho- knowledge-intensive computer systems, developed in the tosensitive lithographic printing plate”, “tracking error sig- human language technology community. However, as with nal”, and “magenta coupler”. Even in Wikipedia, which other types of linguistic knowledge relying on human intro- is a large encyclopedia on the Web, a number of high-tech spection and supervision, compiling encyclopedias is ex- terms are not explained.
    [Show full text]
  • Towards Building a Multilingual Sememe Knowledge Base
    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Towards Building a Multilingual Sememe Knowledge Base: Predicting Sememes for BabelNet Synsets Fanchao Qi,1∗ Liang Chang,2∗† Maosong Sun,1,3‡ Sicong Ouyang,2† Zhiyuan Liu1 1Department of Computer Science and Technology, Tsinghua University Institute for Artificial Intelligence, Tsinghua University Beijing National Research Center for Information Science and Technology 2Beijing University of Posts and Telecommunications 3Jiangsu Collaborative Innovation Center for Language Ability, Jiangsu Normal University [email protected], [email protected] {sms, liuzy}@tsinghua.edu.cn, [email protected] Abstract word husband A sememe is defined as the minimum semantic unit of human languages. Sememe knowledge bases (KBs), which contain sense "married man" "carefully use" words annotated with sememes, have been successfully ap- plied to many NLP tasks. However, existing sememe KBs are built on only a few languages, which hinders their widespread human economize utilization. To address the issue, we propose to build a uni- sememe fied sememe KB for multiple languages based on BabelNet, a family male spouse multilingual encyclopedic dictionary. We first build a dataset serving as the seed of the multilingual sememe KB. It man- ually annotates sememes for over 15 thousand synsets (the entries of BabelNet). Then, we present a novel task of auto- Figure 1: Sememe annotation of the word “husband” in matic sememe prediction for synsets, aiming to expand the HowNet. seed dataset into a usable KB. We also propose two simple and effective models, which exploit different information of synsets. Finally, we conduct quantitative and qualitative anal- sememes to annotate senses of over 100 thousand Chinese yses to explore important factors and difficulties in the task.
    [Show full text]