Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary

Total Page:16

File Type:pdf, Size:1020Kb

Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary Torsten Zesch and Christof Müller and Iryna Gurevych Ubiquitous Knowledge Processing Lab Computer Science Department Technische Universität Darmstadt, Hochschulstraße 10 D-64289 Darmstadt, Germany {zesch,mueller,gurevych} (at) tk.informatik.tu-darmstadt.de Abstract Recently, collaboratively constructed resources such as Wikipedia and Wiktionary have been discovered as valuable lexical semantic knowledge bases with a high potential in diverse Natural Language Processing (NLP) tasks. Collaborative knowledge bases however significantly differ from traditional linguistic knowledge bases in various respects, and this constitutes both an asset and an impediment for research in NLP. This paper addresses one such major impediment, namely the lack of suitable programmatic access mechanisms to the knowledge stored in these large semantic knowledge bases. We present two application programming interfaces for Wikipedia and Wiktionary which are especially designed for mining the rich lexical semantic information dispersed in the knowledge bases, and provide efficient and structured access to the available knowledge. As we believe them to be of general interest to the NLP community, we have made them freely available for research purposes. 1. Introduction with linguistic knowledge bases in Section 2. We review ex- Currently, the world wide web is undergoing a major isting mechanisms of accessing Wikipedia and Wiktionary change as more and more people are actively contributing in Section 3. In Section 4., we introduce the system archi- to the content available in the so called Web 2.0. Some of tecture that is used to provide structured access to the lexi- these rapidly growing web sites, e.g. Wikipedia (Wikime- cal semantic information contained in Wikipedia and Wik- dia Foundation, 2008a) or Wiktionary (Wikimedia Founda- tionary. In Section 5., we show how selected NLP tasks can tion, 2008b), have the potential to be used as a new kind benefit from the improved access capabilities provided by of lexical semantic resource due to their increasing size and the proposed APIs. We conclude with a summary in Sec- significant coverage of past and current developments. tion 6. In particular, the potential of Wikipedia as a lexical se- 2. Collaborative Knowledge Bases mantic knowledge base has recently started to get ex- Wikipedia and Wiktionary are instances of knowledge plored. It has been used in NLP tasks like text catego- bases that are collaboratively constructed by mainly non- rization (Gabrilovich and Markovitch, 2006), information professional volunteers on the web. We call such a knowl- extraction (Ruiz-Casado et al., 2005), information retrieval edge base Collaborative Knowledge Base (CKB), as op- (Gurevych et al., 2007), question answering (Ahn et al., posed to a Linguistic Knowledge Base (LKB) like Word- 2004), computing semantic relatedness (Zesch et al., 2007), Net (Fellbaum, 1998) or GermaNet (Kunze, 2004). In this or named entity recognition (Bunescu and Pasca, 2006). section, we briefly analyze the CKBs Wikipedia and Wik- Wiktionary has not yet been exploited for research pur- tionary as lexical semantic knowledge bases, and compare poses as extensively as Wikipedia. Interest has nonetheless them with traditionally used LKBs. already arisen, as it has recently been employed in areas like subjectivity and polarity classification (Chesley et al., 2.1. Wikipedia 2006), or diachronic phonology (Bouchard et al., 2007). Wikipedia is a multilingual, web-based, freely available en- All these tasks require reliable lexical semantic infor- cyclopedia, constructed in a collaborative effort of volun- mation which usually comes from linguistic knowledge tary contributors. It grows rapidly, and with approx 7.5 mil- bases like WordNet (Fellbaum, 1998) or GermaNet (Kunze, lion articles in more than 250 languages it has arguably be- 2004). They are usually shipped with easy-to-use appli- come the largest collection of freely available knowledge.3 1 cation programming interfaces (APIs), e.g. JWNL or Articles in Wikipedia form a heavily interlinked knowl- 2 GermaNetAPI , that allow for easy integration into applica- edge base, enriched with a category system emerging from tions. However, Wikipedia and Wiktionary have lacked this collaborative tagging, which constitutes a thesaurus (Voss, kind of support so far which constitutes a significant imped- 2006). Wikipedia thus contains a rich body of lexical iment for NLP research. Therefore, we developed general semantic information, whose aspects are thoroughly de- purpose, high performance Java-based APIs for Wikipedia scribed in (Zesch et al., 2007). This includes knowledge and Wiktionary that we made freely available to the re- about named entities, domain specific terms or domain spe- search community. cific word senses that is rarely available in LKBs. Addi- In this paper, we first describe Wikipedia and Wiktionary tionally, the redirect system of Wikipedia articles can be from a lexical semantic point of view, and compare them used as a dictionary for synonyms, spelling variations and abbreviations. 1 http://sourceforge.net/projects/jwordnet 2 3 http://projects.villa-bosch.de/nlpsoft/gn_api/index.html http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons 1646 Language Rank Entries English Wiktionary German Wiktionary French 1 730,193 English German English German English 2 682,982 Entries 176,410 10,487 3,231 20,557 Vietnamese 3 225,380 Nouns 99,456 6,759 2,116 13,977 Turkish 4 185,603 Verbs 31,164 1,257 378 1,872 Russian 5 132,386 Adjectives 23,041 1,117 357 2,261 Ido 6 128,366 Examples 34,083 465 1,217 20,053 Chinese 7 115,318 Quotations 8,849 55 0 0 Greek 8 102,198 Categories 4,019 992 32 89 Arabic 9 95,020 Derived terms 43,903 944 2,319 36,259 Polish 10 85,494 Collocations 0 0 1,568 28,785 German 12 71,399 Synonyms 29,703 1,916 2,651 34,488 Spanish 20 31,652 Hyponyms 94 0 390 17,103 Hypernyms 42 0 336 17,286 Table 1: Size of Wiktionary language editions as of Febru- Antonyms 4,305 238 283 10,902 ary 29, 2008. Table 2: The number of entries and selected types of lex- ical semantic information available from the English and 2.2. Wiktionary German editions of Wiktionary as of September 2007. Wiktionary is a multilingual, web-based, freely available dictionary, thesaurus and phrase book, designed as the lex- ical companion to Wikipedia. It is also collaboratively con- mology, pronunciation, declension, examples, sample quo- structed by volunteers with no specialized qualifications tations, translations, collocations, derived terms, and usage necessary. notes. Lexically or semantically related terms of several Wiktionary targets common vocabulary and matters of lan- types like synonyms, antonyms, hypernyms and hyponyms guage and wordsmithing. It includes terms from all parts are included as well. On top of that, the English Wiktionary of speech, but excludes in-depth factual and encyclope- edition offers a remarkable amount of information not typi- dic information, as this kind of information is contained cally found in LKBs, including compounds, abbreviations, in Wikipedia.4 Thus, Wikipedia and Wiktionary are largely acronyms and initialisms, common misspellings (e.g. ba- complementary. sicly vs. basically), simplified spelling variants (e.g. thru vs. through), contractions (e.g. o’ vs. of ), proverbs (e.g. Languages and size Wiktionary consists of approx 3.5 no pain, no gain), disputed usage words (e.g. irregardless 5 million entries in 172 language editions. Unlike most vs. irrespective or regardless), protologisms (e.g. iPodian), LKBs each Wiktionary edition also contains entries for onomatopoeia (e.g. grr), or even colloquial, slang and pejo- foreign language terms. Therefore, each language edi- rative language forms. Most of these lexical semantic rela- tion comprises a multilingual dictionary with a substantial tions are explicitly encoded in the structure of a Wiktionary amount of entries in different languages (cf. Table 2). For entry. This stands in clear contrast to Wikipedia, where instance, the English Wiktionary contains the German entry links between articles usually lack clearly defined seman- Haus, which is explained in English as meaning house. tics. The size of a particular language edition of Wiktionary Different Wiktionary editions may include different types largely depends on how active the corresponding commu- of information; e.g. the German edition offers mnemonics, nity is (see Table 1). Surprisingly, the English edition while it currently does not contain quotations. The English (682,982 entries), started on December 12, 2002, is, though edition has no collocations and only very few instances of the oldest, not the largest one. The French Wiktionary hyponymy or hypernymy (see Table 2). Like in Wikipedia, (730,193 entries), which was launched over a year later, is each entry in Wiktionary is additionally connected to a list the largest. Other major languages like German (71,399 of categories. Finally, entries in Wiktionary are massively entries) or Spanish (31,652 entries) are not found among linked to other entries in different ways: they are intra- the top ten, while Ido, a constructed language, has the 6th linked, pointing to other entries in the same Wiktionary; largest edition of Wiktionary containing 128,366 entries. they are inter-linked, pointing to corresponding entries in Table 2 shows the number of English and German entries other language editions of Wiktionary; they also link to ex- in the corresponding Wiktionary editions. The English ternal knowledge bases such as Wikipedia and other web- Wiktionary edition exceeds the size of WordNet 3.0 with based dictionaries. 176,410 English entries as compared to 155,287 unique lexical units in WordNet. In contrast, the 20,557 German 2.3. Comparison with LKBs entries in the German Wiktionary edition are considerably Wikipedia and Wiktionary are instances of collabora- fewer than the 76,563 lexical units in GermaNet 5.0. tive knowledge bases (other examples are dmoz6 or Citi- Lexical semantic information Entries in Wiktionary are zendium7).
Recommended publications
  • Dictionary Users Do Look up Frequent Words. a Logfile Analysis
    Erschienen in: Müller-Spitzer, Carolin (Hrsg.): Using Online Dictionaries. Berlin/ Boston: de Gruyter, 2014. (Lexicographica: Series Maior 145), S. 229-249. Alexander Koplenig, Peter Meyer, Carolin Müller-Spitzer Dictionary users do look up frequent words. A logfile analysis Abstract: In this paper, we use the 2012 log files of two German online dictionaries (Digital Dictionary of the German Language1 and the German Version of Wiktionary) and the 100,000 most frequent words in the Mannheim German Reference Corpus from 2009 to answer the question of whether dictionary users really do look up fre- quent words, first asked by de Schryver et al. (2006). By using an approach to the comparison of log files and corpus data which is completely different from that of the aforementioned authors, we provide empirical evidence that indicates - contra - ry to the results of de Schryver et al. and Verlinde/Binon (2010) - that the corpus frequency of a word can indeed be an important factor in determining what online dictionary users look up. Finally, we incorporate word dass Information readily available in Wiktionary into our analysis to improve our results considerably. Keywords: log file, frequency, corpus, headword list, monolingual dictionary, multi- lingual dictionary Alexander Koplenig: Institut für Deutsche Sprache, R 5, 6-13, 68161 Mannheim, +49-(0)621-1581- 435, [email protected] Peter Meyer: Institut für Deutsche Sprache, R 5, 6-13, 68161 Mannheim, +49-(0)621-1581-427, [email protected] Carolin Müller-Spitzer: Institut für Deutsche Sprache, R 5, 6-13, 68161 Mannheim, +49-(0)621-1581- 429, [email protected] Introduction We would like to Start this chapter by asking one of the most fundamental questions for any general lexicographical endeavour to describe the words of one (or more) language(s): which words should be included in a dictionary? At first glance, the answer seems rather simple (especially when the primary objective is to describe a language as completely as possible): it would be best to include every word in the dictionary.
    [Show full text]
  • Etytree: a Graphical and Interactive Etymology Dictionary Based on Wiktionary
    Etytree: A Graphical and Interactive Etymology Dictionary Based on Wiktionary Ester Pantaleo Vito Walter Anelli Wikimedia Foundation grantee Politecnico di Bari Italy Italy [email protected] [email protected] Tommaso Di Noia Gilles Sérasset Politecnico di Bari Univ. Grenoble Alpes, CNRS Italy Grenoble INP, LIG, F-38000 Grenoble, France [email protected] [email protected] ABSTRACT a new method1 that parses Etymology, Derived terms, De- We present etytree (from etymology + family tree): a scendants sections, the namespace for Reconstructed Terms, new on-line multilingual tool to extract and visualize et- and the etymtree template in Wiktionary. ymological relationships between words from the English With etytree, a RDF (Resource Description Framework) Wiktionary. A first version of etytree is available at http: lexical database of etymological relationships collecting all //tools.wmflabs.org/etytree/. the extracted relationships and lexical data attached to lex- With etytree users can search a word and interactively emes has also been released. The database consists of triples explore etymologically related words (ancestors, descendants, or data entities composed of subject-predicate-object where cognates) in many languages using a graphical interface. a possible statement can be (for example) a triple with a lex- The data is synchronised with the English Wiktionary dump eme as subject, a lexeme as object, and\derivesFrom"or\et- at every new release, and can be queried via SPARQL from a ymologicallyEquivalentTo" as predicate. The RDF database Virtuoso endpoint. has been exposed via a SPARQL endpoint and can be queried Etytree is the first graphical etymology dictionary, which at http://etytree-virtuoso.wmflabs.org/sparql.
    [Show full text]
  • User Contributions to Online Dictionaries Andrea Abel and Christian M
    The dynamics outside the paper: User Contributions to Online Dictionaries Andrea Abel and Christian M. Meyer Electronic lexicography in the 21st century (eLex): Thinking outside the paper, Tallinn, Estonia, October 17–19, 2013. 18.10.2013 | European Academy of Bozen/Bolzano and Technische Universität Darmstadt | Andrea Abel and Christian M. Meyer | 1 Introduction Online dictionaries rely increasingly on their users and leverage methods for facilitating user contributions at basically any step of the lexicographic process. 18.10.2013 | European Academy of Bozen/Bolzano and Technische Universität Darmstadt | Andrea Abel and Christian M. Meyer | 2 Previous work . Mostly focused on specific type/dimension of user contribution . e.g., Carr (1997), Storrer (1998, 2010), Køhler Simonsen (2005), Fuertes-Olivera (2009), Melchior (2012), Lew (2011, 2013) . Ambiguous, partly overlapping terms: www.wordle.net/show/wrdl/7124863/User_contributions (02.10.2013) www.wordle.net/show/wrdl/7124863/User_contributions http:// 18.10.2013 | European Academy of Bozen/Bolzano and Technische Universität Darmstadt | Andrea Abel and Christian M. Meyer | 3 Research goals and contribution How to effectively plan the user contributions for new and established dictionaries? . Comprehensive and systematic classification is still missing . Mann (2010): analysis of 88 online dictionaries . User contributions roughly categorized as direct / indirect contributions and exchange with other dictionary users . Little detail given, since outside the scope of the analysis . We use this as a basis for our work We propose a novel classification of the different types of user contributions based on many practical examples 18.10.2013 | European Academy of Bozen/Bolzano and Technische Universität Darmstadt | Andrea Abel and Christian M.
    [Show full text]
  • Wiktionary Matcher
    Wiktionary Matcher Jan Portisch1;2[0000−0001−5420−0663], Michael Hladik2[0000−0002−2204−3138], and Heiko Paulheim1[0000−0003−4386−8195] 1 Data and Web Science Group, University of Mannheim, Germany fjan, [email protected] 2 SAP SE Product Engineering Financial Services, Walldorf, Germany fjan.portisch, [email protected] Abstract. In this paper, we introduce Wiktionary Matcher, an ontology matching tool that exploits Wiktionary as external background knowl- edge source. Wiktionary is a large lexical knowledge resource that is collaboratively built online. Multiple current language versions of Wik- tionary are merged and used for monolingual ontology matching by ex- ploiting synonymy relations and for multilingual matching by exploiting the translations given in the resource. We show that Wiktionary can be used as external background knowledge source for the task of ontology matching with reasonable matching and runtime performance.3 Keywords: Ontology Matching · Ontology Alignment · External Re- sources · Background Knowledge · Wiktionary 1 Presentation of the System 1.1 State, Purpose, General Statement The Wiktionary Matcher is an element-level, label-based matcher which uses an online lexical resource, namely Wiktionary. The latter is "[a] collaborative project run by the Wikimedia Foundation to produce a free and complete dic- tionary in every language"4. The dictionary is organized similarly to Wikipedia: Everybody can contribute to the project and the content is reviewed in a com- munity process. Compared to WordNet [4], Wiktionary is significantly larger and also available in other languages than English. This matcher uses DBnary [15], an RDF version of Wiktionary that is publicly available5. The DBnary data set makes use of an extended LEMON model [11] to describe the data.
    [Show full text]
  • Extracting an Etymological Database from Wiktionary Benoît Sagot
    Extracting an Etymological Database from Wiktionary Benoît Sagot To cite this version: Benoît Sagot. Extracting an Etymological Database from Wiktionary. Electronic Lexicography in the 21st century (eLex 2017), Sep 2017, Leiden, Netherlands. pp.716-728. hal-01592061 HAL Id: hal-01592061 https://hal.inria.fr/hal-01592061 Submitted on 22 Sep 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Extracting an Etymological Database from Wiktionary Benoît Sagot Inria 2 rue Simone Iff, 75012 Paris, France E-mail: [email protected] Abstract Electronic lexical resources almost never contain etymological information. The availability of such information, if properly formalised, could open up the possibility of developing automatic tools targeted towards historical and comparative linguistics, as well as significantly improving the automatic processing of ancient languages. We describe here the process we implemented for extracting etymological data from the etymological notices found in Wiktionary. We have produced a multilingual database of nearly one million lexemes and a database of more than half a million etymological relations between lexemes. Keywords: Lexical resource development; etymology; Wiktionary 1. Introduction Electronic lexical resources used in the fields of natural language processing and com- putational linguistics are almost exclusively synchronic resources; they mostly include information about inflectional, derivational, syntactic, semantic or even pragmatic prop- erties of their entries.
    [Show full text]
  • Wiktionary Matcher Results for OAEI 2020
    Wiktionary Matcher Results for OAEI 2020 Jan Portisch1;2[0000−0001−5420−0663] and Heiko Paulheim1[0000−0003−4386−8195] 1 Data and Web Science Group, University of Mannheim, Germany fjan, [email protected] 2 SAP SE Product Engineering Financial Services, Walldorf, Germany [email protected] Abstract. This paper presents the results of the Wiktionary Matcher in the Ontology Alignment Evaluation Initiative (OAEI) 2020. Wiktionary Matcher is an ontology matching tool that exploits Wiktionary as exter- nal background knowledge source. Wiktionary is a large lexical knowl- edge resource that is collaboratively built online. Multiple current lan- guage versions of Wiktionary are merged and used for monolingual on- tology matching by exploiting synonymy relations and for multilingual matching by exploiting the translations given in the resource. This is the second OAEI participation of the matching system. Wiktionary Matcher has been improved and is the best performing system on the knowledge graph track this year.3 Keywords: Ontology Matching · Ontology Alignment · External Re- sources · Background Knowledge · Wiktionary 1 Presentation of the System 1.1 State, Purpose, General Statement The Wiktionary Matcher is an element-level, label-based matcher which uses an online lexical resource, namely Wiktionary. The latter is "[a] collaborative project run by the Wikimedia Foundation to produce a free and complete dic- tionary in every language"4. The dictionary is organized similarly to Wikipedia: Everybody can contribute to the project and the content is reviewed in a com- munity process. Compared to WordNet [2], Wiktionary is significantly larger and also available in other languages than English. This matcher uses DBnary [13], an RDF version of Wiktionary that is publicly available5.
    [Show full text]
  • S Wiktionary Wikisource Wikibooks Wikiquote Wikimedia Commons
    SCHWESTERPROJEKTE Wiktionary S Das Wiktionary ist der lexikalische Partner der freien Enzyklopädie Wikipedia: ein Projekt zur Erstellung freier Wörterbücher und Thesau- ri. Während die Wikipedia inhaltliche Konzepte beschreibt, geht es in ihrem ältesten Schwester- projekt, dem 2002 gegründeten Wiktionary um Wörter, ihre Grammatik und Etymologie, Homo- nyme und Synonyme und Übersetzungen. Wikisource Wikisource ist eine Sammlung von Texten, die entweder urheberrechtsfrei sind oder unter ei- ner freien Lizenz stehen. Das Projekt wurde am 24. November 2003 gestartet. Der Wiktionary-EIntrag zum Wort Schnee: Das Wörterbuch präsen- Zunächst mehrsprachig auf einer gemeinsamen tiert Bedeutung, Deklination, Synonyme und Übersetzungen. Plattform angelegt, wurde es später in einzel- ne Sprachversionen aufgesplittet. Das deutsche Teilprojekt zählte im März 2006 über 2000 Texte Wikisource-Mitarbeiter arbeiten an einer digitalen, korrekturge- und über 100 registrierte Benutzer. lesenen und annotierten Ausgabe der Zimmerischen Chronik. Wikibooks Das im Juli 2003 aus der Taufe gehobene Projekt Wikibooks dient der gemeinschaftlichen Schaf- fung freier Lehrmaterialien – vom Schulbuch über den Sprachkurs bis zum praktischen Klet- terhandbuch oder der Go-Spielanleitung Wikiquote Wikiquote zielt darauf ab, auf Wiki-Basis ein freies Kompendium von Zitaten und Das Wikibooks-Handbuch Go enthält eine ausführliche Spielanleitung Sprichwörtern in jeder Sprache zu schaffen. Die des japanischen Strategiespiels. Artikel über Zitate bieten (soweit bekannt) eine Quellenangabe und werden gegebenenfalls in die deutsche Sprache übersetzt. Für zusätzliche Das Wikimedia-Projekt Wikiquote sammelt Sprichwörter und Informationen sorgen Links in die Wikipedia. Zitate, hier die Seite zum Schauspieler Woody Allen Wikimedia Commons Wikimedia Commons wurde im September 2004 zur zentralen Aufbewahrung von Multime- dia-Material – Bilder, Videos, Musik – für alle Wi- kimedia-Projekte gegründet.
    [Show full text]
  • Thank You - Wiktionary 4/24/11 9:56 PM Thank You
    thank you - Wiktionary 4/24/11 9:56 PM thank you Definition from Wiktionary, the free dictionary See also thankyou and thank-you Contents 1 English 1.1 Etymology 1.2 Pronunciation 1.3 Interjection 1.3.1 Synonyms 1.3.2 Translations 1.4 Noun 1.5 References 1.6 See also English Etymology Thank you is a shortened expression for I thank you; it is attested since c. 1400.[1] Pronunciation Audio (UK) (file) Interjection thank you 1. An expression of gratitude or politeness, in response to something done or given. http://en.wiktionary.org/wiki/thank_you Page 1 of 5 thank you - Wiktionary 4/24/11 9:56 PM Synonyms cheers (informal), thanks, thanks very much, thank you very much, thanks a lot, ta (UK, Australia), thanks a bunch (informal), thanks a million (informal), much obliged Translations an expression of gratitude [hide ▲] Select targeted languages ﺳﻮﭘﺎﺱ ,Afar: gadda ge Kurdish: spas Afrikaans: dankie, baie dankie Ladin: giulan Albanian: faleminderit, ju falem nderit Lao: (kööp cai) Gheg: falimineres Latin: benignē dīcis (la), tibi gratiās agō (la) Latvian: paldies (lv) Aleut: qaĝaasakuq (Atkan), qaĝaalakux̂ Lithuanian: ačiū (lt) (Eastern) Luo: erokamano Alutiiq: quyanaa Macedonian: благодарам (mk) (blagódaram) American Sign Language: OpenB@Chin- (formal), фала (mk) (fála) (informal) PalmBack OpenB@FromChin-PalmUp Malagasy: misaotra (mg) Amharic: (am) (amesegenallo) Malay: terima kasih (ms) (ar) (mersii) Malayalam: (ml) (nandhi) ﻣﻴﺮﺳﻲ ,(ar) (shúkran) ﺷﻜﺮًﺍ :Arabic (informal) Maltese: grazzi (mt) Armenian: Maori: kia ora (mi) շնորհակալություն
    [Show full text]
  • Detecting Non-Compositional MWE Components Using Wiktionary
    Detecting Non-compositional MWE Components using Wiktionary Bahar Salehi,♣♥ Paul Cook♥ and Timothy Baldwin♣♥ NICTA Victoria Research Laboratory Department♣ of Computing and Information Systems ♥ The University of Melbourne Victoria 3010, Australia [email protected], [email protected], [email protected] Abstract components of a given MWE have a composi- tional usage. Experiments over two widely-used We propose a simple unsupervised ap- datasets show that our approach outperforms state- proach to detecting non-compositional of-the-art methods. components in multiword expressions based on Wiktionary. The approach makes 2 Related Work use of the definitions, synonyms and trans- Previous studies which have considered MWE lations in Wiktionary, and is applicable to compositionality have focused on either the iden- any type of MWE in any language, assum- tification of non-compositional MWE token in- ing the MWE is contained in Wiktionary. stances (Kim and Baldwin, 2007; Fazly et al., Our experiments show that the proposed 2009; Forthergill and Baldwin, 2011; Muzny and approach achieves higher F-score than Zettlemoyer, 2013), or the prediction of the com- state-of-the-art methods. positionality of MWE types (Reddy et al., 2011; 1 Introduction Salehi and Cook, 2013; Salehi et al., 2014). The identification of non-compositional MWE tokens A multiword expression (MWE) is a combina- is an important task when a word combination tion of words with lexical, syntactic or seman- such as kick the bucket or saw logs is ambiguous tic idiosyncrasy (Sag et al., 2002; Baldwin and between a compositional (generally non-MWE) Kim, 2009). An MWE is considered (semanti- and non-compositional MWE usage.
    [Show full text]
  • The Wiki Family of Web Sites
    The Federal Lawyer In Cyberia MICHAEL J. TONSING The Wiki Family of Web Sites he growth in the number of informative Web set of resources. Much of what follows is drawn from the sites seems exponential. It is increasingly hard self-descriptions on the Wiki Web sites themselves. Tto keep up with them. If you’re not using the “Wiki” family of sites, you’re missing some sources that Wiktionary are almost stupefying in their scope. A “wiki” is essen- Wiktionary (en.wikipedia.org/wiki/Wiktionary) is tially a Web site in which contributors can add their a sister site to Wikipedia. Wiktionary is not an online own copy. (I capitalize the word in most instances in encyclopedia but an online dictionary, and its entries this column to make it clear that I am referring to a par- are about words. A Wiktionary entry focuses on mat- ticular family of wikis. There are many other wikis out ters of language and wordsmithery, spelling, pronun- there. You may decide to start your own someday.) ciation, etymology, translation, usage, quotations, and links to related words and concepts. Because Wiktion- Wikipedia ary is not written on paper, it has no size limits, it can Wikipedia (www.wikipedia.org), then, is an online include links, and its information can be more timely encyclopedia in which visitors can read infor- than that found in a written dictionary. mation on the topics they visit, then edit the information on the site itself if they choose Wikisource to do so. (The name “Wikipedia” is a meld- Wikisource (en.wikipedia.org/wiki/Wikisource), which ing of the works “wiki” and “encyclopedia.”) dubs itself “the free library,” collects and stores previ- Out of what might at first have seemed like ously published texts in digital format; contents include online chaos has come semirespectability and novels, nonfiction works, letters, speeches, constitutional order.
    [Show full text]
  • The Wiki-Fication of the Dictionary: Defining Lexicography in the Digital Age
    THE WIKI-FICATION OF THE DICTIONARY: DEFINING LEXICOGRAPHY IN THE DIGITAL AGE Darrell J. Penta College of Professional Studies Northeastern University [email protected] Abstract The future of lexical reference books, such as the 20-volume Oxford English Dictionary (OED), is going to be determined, in part, by the emergence of free on-line dictionaries, such as Wiktionary and Urban Dictionary. Specifically, we are witnessing a paradigmatic shift of authority in which users, rather than editorial boards, are making decisions concerning the content associated with a lexical entry’s definition. In effect, an exclusive privilege formerly enjoyed by professional lexicographers is now being extended unequivocally to laypersons. It is pertinent to ask, therefore, what effect this state of affairs will have on the ways that dictionaries are compiled and used. For some, including Jill Lepore of the New Yorker Magazine, online collaborative lexical references are “Maoist” resources, “cobble[d]…together” by non-experts who “pilfer” definitions (79). This paper rejects such a characterization and seeks, instead, to provide a description more suitable for critical inquiry. By contrasting the entry “bomb” as it appears in the OED, Wiktionary, and Urban Dictionary, and by making use of contemporary linguistic theory, the author posits that: word meanings are highly constrained by popular usage; and, users regularly provide semantically and pragmatically significant, and grammatical accurate, definitions; and, in providing users the flexibility to modify entries in real-time, user-generated dictionaries are uniquely practical as catalogues of the current state of language. It is concluded that, whereas traditional dictionaries may be the better resource for diachronic analyses of words, Wiktionary and the like may prove better for synchronic analyses.
    [Show full text]
  • SLIDE – a Sentiment Lexicon of Common Idioms
    SLIDE – a Sentiment Lexicon of Common Idioms Charles Jochim, Francesca Bonin, Roy Bar-Haim, Noam Slonim IBM Research fcharlesj,[email protected], froybar,[email protected] Abstract Idiomatic expressions are problematic for most sentiment analysis approaches, which rely on words as the basic linguistic unit. Compositional solutions for phrase sentiment are not able to handle idioms correctly because their sentiment is not derived from the sentiment of the individual words. Previous work has explored the importance of idioms for sentiment analysis, but has not addressed the breadth of idiomatic expressions in English. In this paper we present an approach for collecting sentiment annotation of idiomatic multiword expressions using crowdsourcing. We collect 10 annotations for each idiom and the aggregated label is shown to have good agreement with expert annotations. We describe the resulting publicly available lexicon and how it captures sentiment strength and ambiguity. The Sentiment Lexicon of IDiomatic Expressions (SLIDE) is much larger than previous idiom lexicons. The lexicon includes 5,000 frequently occurring idioms, as estimated from a large English corpus. The idioms were selected from Wiktionary, and over 40% of them were labeled as sentiment-bearing. Keywords: Idiom, Lexicon, Sentiment Analysis 1. Introduction of magnitude larger than previous idiom sentiment lexi- cons and focuses specifically on the most frequently used Multiword expressions (MWE) are a key challenge in Natu- idioms.1 In creating this resource, we are somewhat ag- ral Language Processing (Sag et al., 2002). Among MWEs, nostic to the exact definition of idiom. We are more gener- idioms are often defined as non-compositional multiword ally interested in sentiment analysis that can handle MWEs.
    [Show full text]