Mining Wiki Resources for Multilingual Named Entity Recognition

Total Page:16

File Type:pdf, Size:1020Kb

Mining Wiki Resources for Multilingual Named Entity Recognition Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman Patrick Schone Department of Defense Department of Defense Washington, DC 20310 Fort George G. Meade, MD 20755 [email protected] [email protected] required during development). The expectation is Abstract that for any language in which Wikipedia is sufficiently well-developed, a usable set of training In this paper, we describe a system by which data can be obtained with minimal human the multilingual characteristics of Wikipedia intervention. As Wikipedia is constantly can be utilized to annotate a large corpus of expanding, it follows that the derived models are text with Named Entity Recognition (NER) continually improved and that increasingly many tags requiring minimal human intervention languages can be usefully modeled by this method. and no linguistic expertise. This process, though of value in languages for which In order to make sure that the process is as resources exist, is particularly useful for less language-independent as possible, we declined to commonly taught languages. We show how make use of any non-English linguistic resources the Wikipedia format can be used to identify outside of the Wikimedia domain (specifically, possible named entities and discuss in detail Wikipedia and the English language Wiktionary the process by which we use the Category (en.wiktionary.org)). In particular, we did not use structure inherent to Wikipedia to determine any semantic resources such as WordNet or part of the named entity type of a proposed entity. speech taggers. We used our automatically anno- We further describe the methods by which tated corpus along with an internally modified English language data can be used to variant of BBN's IdentiFinder (Bikel et al., 1999), bootstrap the NER process in other languages. We demonstrate the system by using the specifically modified to emphasize fast text generated corpus as training sets for a variant processing, called “PhoenixIDF,” to create several of BBN's Identifinder in French, Ukrainian, language models that could be tested outside of the Spanish, Polish, Russian, and Portuguese, Wikipedia framework. We built on top of an achieving overall F-scores as high as 84.7% existing system, and left existing lists and tables on independent, human-annotated corpora, intact. Depending on language, we evaluated our comparable to a system trained on up to derived models against human or machine 40,000 words of human-annotated newswire. annotated data sets to test the system. 1 Introduction 2 Wikipedia Named Entity Recognition (NER) has long been a 2.1 Structure major task of natural language processing. Most of Wikipedia is a multilingual, collaborative encyclo- the research in the field has been restricted to a few pedia on the Web which is freely available for re- languages and almost all methods require substan- search purposes. As of October 2007, there were tial linguistic expertise, whether creating a rule- over 2 million articles in English, with versions based technique specific to a language or manually available in 250 languages. This includes 30 lan- annotating a body of text to be used as a training guages with at least 50,000 articles and another 40 set for a statistical engine or machine learning. with at least 10,000 articles. Each language is In this paper, we focus on using the multilingual available for download (download.wikimedia.org) Wikipedia (wikipedia.org) to automatically create in a text format suitable for inclusion in a database. an annotated corpus of text in any given language, For the remainder of this paper, we refer to this with no linguistic expertise required on the part of format. the user at run-time (and only English knowledge 1 Proceedings of ACL-08: HLT, pages 1–9, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics Within Wikipedia, we take advantage of five A redirect page is a short entry whose sole pur- major features: pose is to direct a query to the proper page. There • Article links, links from one article to another are a few reasons that redirect pages exist, but the of the same language; primary purpose is exemplified by the fact that • Category links , links from an article to special “USA” is an entry which redirects the user to the “Category” pages; page entitled “United States.” That is, in the vast • Interwiki links , links from an article to a majority of cases, redirect pages provide another presumably equivalent, article in another name for an entity. language; A disambiguation page is a special article • Redirect pages , short pages which often which contains little content but typically lists a provide equivalent names for an entity; and number of entries which might be what the user • Disambiguation pages , a page with little was seeking. For instance, the page “Franklin” content that links to multiple similarly named contains 70 links, including the singer “Aretha articles. Franklin,” the town “Franklin, Virginia,” the The first three types are collectively referred to as “Franklin River” in Tasmania, and the cartoon wikilinks . character “Franklin (Peanuts).” Most disambigua- A typical sentence in the database format looks tion pages are in Category:Disambiguation or one like the following: of its subcategories. “Nescopeck Creek is a [[tributary]] of the [[North 2.2 Related Studies Branch Susquehanna River]] in [[Luzerne County, Wikipedia has been the subject of a considerable Pennsylvania|Luzerne County]].” amount of research in recent years including Gabrilovich and Markovitch (2007), Strube and The double bracket is used to signify wikilinks. In Ponzetto (2006), Milne et al. (2006), Zesch et al. this snippet, there are three articles links to English (2007), and Weale (2007). The most relevant to language Wikipedia pages, titled “Tributary,” our work are Kazama and Torisawa (2007), Toral “North Branch Susquehanna River,” and “Luzerne and Muñoz (2006), and Cucerzan (2007). More County, Pennsylvania.” Notice that in the last link, details follow, but it is worth noting that all known the phrase preceding the vertical bar is the name of prior results are fundamentally monolingual, often the article, while the following phrase is what is developing algorithms that can be adapted to other actually displayed to a visitor of the webpage. languages pending availability of the appropriate Near the end of the same article, we find the semantic resource. In this paper, we emphasize the following representations of Category links: use of links between articles of different languages, [[Category:Luzerne County, Pennsylvania]], specifically between English (the largest and best [[Category:Rivers of Pennsylvania]], {{Pennsyl- linked Wikipedia) and other languages. vania-geo-stub}}. The first two are direct links to Toral and Muñoz (2006) used Wikipedia to cre- Category pages. The third is a link to a Template, ate lists of named entities. They used the first which (among other things) links the article to sentence of Wikipedia articles as likely definitions “Category:Pennsylvania geography stubs”. We of the article titles, and used them to attempt to will typically say that a given entity belongs to classify the titles as people, locations, organiza- those categories to which it is linked in these ways. tions, or none. Unlike the method presented in this The last major type of wikilink is the link be- paper, their algorithm relied on WordNet (or an tween different languages. For example, in the equivalent resource in another language). The au- Turkish language article “Kanuni Sultan Süley- thors noted that their results would need to pass a man” one finds a set of links including [[en:Sulei- manual supervision step before being useful for the man the Magnificent]] and [[ru: Сулейман I]]. NER task, and thus did not evaluate their results in These represent links to the English language the context of a full NER system. article “Suleiman the Magnificent” and the Russian Similarly, Kazama and Torisawa (2007) used language article “ Сулейман I.” In almost all Wikipedia, particularly the first sentence of each cases, the articles linked in this manner represent article, to create lists of entities. Rather than articles on the same subject. building entity dictionaries associating words and 2 phrases to the classical NER tags (PERSON, LO- not marked in an existing corpus or not needed for CATION, etc.) they used a noun phrase following a given purpose, the system can easily be adapted. forms of the verb “to be” to derive a label. For ex- Our goal was to automatically annotate the text ample, they used the sentence “Franz Fischler ... is portion of a large number of non-English articles an Austrian politician” to associate the label “poli- with tags like <ENAMEX TYPE=“GPE”>Place tician” to the surface form “Franz Fischler.” They Name</ENAMEX> as used in MUC (Message proceeded to show that the dictionaries generated Understanding Conference). In order to do so, our by their method are useful when integrated into an system first identifies words and phrases within the NER system. We note that their technique relies text that might represent entities, primarily through upon a part of speech tagger, and thus was not ap- the use of wikilinks. The system then uses catego- propriate for inclusion as part of our non-English ry links and/or interwiki links to associate that system. phrase with an English language phrase or set of Cucerzan (2007), by contrast to the above, Categories. Finally, it determines the appropriate used Wikipedia primarily for Named Entity Dis- type of the English language data and assumes that ambiguation, following the path of Bunescu and the original phrase is of the same type. Pa şca (2006). As in this paper, and unlike the In practice, the English language categorization above mentioned works, Cucerzan made use of the should be treated as one-time work, since it is explicit Category information found within Wiki- identical regardless of the language model being pedia.
Recommended publications
  • Dictionary Users Do Look up Frequent Words. a Logfile Analysis
    Erschienen in: Müller-Spitzer, Carolin (Hrsg.): Using Online Dictionaries. Berlin/ Boston: de Gruyter, 2014. (Lexicographica: Series Maior 145), S. 229-249. Alexander Koplenig, Peter Meyer, Carolin Müller-Spitzer Dictionary users do look up frequent words. A logfile analysis Abstract: In this paper, we use the 2012 log files of two German online dictionaries (Digital Dictionary of the German Language1 and the German Version of Wiktionary) and the 100,000 most frequent words in the Mannheim German Reference Corpus from 2009 to answer the question of whether dictionary users really do look up fre- quent words, first asked by de Schryver et al. (2006). By using an approach to the comparison of log files and corpus data which is completely different from that of the aforementioned authors, we provide empirical evidence that indicates - contra - ry to the results of de Schryver et al. and Verlinde/Binon (2010) - that the corpus frequency of a word can indeed be an important factor in determining what online dictionary users look up. Finally, we incorporate word dass Information readily available in Wiktionary into our analysis to improve our results considerably. Keywords: log file, frequency, corpus, headword list, monolingual dictionary, multi- lingual dictionary Alexander Koplenig: Institut für Deutsche Sprache, R 5, 6-13, 68161 Mannheim, +49-(0)621-1581- 435, [email protected] Peter Meyer: Institut für Deutsche Sprache, R 5, 6-13, 68161 Mannheim, +49-(0)621-1581-427, [email protected] Carolin Müller-Spitzer: Institut für Deutsche Sprache, R 5, 6-13, 68161 Mannheim, +49-(0)621-1581- 429, [email protected] Introduction We would like to Start this chapter by asking one of the most fundamental questions for any general lexicographical endeavour to describe the words of one (or more) language(s): which words should be included in a dictionary? At first glance, the answer seems rather simple (especially when the primary objective is to describe a language as completely as possible): it would be best to include every word in the dictionary.
    [Show full text]
  • Etytree: a Graphical and Interactive Etymology Dictionary Based on Wiktionary
    Etytree: A Graphical and Interactive Etymology Dictionary Based on Wiktionary Ester Pantaleo Vito Walter Anelli Wikimedia Foundation grantee Politecnico di Bari Italy Italy [email protected] [email protected] Tommaso Di Noia Gilles Sérasset Politecnico di Bari Univ. Grenoble Alpes, CNRS Italy Grenoble INP, LIG, F-38000 Grenoble, France [email protected] [email protected] ABSTRACT a new method1 that parses Etymology, Derived terms, De- We present etytree (from etymology + family tree): a scendants sections, the namespace for Reconstructed Terms, new on-line multilingual tool to extract and visualize et- and the etymtree template in Wiktionary. ymological relationships between words from the English With etytree, a RDF (Resource Description Framework) Wiktionary. A first version of etytree is available at http: lexical database of etymological relationships collecting all //tools.wmflabs.org/etytree/. the extracted relationships and lexical data attached to lex- With etytree users can search a word and interactively emes has also been released. The database consists of triples explore etymologically related words (ancestors, descendants, or data entities composed of subject-predicate-object where cognates) in many languages using a graphical interface. a possible statement can be (for example) a triple with a lex- The data is synchronised with the English Wiktionary dump eme as subject, a lexeme as object, and\derivesFrom"or\et- at every new release, and can be queried via SPARQL from a ymologicallyEquivalentTo" as predicate. The RDF database Virtuoso endpoint. has been exposed via a SPARQL endpoint and can be queried Etytree is the first graphical etymology dictionary, which at http://etytree-virtuoso.wmflabs.org/sparql.
    [Show full text]
  • User Contributions to Online Dictionaries Andrea Abel and Christian M
    The dynamics outside the paper: User Contributions to Online Dictionaries Andrea Abel and Christian M. Meyer Electronic lexicography in the 21st century (eLex): Thinking outside the paper, Tallinn, Estonia, October 17–19, 2013. 18.10.2013 | European Academy of Bozen/Bolzano and Technische Universität Darmstadt | Andrea Abel and Christian M. Meyer | 1 Introduction Online dictionaries rely increasingly on their users and leverage methods for facilitating user contributions at basically any step of the lexicographic process. 18.10.2013 | European Academy of Bozen/Bolzano and Technische Universität Darmstadt | Andrea Abel and Christian M. Meyer | 2 Previous work . Mostly focused on specific type/dimension of user contribution . e.g., Carr (1997), Storrer (1998, 2010), Køhler Simonsen (2005), Fuertes-Olivera (2009), Melchior (2012), Lew (2011, 2013) . Ambiguous, partly overlapping terms: www.wordle.net/show/wrdl/7124863/User_contributions (02.10.2013) www.wordle.net/show/wrdl/7124863/User_contributions http:// 18.10.2013 | European Academy of Bozen/Bolzano and Technische Universität Darmstadt | Andrea Abel and Christian M. Meyer | 3 Research goals and contribution How to effectively plan the user contributions for new and established dictionaries? . Comprehensive and systematic classification is still missing . Mann (2010): analysis of 88 online dictionaries . User contributions roughly categorized as direct / indirect contributions and exchange with other dictionary users . Little detail given, since outside the scope of the analysis . We use this as a basis for our work We propose a novel classification of the different types of user contributions based on many practical examples 18.10.2013 | European Academy of Bozen/Bolzano and Technische Universität Darmstadt | Andrea Abel and Christian M.
    [Show full text]
  • Wiktionary Matcher
    Wiktionary Matcher Jan Portisch1;2[0000−0001−5420−0663], Michael Hladik2[0000−0002−2204−3138], and Heiko Paulheim1[0000−0003−4386−8195] 1 Data and Web Science Group, University of Mannheim, Germany fjan, [email protected] 2 SAP SE Product Engineering Financial Services, Walldorf, Germany fjan.portisch, [email protected] Abstract. In this paper, we introduce Wiktionary Matcher, an ontology matching tool that exploits Wiktionary as external background knowl- edge source. Wiktionary is a large lexical knowledge resource that is collaboratively built online. Multiple current language versions of Wik- tionary are merged and used for monolingual ontology matching by ex- ploiting synonymy relations and for multilingual matching by exploiting the translations given in the resource. We show that Wiktionary can be used as external background knowledge source for the task of ontology matching with reasonable matching and runtime performance.3 Keywords: Ontology Matching · Ontology Alignment · External Re- sources · Background Knowledge · Wiktionary 1 Presentation of the System 1.1 State, Purpose, General Statement The Wiktionary Matcher is an element-level, label-based matcher which uses an online lexical resource, namely Wiktionary. The latter is "[a] collaborative project run by the Wikimedia Foundation to produce a free and complete dic- tionary in every language"4. The dictionary is organized similarly to Wikipedia: Everybody can contribute to the project and the content is reviewed in a com- munity process. Compared to WordNet [4], Wiktionary is significantly larger and also available in other languages than English. This matcher uses DBnary [15], an RDF version of Wiktionary that is publicly available5. The DBnary data set makes use of an extended LEMON model [11] to describe the data.
    [Show full text]
  • Extracting an Etymological Database from Wiktionary Benoît Sagot
    Extracting an Etymological Database from Wiktionary Benoît Sagot To cite this version: Benoît Sagot. Extracting an Etymological Database from Wiktionary. Electronic Lexicography in the 21st century (eLex 2017), Sep 2017, Leiden, Netherlands. pp.716-728. hal-01592061 HAL Id: hal-01592061 https://hal.inria.fr/hal-01592061 Submitted on 22 Sep 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Extracting an Etymological Database from Wiktionary Benoît Sagot Inria 2 rue Simone Iff, 75012 Paris, France E-mail: [email protected] Abstract Electronic lexical resources almost never contain etymological information. The availability of such information, if properly formalised, could open up the possibility of developing automatic tools targeted towards historical and comparative linguistics, as well as significantly improving the automatic processing of ancient languages. We describe here the process we implemented for extracting etymological data from the etymological notices found in Wiktionary. We have produced a multilingual database of nearly one million lexemes and a database of more than half a million etymological relations between lexemes. Keywords: Lexical resource development; etymology; Wiktionary 1. Introduction Electronic lexical resources used in the fields of natural language processing and com- putational linguistics are almost exclusively synchronic resources; they mostly include information about inflectional, derivational, syntactic, semantic or even pragmatic prop- erties of their entries.
    [Show full text]
  • Wiktionary Matcher Results for OAEI 2020
    Wiktionary Matcher Results for OAEI 2020 Jan Portisch1;2[0000−0001−5420−0663] and Heiko Paulheim1[0000−0003−4386−8195] 1 Data and Web Science Group, University of Mannheim, Germany fjan, [email protected] 2 SAP SE Product Engineering Financial Services, Walldorf, Germany [email protected] Abstract. This paper presents the results of the Wiktionary Matcher in the Ontology Alignment Evaluation Initiative (OAEI) 2020. Wiktionary Matcher is an ontology matching tool that exploits Wiktionary as exter- nal background knowledge source. Wiktionary is a large lexical knowl- edge resource that is collaboratively built online. Multiple current lan- guage versions of Wiktionary are merged and used for monolingual on- tology matching by exploiting synonymy relations and for multilingual matching by exploiting the translations given in the resource. This is the second OAEI participation of the matching system. Wiktionary Matcher has been improved and is the best performing system on the knowledge graph track this year.3 Keywords: Ontology Matching · Ontology Alignment · External Re- sources · Background Knowledge · Wiktionary 1 Presentation of the System 1.1 State, Purpose, General Statement The Wiktionary Matcher is an element-level, label-based matcher which uses an online lexical resource, namely Wiktionary. The latter is "[a] collaborative project run by the Wikimedia Foundation to produce a free and complete dic- tionary in every language"4. The dictionary is organized similarly to Wikipedia: Everybody can contribute to the project and the content is reviewed in a com- munity process. Compared to WordNet [2], Wiktionary is significantly larger and also available in other languages than English. This matcher uses DBnary [13], an RDF version of Wiktionary that is publicly available5.
    [Show full text]
  • Using Data to Visualize Wikipedia Knowledge Gaps; News in Brief – Wikimedia Blog 23/01/18, 1223
    Community Digest: Using data to visualize Wikipedia knowledge gaps; news in brief – Wikimedia Blog 23/01/18, 1223 COMMUNITY WIKIPEDIA FOUNDATION TECHNOLOGY SHARE ! + # Photo by Brocken Inaglory, CC BY-SA 3.0 COMMUNITY, COMMUNITY DIGEST, GENDER GAP GET CONNECTED ! " # $ Community Digest: Using data to visualize Wikipedia knowledge gaps; news in brief GET OUR EMAIL UPDATES By Serena Cangiano Your email address Giovanni Profeta Marco Lurati Fabian Frei Subscribe Florence Devouard Iolanda Pensa Samir Elsharbaty, Wikimedia Foundation MEET OUR COMMUNITY February 23rd, 2017 This group has compared data from Wikidata to define the knowledge gaps on Wikimedia projects about Africa. The resulting visualizations helped make the gaps Wikimedia engineer Why Tímea Baksa easier to understand for anyone. In addition, this week’s news in brief includes the contributes several writes about Korean second WikiCite in Vienna, the Danish and Ukrainian Wikipedia communities fonts to Malayalam culture on Wikipedia celebrate their Wikipedia anniversary, an effort to commemorate the WWI era on language Wikipedia and more. More Community Profiles MOST VIEWED THIS MONTH ‘Monumental’ winners from the world’s largest photo contest showcase history and heritage The top fifteen images from Wiki... Discover a world of natural heritage through the winning images from Wiki Loves Earth Nearly 132,000 images, all freely licensed,... Designing for offline on Android We’ve introduced a number of improvements... Photo by MONUSCO, CC BY-SA 2.0 ARCHIVES recent workshop has used Wikidata to demonstrate that a gender gap still exists on Wikipedia: women A born in Africa in the 19th and 20th centuries, for example, make up only 15–20% of the total number of JANUARY 2018 biographies.
    [Show full text]
  • S Wiktionary Wikisource Wikibooks Wikiquote Wikimedia Commons
    SCHWESTERPROJEKTE Wiktionary S Das Wiktionary ist der lexikalische Partner der freien Enzyklopädie Wikipedia: ein Projekt zur Erstellung freier Wörterbücher und Thesau- ri. Während die Wikipedia inhaltliche Konzepte beschreibt, geht es in ihrem ältesten Schwester- projekt, dem 2002 gegründeten Wiktionary um Wörter, ihre Grammatik und Etymologie, Homo- nyme und Synonyme und Übersetzungen. Wikisource Wikisource ist eine Sammlung von Texten, die entweder urheberrechtsfrei sind oder unter ei- ner freien Lizenz stehen. Das Projekt wurde am 24. November 2003 gestartet. Der Wiktionary-EIntrag zum Wort Schnee: Das Wörterbuch präsen- Zunächst mehrsprachig auf einer gemeinsamen tiert Bedeutung, Deklination, Synonyme und Übersetzungen. Plattform angelegt, wurde es später in einzel- ne Sprachversionen aufgesplittet. Das deutsche Teilprojekt zählte im März 2006 über 2000 Texte Wikisource-Mitarbeiter arbeiten an einer digitalen, korrekturge- und über 100 registrierte Benutzer. lesenen und annotierten Ausgabe der Zimmerischen Chronik. Wikibooks Das im Juli 2003 aus der Taufe gehobene Projekt Wikibooks dient der gemeinschaftlichen Schaf- fung freier Lehrmaterialien – vom Schulbuch über den Sprachkurs bis zum praktischen Klet- terhandbuch oder der Go-Spielanleitung Wikiquote Wikiquote zielt darauf ab, auf Wiki-Basis ein freies Kompendium von Zitaten und Das Wikibooks-Handbuch Go enthält eine ausführliche Spielanleitung Sprichwörtern in jeder Sprache zu schaffen. Die des japanischen Strategiespiels. Artikel über Zitate bieten (soweit bekannt) eine Quellenangabe und werden gegebenenfalls in die deutsche Sprache übersetzt. Für zusätzliche Das Wikimedia-Projekt Wikiquote sammelt Sprichwörter und Informationen sorgen Links in die Wikipedia. Zitate, hier die Seite zum Schauspieler Woody Allen Wikimedia Commons Wikimedia Commons wurde im September 2004 zur zentralen Aufbewahrung von Multime- dia-Material – Bilder, Videos, Musik – für alle Wi- kimedia-Projekte gegründet.
    [Show full text]
  • Flags and Banners
    Flags and Banners A Wikipedia Compilation by Michael A. Linton Contents 1 Flag 1 1.1 History ................................................. 2 1.2 National flags ............................................. 4 1.2.1 Civil flags ........................................... 8 1.2.2 War flags ........................................... 8 1.2.3 International flags ....................................... 8 1.3 At sea ................................................. 8 1.4 Shapes and designs .......................................... 9 1.4.1 Vertical flags ......................................... 12 1.5 Religious flags ............................................. 13 1.6 Linguistic flags ............................................. 13 1.7 In sports ................................................ 16 1.8 Diplomatic flags ............................................ 18 1.9 In politics ............................................... 18 1.10 Vehicle flags .............................................. 18 1.11 Swimming flags ............................................ 19 1.12 Railway flags .............................................. 20 1.13 Flagpoles ............................................... 21 1.13.1 Record heights ........................................ 21 1.13.2 Design ............................................. 21 1.14 Hoisting the flag ............................................ 21 1.15 Flags and communication ....................................... 21 1.16 Flapping ................................................ 23 1.17 See also ...............................................
    [Show full text]
  • Thank You - Wiktionary 4/24/11 9:56 PM Thank You
    thank you - Wiktionary 4/24/11 9:56 PM thank you Definition from Wiktionary, the free dictionary See also thankyou and thank-you Contents 1 English 1.1 Etymology 1.2 Pronunciation 1.3 Interjection 1.3.1 Synonyms 1.3.2 Translations 1.4 Noun 1.5 References 1.6 See also English Etymology Thank you is a shortened expression for I thank you; it is attested since c. 1400.[1] Pronunciation Audio (UK) (file) Interjection thank you 1. An expression of gratitude or politeness, in response to something done or given. http://en.wiktionary.org/wiki/thank_you Page 1 of 5 thank you - Wiktionary 4/24/11 9:56 PM Synonyms cheers (informal), thanks, thanks very much, thank you very much, thanks a lot, ta (UK, Australia), thanks a bunch (informal), thanks a million (informal), much obliged Translations an expression of gratitude [hide ▲] Select targeted languages ﺳﻮﭘﺎﺱ ,Afar: gadda ge Kurdish: spas Afrikaans: dankie, baie dankie Ladin: giulan Albanian: faleminderit, ju falem nderit Lao: (kööp cai) Gheg: falimineres Latin: benignē dīcis (la), tibi gratiās agō (la) Latvian: paldies (lv) Aleut: qaĝaasakuq (Atkan), qaĝaalakux̂ Lithuanian: ačiū (lt) (Eastern) Luo: erokamano Alutiiq: quyanaa Macedonian: благодарам (mk) (blagódaram) American Sign Language: OpenB@Chin- (formal), фала (mk) (fála) (informal) PalmBack OpenB@FromChin-PalmUp Malagasy: misaotra (mg) Amharic: (am) (amesegenallo) Malay: terima kasih (ms) (ar) (mersii) Malayalam: (ml) (nandhi) ﻣﻴﺮﺳﻲ ,(ar) (shúkran) ﺷﻜﺮًﺍ :Arabic (informal) Maltese: grazzi (mt) Armenian: Maori: kia ora (mi) շնորհակալություն
    [Show full text]
  • Detecting Non-Compositional MWE Components Using Wiktionary
    Detecting Non-compositional MWE Components using Wiktionary Bahar Salehi,♣♥ Paul Cook♥ and Timothy Baldwin♣♥ NICTA Victoria Research Laboratory Department♣ of Computing and Information Systems ♥ The University of Melbourne Victoria 3010, Australia [email protected], [email protected], [email protected] Abstract components of a given MWE have a composi- tional usage. Experiments over two widely-used We propose a simple unsupervised ap- datasets show that our approach outperforms state- proach to detecting non-compositional of-the-art methods. components in multiword expressions based on Wiktionary. The approach makes 2 Related Work use of the definitions, synonyms and trans- Previous studies which have considered MWE lations in Wiktionary, and is applicable to compositionality have focused on either the iden- any type of MWE in any language, assum- tification of non-compositional MWE token in- ing the MWE is contained in Wiktionary. stances (Kim and Baldwin, 2007; Fazly et al., Our experiments show that the proposed 2009; Forthergill and Baldwin, 2011; Muzny and approach achieves higher F-score than Zettlemoyer, 2013), or the prediction of the com- state-of-the-art methods. positionality of MWE types (Reddy et al., 2011; 1 Introduction Salehi and Cook, 2013; Salehi et al., 2014). The identification of non-compositional MWE tokens A multiword expression (MWE) is a combina- is an important task when a word combination tion of words with lexical, syntactic or seman- such as kick the bucket or saw logs is ambiguous tic idiosyncrasy (Sag et al., 2002; Baldwin and between a compositional (generally non-MWE) Kim, 2009). An MWE is considered (semanti- and non-compositional MWE usage.
    [Show full text]
  • The Wiki Family of Web Sites
    The Federal Lawyer In Cyberia MICHAEL J. TONSING The Wiki Family of Web Sites he growth in the number of informative Web set of resources. Much of what follows is drawn from the sites seems exponential. It is increasingly hard self-descriptions on the Wiki Web sites themselves. Tto keep up with them. If you’re not using the “Wiki” family of sites, you’re missing some sources that Wiktionary are almost stupefying in their scope. A “wiki” is essen- Wiktionary (en.wikipedia.org/wiki/Wiktionary) is tially a Web site in which contributors can add their a sister site to Wikipedia. Wiktionary is not an online own copy. (I capitalize the word in most instances in encyclopedia but an online dictionary, and its entries this column to make it clear that I am referring to a par- are about words. A Wiktionary entry focuses on mat- ticular family of wikis. There are many other wikis out ters of language and wordsmithery, spelling, pronun- there. You may decide to start your own someday.) ciation, etymology, translation, usage, quotations, and links to related words and concepts. Because Wiktion- Wikipedia ary is not written on paper, it has no size limits, it can Wikipedia (www.wikipedia.org), then, is an online include links, and its information can be more timely encyclopedia in which visitors can read infor- than that found in a written dictionary. mation on the topics they visit, then edit the information on the site itself if they choose Wikisource to do so. (The name “Wikipedia” is a meld- Wikisource (en.wikipedia.org/wiki/Wikisource), which ing of the works “wiki” and “encyclopedia.”) dubs itself “the free library,” collects and stores previ- Out of what might at first have seemed like ously published texts in digital format; contents include online chaos has come semirespectability and novels, nonfiction works, letters, speeches, constitutional order.
    [Show full text]