Wiktionary and NLP: Improving Synonymy Networks

Wiktionary and NLP: Improving synonymy networks Emmanuel Navarro Franck Sajous Bruno Gaume IRIT, CNRS & CLLE-ERSS, CNRS & CLLE-ERSS & IRIT, CNRS & Université de Toulouse Université de Toulouse Université de Toulouse [email protected] [email protected] [email protected] Laurent Prévot Hsieh ShuKai Kuo Tzu-Yi LPL, CNRS & English Department Graduate Institute of Linguistics Université de Provence NTNU, Taiwan NTU, Taiwan [email protected] [email protected] [email protected] Pierre Magistry Huang Chu-Ren TIGP, CLCLP, Academia Sinica, Dept. of Chinese and Bilingual Studies GIL, NTU, Taiwan Hong Kong Poly U. , Hong Kong. [email protected] [email protected] Abstract to this difficult situation. Among them Wiktionary seems to be the perfect resource for building com- Wiktionary, a satellite of the Wikipedia putational mono-lingual and multi-lingual lexica. initiative, can be seen as a potential re- This paper focuses therefore on Wiktionary, how source for Natural Language Processing. to improve it, and on its exploitation for creating It requires however to be processed be- resources. fore being used efficiently as an NLP re- In next section, we present some relevant infor- source. After describing the relevant as- mation about Wiktionary. Section 3 presents the pects of Wiktionary for our purposes, we lexical graphs we are using and the way we build focus on its structural properties. Then, them. Then we pay some attention to evaluation we describe how we extracted synonymy (§4) before exploring some tracks of improvement networks from this resource. We pro- suggested by Wiktionary structure itself. vide an in-depth study of these synonymy networks and compare them to those ex- 2 Wiktionary As previously said, NLP suffers from a lack of tracted from traditional resources. Fi- lexical resources, be it due to the low-quality or nally, we describe two methods for semi- non-existence of such resources, or to copyrights- automatically improving this network by related problems. As an example, we consider adding missing relations: (i) using a kind French language resources. Jacquin et al. (2002) of semantic proximity measure; (ii) using highlighted the limitations and inconsistencies translation relations of Wiktionary itself. from the French EuroWordnet. Later, Sagot and Note: The experiments of this paper are based on Wik- Fišer (2008) explained how they needed to re- tionary’s dumps downloaded in year 2008. Differences may be observed with the current versions available online. course to PWN, BalkaNet (Tufis, 2000) and other resources (notably Wikipedia) to build WOLF, a 1 Introduction free French WordNet that is promising but still a Reliable and comprehensive lexical resources con- very preliminary resource. Some languages are stitute a crucial prerequisite for various NLP tasks. straight-off purely under-resourced. However their building cost keeps them rare. In The Web as Corpus initiative arose (Kilgarriff this context, the success of the Princeton Word- and Grefenstette, 2003) as an attempt to design Net (PWN) (Fellbaum, 1998) can be explained by tools and methodologies to use the web for over- the quality of the resource but also by the lack of coming data sparseness (Keller and Lapata, 2002). serious competitors. Widening this observation to Nevertheless, this initiative raised non-trivial tech- more languages only makes this observation more nical problems described in Baroni et al. (2008). acute. In spite of various initiatives, costs make Moreover, the web is not structured enough to eas- resource development extremely slow or/and re- ily and massively extract semantic relations. sult in non freely accessible resources. Collabo- In this context, Wiktionary could appear to be rative resources might bring an attractive solution a paradisiac playground for creating various lexi- 19 Proceedings of the 2009 Workshop on the People’s Web Meets NLP, ACL-IJCNLP 2009, pages 19–27, Suntec, Singapore, 7 August 2009. c 2009 ACL and AFNLP cal resources. We describe below the Wiktionary 2.2.2 Layouts resource and we explain the restrictions and prob- In the following paragraph, we outline wik- lems we are facing when trying to exploit it. This tionary’s general structure. We only consider description may complete few earlier ones, for ex- words in the wiktionary’s own language. ample Zesch et al. (2008a). An entry consists of a graphical form and a cor- responding article that is divided into the follow- 2.1 Collaborative editing ing, possibly embedded, sections: Wiktionary, the lexical companion to Wikipedia, etymology sections separate homonyms when • is a collaborative project to produce a free-content relevant; 1 multilingual dictionary. As the other Wikipedia’s among an etymology section, different parts • satellite projects, the resource is not experts-led, of speech may occur; rather filled by any kind of users. The might-be definitions and examples belong to a part of • inaccuracy of the resulting resource has lengthily speech section and may be subdivided into sub- been discussed and we will not debate it: see Giles senses; (2005) and Britannica (2006) for an illustration translations, synonyms/antonyms and hy- • of the controversy. Nevertheless, we think that pernyms/hyponyms are linked to a given part of Wiktionary should be less subject (so far) than speech, with or without subsenses distinctions. Wikipedia to voluntary misleading content (be it In figure 1 is depicted an article’s layout example. for ideological, commercial reasons, or alike). 2.2 Articles content As one may expect, a Wiktionary article2 may (not systematically) give information on a word’s part of speech, etymology, definitions, examples, pro- nunciation, translations, synonyms/antonyms, hy- pernyms/hyponyms, etc. 2.2.1 Multilingual aspects Wiktionary’s multilingual organisation may be surprising and not always meet one’s expectations or intuitions. Wiktionaries exist in 172 languages, but we can read on the English language main page, “1,248,097 entries with English definitions from over 295 languages”. Indeed, a given wiktionary describes the words in its own language but also foreign words. For example, the English article moral includes the word in English (adjective and noun) and Spanish (adjective and noun) Figure 1: Layout of boot article (shortened) but not in French. Another example, boucher, About subsenses, they are identified with an in- which does not exist in English, is an article of the dex when first introduced but they may appear as English wiktionary, dedicated to the French noun a plain text semantic feature (without index) when (a butcher) and French verb (to cork up). used in relations (translations, synonyms, etc.). It A given wiktionary’s ’in other languages’ left is therefore impossible to associate the relations menu’s links, point to articles in other wiktionar- arguments to subsenses. Secondly, subsense index ies describing the word in the current language. appears only in the current word (the source of the For example, the Français link in the dictionary relation) and not in the target word’s article it is article of the English wiktionary points to an arti- linked to (see orange French N. and Adj., Jan. 10, cle in the French one, describing the English word 20083). dictionary. A more serious issue appears when relations are shared by several parts of speech sections. In Ital- 1http://en.wiktionary.org/ 2What article refers to is more fuzzy than classical entry 3http://fr.wiktionary.org/w/index.php? or acceptance means. title=orange&oldid=2981313 20 ian, both synonyms and translations parts are com- lows in §3.2. When merging information extracted mon to all words categories (see for example car- from several languages, the homogenisation of the dinale N. and Adj., Apr. 26, 20094). data structure often leads to the choice of the poor- 2.3 Technical issues est one, resulting in a loss of information. As Wikipedia and the other Wikimedia Founda- 2.5 The bigger the better? tion’s projects, the Wiktionary’s content manage- Taking advantage of colleagues mastering various ment system relies on the MediaWiki software languages, we studied the wiktionary of the fol- and on the wikitext. As stated in Wikipedia’s lowing languages: French, English, German, Pol- MetaWiki article, “no formal syntax has been de- ish and Mandarin Chinese. A first remark con- fined” for the MediaWiki and consequently it is cerns the size of the resource. The official num- not possible to write a 100% reliable parser. ber of declared articles in a given wiktionary in- Unlike Wikipedia, no HTML dump is available cludes a great number of meta-articles which are and one has to parse the Wikicode. Wikicode not word entries As of April 2009, the French wik- is difficult to handle since wiki templates require tionary reaches the first rank6, before the English handwritten rules that need to be regularly up- one. This can be explained by the automated im- dated. Another difficulty is the language-specific port of public-domain dictionaries articles (Littré encoding of the information. Just to mention one, 1863 and Dictionnaire de l’Académie Française the target language of a translation link is iden- 1932-1935). Table 1 shows the ratio between the tified by a 2 or 3 letters ISO-639 code for most total number of articles and the “relevant” ones languages. However in the Polish wiktionary the (numbers based on year 2008 snapshots). complete name of the language name (angielski, Total Meta∗ Other∗∗ Relevant francuski, . ) is used. fr 728,266 25,244 369,948 337,074 46% en 905,963 46,202 667,430 192,331 21% 2.4 Parsing and modeling de 88,912 7,235 49,672 32,005 36% The (non-exhaustive) aforementioned list of diffi- pl 110,369 4,975 95,241 10,153 9% zh 131,752 8,195 112,520 1,037 0.7% culties (see §2.2.2 and §2.3) leads to the following ∗ templates definitions, help pages, user talks, etc.

Load more