The Apertium Bilingual Dictionaries on the Web of Data
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Web 0 (0) 1 1 IOS Press The Apertium Bilingual Dictionaries on the Web of Data Editor(s): Philipp Cimiano, Universität Bielefeld, Germany Solicited review(s): Roberto Navigli, Sapienza University of Rome, Italy; Bettina Klimek, University of Leipzig, Germany; Sebastian Hellmann, University of Leipzig, Gemany; one anonymous reviewer Jorge Gracia a;∗, Marta Villegas b, Asunción Gómez-Pérez a, and Núria Bel b, a Ontology Engineering Group, Universidad Politécnica de Madrid Campus de Montegancedo s/n Boadilla del Monte 28660 Madrid. Spain E-mail: {jgracia,asun}@fi.upm.es b Institut Universitari Linguistica Aplicada, Universitat Pompeu Fabra Roc Boronat, 138 08018 Barcelona. Spain E-mail: {marta.villegas,nuria.bel}@upf.edu Abstract. Bilingual electronic dictionaries contain collections of lexical entries in two languages, with explicitly declared trans- lation relations between such entries. Nevertheless, they are typically developed in isolation, in their own formats and accessible through proprietary APIs. In this paper we propose the use of Semantic Web techniques to make translations available on the Web to be consumed by other semantic enabled resources in a direct manner, based on standard languages and query means. In particular, we describe the conversion of the Apertium family of bilingual dictionaries and lexicons into RDF (Resource De- scription Framework) and how their data have been made accessible on the Web as linked data. As a result, all the converted dictionaries (many of them covering under-resourced languages) are connected among them and can be easily traversed from one to another to obtain, for instance, translations between language pairs not originally connected in any of the original dictionaries. Keywords: linguistic linked data, multilingualism, Apertium, bilingual dictionaries, lexicons, lemon, translation 1. Introduction Data (LLOD) cloud is emerging1, that is, a new lin- guistic ecosystem based on the LD principles that will The publication of bilingual and multilingual lan- allow the open exploitation of such data at global scale. guage resources as linked data (LD) on the Web can In this article we will focus on the case of electronic largely benefit the creation of the critical mass of cross- bilingual dictionaries as a particular type of language lingual connections required by the vision of the mul- resources. Bilingual dictionaries are specialized dictio- tilingual Web of Data [9]. The benefits of sharing lin- naries that describe translations of words or phrases guistic information on the Web of Data have been re- from one language to another. They can be unidirec- cently recognized by the language resources commu- tional or bidirectional, allowing translation, in the lat- ter case, to and from both languages. A bilingual dic- nity, which has shown increasing interest in publish- tionary can contain additional information such as part ing their linguistic data and metadata as LD on the of speech, gender, declension model and other gram- Web [2]. As a result of interlinking multilingual and matical properties. open language resources, the Linguistic Linked Open 1An updated picture of the current LLOD cloud can be found at *Corresponding author http://linguistic-lod.org/ 1570-0844/0-1900/$27.50 © 0 – IOS Press and the authors. All rights reserved 2 J. Gracia et al. / The Apertium Bilingual Dictionaries on the Web of Data Electronic bilingual dictionaries have been typically 2. The source data developed in isolation, in their own formats and acces- sible through proprietary APIs. We propose the use of Apertium is a free/open-source machine translation Semantic Web techniques to make translations avail- platform [4], initially aimed at related-language pairs 2 able on the Web to be consumed by other semantic en- which currently includes up to 40 language pairs . The system was released under the terms of the GNU Gen- abled resources in a direct manner, based on standard eral Public License. languages and query means. The result of a principled The translation engine consists of series of assem- conversion into RDF is that the LD dictionaries are bled modules which communicate using text streams. connected among them [7] and can be easily traversed One of the modules is the lexical transfer module from one to another to obtain, for instance, transla- which reads lexical forms of the source language tions between language pairs not originally connected and delivers the corresponding target language lexical in any of the original dictionaries. Other potential uses forms. The module uses a bilingual dictionary which of bilingual dictionaries in LD are enhancement of LD- contains an equivalent for each source lexical form. based machine translation [10] and crosslingual infor- Apertium dictionaries are designed so that they can be mation access over LD [9]. compiled into letter transducers [6] which are able to In particular, we have converted the Apertium fam- process input strings and produce output strings. Ac- ily of bilingual dictionaries [4] into RDF, making their cordingly, dictionaries are made of entries consisting of string pairs that correspond to the inputs and out- data interconnected and accessible on the Web as LD. puts of the transducer. The Apertium dictionaries are Thus, the main contributions of this work are: described in XML. Notice that the Apertium initia- – We propose a method for converting bilingual tive benefits under-resourced languages specially, for dictionaries into RDF and publishing them as LD which it is difficult to apply Statistical Machine Trans- lation techniques due to the lack of large parallel cor- on the Web, that we have particularised in the pora in such languages. Apertium case. During the METANET4U Project3, a good num- – As a result, we contribute to the cloud of LLOD ber of lexicons were converted into Lexical Markup with 22 new linguistic datasets, many of them Framework [5] (LMF) in an effort to upgrade existing covering under-resourced languages that had lit- resources to agreed standards and guidelines. Many tle or no presence so far in the Web of Data Apertium lexicons were included in that process4. (e.g., Occitan, Asturian, Aragonese, Esperanto, Thus, following the LMF model, for each bilingual Basque, etc.). Apertium lexicon a new LexicalResource was cre- – We also analyse the resulting Apertium RDF ated. Each LexicalResource contains two Lexicons graph and exemplify how to traverse it to obtain (for source and target languages) and a set of SenseAxis direct and indirect translations. elements that are used to link senses in different lan- – We propose two different algorithms based on the guages. Only open categories were considered (nouns RDF graph structure to compute the confidence -including proper nouns-, verbs, adjectives and ad- verbs). The corresponding IDs were generated by con- degree of indirect translations. catenating the word form, the part of speech (PoS) tag The remainder of this paper is organised as fol- and the language tag. The following XML code exem- lows. In Section 2 we briefly describe the Apertium plifies the LMF representation of a single translation ! source data. Section 3 discusses how lexica and lan- ("bench"@en "banco"@es): guage translations can be represented as LD. Section 4 <LexicalResource> describes the method we have followed to convert the <Lexicon > <feat att="language" val="en"/> Apertium dictionaries into RDF and to publish them as <LexicalEntry id="bench-n-en"> LD on the Web. In Section 5 we exemplify how to tra- <feat att="partOfSpeech" val="n"/> verse the Apertium RDF graph to obtain translations, 2 and how to compute a confidence degree for indirect http://wiki.apertium.org/wiki/Main_Page 3http://www.meta-net.eu/projects/METANET4U/ translations. Section 6 analyses related work and, fi- 4A complete list of available lexicons can be found at http:// nally, conclusions can be found in Section 7. lod.iula.upf.edu/types/Lexica/by/standards J. Gracia et al. / The Apertium Bilingual Dictionaries on the Web of Data 3 <Lemma > erties partially derived from ISOcat6. The core of the <feat att="writtenForm" val="bench"/> 7 </Lemma > lemon model consists of the following elements : <Sense id="bench_banco-n-l"/> </LexicalEntry> LexicalEntry. An entry in the lexicon (word, multi- </Lexicon > word expression or even affix) that is assumed to <Lexicon > <feat att="language" val="es"/> represent a single lexical unit with common prop- <LexicalEntry id="banco-n-es"> erties (e.g., PoS) across all its forms and mean- <feat att="partOfSpeech" val="n"/> <Lemma > ings. <feat att="writtenForm" val="banco"/> LexicalForm. A form represents a particular version </Lemma > <Sense id="banco_bench-n-r"/> of a lexical entry, for example a plural or some </LexicalEntry> other inflected form. </Lexicon > <SenseAxis id="bench_banco-n-banco_bench-n" LexicalSense. The sense refers to the usage of a lex- senses ="bench_banco-n-l banco_bench-n-r"/> ical entry with a specific meaning and can also be </LexicalResource> considered as a reification of the relation between a lexical entry and the ontological entity that char- As it is shown in the above example, every re- acterizes its meaning in a certain context. sulting LexicalEntry includes the lemma, the part Reference. The reference is an entity in the ontology of speech and a Sense element. Sense elements are that defines the formal semantics associated with needed as place holders to encode translation equiva- the lexical entry. lents in the SenseAxis elements. Each LexicalEntry has as many senses as target equivalences. Sense ele- The lemon model did not consider the represen- ments only have an ID, which is formed by concate- tation of explicit translations initially. To that end, nating the source form, the target form, the PoS tag an extension of lemon was proposed for represent- and the ‘l’ or ‘r’ tags (which in the original dictio- ing translations on the Web of Data: the lemon trans- naries indicate "left" and "right" respectively for the lation module [8]. The translation module consists source and target languages).