Mining Translations from the Web of Open Linked Data
Total Page:16
File Type:pdf, Size:1020Kb
Mining translations from the web of open linked data John Philip McCrae Philipp Cimiano University of Bielefeld University of Bielefeld [email protected] [email protected] Abstract OmegaWiki on the web of data should ameliorate In this paper we consider the prospect of the process of harvesting translations from these extracting translations for words from the resources. web of linked data. By searching for We consider two sources for translations from entities that have labels in both English linked data: firstly, we consider mining labels for and German we extract 665,000 transla- concepts from the data contained in the 2010 Bil- tions. We then also consider a linguis- lion Triple Challenge (BTC) data set, as well as tic linked data resource, lemonUby, from DBpedia (Auer et al., 2007) and FreeBase (Bol- which we extract a further 115,000 transla- lacker et al., 2008). Secondly, we mine transla- tions. We combine these translations with tions from lemonUby (Eckle-Kohler et al., 2013), the Moses statistical machine translation, a resource that integrates a number of distinct dic- and we show that the translations extracted tionary language resources in the lemon (Lexi- from the linked data can be used to im- con Model for Ontologies) format (McCrae et al., prove the translation of unknown words. 2012), which is a model for representing rich lex- ical information including forms, sense, morphol- 1 Introduction ogy and syntax of ontology labels. We then con- In recent years there has been a massive explo- sider the process of including these extra transla- sion in the amount and quality of data available tions into an existing translation system, namely as linked data on the web. This data frequently de- Moses (Koehn et al., 2007). We show that we scribes entities in multiple languages and as such can extract many translations which are comple- can be used as a source of translations. In par- mentary to those found by the statistical machine ticular, the web contains much data about named translation system and that these translations im- entities, such as locations, films, people and so prove the translation performance of the system. forth, and often these named entities have trans- 2 Mining translations from the linked lations in many languages. In this paper, we ad- open data cloud dress the question of what can be achieved by us- ing the large amounts of data available as multi- Obtaining translations from the Linked Open Data lingual Linked Open Data (LOD). As state-of-the- (LOD) cloud is a non-trivial task as there are art statistical machine translation systems (Koehn, many different properties used to specify the la- 2010) are typically trained on outdated or out-of- bel or name of resources on the LOD. The stan- domain parallel corpora such as on the transcripts dard method of identifying the language of a label of the European Parliament (Koehn, 2005), we is by means of the xml:lang annotation, which expect to increase the coverage of domain-specific should be an ISO-639 code, such as “en” or “eng” terminology. for English. As noted by Ell et al. (2011), very lit- In addition, there has recently been a move to- tle of the data found on the web actually has such wards the publication of of language resources us- a language tag. Further, as the labels are typically ing linked data principles (Chiarcos et al., 2011), short it is very difficult to infer the language re- which can be expected to lead to a significant in- liably based on its surface form. As such we are crease the availability of information relevant to compelled to rely on the language tags, and this NLP on the Web. In particular, the representa- means that from the data we can only recover a tion of legacy resources such as Wiktionary and small amount of what may be available. In par- 8 Proceedings of the Joint Workshop on NLP&LOD and SWAIE, pages 8–11, Hissar, Bulgaria, 12 September 2013. ticular, out of the Billion Triple Challenge data Resource Translations we recover approximately 398 million labels in OmegaWiki (English) 56,077 English but only 144,000 labels in German. Un- OmegaWiki (German) 55,990 surprisingly, given the dominance of English on Wiktionary (English) 34,421 the web of data (Gracia et al., 2012), we find that Wiktionary (German) 43,212 there are many more labels in English. We then All 114,644 filter these two sets so that we keep only URIs for lemonUby and cloud 777,173 which there is both an English and a German label. Among this data is a significant number of long Table 2: Number of translations extracted for lin- textual descriptions, which are very unlikely to be guistic linked data resource lemonUby useful for translation, such that we also filter out all labels whose length is more than 10 characters. Out of these resources, FrameNet, VerbNet and This filter was necessary to reduce the amount of WordNet are monolingual English resources, so noisy “translations”. We do not filter according to we focus on the OmegaWiki and Wiktionary part any particular property, e.g., we do not limit our- of the resources. LemonUby contains direct trans- selves to the RDFS label property, but in the case lations on the lexical senses of many of the re- where we have multiple labels on the same entity sources (see Figure 1 for an example). The to- we select the RDFS label property as a preference. tal number of translations extracted for each sub- In addition to the BTC data, we also include resource is given in Table 2 as well as the results two large resources for which there are multilin- in combination with the number of translation ex- gual labels in their entirety. These resources are tracted from labels in the previous section. DBpedia1 and FreeBase2 as these resources con- tain many labels in many languages. As the re- 4 Exploiting mined translations mined sources are consistent in the use of the rdfs:label from linked data property, we extract translations by looking at the rdfs:label property and the language tag. In Ta- In order evaluate the utility of the translations ble 1 we see the number of translations that we extracted from the linked data cloud, we inte- extract from the three resources. We see that we grate them into the phrase table of the Moses sys- extract fewer resources from the BTC data but a tem (Koehn et al., 2007) trained on Europarl large and reasonable number of translations from data (Koehn, 2005). We used the system primarily the other two resources in an ‘off-the-shelf’ manner in order to focus on the effect of adding the linked data translations. 3 Translations mined from linguistic Moses uses a log-linear model as the baseline linked data for its translations, where translations are gener- ated by a decoder and evaluated according to the For finding translations from linguistic linked following model: data, we focus on the lemonUby (Eckle-Kohler et X al., 2013) resource, which is a linked data version p(t|f) = exp( φi(t, f)) of the UBY resource (Gurevych et al., 2012). This i resource contains lemon versions of a number of Where t is the candidate translation sentence, f resources in particular: is the input foreign text and φi are scoring func- tions. In the phrase-based model the translation • FrameNet (Baker et al., 1998) is derived compositionally by considering phrases • OmegaWiki 3 and their translations stored in the so called phrase table of the Moses system. • VerbNet (Schuler, 2005) The main challenge in integrating these transla- 4 • Wiktionary tions derived from linked data lies in the fact that • WordNet (Fellbaum, 2010) they lack a probability score. For each transla- tion pair (a, b) derived from the linked data, we 1 We use the dump of the 3.5 version distinguish two cases: If the translation is already 2The dump was downloaded on June 21st 2013 3http://omegawiki.org in the phrase table, we add a new feature that is 4http://www.wiktionary.org set to 1.0 to indicate that the translation was found 9 Resource English Labels German Labels Translations BTC 398,902,866 144,226 51,756 DBpedia 7,332,616 590,381 540,134 FreeBase 41,261,806 1,654,254 259,923 All 447,497,288 2,338,861 665,910 Table 1: The number of labels and translations found in the linked data cloud by resource <OW_eng_LexicalEntry_0#CanonicalForm> lemon:writtenRep "rain"@eng. <OW_eng_LexicalEntry_0> lemon:canonicalForm <OW_eng_LexicalEntry_0#CanonicalForm> ; lemon:sense <OW_eng_Sense_0> . <OW_eng_Sense_0> uby:equivalent "schiffen"@deu , "regnen"@deu . Figure 1: An example of the relevant data for a lemon lexical entry from the OmegaWiki English section of lemonUby BLEU Evaluator 1 Evaluator 2 proved the translation quality almost exclusively Baseline 11.80 36% 34% in the case that the machine translation system did +LD 11.78 64% 64% not have an existing candidate. We then generated all translations for the baseline system without any Table 3: The comparative evaluation of the trans- linked data translations and for the system aug- lations with and without linked data mented with all the linked data translations. Out of 3,003 translations in the test set, we found 346 from the Linked Data Cloud. If the translation was translations which were changed by the introduc- not in the phrase table, we add a new entry with tion of linked data translation. For each transla- probability 1.0 for all scores and the feature for tion, we performed a manual evaluation with two linked data set to 1.0.