<<

Lessons learned (and questions raised) from an interdisciplinary Machine approach

Position paper for the W3C Workshop on the Open Data on the Web, 23 - 24 April 2013, Campus, Shoreditch, London

Timm Heuss University of Plymouth, Plymouth, United Kingdom Timm.Heuss@{plymouth.ac.uk,web.de}

March 2, 2013

field of Natural Processing (NLP), Abstract: Linked Open Data (LOD) has ultimate which widely relies on formalized knowledge in benefits in various fields of computer science and es- various kinds, formats and applications. One pecially the large area of Processing example are , which are usually de- (NLP) might be a very promising use case for it, as it signed to be in a very application specific or widely relies on formalized knowledge. Previously, the even proprietary format. In addition, world author has published1 a fast-forward combinatorial ap- knowledge often plays an important role for the proach he called “ based Machine Trans- quality of a NLP system. lation” (SWMT), which tried to solve a common prob- In the NLP-subfield of Machine Translation lem in the NLP-subfield of Machine Translation (MT) (MT), it is often crucial not only to have a with world knowledge that is, in form of LOD, inherent comprehensive , but to understand in the Web of Data. This paper first introduces this the correctly, as otherwise ambi- practical idea shortly and then summarizes the lessons guities may result in incomprehensible target learned and the questions raised through this approach language [5]. and prototype, regarding the Semantic Web tool stack Within the last decades, the field of MT and design principles. Thereby, the author aimes at fos- has undergone various changes of underlying tering further discussions with the international LOD , from rule-based to statistical ap- community. proaches. Especially the statistical approaches benefit from the vast amounts of information 1 Motivation on the ”eyeball Web” [3, p. 82]. However, none of these MT approaches re- There are arbitrary uses residing in the nature flect the knowledge-based developments the of every information and with 5-star-Linked Web made in the realms of the Semantic Web Open Data (LOD) [2], humanity could finally respectively the Web of LOD since the mille- let machines benefit from the knowledge that nium. In the opinion of the author, a logical is available world wide, to everyone. next step for MT is to involve the knowledge While all fields of computer science could ul- that is inherent in the Web of Data. He there- timately benefit from the Web of Data, one of fore developed a combinatorial approach of the the most pressing area is probably the giant Semantic Web and MT disciplines, which he 1 http://mt-archive.info/EACL-2012-Harr called Semantic Web based Machine Transla- iehausen.pdf (accessed 2013-02-23). tion (SWMT) [5].

1 2 Semantic Web based translate [5, page 5]. But for SWMT, a trans- Machine Translation lation is a logical conclusion of world knowl- edge. The idea behind SWMT is basically to use the knowledge that is available in form of LOD as 2.2 Involved knowledge / LOD dictionary for a MT task. Thereby, a principle design goal is to use the vision and standards of The sentence mentioned above contains the the W3C as strict as possible: There are a few world knowledge that a vendor called Apple Resource Description Framework (RDF) state- produced a product named Pages and a ven- ments containing the world knowledge, nat- dor called Microsoft with the short form MS ural language is represented as rdfs:label, produced a product named . as a ”human-readable version of a resource’s Of course, DBpedia3 is the first address to name” [4], including a language notation fol- retrieve that knowledge from. For example, the lowing RFC-30662 [6]. In addition, reasoning following triples have been extracted to reflect by the (OWL) ex- world knowledge about Microsoft: pands the knowledge logically. Thus, the ap- :Microsoft_Word dbp:developer proach works with implicit, inferenced knowl- dbpedia:Microsoft . edge that is not stated explicitly by RDF state- dbpedia:Microsoft_Excel dbp:developer ments. dbpedia:Microsoft . dbpedia:Excel dbo:wikiPageDisambiguates 2.1 Sample scenario dbpedia:Microsoft_Excel ; rdfs:label "Excel". To demonstrate the powerfulness of this ap- dbo: wikiPageDisambiguates dbpedia: tricky sentence was literally designed to be Microsoft_Word ; rdfs:label"Word". translated in the author’s native language Ger- dbpedia:MS dbo:wikiPageDisambiguates man: dbpedia:Microsoft ; rdfs:label"MS".

dbo:wikiPageDisambigua- Pages by Apple is a word The property tes processor like Word by MS. is refereed to add common short forms (e.g. Excel for Microsoft Excel) and ab- Of course, Pages, Apple, Word and MS breviations (e.g. MS for Microsoft). are proper names and should not be trans- This general-use LOD is supplemented with lated. An additional measure to stress a two application specific enrichments: First, the machine translation is to use indirect prod- property dbo:developer gets an more expres- uct names (Pages by Apple and not Apple sive and especially internationalized human- Pages). Thus, these phrases cannot be de- readable label than DBpedia’s developer: rived from possible dictionary entries. Also, dbp:developer rdfs:label"by"@en,"von" the company Microsoft is abbreviated by the @de,"produziert von"@de . common MS. :produces rdfs:label"produces"@en," produziert"@de. For traditional translation approaches, sen- :produces owl:inverseOf dbp:developer. tences like this are usually extremly hard to

2 http:/ / www.ietf.org/ rfc/ rfc3066.txt 3 http:/ / dbpedia.org/ About (accessed 2013- (accessed 2012-01-25). 03-01).

2 The second addition is the definition of ap- plication specific trigger , which are used in the following section to intensionally trig- ger the enrichment of the translation dictio- nary out of LOD: dbpedia:Microsoft_Worda :trigger . dbpedia:Apple_Inca :trigger .

2.3 Prototype Since its publication last year [5], the pro- totype4 has changed significantly. The basic principle, however, is still the same: The ap- plication translates an English sentence to Ger- man, by emulating a traditional translation Figure 1: Prototype architecture consisting of system that translates words and phrases with a traditional, dictionary based translation mecha- the help of a (also very traditional) dictionary. nism and a corresponding dictionary. This dictio- The novelty of the approach is the Semantic nary is enriched with Semantic Web technology by Web tool stack, that uses the LOD to create reasoning and querying LOD from DBpedia. new dictionary entries based on reasoning on world knowledge. The figure 1 depicts this combination of traditional and novel transla- ?gSb dbo:wikiPageDisambiguates ?sb . ?gSb rdfs:label ?sbLabel . tion components. ?gSth dbo:wikiPageDisambiguates ?sth As mentioned in the previous section, the . DBpedia entries dbpedia:Microsoft Word ?gSth rdfs:label ?sthLabel and dbpedia:Apple Inc are defined as trig- FILTER langMatches(lang(? gers. Thus, if a sentence contains words like doesLabelFrom),"EN") FILTER langMatches(lang(?doesLabelTo Word or Apple (labels that are retrieved by ref- ),"DE") erencing the dbo:wikiPageDisambiguates- FILTER( regex(str(?sbLabel),"Pages relation), the Semantic Web tool stack is asked ","i") || regex(str(?sthLabel), to produce additional dictionary entries for "Pages","i"))} the translation. After creating an inferred As a result, simple but analogous and valid model with an OWL-reasoner, this essentially translations based on actual word knowledge involves a Simple Protocol And RDF Query are produced. In the given scenario, for the Language (SPARQL) query. By the example trigger word Word the phrases shown in the of the trigger word Pages, this query looks like following figure are produced: the following: SELECT ?sbLabel ?doesLabelFrom ? English German sthLabel ?doesLabelTo WHERE Word by MS Word produziert von MS { ?sb ?does ?sth . Word by MS Word von MS ?does rdfs:label ?doesLabelFrom . ?does rdfs:label ?doesLabelTo . MS produces Word MS produziert Word

4 https://github.com/heussd/swmt (accessed Figure 2: Valid translation phrases, reasoned and 2013-02-28). queried out of the LOD from DBpedia.

3 These phrases are then added to the dictio- input format for a extraction, transformation nary. As a result, the translator produces a and load process. Unquestionable, the LOD- valid and analogous German translation for the vision ends with the step of transformation into given sample sentence that would have never a non-RDF system, when cutting off the links been translated by traditional approaches: to the outside world - does it? Is a SPARQL- capable triple store the highest possible storage Pages von Apple ist eine Text- form in a properly designed LOD application? verarbeitung wie Word von MS.

This application is available at GitHub5. 3.2 DBpedia Endpoint The author was thrilled to work with the real 3 Lessons learned and ques- DBpedia endpoint iSPARQL7. Unfortunately, 8 tions raised the designed query seems to be to expensive for the endpoint policies. As mentioned, the prototype is designed as While the author of course understands the close as possible to the vision of LOD and the technical necessity of such restrictions - is it building paradigms of the Semantic Web. Con- really true that the endpoint to the world’s nected to this design goal, some issues emerged largest pooled collection of LOD cannot be during development. In the following, the au- queried above a simple level complexity? thor describes those issues and where he sees Is it best practice for LOD applications with conceptual mismatches, either of the two do- moderate complex SPARQL queries to extract mains NLP and LOD or in the visions of the and hold LOD for its own and not working with Semantic Web and this concrete application the live DBpedia or other endpoints? scenario. 3.3 Incorporating statistics 3.1 Performance A lot of NLP applications utilize various forms Since this approach has first been introduced of statistics, e.g. in so called n-grams9. Dis- [5], the performance of the prototype has been astrously, one basic principle of the Semantic significantly improved. However, production of Web makes creation of statistics hard and even the translation phrases still takes considerable counting “difficult” [1, page 252]: The Open 6 time , even with the very small data sets in the World Assumption - the fact that “at any time given scenario. Thus, this approach might be [...] new information could come to light” [1, unfeasible for realtime-NLP-scenarios. page 10]. So, it is very questionable if, for ex- A typical answer could be the deployment of ample, counting frequencies of certain triple a triple store. But when following this reason- combinations is a valid operation in the Se- ing, one could came up with the idea of using mantic Web. an even more optimized, application-specific storage structure and to treat LOD just as 7 http:/ / dbpedia.org/ isparql/ (accessed 2013-03-01). 5 http://github.com/heussd/swmt (accessed 8The service returns the error message: “The esti- 2013-03-01). mated execution time 7219 (sec) exceeds the limit of 6On the author’s 2.4 GHz Intel Core 2 Duo machine, 3000 (sec)” OWL-reasoning and quering the data takes about one 9 http:/ / en.wikipedia.org/ wiki/ N-gram second. (accessed 2013-03-02).

4 However, in SWMT, statistics would be re- 4 Conclusion quired to reveal the “best” or “the most fit- ting” translation for a given phrase. Currently, This paper describes a combinatory approach SWMT can only find equally weighted trans- to support a Machine Translation (MT) pro- lation alternatives - and it’s pure coincidence cess with world knowledge that is inherent in if the phrase Word by MS is translated with the Web of Linked Open Data (LOD). With Word produziert von MS or with Word von a simple but expressive demo scenario, a pro- MS, because both forms are valid translations totype application is developed and the syner- (see figure 2 on page 3). gistic addition of Semantic Web technology is shown. Then, the author reflects his current stage of work and the issues he encountered in 3.4 Restrictions of RDF connection with application performance, DB- A fundamental withdraw of the approach is the pedia connectivity, integration of statistics and fact that it is limited to translation of triples, the Resource Description Framework (RDF). consisting of the three RDF elements subject, Based on the feedback and the workshop re- predicate and object [1, page 31]. These triples sults, the author continues his research in the might usually resolve to small phrases of three, fields of Natural Language Processing (NLP) sometimes four words (as shown in figure 2 on and Semantic Web. page 3), but they will never constitute com- plete English sentences of a medium or high References complexity. The project SPARQL2NL10, although hav- [1] Dean Allemang and James A. Hendler. Semantic Web for the Working Ontologist - Effective Model- ing available a very impressive demonstration, ing in RDFS and OWL, Second Edition. Morgan basically seems to work similar to SWMT Kaufmann, 2011. ISBN 978-0-12-385965-5. and thus they both underlie the same triple- [2] Tim Berners-Lee. . Webpage, June restriction 2009. URL http://www.w3.org/DesignIs However, this leads to a more general ques- sues/LinkedData.html. tion: Does the LOD claim to carry the com- [3] John G. Breslin, Alexandre Passant, and Stefan plete sense of a (human) language? Can any Decker. The Social Semantic Web. Springer, Berlin, natural language sentence be converted to n 2009. ISBN 978-3-642-01171-9. doi: 10.1007/978- triples - and can this conversion be seamlessly 3-642-01172-6. reversed, preserving at least the original mean- [4] R.V. Guha. RDF Vocabulary Description Language ing (not to speak of the exact wording)? 1.0: RDF Schema. 2004. NLP applications usually have a language [5] Bettina Harriehausen-M¨uhlbauerand Timm Heuss. model, that includes the language’s morphol- Semantic Web based Machine Translation. In Pro- ogy, and . One hint could ceedings of the Joint Workshop on Exploiting Syn- be including that knowledge of a language in ergies between and Machine the SWMT translation process. Therefore, the Translation (ESIRMT) and Hybrid Approaches to 11 Machine Translation (HyTra), pages 1–9, Avignon, NLP Interchange Format (NIF) looks right in France, April 2012. Association for Computational doing this the LOD way. . URL http://www.aclweb.org/a nthology/W12-0101. 10 http://blog.aksw.org/2013/two-aksw-pa pers-at-www/ (accessed 2013-02-18). [6] G. Klyne and J.J. Carroll. Resource description 11 http:/ / svn.aksw.org/ papers/ 2012/ WWW_ framework (RDF): Concepts and Abstract Syntax. NIF/public.pdf (accessed 2013-03-02). Technical Report February, W3C, 2004.

5