Rdfization of Japanese Electronic Dictionaries and LOD

RDFization of Japanese Electronic Dictionaries and LOD Seiji Koide Hideaki Takeda Research Organization of Information National Institute of Informatics and Systems 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo [email protected] [email protected] Abstract linkage to DBpedia Japanese. Section 5 presents the publication of our work as LOD. Related work This paper describes the practice and the is discussed in Section 6, and Section 7 finally reality of OWL conversion of Japanese gives the summary and the discussion for future WordNet and Japanese dictionary IPAdic. work. The outcomes of OWL conversion are linked to DBpedia Japanese dataset us- 2 LOD and DBpedia ing lexical word matching. The difficulty 2.1 Linguistic LOD and Five Stars originating from the specialty of Japanese, which is shareable by non-English lan- In Linked Open Data (LOD), Tim Berners-Lee, guages, is focused. The potential of LOD the inventor of the Web and Linked Data initia- 1 in linguistics is also discussed. The goal of tor, suggested a five-star deployment scheme. In our study on Linguistics by LOD is to pro- this view, there was no LOD resource for Japanese vide an open and rich environment in lin- linguistics up to this study. EDR (Yokoi, 1995) guistics that propels multi-lingual studies by Japan Electronic Dictionary Research Center for linguistics researchers and bottom-up and lately NICT, GoiTaikei (Ikehara, et al., 1997) style ontology buildings for ontologists. by NTT, and a Japanese corpora by National In- stitute for Japanese Language and Linguistics2 1 Introduction are provided in machine readable forms but not in free use. However, the property of Japanese The traditional study of linguistics in Japanese is WordNet (Isahara, et al., 2008), IPAdic/NAIST- somehow domestic and not open so far to unre- jdic (Matsumoto, et al., 1999), and UniDic (Den, lated people. Linguistics by Linked Open Data et al., 2008) is in free use. (LLOD) has a potential to break this tradition and Based on the five-star scheme for LOD, we can to open linguistic resources to broad researchers deduce the condition of making LOD of a domain unlimited within linguistics. However, Japanese as follows. linguistic LOD embraces special difficulties that arise from specialties of the nature of Japanese. 1. Are materials in the domain open (free in These difficulties are not only limited to Japanese use)? but also common to non-English languages. 2. Is the structure of materials disclosed being In this paper, we describe the practice and the sufficient for RDFization? reality of OWL conversion of Japanese WordNet and Japanese dictionary IPAdic. To make the out- 3. Is it possible to name the components by con- comes into LOD, we linked the entities of them to trollable URIs? DBpedia Japanese and made them accessible on WWWs. 4. Is it possible to make linkage to other re- In the next section, we summarize what is LOD sources? and address the benefit of LLOD along with the Therefore, Japanese WordNet, IPAdic/NAIST- introduction of DBpedia Japanese. Our work of jdic, and UniDic deserve the conversion to RDFization of Japanese WordNet and linkage to 1See http://5stardata.info/. DBpedia Japanese are described in Section 3. Sec- 2See, http://www.ninjal.ac.jp/corpus_ tion 4 introduces the RDFization of IPAdic and the center/kotonoha.html. 64 RDF/OWL data format in order to let them turn 3.1.1 Normalization of UNICODE data resources in LOD, namely making URIs As known by the popular picture of Semantic of all components in dictionaries with control- Web Layer Cake6, UNICODE is the proper char- lable domain names and letting them enable to acter encoding set of Semantic Web and LOD. be referenced on the webs (i.e., dereferenceable). However, it is not known that strings in an RDF Whereby, we can enjoy Japanese linguistic re- graph should be in Normal Form C (NFC) of UNI- sources in the new paradigm of LOD. CODE.7 Otherwise, serious problems may hap- We propose the benefit of LLOD as follows. pen in Japanese and other non-English languages. Enables the sharing of linguistic resources. For example, ‘o’¨ that is located in Basic Plane 0 • is encoded to U+00F6 but it is also printed by Enables the comparison of linguistic re- octets U+006F (Latin small letter o) + U+0308 • sources among them over silos of different (combining dieresis). Then, we may miss string dictionaries in their own definitions. matching “Godel”¨ between one that consists of U+00F6 and the other that consists of U+006F + Enables the usage of linguistic resources with • U+0308. The same thing can happen in case of other non-linguistic resources (e.g., DBpe- Plato (Πλατων´ ) in which ‘α´’ may be U+03AC, dia). or the combination of U+03B1(Greek small letter Enables the development of ontologies start- alpha) and U+0301(combining acute accent). In • Japanese, ‘が’(U+304C) may be represented by ing at the lexical level for multiple vocabu- { か + ゛ , and ‘ぷ’(U+3077) may be represented lary sets. } by ふ + ゜ . The normalization of NFC solves { } 2.2 DBpedia Japanese as LOD Hub this ambiguity of character strings in UNICODE. DBpedia Japanese is a database generated from Japanese Wikipedia using DBpedia Information 3.1.2 Supplementary Ideographic Plane in Extraction Framework (DIEF).3 Although there UNICODE was significant delay in the deployment of DBpe- Several extended kanji characters are located in dia Japanese, it was launched in 2012 by our col- Supplementary Ideographic Plane of UNICODE, leagues at National Institute of Informatics (NII). which is implemented by surrogate pairs, and Since then, all LOD resources in Japan are being these extended kanji characters has been used for linked to the DBpedia Japanese and it has become Japanese person names before the age of electron- the hub of LOD-cloud in Japan as English DBpe- ics. For example, ‘ஷ’ (U+20BB7) is very similar dia (Bizer, et al., 2009) is in the world. In Japan, to basic kanji ‘吉’ (U+5409), and ‘丈’ (U+2000B) there are currently 23 data sets linked directly or is similar to basic kanji ‘丈’ (U+4E08), but many indirectly to DBpedia Japanese, which contains computer systems cannot print out the extended 77,445,359 triples, at the time of writing this pa- kanji characters in Supplementary Ideographic per. Plane. Then, Wikipedia titles a page for a pro- boxer to “辰吉丈一郎” instead of his proper name 3 RDFization of Japanese WordNet and “辰ஷ丈一郎”, and then guides us to the page8, Links to DBpedia Japanese even if we, on top of Wikipedia, search a page with the proper name “辰ஷ丈一郎”. We must take care 3.1 Practice of RDFization of extended kanji characters with surrogate pairs 4 5 In addition to RDF syntax and RDF semantics , in data resources. we have discovered some pragmatics on RDFiza- tion in LOD. General ones over diverse domains 3.1.3 URI vs. IRI are described in Heath and Bizer (2011). In this N-Triples9 is a line-based, plain text format for en- section, we describe more specific practices in coding an RDF graph, but the character encoding RDFization of Japanese resources. 6http://en.wikipedia.org/wiki/ 3https://github.com/dbpedia/ Semantic_Web_Stack extraction-framework/wiki/ 7http://www.w3.org/TR/rdf-mt/ The-DBpedia-Information-Extraction-Framework#graphsyntax 4http://www.w3.org/TR/ 8http://ja.wikipedia.org/wiki/ 辰吉丈一郎 rdf-syntax-grammar/ 9http://www.w3.org/TR/rdf-testcases/ 5http://www.w3.org/TR/rdf-mt/ #ntriples 65 in string is designated to 7-bit US-ASCII. So, non- After the W3C proposal for OWL conver- ASCII characters must be made available by \- sion of WordNet, the Princeton WordNet was escape sequences, such as ‘\u3042’ for Japanese updated to version 2.1, in which new relations hiragana ‘あ’ (U+3042).10 of instanceHypernym and instanceHyponym has RDF/XML syntax11 designates %-encoding for been introduced, and now the latest version is disallowable characters that do not correspond to 3.0. In following the updates of WordNet, the permitted US-ASCII in URI encoding, in spite that RDF schema for WordNet 2.0 should be reused the UNICODE string as UTF-8 is designated to to 2.1 and 3.0, according to one of rules for the RDF/XML representation. Therefore, the dis- the best practice in LOD. Only for two new allowed URL http://ja.dbpedia.org/page/ 辰 properties, wn21schema:instanceHyponymOf and 吉丈一郎 must be escaped as http://ja.dbpedia. wn21schema:instanceHypernymOf should be de- org/page/ %E8%BE%B0%E5%90%89%E4%B8%88%E4% fined in WordNet 2.1. On the other hand, B8%80%E9%83%8E in RDF/XML syntax. the namespaces of every instance of words, Turtle12 and JSON-LD13 allow IRIs. We expect word senses, and synsets may be updated to every platform for Semantic Web and LOD can wn21instances or wn30instances, de- process format files of Turtle and JSON-LD, and pending on the version numbers in order to dis- then the revised edition of RDF/XML will allow tinguish the version of data, even if the content of IRIs in near future. an entry was not updated in a new version. At the end, we will be able to choose URIs if 3.3 RDFization of Japanese WordNet we focus on the international usability of the data, or IRIs if we take care of domestic understand- The latest Japanese WordNet is built on top of ability. The RFC3986, the standard of URI, says Princeton’s English WordNet 3.0 by adding ap- for the design of URI, “a URI often has to be re- propriate Japanese words to the content of Prince- membered by people, and it is easier for people to ton WordNet 3.0 on the framework of the Word- remember a URI when it consists of meaningful Net.

Load more