Building up a Large Scale of Ontology from Japanese Wikipedia

Takahira Yamaguchi Keio University, Today’s Talk z Background z Proposed Methods IS‐A Hierarchy, Class – Instance, RDF triple, Property Domains, Synonyms z Application 1 (Knowledge Creation Support) Demo z Application 2 (Human Robot Interaction) Demo z Evaluation (Results, Related Work) z Conclusions and Future Work

2 Background: ODP&T (Ontology Development Process and Tool) Process determine consider enumerate define define define create scope reuse terms classes properties constraints instances (Noy 2003)

Tool Ontology Search Ontology Learning Linked Open Data (LOD) (by SWOOGLE, WATSON) from Text, Search Monkey Ontology Matching & Semi‐structured (Enhanced Results) Alignment Resources

Wikipedia2 Ontology

3 Proposed Methods

Institute Educational Intstitute Univ

Japanese Univ Wikipedia Keio Univ Wikipedia Keio Univ foundation Ontology

1858

Institute location foundation

Keio Univ

Keio University 4 Extracting Is‐a Relationships

• String Matching Methods on the category names • Matching Infobox Templates and Categories

5 Wikipedia Category Tree

Category A

Category B

Article a Category D

Article b Category C

Category E Article c Category F Article d

Article e 6 Category Tree Category and Category Tree http://ja.wikipedia.org/wiki/Category:プログラミング言語 Category 「Programming Language」

Mixing is-a, has-a, class-instance, and other relationshipsSub categores in the of Programming category tree Language

Extracting is-a relationships from category tree

The number of categories: 91,316 http://ja.wikipedia.org/wiki/Wikipedia:カテゴリ7 String Matching Methods on the category names

„ Backward String Matching Super Class Sub Class Motorway Japanese Motorway AirportHigh-speed rail Taiwan High-speedAirport rail Seafood Japanese Seafood SubSum categoryo of Amateur Sumo Is-a 7,971 Is-a relationships JapaneseJunior Airport college Japanese JuniorJapanese college Airport Short film Disney Short film Category Tree Is-a relationship Total: 12,558 „ Forward Matched String Eliminating Super Class Sub Class Noodle Yakisoba JapaneseBird AthleteDomestic duck Athlete Bird Penguin Musician Sub categoryLyricist of Is-a 4,587 Is-a relationships JapaneseMusician Golfer Composer Golfer Media Music CategoryMedia TreeNewspaper Is-a relationship Author Poetry 8 Matching Infobox Template Name to Category Name

「Instrument」Super Class Template Sub Class Super Class Sub Class Organic compound Software Free Software 「Piano」Article Amine Organic compound Ester Software Image Processing Software Organic compound Carboxylic acid Software Text Editor Organic compound Terpenoid SoftwareInst楽器rument Web Browser Organic compound Organonitrogen compounds Software E-mail Software Organic compound Aromaticity Is-aSoftware Security Software Organic compound Heterocyclic compound Software Word processor Organic compound Amide Keyboard Software CAD Software 鍵盤楽器instrument Organic compound Amino acid instrumentSoftware System Software Organic compound AlkaloidIs-a Software Windows Game Software Organic compound Alcohol Software Application Software Organic compound AldehydeピアノPiano Software TeX

Is-a relationships: Categories that the piano article belongs Keyboard instrument | Piano 3,782 9 Extracting Class‐Instance Relationships Title → CClasslass Title Item → Instance List of People from People from Tokyo ==Mathmatician Class Instance * Keio University People Yukichi Fukuzawa Keio University People Atsushi Seike Item Listing Pages: * Shokichi Iyanaga Keio University People Yuichiro Anzai aboutJapanese #8,300 Tourist attraction Kyoto ==Physicist Class-Instance Japanese Tourist attraction Hiroshima Japanese Tourist attraction Akihabara * Sin‐Itiro Tomonaga Relationships: 421,989 Japanese Tourist attraction Yokohama Nobel laureates Hideki Yukawa Procedure Nobel laureates(1) Scrape lines includingMasatoshi instances’ Koshibastring (2) Eliminate ‘*’ lines which are used in the explanatory text of the list. Nobel laureates(3) Eliminate ‘*’ linesYoichiro which are Nambu linked to other listing pages. (4) Eliminate ‘*’ lines which belong to unconcerned content index like “recital”. (5) Eliminate ‘*’ lines which aren’t correct as instances like “*REDIRECT”. (6) Eliminate ‘*’ lines which are used to describe year like “*19th”. 10 (7) Scrape the string of instance out of each ‘*’ lines by using symbol of link “[[ ]]” to identify it. Extracting RDF Triples

Subject Subject Predicate Object Tokyo region Kanto Tokyo area 2,187.65 Tokyo population 12,988,797 Tokyo density 5,940 Tokyo tree Ginkgo tree Tokyo flower Somei‐Yoshino Predicate Object Tokyo bird Black‐headed Gull Tokyo governor Shintaro Ishikawa

・・・・・・1,485,751 Triples

11 Implementation of Wikipedia Ontology Search Application

Institute Is-a Relationships Educational Intstitute ?String Matching Methods Univ ?Matching Templates and Categories

Japanese Univ Class-Instance Relationships Wikipedia ?Scraping Listing Pages Keio Univ

Keio Univ RDF Triples foundation ?Scraping Infoboxes 1858 Institute Domain of Property Wikipedia location ?Scraping Infoboxes foundation Synonym Ontology Keio Univ ?Extracting Redirect Links Keio University Support ・ Idea generation ・ Analysis

Search Results 12 Demo1 WiLD (Wikipedia Linked Data Application) (think about Japanese famous novelist) 1min.

Linked Data ・ Book ・ Restaurant

13 13 Demo2 HRI (Human Robot Interaction) (1)

Microphone Speaker NAO Sonar Python Programming

Inertial sensor 58cm

Pressure sensor

NAO comes from Aldebaran in . http://www.aldebaran‐robotics.com/en 14 14 Demo2 HRI (Human Robot Interaction) (2) 1. An user asks Nao the ways for health-care. 2. Using WikipediaJapn ontology, Nao enumerates them. 3. The user selects tai-chi from them. 4. Using Action ontology with Nao, Nao shows the user tai-chi actions that Nao can do. 5. The user selects tai-chi_1 from them. 6. Nao does the action of tai-chi_1. Demo: by Japanese 実行可能動作 1min30sec 健康法 基本動作 複合動作 Tai‐chi 太極拳

太極拳 ウォーキング ピラティス早寝早起き 禁煙 日光浴 移動 回転 姿勢 重心 屈伸 ダンス ラジオ体操 陳式太極拳 楊式太極拳 スイミング 孫式太極拳 ターンゆっくりターン 左足屈伸右足屈伸 ラジオ体操第一ラジオ体操第二 背泳ぎ 前進 後退横歩き 立つ 座る 太ももを 武式太極拳呉式太極拳 バタフライ ヨガ 両手を 前屈 寝転ぶ 伸ばす 自由形 平泳ぎ 歩くゆっくり歩く ゆっくり腰に手を 太極拳1 サイドステップ 前に出す 太極拳2ヨガ 自己紹介スターウォーズ カルマヨーガ ジャパヨーガ サイドステップ当てる右足に 左足に ギャーナヨーガ 後ろ足ゆっくり後ろ足 ダンス ダンス 重心をのせる中心に重心をのせる パクティヨーガ ラージャヨーガ スリラーダンス マントラヨーガ 重心を戻す WikipediaJapn ontology Action ontology 15 Learning Results from Wikipedia Japan (1)

Relations # Precision Al IS-A 93,322 76.30% IS-A By string matching 12,558 93.1±1.51% By template matching 3,782 95.6±1.09% By contents headlines 83,288 72.6±2.74% Class - Instance 421,989 97.2±1.02% RDF triple 1,485,751 95.8±1.79% Property Domains 6,485 95.4±1.22% Synonyms 106,671 67.0±2.90% Total 1,834,753 -

16 How is WJO going ?

Less upper concepts Some concepts with many sub concepts

Some classes with no instances Some classes with popular instances

17 Related Work Property Instance IS‐A Property Class‐ RDF Property domain Instance triple Bizer DBpedia ○○ Suchanek YAGO ○△○ △ Ponzetto - ○ Wikipedia Keio Univ. Japan2 ○○○○○ Ontology

DBpedia - A Crystallization Point for the Web of Data

Christian Bizer, Jens Lehmann, Georgi Kobilarov, Soren Auer, Christian Becker, Richard Cyganiak Sebastian Hellmann Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154–165, 2009. http://wiki.dbpedia.org/

YAGO - A Large Ontology from Wikipedia and WordNet Fabian M Suchanek, Gjergi Kasneci, Gerhard Weikum Elsevier Journal of Web Semantics http://www.mpi-inf.mpg.de/yago-naga/yago/ 18 19 Conclusions and Future work

Conclusions „ Wikipedia Japan works for light‐weight (not heavy‐weight) ontology development.

„ Wikipedia Japan Ontology works for knowledge creation support and HRI.

Less upper concepts Some concepts with Future Work many sub concepts Manage the right issues of Wikipedia Japan Ontology by with Upper Ontologies. Some classes with no instances Some classes with popular instances 19