<<

UNCOVERING THE SEMANTICS OF WIKIPEDIA CATEGORIES

Nicolas Heist | Heiko Paulheim THE WIKIPEDIA CATEGORY GRAPH

Nicolas Heist | Heiko Paulheim 2 LEARNING AXIOMS Introduction

:

: ⊆ 𝑑𝑑𝑑𝑑𝑑𝑑. 𝐴𝐴𝐴𝐴𝐴𝐴:𝐴𝐴𝐴𝐴 _ ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑: 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔. 𝑑𝑑𝑑𝑑: 𝑟𝑟 𝑅𝑅𝑅𝑅_𝑅𝑅𝑅𝑅 𝑚𝑚𝑚𝑚_ 𝑚𝑚𝑚𝑚𝑚𝑚 ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑟𝑟 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝐼𝐼𝐼𝐼𝐼𝐼퐼 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 Nicolas Heist | Heiko Paulheim 3 LEARNING AXIOMS The asyE Mode

USE CATEGORY NAMES albums :

⊆ 𝑑𝑑𝑑𝑑𝑑𝑑 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 : . : _ The Beatles albums ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑: 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔. 𝑑𝑑𝑑𝑑𝑟𝑟: 𝑇𝑇ℎ𝑒𝑒_𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑟𝑟 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 USE CATEGORY RESOURCES The Beatles albums … BUT 37% 20% of categories have 0 resources 66% 50% of categories have <= 3 resources The Beatles albums 79% 68% of categories have <= 7 resources

Nicolas Heist | Heiko Paulheim 4 LEARNING AXIOMS E n nn nnn g Approaches

< > < >

𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎

CATRIPLE [1] C-DF [2] - Identify relation patterns - Use tf-idf inspired score on relations in category hierarchy and types to find candidate axioms - Assign axioms based on - Learn rules from candidate axioms extracted keywords or voting - Apply rules to discover final axioms

[1] Liu et al.: Catriple: Extracting Triples from Wikipedia [2] Xu et al.: Learning Defining Features for Categories, Categories, ASWC 2008 Nicolas Heist | Heiko Paulheim IJCAI 2016 5 Cat2Ax Idea

(1)

𝑐𝑐𝑣𝑣𝑣𝑣𝑣𝑣 (1), (2) (1), (3) 𝑐𝑐𝑓𝑓𝑓𝑓𝑓𝑓 (1) … albums … : (2) someArtist albums . { : } (3) someenre albums → .⊆ 𝑑𝑑𝑑𝑑𝑑𝑑{𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴: } → ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 → ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑑𝑑𝑑𝑑𝑑𝑑 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

Nicolas Heist | Heiko Paulheim 6 Cat2Ax Relation xiomsA

, , = , , ( , )

𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑐𝑐 𝑝𝑝 𝑣𝑣 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑐𝑐 𝑝𝑝 𝑣𝑣 ∗ 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑣𝑣 𝑐𝑐𝑣𝑣𝑣𝑣𝑣𝑣 , : , : _ = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎, 𝑑𝑑𝑑𝑑𝑑𝑑: 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎,𝑎𝑎 𝑑𝑑𝑑𝑑:𝑑𝑑 𝑇𝑇𝑇𝑇�_ 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 * 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎: _ 𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎,𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵

𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑑𝑑𝑑𝑑𝑑𝑑 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 “𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵�

( ) accept pattern ( , , ) ( , , ) ( , , ) if median > 0 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑣𝑣 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑣𝑣 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑣𝑣 𝑝𝑝 𝑟𝑟𝑟𝑟𝑟𝑟 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑐𝑐 𝑝𝑝 𝑣𝑣 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑐𝑐 𝑝𝑝 𝑣𝑣 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐 𝑝𝑝 𝑣𝑣 Nicolas Heist | Heiko Paulheim 7 Cat2Ax Type xiomsA

, = , : , ( , )

𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑐𝑐 𝑡𝑡 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑐𝑐 𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑡𝑡 ∗ 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑡𝑡 𝑐𝑐𝑓𝑓𝑓𝑓𝑓𝑓 , : = 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑎𝑎, 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎: 𝑑𝑑𝑑𝑑𝑑𝑑, 𝐴𝐴𝐴𝐴:𝐴𝐴𝐴𝐴𝐴𝐴 * 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎: 𝑟𝑟𝑟𝑟𝑟𝑟, 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑑𝑑𝑑𝑑𝑑𝑑 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴

𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑑𝑑𝑏𝑏𝑏𝑏 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 “𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎�

accept pattern ( ( , ) ( , ) ( , ) ) if median > 0 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑐𝑐 𝑡𝑡 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑐𝑐 𝑡𝑡 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑐𝑐 𝑡𝑡 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑡𝑡 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 Nicolas Heist | Heiko Paulheim 8 Cat2Ax Type Lexicalisations

: … : _ _ : _ _( ) 𝒅𝒅𝒅𝒅𝒅𝒅 𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨 𝑑𝑑𝑑𝑑𝑟𝑟 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀� 𝑑𝑑𝑑𝑑𝑟𝑟 𝑇𝑇𝑇𝑇� 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 : ! _( )

𝑑𝑑𝑑𝑑𝑟𝑟 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎

Nicolas Heist | Heiko Paulheim 9 Cat2Ax L n n nn n nnn n nnn n S nn nn

ENTITY LEXICALISATIONS

‘band‘: 16354, ‘rock‘: 8264, ‘american‘: 5417, ‘group‘: 3651, dbo:Band ‘metal‘: 2583, ‘music‘: 1716, ‘pop‘: 1711, ‘indie‘: 1662, ..

‘the beatles‘: 3401, ‘beatles‘: 204, ‘beatle‘: 72, ‘‘: dbr:The_Beatles 9, ‘the beat brothers‘: 4, ‘‘: 3, ..

LEXICALISATION ENTITIES

dbo:Band [0.87], dbo:MusicalArtist [0.09], ‘band‘ dbo:Person [0.01], dbo:Album [0.004], …

dbr:The_Beatles [0.97], dbr:The_Beatles_(album) [0.02], ‘the beatles‘ dbr:The_Beatles_(TV_series) [0.003], …

Nicolas Heist | Heiko Paulheim 10 Cat2Ax Pattern Learning

+ 500 + 100 others others

: . : : . :

⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑟𝑟 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑑𝑑𝑑𝑑𝑟𝑟 𝑠𝑠𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜

ID PATTERN SUPPORT CONFIDENCE

lex(dbr:res) : . : 503 0.83 𝑃𝑃𝑃𝑃1 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 → ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑 𝑟𝑟𝑟𝑟𝑟𝑟 lex(dbr:res) : . : 103 0.17 𝑃𝑃𝑃𝑃2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 → ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑑𝑑𝑑𝑑𝑑𝑑 𝑟𝑟𝑟𝑟𝑟𝑟

Nicolas Heist | Heiko Paulheim 11 Cat2Ax Pattern Application

, , = ( , , )

𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 𝑃𝑃𝑃𝑃 𝑣𝑣 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑃𝑃𝑃𝑃 ∗ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑐𝑐 𝑝𝑝𝑃𝑃𝑃𝑃 𝑣𝑣 , , : = , , : 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑎𝑎=𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎0.83 𝑃𝑃𝑃𝑃01 𝑑𝑑𝑑𝑑𝑑𝑑 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑃𝑃𝑃𝑃1 ∗ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅= 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑃𝑃𝑃𝑃� 𝑑𝑑𝑑𝑑𝑑𝑑 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 ∗ 𝟎𝟎 , , : = , , : 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅=𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎0.17 𝑃𝑃𝑃𝑃0.742 𝑑𝑑𝑑𝑑𝑑𝑑 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑃𝑃𝑃𝑃2 ∗ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅= . 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑃𝑃𝑃𝑃� 𝑑𝑑𝑑𝑑𝑑𝑑 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 ∗ 𝟎𝟎 𝟏𝟏𝟏𝟏 Nicolas Heist | Heiko Paulheim 12 RESULTS

Category Sets: 176,785 candidate sets found (containing 8 categories on average) Patterns: 54,465 textual patterns found (99% imply types, 44% imply properties)

Nicolas Heist | Heiko Paulheim 13 EVALUATION

Methodology: Manual evaluation by annotators on Amazon Mechanical Turk Sample size: 1250 axioms and 1250 assertions with three annotations each

Nicolas Heist | Heiko Paulheim 14 EVALUATION

Nicolas Heist | Heiko Paulheim 15 OUTLOOK

• Create true hierarchy  of Wikipedia categories remove invalid edges and apply Cat2Ax in category graph  apply Cat2Ax to materialized graph

• Investigate and exploit  dependencies between find inconsistencies discovered patterns in generated axioms  find inconsistencies in pattern hierarchy

Nicolas Heist | Heiko Paulheim 16 OUTLOOK

A large-scale semantic knowledge graph with . a rich ontology compiled from . 750K classes the DBpedia ontology, . 880K subclass relations and Wikipedia Categories & Listpages

. 200K relation axioms . class restrictions based on Cat2Ax . 7.2M typed instances (50% more than in DBP) . additional instances extracted . 870K new instances from Wikipedia Listpages from Listpages

More information on http://caligraph.org Nicolas Heist | Heiko Paulheim 17 T H A N

K Y O U THRESHOLD DETERMINATION

Methodology: Manual evaluation of 50 type and relation axioms for each interval

Nicolas Heist | Heiko Paulheim 19 RESULTS

Category Axioms: 430,405 type axioms and 272,707 relation axioms extracted Assertions: 3,342,057 type assertions and 4,424,785 relation assertions added

Coverage of DBpedia Discovery in DBpedia

Nicolas Heist | Heiko Paulheim 20