UNCOVERING THE SEMANTICS OF WIKIPEDIA CATEGORIES
Nicolas Heist | Heiko Paulheim THE WIKIPEDIA CATEGORY GRAPH
Nicolas Heist | Heiko Paulheim 2 LEARNING AXIOMS Introduction
:
: ⊆ 𝑑𝑑𝑑𝑑𝑑𝑑. 𝐴𝐴𝐴𝐴𝐴𝐴:𝐴𝐴𝐴𝐴 _ ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑: 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔. 𝑑𝑑𝑑𝑑: 𝑟𝑟 𝑅𝑅𝑅𝑅_𝑅𝑅𝑅𝑅 𝑚𝑚𝑚𝑚_ 𝑚𝑚𝑚𝑚𝑚𝑚 ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑟𝑟 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝐼𝐼𝐼𝐼𝐼𝐼퐼 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 Nicolas Heist | Heiko Paulheim 3 LEARNING AXIOMS The asyE Mode
USE CATEGORY NAMES The Beatles albums :
⊆ 𝑑𝑑𝑑𝑑𝑑𝑑 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 : . : _ The Beatles albums ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑: 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔. 𝑑𝑑𝑑𝑑𝑟𝑟: 𝑇𝑇ℎ𝑒𝑒_𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑟𝑟 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 USE CATEGORY RESOURCES The Beatles albums … BUT 37% 20% of categories have 0 resources 66% 50% of categories have <= 3 resources The Beatles albums 79% 68% of categories have <= 7 resources
Nicolas Heist | Heiko Paulheim 4 LEARNING AXIOMS E n nn nnn g Approaches
< > < >
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
CATRIPLE [1] C-DF [2] - Identify relation patterns - Use tf-idf inspired score on relations in category hierarchy and types to find candidate axioms - Assign axioms based on - Learn rules from candidate axioms extracted keywords or voting - Apply rules to discover final axioms
[1] Liu et al.: Catriple: Extracting Triples from Wikipedia [2] Xu et al.: Learning Defining Features for Categories, Categories, ASWC 2008 Nicolas Heist | Heiko Paulheim IJCAI 2016 5 Cat2Ax Idea
(1)
𝑐𝑐𝑣𝑣𝑣𝑣𝑣𝑣 (1), (2) (1), (3) 𝑐𝑐𝑓𝑓𝑓𝑓𝑓𝑓 (1) … albums … : (2) someArtist albums . { : } (3) someenre albums → .⊆ 𝑑𝑑𝑑𝑑𝑑𝑑{𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴: } → ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 → ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑑𝑑𝑑𝑑𝑑𝑑 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
Nicolas Heist | Heiko Paulheim 6 Cat2Ax Relation xiomsA
, , = , , ( , )
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑐𝑐 𝑝𝑝 𝑣𝑣 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑐𝑐 𝑝𝑝 𝑣𝑣 ∗ 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑣𝑣 𝑐𝑐𝑣𝑣𝑣𝑣𝑣𝑣 , : , : _ = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎, 𝑑𝑑𝑑𝑑𝑑𝑑: 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎,𝑎𝑎 𝑑𝑑𝑑𝑑:𝑑𝑑 𝑇𝑇𝑇𝑇�_ 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 * 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎: _ 𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎,𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑑𝑑𝑑𝑑𝑑𝑑 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 “𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵�
( ) accept pattern ( , , ) ( , , ) ( , , ) if median > 0 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑣𝑣 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑣𝑣 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑣𝑣 𝑝𝑝 𝑟𝑟𝑟𝑟𝑟𝑟 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑐𝑐 𝑝𝑝 𝑣𝑣 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑐𝑐 𝑝𝑝 𝑣𝑣 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐 𝑝𝑝 𝑣𝑣 Nicolas Heist | Heiko Paulheim 7 Cat2Ax Type xiomsA
, = , : , ( , )
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑐𝑐 𝑡𝑡 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑐𝑐 𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑡𝑡 ∗ 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑡𝑡 𝑐𝑐𝑓𝑓𝑓𝑓𝑓𝑓 , : = 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑎𝑎, 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎: 𝑑𝑑𝑑𝑑𝑑𝑑, 𝐴𝐴𝐴𝐴:𝐴𝐴𝐴𝐴𝐴𝐴 * 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑇𝑇𝑇𝑇� 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎: 𝑟𝑟𝑟𝑟𝑟𝑟, 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑑𝑑𝑑𝑑𝑑𝑑 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑑𝑑𝑏𝑏𝑏𝑏 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 “𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎�
accept pattern ( ( , ) ( , ) ( , ) ) if median > 0 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑐𝑐 𝑡𝑡 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑐𝑐 𝑡𝑡 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑐𝑐 𝑡𝑡 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑡𝑡 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 Nicolas Heist | Heiko Paulheim 8 Cat2Ax Type Lexicalisations
: … : _ _ : _ _( ) 𝒅𝒅𝒅𝒅𝒅𝒅 𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨 𝑑𝑑𝑑𝑑𝑟𝑟 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀� 𝑑𝑑𝑑𝑑𝑟𝑟 𝑇𝑇𝑇𝑇� 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 : ! _( )
𝑑𝑑𝑑𝑑𝑟𝑟 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
Nicolas Heist | Heiko Paulheim 9 Cat2Ax L n n nn n nnn n nnn n S nn nn
ENTITY LEXICALISATIONS
‘band‘: 16354, ‘rock‘: 8264, ‘american‘: 5417, ‘group‘: 3651, dbo:Band ‘metal‘: 2583, ‘music‘: 1716, ‘pop‘: 1711, ‘indie‘: 1662, ..
‘the beatles‘: 3401, ‘beatles‘: 204, ‘beatle‘: 72, ‘beatlesque‘: dbr:The_Beatles 9, ‘the beat brothers‘: 4, ‘beatlemania‘: 3, ..
LEXICALISATION ENTITIES
dbo:Band [0.87], dbo:MusicalArtist [0.09], ‘band‘ dbo:Person [0.01], dbo:Album [0.004], …
dbr:The_Beatles [0.97], dbr:The_Beatles_(album) [0.02], ‘the beatles‘ dbr:The_Beatles_(TV_series) [0.003], …
Nicolas Heist | Heiko Paulheim 10 Cat2Ax Pattern Learning
+ 500 + 100 others others
: . : : . :
⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑟𝑟 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑑𝑑𝑑𝑑𝑟𝑟 𝑠𝑠𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜
ID PATTERN SUPPORT CONFIDENCE
lex(dbr:res) : . : 503 0.83 𝑃𝑃𝑃𝑃1 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 → ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑 𝑟𝑟𝑟𝑟𝑟𝑟 lex(dbr:res) : . : 103 0.17 𝑃𝑃𝑃𝑃2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 → ⊆ ∃𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑑𝑑𝑑𝑑𝑑𝑑 𝑟𝑟𝑟𝑟𝑟𝑟
Nicolas Heist | Heiko Paulheim 11 Cat2Ax Pattern Application
, , = ( , , )
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 𝑃𝑃𝑃𝑃 𝑣𝑣 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑃𝑃𝑃𝑃 ∗ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑐𝑐 𝑝𝑝𝑃𝑃𝑃𝑃 𝑣𝑣 , , : = , , : 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑎𝑎=𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎0.83 𝑃𝑃𝑃𝑃01 𝑑𝑑𝑑𝑑𝑑𝑑 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑃𝑃𝑃𝑃1 ∗ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅= 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑃𝑃𝑃𝑃� 𝑑𝑑𝑑𝑑𝑑𝑑 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 ∗ 𝟎𝟎 , , : = , , : 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅=𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎0.17 𝑃𝑃𝑃𝑃0.742 𝑑𝑑𝑑𝑑𝑑𝑑 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑃𝑃𝑃𝑃2 ∗ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅= . 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑃𝑃𝑃𝑃� 𝑑𝑑𝑑𝑑𝑑𝑑 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 ∗ 𝟎𝟎 𝟏𝟏𝟏𝟏 Nicolas Heist | Heiko Paulheim 12 RESULTS
Category Sets: 176,785 candidate sets found (containing 8 categories on average) Patterns: 54,465 textual patterns found (99% imply types, 44% imply properties)
Nicolas Heist | Heiko Paulheim 13 EVALUATION
Methodology: Manual evaluation by annotators on Amazon Mechanical Turk Sample size: 1250 axioms and 1250 assertions with three annotations each
Nicolas Heist | Heiko Paulheim 14 EVALUATION
Nicolas Heist | Heiko Paulheim 15 OUTLOOK
• Create true hierarchy of Wikipedia categories remove invalid edges and apply Cat2Ax in category graph apply Cat2Ax to materialized graph
• Investigate and exploit dependencies between find inconsistencies discovered patterns in generated axioms find inconsistencies in pattern hierarchy
Nicolas Heist | Heiko Paulheim 16 OUTLOOK
A large-scale semantic knowledge graph with . a rich ontology compiled from . 750K classes the DBpedia ontology, . 880K subclass relations and Wikipedia Categories & Listpages
. 200K relation axioms . class restrictions based on Cat2Ax . 7.2M typed instances (50% more than in DBP) . additional instances extracted . 870K new instances from Wikipedia Listpages from Listpages
More information on http://caligraph.org Nicolas Heist | Heiko Paulheim 17 T H A N
K Y O U THRESHOLD DETERMINATION
Methodology: Manual evaluation of 50 type and relation axioms for each interval
Nicolas Heist | Heiko Paulheim 19 RESULTS
Category Axioms: 430,405 type axioms and 272,707 relation axioms extracted Assertions: 3,342,057 type assertions and 4,424,785 relation assertions added
Coverage of DBpedia Discovery in DBpedia
Nicolas Heist | Heiko Paulheim 20