The Consistency and Conformance of Web Document Collection Based on Heterogeneous DAC Graph

Marek Kopel and Aleksander Zgrzywa

www.iis.pwr.wroc.pl

www.zsi.pwr.wroc.pl Outline

• Background & Idea • Personal Web of Trust • User and Agent Trust • Local Document Ranking & Filtering • Example Scenario • Conclusions & Future Work

2 Relationships in WWW

• directed graph - most common model of a Web document collection • documents' hyperlinking relationship (edges) → PageRank, HITS • Tim Berners-Lee (in reference to the social aspect of Web 2.0): “I called this graph the , but maybe it should have been Giant Global Graph!”

3 Relationships in WWW (2)

There's more to hyperlink than href: • HTML 4.01 attributes rel and rev - e.g: used definitions – navigation in a document collection (start, prev, next, contents, index), – structure (chapter, section, subsection, appendix, glossary) – meta (copyright, help) • XHTML 2.0 – custom namespaces 4 Relationships in WWW (3)

Popular relation ontologies: • FOAF • XFN microformat – friendship (contact, acquaintance, friend) – family (child, parent, sibling, spouse, kin) – professional (co-worker, colleague) – physical (met) – geographical (co-resident, neighbor) – romantic (muse, crush, date, sweetheart)

• rel- microformat - 5 Heterogeneous DAC Graph

• DAC graph – nodes of three types: • Document • Author • Concept – edges between nodes model the relationships – most of the relationships can be acquired directly from the Web data

6 Consistency and Conformance

• Consistency of a Web document collection – inner similarity concerning subject • similarly tagged (Web 2.0) – authors assigned the same tags or categories • same keywords (digital libraries) • Conformance of a Web document collection – document authors' relationship – authors with strong relationship → often coauthors (agree on some subjects) – citing and referencing – Web of Trust 7 Relationships in DAC Graph

• Document

• Author

• Concept

8 fragment of a DAC graph of a Web document collection

a 1 d d 1 2

a c 2 1

c 2 d d 3 4

9 Consistency Collection document-concept graph

a 1 d d 1 2

a c 2 1

c 2 d d 3 4

10 Conformance Collection document-author graph

a 1 d d 1 2

a c 2 1

c 2 d d 3 4

11 Deriving Relationships

a 1 a c 2 1 d a 1 3

c 2 c 3 c 4 c 5

For all authors a i that have relationships with both c o n c e p t c 1 and document d 1

For all concepts c i that have relationships with both concept c 1 and document d 1

r e l ( c 1 , d 1 ) + = r e l ( c 1 , a i ) * r e l ( a i , d 1 )* r e l ( c , c ) * r e l ( c , d ) 1 i i 1 12 Consistency and Conformance

• Subgraphs are clustered – only the relationships’ values • consistency collection graph – output biggest cluster’s doc. nodes → consistent subcollection • conformance collection graph – → conformable subcollection card(cons _ sub(C)) consistency = card(C) card(conf _ sub(C)) conformance = • C – Web document collection card(C) • cons_subc(C) – consistent subcollection of C • conf_subc(C) – conformable subcollection of C Conclusions and Future Work

• Relationships are asymmetric, so undirected → directed graph • Relationship deriving using: paths with one → n proxy nodes • Graph clustering: – MCA - Markov Cluster Algorithm (currently) – Other algorithms – Maximum clique technique

14 Q & A