Ontology Learning from Semi-Structured Web Documents
Total Page:16
File Type:pdf, Size:1020Kb
Ontology Learning from semi-structured Web Documents Dissertation zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.) angenommen durch die Fakult¨at fur¨ Informatik der Otto-von-Guericke-Universit¨at Magdeburg von Dipl. Wirt.-Ing. (FH) Marko Brunzel geb. am 21. Januar 1977 in Meerane Gutachter: Prof. Dr. Steffen Staab Prof. Dr. Myra Spiliopoulou Prof. Dr. Andreas Dengel Magdeburg, den 17. Februar 2010 Contents List of Figures vii List of Tables xi List of Algorithms xiii 1 Introduction 5 1.1 Motivation................................5 1.2 Using the Web for Ontology Learning . .5 1.3 Objectives................................7 1.4 Foundations............................... 10 1.4.1 Introductory Examples . 10 1.4.2 Notions of Sibling Relations . 12 1.4.3 Definitions............................ 14 1.4.4 Sibling Relations beyond Ontologies . 15 1.5 Outline.................................. 16 2 Related Work 21 2.1 Learning from the Web . 22 2.2 Learning from HTML Documents . 23 2.2.1 Markup in General . 23 2.2.2 Tables .............................. 24 2.2.3 Headings . 25 2.2.4 Lists............................... 25 2.3 Learning Sibling Relations . 26 3 Group-By-Path 31 3.1 Web Document Structures . 31 3.2 Group-By-Path Algorithm . 36 3.3 Real World Example and Application Outlook . 38 3.4 Related Work . 43 3.4.1 Wrapper............................. 44 3.4.2 XPath-Siblings ........................ 45 3.4.3 XML Document Similarity . 46 3.4.4 Further Path based Approaches . 46 3.5 Summary ................................ 47 i Contents 4 Learning Sibling Groups - XTREEM-SG 49 4.1 XTREEM-SG Procedure . 49 4.1.1 Step 1 - Querying & Retrieving: . 51 4.1.2 Step 2 - Group-By-Path: . 52 4.1.3 Step3-Filtering:........................ 52 4.1.4 Step 4 - Vectorization: . 53 4.1.5 Step 5 - Clustering . 53 4.1.6 Step 6 - Cluster Labelling . 55 4.2 Evaluation Methodology . 56 4.2.1 Evaluation Criteria: Sibling Group Overlap . 56 4.2.2 Evaluation Reference . 58 4.2.3 Inputs .............................. 58 4.2.4 Variations on Procedure and Parameters . 59 4.3 Experiments............................... 61 4.3.1 Experiment 1: Sibling Relations from Group-By-Path in contrast to alternative Methods . 62 4.3.2 Experiment 2: Sibling Relations from Labelled Clusters . 63 4.3.3 Experiment 3: Varying the Cluster Labelling Threshold . 66 4.3.4 Experiment 4: Varying the Number of Clusters . 68 4.3.5 Experiment 5: Varying the Topic Bias . 70 4.3.6 Experiment 6: Variations on the Minimum Support . 72 4.3.7 Experiment 7: Sampling on Tagpath Clustering . 74 4.3.8 Experiment 8: Frequent Itemsets in Comparison to Clusters 76 4.3.9 Experiment 9: Tagpath Clustering in Comparison to Term Clustering . 78 4.3.10 Experiment 10: Sampling on Term Clustering . 80 4.3.11 Results from Term Clustering . 82 4.4 Conclusion . 86 5 Learning Sibling Groups Hierarchies - XTREEM-SGH 87 5.1 Hierarchical clustering for Sibling Groups Hierarchies . 88 5.1.1 Hierarchical Term Clustering . 88 5.1.2 Hierarchical Tagpath Clustering . 93 5.1.3 XTREEM-SGH Procedure . 94 5.2 Evaluation Methodology . 95 5.3 Experiments............................... 95 5.3.1 Experiment 1: K-Means in Comparison to Bi-Secting-K-Means 96 5.3.2 Experiment 2: Different Observation Strategies on the Cluster Hierarchy . 98 5.3.3 Experiment 3: Best Matching Hierarchy Levels . 100 5.4 Conclusion . 102 6 Learning Sibling Pairs - XTREEM-SP 103 6.1 XTREEM-SP Procedure . 104 ii Contents 6.1.1 Step 4 - Co-Occurrence Counting . 106 6.1.2 Step 5 - Computing Association Scores . 106 6.2 Evaluation Methodology . 108 6.2.1 Evaluation Criteria: Precision and Recall . 108 6.2.2 Evaluation Reference . 109 6.2.3 Inputs .............................. 109 6.2.4 Variations on Procedure and Parameters . 109 6.3 Experiments............................... 110 6.3.1 Experiment 1: Sibling Relations from Group-By-Path in contrast to alternative Methods . 110 6.3.2 Experiment 2: Association Measures in Comparison . 114 6.3.3 Experiment 3: Varying the Topic Bias . 116 6.3.4 Experiment 4: Variations on the Minimum Support . 118 6.4 Conclusion . 120 7 Vocabulary Extraction with XTREEM-T 121 7.1 Related Work . 122 7.2 XTREEM-T Procedure . 123 7.2.1 Step 1 - Querying & Retrieving: . 125 7.2.2 Step 2 - Markup Exploitation: . 125 7.2.3 Step 3 - Text span Counting: . 126 7.2.4 Step 4 - Order By Frequency: . 126 7.3 Evaluation Methodology . 127 7.3.1 Evaluation Criteria: Precision . 127 7.3.2 Inputs .............................. 127 7.4 Experiments............................... 128 7.4.1 Experiment 1: Human Vocabulary Evaluation . 128 7.4.2 Experiment 2: N-Gram Level Distribution . 130 7.4.3 Experiment 3: POS Patterns . 133 7.5 Conclusion . 133 8 Finding Synonyms with XTREEM-S 135 8.1 Related Work . 136 8.2 XTREEM-S Procedure . 136 8.2.1 Step 1 - Querying & Retrieving: . 139 8.2.2 Step 2 - Group-By-Path: . 139 8.2.3 Step 3 - Filtering: . 139 8.2.4 Step 4 - Vectorization: . 139 8.2.5 Step 5 - First Order Association Computation: . 139 8.2.6 Step 6 - Second Order Association Computation: . 140 8.3 Evaluation Methodology . 140 8.3.1 Evaluation Criteria: Precision and Recall . 141 8.3.2 Evaluation Reference . 141 8.4 Experiment ............................... 142 iii Contents 8.5 Conclusion . 143 9 Domain Relevance enhanced Term Weighting for Learning Sibling Groups - XTREEM-SGT;DR 145 9.1 Motivation................................ 145 9.1.1 Distorted Occurrence Distributions . 146 9.1.2 Interest towards Domain Relevant Terms . 146 9.2 Related Work . 147 9.2.1 TermWeighting......................... 147 9.2.2 Domain Relevance . 148 9.3 XTREEM-SGT;DR Procedure . 150 9.4 Evaluation Methodology . 152 9.4.1 Evaluation Criteria: DRSum .................. 152 9.4.2 Evaluation Reference . 153 9.4.3 Inputs .............................. 153 9.4.4 Variations on Procedure and Parameters . 154 9.5 Experiments............................... 154 9.5.1 Experiment 1: DRSumI .................... 154 9.5.2 Experiment 2: DRSumII .................... 155 9.5.3 Experiment 3: DRSumIII ................... 156 9.6 Conclusion . 157 10 Indexing and Retrieving of Sibling Terms with { XTREEM-SL 159 10.1 Related Work . 160 10.2 XTREEM-SL Procedure . 161 10.2.1 Creating the XTREEM-SL Index . 161 10.2.2 Term Retrieval on the XTREEM-SL Index . 165 10.3 Evaluation Methodology . 169 10.3.1 Evaluation Criteria: Rediscovering Rank . 169 10.3.2 Evaluation Reference . 170 10.3.3 Inputs .............................. 170 10.3.4 Variations on Procedure and Parameters . 170 10.4 Experiments . 171 10.4.1 Experiment 1: Text span Length . 171 10.4.2 Experiment 2: Tagpath Cardinality . 172 10.4.3 Experiment 3: A Priory Evaluation . 172 10.4.4 Experiment 4: Occurrence Frequency . 174 10.4.5 Experiment 5: A Posteriori Evaluation . 178 10.4.6 Experiment 6: XTREEM-SL in Comparison to Google Sets . 179 10.5 Conclusion . 183 11 Conclusions and Outlook 185 11.1 Main Contributions . 185 11.2FutureWork............................... 187 iv Contents A Exemplary Ontology Structure 191 B Reference Sibling Groups from Gold Standard Ontologies 193 Bibliography 197 v List of Figures 1.1 Ontology learning layer cake [Cimiano, 2006]. The layers examined in this thesis are highlighted. .7 1.2 Distinguished sub-ordination and co-ordination directions of concept hierarchies within the ontology learning layer cake . .9 1.3 Example hierarchy of geographic entities (adopted from [Buitelaar and Cimiano, 2007], shown in appendix A, figure A.1). Sibling concepts are emphasized by doted ellipses. 11 1.4 Example hierarchy of geographic entities where in addition to the concepts shown in figure 1.3 as blue boxes, instances depicted by green boxes are present. 12 1.5 Exemplary usage of sibling items on an e-commerce website . 16 1.6 Thesis overview. Dependencies between chapters. 17 3.1 Highlighted terms in an exemplary HTML Web document . 31 3.2 Headings in an exemplary HTML Web document . 32 3.3 Web document rendered in a Web browser . 33 3.4 Source code of a Web document . 35 3.5 Tree structure of a Web document . 35 3.6 A Web document with its tagpaths and text spans . 36 3.7 Grouping of text spans with the same preceding tagpath . 37 3.8 A exemplary real world Web document (http://www.seasky.org/reeflife/sea2i.html) . 40 3.9 Tagpaths and text spans from Web document . 41 3.10 Text spans from Web document grouped according to tagpaths . 42 4.1 Dataflow diagram of the XTREEM-SG procedure . 50 4.2 Exemplary fragment of a Group-By-Path vectorization . 54 4.3 FMASO for different K and for different document representation methods (query1,τ=0.2) for (a) GSO1 and (b) GSO2 . 65 4.4 FMASO for different K and for different τ (query1) for (a) GSO1 and(b)GSO2.............................. 67 4.5 SOFICL for different K and τ (query1) for GSO1 . 69 4.6 NODFICL for different K and τ (query1) for GSO1 . 70 4.7 FMASO for different K and for different queries (τ=0.2) for (a) GSO1 and (b) GSO2 . 71 4.8 FMASO for different frequency support levels (query1, τ=0.2) for (a) GSO1 and (b) GSO2 . 73 vii LIST OF FIGURES 4.9 Sampling for tagpath clustering for (a) GSO1 and (b) GSO2) . 75 4.10 Comparison of frequent itemsets and K-Means generated cluster labels for (a) GSO1 and (b) GSO2) . 77 4.11 Comparison of K-Means tagpath clustering to term clustering for (a) GSO1 and (b) GSO2) . 79 4.12 Sampling on term clustering for (a) GSO1 and (b) GSO2) . 81 4.13 Resulting clusters from term clustering for GSO1 . 83 4.14 Resulting clusters from term clustering for GSO2 - part.