Leveraging Recommendations for Knowledge Curation

Dissertation The SnoopyConcept: Leveraging Recommendations for Knowledge Curation Wolfgang Gassler submitted to the Faculty of Mathematics, Computer Science and Physics of the University of Innsbruck in partial fulfillment of the requirements for the degree of \Doktor der technischen Wissenschaften" Advisor: Univ.-Prof. Dr. Günther Specht Innsbruck, August 11, 2017 Abstract In recent years, online knowledge curation on the internet has been lifted to a new level|online mass-collaboration. Knowledge bases such as Wikipedia and Wikidata are curated by huge communities and produce a vast amount of knowledge. Therefore, the prevailing challenge is to maintain a homogeneous structure in those knowledge bases to ensure efficient search capabilities, allow automated reasoning, and provide semantically linkable knowledge for the Linked Open Data cloud. In this thesis we propose the SnoopyConcept which leverages recommender systems to support the user to homogenize knowledge already during the insertion process. The recommendations furthermore aim at increasing the quality and quantity of stored knowledge in collaborative information systems. In addition, we implement the universal SnoopyConcept in several domains and systems to evaluate and assess the recommender system based approach. Acknowledgements Eva, Sarah, and Robert, thanks for 10 incredible years; we shared almost everything. We enjoyed the good moments, and handled the bad. Robert, you were not only my flat mate, you also challenged me every day, by sharing your deep technical knowledge. Thanks for all the fun weekends in the office working on world-shaking algorithms and software. And thanks for being such a good friend all the time. Sarah, you introduced me to the beautiful world of spontaneity, taught me how to be happy without being in control of everything, and pushed me out of my comfort zone to make bold steps. I never would have been at this point without you; thank you a lot for everything. Eva, you have accompanied me through the world of academia for more than 15 years. We've worked on the \Weltfrieden", survived rainy days, and have been riding the waves. Thanks for listening, supporting, and coaching me everywhere and anytime, and of course, being my best friend. Thanks to all the DBIS members: Niko, Doris, Rainer, Gabi, Sylvia, Peter, Michael, Matthias, and Lukas. Special thanks goes to: Michi, Martin, Seppi, Skifahrlehrer Schorsch, Hansi, and Dominic for all the fun we had together. Thanks to my supporting friends: Herbert, Ingä,Stefanie, Jakob, |Zugi|, and Niki who never got tired of nudging me about my thesis | your relentless persistence paid off. I also want to thank my advisor Günther Specht for giving me the opportunity of writing this thesis, all the resources, and his support the whole time, even after I left DBIS after eight instructive years. A special \thank you" goes to my whole family, especially my parents who have never forced me to do anything and have always supported me, no matter how ridiculous my idea was. Thanks to all the amazing members and students at the Institute of Computer Science Innsbruck who taught me a lot, even outside the world of science. This work was only possible due to the huge amount of available Open Source tools. Only Open Source tools were used to build and evaluate the SnoopyConcept and write all related publications and this thesis. Thanks a lot to all the Open Source contributors. This work was partially funded by the Austrian Research Promotion Agency (FFG) and the Nachwuchsförderungof the University of Innsbruck. Eidesstattliche Erklärung Ich erklärehiermit an Eides statt durch meine eigenhändigeUnterschrift, dass ich die vorliegende Arbeit selbständigverfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe. Alle Stellen, die wörtlich oder inhaltlich den angegebenen Quellen entnommen wurden, sind als solche kenntlich gemacht. Die vorliegende Arbeit wurde bisher in gleicher oder ähnlicher Form noch nicht als Magister-/Master-/Diplomarbeit/Dissertation eingereicht. Datum Wolfgang Gassler The beautiful thing about learning is nobody can take it away from you. B.B. King, 1997 Contents AbstractI Acknowledgements III 1 Introduction1 1.1 Aims and Research Questions . .3 1.2 Contributions and Published Work . .3 1.3 Thesis Outline . .7 2 Background & Related Work9 2.1 Information Representation & Structure . .9 2.1.1 Strictly Structured Information Systems . 10 2.1.2 Unstructured Information Systems|Wikis . 10 2.1.3 Semi-structured Data . 11 2.2 Wiki Systems and Knowledge Curation . 12 2.2.1 Wikipedia and its Structural Limitations . 13 2.2.2 Wiki Enhancements . 14 2.2.3 Semantics, Wikidata & Knowledge Bases . 16 2.3 Recommender Systems . 20 2.3.1 Introduction to Recommender Systems . 21 2.3.2 Classification . 22 2.3.3 Collaborative Filtering . 22 2.3.4 Content-based Recommender Systems . 26 2.3.5 Knowledge-based Recommender Systems . 27 2.3.6 Context-aware Recommender Systems . 28 2.3.7 Hybrid Recommender Systems . 29 2.3.8 Summary . 30 Contents 2.4 Recommender Systems in Information Systems . 30 2.4.1 Synonyms and Search Capabilities . 31 2.4.2 Guided Refinement and Enrichment . 33 2.4.3 Guidance during the Insertion . 37 2.5 Database Models . 39 2.5.1 Relational Database Systems . 39 2.5.2 NoSQL: Document-oriented Stores . 40 2.5.3 NoSQL: Graph Stores . 42 2.6 Summary . 44 3 The SnoopyConcept 45 3.1 Key Idea: Incorporate the User . 45 3.2 Data Model . 49 3.3 Recommendations . 50 3.3.1 Recommending Structure . 50 3.3.2 Avoiding Synonyms by Recommendations . 52 3.3.3 Semantic Refinement and Value Recommendations . 53 3.3.4 Validation and Recommendations of Data Types . 54 3.3.5 Ranking . 54 3.4 Recommendation Algorithm . 55 3.4.1 Basic Recommendation Algorithm . 56 3.4.2 Ranking Algorithms . 61 3.4.3 Context-Sensitive Optimization . 63 3.4.4 Refinement Recommendations . 65 3.5 Personalizing the SnoopyConcept . 67 3.5.1 User Modeling . 67 3.5.2 Algorithm Extension by User Modeling . 68 3.6 Tackling the Cold-Start Problem . 71 3.7 Summary . 76 4 SnoopyConcept Storage Models and Recommendation Implemen- tation 77 4.1 Relational Database Systems . 78 4.1.1 RDF Storage Models . 78 4.1.2 RDF Storage Index Structures . 80 4.1.3 SnoopyConcept Relational Storage Model . 82 4.1.4 Recommendation Computation . 84 4.1.5 Rule Based Recommendation . 85 4.2 NoSQL: Document-oriented Stores . 88 4.2.1 Storage Model . 88 4.2.2 Recommendation computation . 89 4.2.3 Distribution and Sharding . 91 4.3 NoSQL: Graph Stores . 93 4.3.1 Storage Model . 94 X Contents 4.3.2 Graph-based Rule Generation . 95 4.3.3 Graph-based Recommendation Algorithm . 96 4.4 Storage Model Evaluation . 98 4.5 Storage Models Summary . 102 5 SnoopyConcept Showcases 105 5.1 First Reference Implementation: SnoopyDB . 105 5.2 SnoopyTagging . 106 5.2.1 Structured Tags . 107 5.2.2 Recommendations based on the SnoopyConcept . 107 5.3 hash5 . 109 5.3.1 Main Concepts . 109 5.3.2 Recommendations based on the SnoopyConcept . 111 5.4 Wikidata . 113 5.4.1 Wikidata Property Suggestor . 114 5.4.2 Incorporating the SnoopyConcept Algorithm . 115 5.5 Summary . 116 6 Evaluation 117 6.1 Recommendation Algorithms . 117 6.1.1 Offline Evaluation . 118 6.1.2 Wikidata Property Suggestor Evaluation . 129 6.1.3 Online Evaluation/User Experiment . 134 6.2 User Modeling Evaluation . 137 6.3 Evaluation Summary . 140 7 Conclusion 141 Bibliography 145 XI CHAPTER 1 Introduction Scientia potentia est 1 has emphasized the importance of knowledge for cen- turies. Especially since the beginning of the information age [37], the economy has been based on information computerization and thus, the access to knowledge has become crucial. Information systems or knowledge bases are used to store knowledge and provide users access to knowledge. In the past, information systems were solely used by specialists and usually were maintained by a small group of people. This has drastically changed with the rise of the internet and the web 2.0 movement which has lifted curation of knowledge to a new level|online mass-collaboration. In mass-collaborative information system, such as Wikipedia2, thousands of people with different backgrounds collaborate to curate knowledge. Those platforms usually do not restrict the user to use a predefined structure to store and structure information. The shortcoming of this structure-less paradigm is its limited search capability. Consider a complex query such as \Which Austrian cities have more than 10.000 inhabitants and have a female mayor who has a doctoral degree?". It is not possible to answer such a query through full-text search which is provided by most wiki-systems. Weikum et al. [171] 1Knowledge is power, https://en.wikipedia.org/w/index.php?title=Scientia_ potentia_est&oldid=784909874 (revision 10 June 2017 ) 2http://www.wikipedia.org, accessed 2017-07-17 Chapter 1 Introduction observed that modern information systems have to be able to support both structured and unstructured data to combine the advantages of both worlds and to be able to answer such questions. One possible solution is the usage of semi-structured information systems (cf. Section 2.1.3) which allows for defining a structure but do not require a predefined schema. For example tagging systems, fact based knowledge bases, document-oriented databases or XML files provide the possibility to store any arbitrary data in a structured way without adhering to a predefined schema. The flexibility of the previously described schemaless semi-structured storage in combination with collaborative data curation leads to a massive problem. Every single user has her own view of structuring knowledge and information and uses her own terminology. Furnas et al.

Load more