This Electronic Thesis Or Dissertation Has Been Downloaded from Explore Bristol Research
Total Page:16
File Type:pdf, Size:1020Kb
This electronic thesis or dissertation has been downloaded from Explore Bristol Research, http://research-information.bristol.ac.uk Author: Price, Simon Title: Higher-order frameworks for profiling and matching heterogeneous data General rights Access to the thesis is subject to the Creative Commons Attribution - NonCommercial-No Derivatives 4.0 International Public License. A copy of this may be found at https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode This license sets out your rights and the restrictions that apply to your access to the thesis so it is important you read this before proceeding. Take down policy Some pages of this thesis may have been removed for copyright restrictions prior to having it been deposited in Explore Bristol Research. However, if you have discovered material within the thesis that you consider to be unlawful e.g. breaches of copyright (either yours or that of a third party) or any other law, including but not limited to those relating to patent, trademark, confidentiality, data protection, obscenity, defamation, libel, then please contact [email protected] and include the following information in your message: •Your contact details •Bibliographic details for the item, including a URL •An outline nature of the complaint Your claim will be investigated and, where appropriate, the item in question will be removed from public view as soon as possible. Higher-order frameworks for profiling and matching heterogeneous data Simon Price Department of Computer Science University of Bristol A dissertation submitted to the University of Bristol in accordance with the requirements of the degree of Doctor of Philosophy in the Faculty of Engineering, Department of Computer Science. June 2014 Word count: circa 70,000 Abstract This Thesis brings together complementary research from higher-order computational logic and workflow systems to investigate software and theoretical frameworks for profiling and matching heterogeneous data. A motivating use case is submission sifting, which matches submitted con- ference or journal papers to potential peer reviewers based on the similar- ity between the paper’s abstract and the reviewer’s publications as found in online bibliographic databases. Inspired by e-Science workflows, we introduce the SubSift submission sifting framework for developing web- based research intelligence applications that profile and match hetero- geneous textual content from web pages and documents. Abstracting SubSift we define a formal higher-order dataflow framework that ranges over a class of higher-order relations that are sufficiently expressive to represent a wide variety data types and structures. This dataflow model is shown to be embarrassingly parallel. JSONMatch, our proof of concept serial implementation, is used to demonstrate that the combination of this model and higher-order representation provides a flexible approach to analysing heterogeneous data. Finally we propose a theoretical frame- work for querying structured data, elevating Codd’s relational algebra to a higher-order algebra defined on the basic terms of a higher-order logic. An extension incorporates approximate joins on structured data and is demonstrated to be feasible and have promise for future work. iii Acknowledgements Firstly, I would like to thank my co-authors of works cited in this Thesis: Christopher Bailey and Nikki Rogers (ILRT, University of Bristol), Peter Flach, Bruno Golénia and Sebastian Spiegler (Intelligent Systems Laboratory, University of Bristol), John Guiver and Thore Graepel (Microsoft Research, Cambridge), Ralf Herbrich (Ama- zon, Mountain View, CA), and Mohammed Zaki (Rensselaer Polytechnic Institute). Thanks are also due to further colleagues at ILRT for their expert advice on RDF and other topics: Mike Jones, Damian Steer and Jasper Tredgold. The implemen- tation work in Chapter 5 benefited enormously from software engineering debates with my co-authors of, “Coding guidelines for Prolog”: Michael Covington, Roberto Bagnara, Richard O’Keefe and Jan Wielemaker. I am grateful to the following conference Programme Committee chairs for test- ing the SubSift software: Peter Flach (KDD’09 and ECML/PKDD’12), Bart Goethals (SDM’10), Qiang Yang (KDD’10), Geoffrey Webb (ICDM’10) and Mohammed Zaki (KDD’09 and PAKDD’10) – some of whom merit a second mention, in their capac- ity as journal Editor-in-Chiefs and Editorial Board members, for adopting SubSift in support of their reviewer assignment process: Peter Flach (Machine Learning), and Bart Goethals and Geoffrey Webb (Data Mining and Knowledge Discovery). Special recognition should be given to my Thesis examiners, Carole Goble CBE (University of Manchester) and Tim Kovacs (University of Bristol), for their insight- ful reviews and a thoroughly inspirational viva. Finally, I am particularly indebted to Peter Flach, my supervisor and colleague, for his contagious passion for data science and his unwavering support. v Author’s declaration I declare that the work in this dissertation was carried out in accordance with the re- quirements of the University’s Regulations and Code of Practice for Research Degree Programmes and that it has not been submitted for any other academic award. Ex- cept where indicated by specific reference in the text, the work is the candidate’s own work. Work done in collaboration with, or with the assistance of, others, is indicated as such. Any views expressed in the dissertation are those of the author. SIGNED: ............................................................. DATE:.......................... vii Contents 1 Introduction 1 1.1 Motivation............................... 1 1.1.1 Use Case 1 – Submission Sifting . 2 1.1.2 Use Case 2 – Finding an Expert . 3 1.1.3 Use Case 3 – Visualising Similarity Networks . 3 1.1.4 Use Case 4 – Profiling Reading Lists . 4 1.1.5 Use Case 5 – Ranking News Stories . 4 1.1.6 RelatedUseCases ...................... 5 1.2 Frameworks.............................. 6 1.2.1 Dataflows........................... 6 1.2.2 Workflows .......................... 10 1.2.3 Dataspaces .......................... 18 1.3 Research Questions and Objectives . 20 1.4 Contributions ............................. 22 1.5 Publications.............................. 24 1.6 StructureofThesis .......................... 25 2 Background 27 2.1 Terminology.............................. 27 2.2 Informative Background Topics . 29 2.2.1 Representing Structured Data . 30 2.2.2 Comparing Heterogeneous Data . 40 2.2.3 Information Retrieval . 44 2.2.4 Comparing Structured Data . 45 2.3 Required Background Theory . 49 2.3.1 Vector Space Model . 49 2.3.2 Individuals as Terms . 50 2.3.3 Basic Terms in a Higher-Order Logic . 52 2.3.4 Kernels and Distances for Basic Terms . 57 2.3.5 RESTfulWebServices . 61 2.4 Summary ............................... 63 ix 3 Workflows for Profiling and Matching Textual Content 65 3.1 SubSiftCaseStudy .......................... 65 3.2 SubSiftWebServices . ..... ...... ..... ...... .. 79 3.2.1 Example REST API Enactment of Submission Sifting . 82 3.3 Profiling and Matching Workflow Demonstrations . 86 3.3.1 Use Case 1 – Submission Sifting . 87 3.3.2 Use Case 2 – Finding an Expert . 88 3.3.3 Use Case 3 – Visualising Similarity Networks . 89 3.3.4 Use Case 4 – Profiling Reading Lists . 91 3.3.5 Use Case 5 – Ranking News Stories . 91 3.3.6 RelatedUseCases ...................... 94 3.4 Summary ............................... 96 4 Higher-Order Dataflows 99 4.1 A Higher-Order Dataflow Model . 101 4.1.1 Generator Function g ................... 103 p q 4.1.2 Data Constructor h .................... 105 p q 4.2 Knowledge Representation . 105 4.3 Formalising the Higher-Order Dataflow Model . 106 4.3.1 Paths ............................. 111 4.3.2 Path Expressions . 113 4.3.3 Parallelisability and Scalability of the Model . 115 4.3.4 Continuations in the Model . 121 4.4 Implementation ............................ 123 4.4.1 DataFormats ......................... 123 4.4.2 Templates and Embedded Functions . 124 4.4.3 ExampleDataflows. 125 4.4.4 Other Example Dataflows . 131 4.5 Comparison of JSONMatch with SubSift . 132 4.5.1 Submission Sifting Workflow Implementation . 132 4.5.2 Asynchronous Behaviour . 134 4.5.3 Feature Comparison . 135 4.5.4 Summary of Comparision . 138 4.6 Summary ............................... 138 5 Querying and Merging Heterogeneous Data 141 5.1 RelationalModel ........................... 143 5.1.1 Relational Operations . 144 5.1.2 Relational Algebra . 148 5.2 Lifting the Relational Model to a Higher-Order . 148 5.2.1 Indexing Basic Terms . 149 5.3 Basic Term Relational Model . 157 x 5.3.1 Basic Term Relational Representation . 157 5.3.2 Indexing Basic Term Relations . 158 5.3.3 Basic Term Relational Operations . 158 5.3.4 Basic Term Relational Algebra . 161 5.4 Approximate Relational Joins . 162 5.4.1 Proximity-Joins . 162 5.4.2 Basic Term Proximity-Join . 163 5.5 ProofofConcept ........................... 164 5.5.1 Kernels and Distances for Approximate Joins . 164 5.5.2 Demonstrations . 168 5.6 Summary ............................... 173 6 Related Work 179 6.1 Broadly Related Areas . 179 6.1.1 BigData ........................... 180 6.1.2 Clustering........................... 181 6.1.3 ExpertFinding ... ..... ...... ..... ..... 181 6.1.4 Higher-Order Frameworks . 182 6.1.5 Scientometrics and Bibliometrics . 183 6.1.6 SemanticWeb ........................ 183 6.1.7 Text Processing . 184 6.2 Approaches to Profiling and Matching . 185 6.2.1 Schema and Ontology Matching . 185 6.2.2 Dataspaces .........................