This Electronic Thesis Or Dissertation Has Been Downloaded from Explore Bristol Research

Total Page:16

File Type:pdf, Size:1020Kb

This Electronic Thesis Or Dissertation Has Been Downloaded from Explore Bristol Research This electronic thesis or dissertation has been downloaded from Explore Bristol Research, http://research-information.bristol.ac.uk Author: Price, Simon Title: Higher-order frameworks for profiling and matching heterogeneous data General rights Access to the thesis is subject to the Creative Commons Attribution - NonCommercial-No Derivatives 4.0 International Public License. A copy of this may be found at https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode This license sets out your rights and the restrictions that apply to your access to the thesis so it is important you read this before proceeding. Take down policy Some pages of this thesis may have been removed for copyright restrictions prior to having it been deposited in Explore Bristol Research. However, if you have discovered material within the thesis that you consider to be unlawful e.g. breaches of copyright (either yours or that of a third party) or any other law, including but not limited to those relating to patent, trademark, confidentiality, data protection, obscenity, defamation, libel, then please contact [email protected] and include the following information in your message: •Your contact details •Bibliographic details for the item, including a URL •An outline nature of the complaint Your claim will be investigated and, where appropriate, the item in question will be removed from public view as soon as possible. Higher-order frameworks for profiling and matching heterogeneous data Simon Price Department of Computer Science University of Bristol A dissertation submitted to the University of Bristol in accordance with the requirements of the degree of Doctor of Philosophy in the Faculty of Engineering, Department of Computer Science. June 2014 Word count: circa 70,000 Abstract This Thesis brings together complementary research from higher-order computational logic and workflow systems to investigate software and theoretical frameworks for profiling and matching heterogeneous data. A motivating use case is submission sifting, which matches submitted con- ference or journal papers to potential peer reviewers based on the similar- ity between the paper’s abstract and the reviewer’s publications as found in online bibliographic databases. Inspired by e-Science workflows, we introduce the SubSift submission sifting framework for developing web- based research intelligence applications that profile and match hetero- geneous textual content from web pages and documents. Abstracting SubSift we define a formal higher-order dataflow framework that ranges over a class of higher-order relations that are sufficiently expressive to represent a wide variety data types and structures. This dataflow model is shown to be embarrassingly parallel. JSONMatch, our proof of concept serial implementation, is used to demonstrate that the combination of this model and higher-order representation provides a flexible approach to analysing heterogeneous data. Finally we propose a theoretical frame- work for querying structured data, elevating Codd’s relational algebra to a higher-order algebra defined on the basic terms of a higher-order logic. An extension incorporates approximate joins on structured data and is demonstrated to be feasible and have promise for future work. iii Acknowledgements Firstly, I would like to thank my co-authors of works cited in this Thesis: Christopher Bailey and Nikki Rogers (ILRT, University of Bristol), Peter Flach, Bruno Golénia and Sebastian Spiegler (Intelligent Systems Laboratory, University of Bristol), John Guiver and Thore Graepel (Microsoft Research, Cambridge), Ralf Herbrich (Ama- zon, Mountain View, CA), and Mohammed Zaki (Rensselaer Polytechnic Institute). Thanks are also due to further colleagues at ILRT for their expert advice on RDF and other topics: Mike Jones, Damian Steer and Jasper Tredgold. The implemen- tation work in Chapter 5 benefited enormously from software engineering debates with my co-authors of, “Coding guidelines for Prolog”: Michael Covington, Roberto Bagnara, Richard O’Keefe and Jan Wielemaker. I am grateful to the following conference Programme Committee chairs for test- ing the SubSift software: Peter Flach (KDD’09 and ECML/PKDD’12), Bart Goethals (SDM’10), Qiang Yang (KDD’10), Geoffrey Webb (ICDM’10) and Mohammed Zaki (KDD’09 and PAKDD’10) – some of whom merit a second mention, in their capac- ity as journal Editor-in-Chiefs and Editorial Board members, for adopting SubSift in support of their reviewer assignment process: Peter Flach (Machine Learning), and Bart Goethals and Geoffrey Webb (Data Mining and Knowledge Discovery). Special recognition should be given to my Thesis examiners, Carole Goble CBE (University of Manchester) and Tim Kovacs (University of Bristol), for their insight- ful reviews and a thoroughly inspirational viva. Finally, I am particularly indebted to Peter Flach, my supervisor and colleague, for his contagious passion for data science and his unwavering support. v Author’s declaration I declare that the work in this dissertation was carried out in accordance with the re- quirements of the University’s Regulations and Code of Practice for Research Degree Programmes and that it has not been submitted for any other academic award. Ex- cept where indicated by specific reference in the text, the work is the candidate’s own work. Work done in collaboration with, or with the assistance of, others, is indicated as such. Any views expressed in the dissertation are those of the author. SIGNED: ............................................................. DATE:.......................... vii Contents 1 Introduction 1 1.1 Motivation............................... 1 1.1.1 Use Case 1 – Submission Sifting . 2 1.1.2 Use Case 2 – Finding an Expert . 3 1.1.3 Use Case 3 – Visualising Similarity Networks . 3 1.1.4 Use Case 4 – Profiling Reading Lists . 4 1.1.5 Use Case 5 – Ranking News Stories . 4 1.1.6 RelatedUseCases ...................... 5 1.2 Frameworks.............................. 6 1.2.1 Dataflows........................... 6 1.2.2 Workflows .......................... 10 1.2.3 Dataspaces .......................... 18 1.3 Research Questions and Objectives . 20 1.4 Contributions ............................. 22 1.5 Publications.............................. 24 1.6 StructureofThesis .......................... 25 2 Background 27 2.1 Terminology.............................. 27 2.2 Informative Background Topics . 29 2.2.1 Representing Structured Data . 30 2.2.2 Comparing Heterogeneous Data . 40 2.2.3 Information Retrieval . 44 2.2.4 Comparing Structured Data . 45 2.3 Required Background Theory . 49 2.3.1 Vector Space Model . 49 2.3.2 Individuals as Terms . 50 2.3.3 Basic Terms in a Higher-Order Logic . 52 2.3.4 Kernels and Distances for Basic Terms . 57 2.3.5 RESTfulWebServices . 61 2.4 Summary ............................... 63 ix 3 Workflows for Profiling and Matching Textual Content 65 3.1 SubSiftCaseStudy .......................... 65 3.2 SubSiftWebServices . ..... ...... ..... ...... .. 79 3.2.1 Example REST API Enactment of Submission Sifting . 82 3.3 Profiling and Matching Workflow Demonstrations . 86 3.3.1 Use Case 1 – Submission Sifting . 87 3.3.2 Use Case 2 – Finding an Expert . 88 3.3.3 Use Case 3 – Visualising Similarity Networks . 89 3.3.4 Use Case 4 – Profiling Reading Lists . 91 3.3.5 Use Case 5 – Ranking News Stories . 91 3.3.6 RelatedUseCases ...................... 94 3.4 Summary ............................... 96 4 Higher-Order Dataflows 99 4.1 A Higher-Order Dataflow Model . 101 4.1.1 Generator Function g ................... 103 p q 4.1.2 Data Constructor h .................... 105 p q 4.2 Knowledge Representation . 105 4.3 Formalising the Higher-Order Dataflow Model . 106 4.3.1 Paths ............................. 111 4.3.2 Path Expressions . 113 4.3.3 Parallelisability and Scalability of the Model . 115 4.3.4 Continuations in the Model . 121 4.4 Implementation ............................ 123 4.4.1 DataFormats ......................... 123 4.4.2 Templates and Embedded Functions . 124 4.4.3 ExampleDataflows. 125 4.4.4 Other Example Dataflows . 131 4.5 Comparison of JSONMatch with SubSift . 132 4.5.1 Submission Sifting Workflow Implementation . 132 4.5.2 Asynchronous Behaviour . 134 4.5.3 Feature Comparison . 135 4.5.4 Summary of Comparision . 138 4.6 Summary ............................... 138 5 Querying and Merging Heterogeneous Data 141 5.1 RelationalModel ........................... 143 5.1.1 Relational Operations . 144 5.1.2 Relational Algebra . 148 5.2 Lifting the Relational Model to a Higher-Order . 148 5.2.1 Indexing Basic Terms . 149 5.3 Basic Term Relational Model . 157 x 5.3.1 Basic Term Relational Representation . 157 5.3.2 Indexing Basic Term Relations . 158 5.3.3 Basic Term Relational Operations . 158 5.3.4 Basic Term Relational Algebra . 161 5.4 Approximate Relational Joins . 162 5.4.1 Proximity-Joins . 162 5.4.2 Basic Term Proximity-Join . 163 5.5 ProofofConcept ........................... 164 5.5.1 Kernels and Distances for Approximate Joins . 164 5.5.2 Demonstrations . 168 5.6 Summary ............................... 173 6 Related Work 179 6.1 Broadly Related Areas . 179 6.1.1 BigData ........................... 180 6.1.2 Clustering........................... 181 6.1.3 ExpertFinding ... ..... ...... ..... ..... 181 6.1.4 Higher-Order Frameworks . 182 6.1.5 Scientometrics and Bibliometrics . 183 6.1.6 SemanticWeb ........................ 183 6.1.7 Text Processing . 184 6.2 Approaches to Profiling and Matching . 185 6.2.1 Schema and Ontology Matching . 185 6.2.2 Dataspaces .........................
Recommended publications
  • A SHORT CONFERENCE REPORT By
    < APE 2012 The Seventh International Conference “Academic Publishing in Europe” Semantic Web, Data & Publishing A SHORT CONFERENCE REPORT by Dr. Chris Armbruster and André Pleintinger 24 - 25 January 2012 Berlin-Brandenburg Academy of Sciences www.ape2012.eu APE 2012: Academic Publishing in Europe A Short Report from the International Conference: “Semantic Web, Data & Publishing” 24-25 January 2012, Berlin-Brandenburg Academy of Sciences preceded by the Education and Training Course: “Hands On! Exchange of Experiences & Updates on Technologies” on 23 January 2012 by Dr. Chris Armbruster (Research Manager, STM) and André Pleintinger (Institute for the Study of the Book, Friedrich-Alexander Universität Erlangen-Nürnberg) Day 1 Welcome and Opening Addresses Welcoming speakers and participants, Dr. Christian Sprang (German Association of Publishers and Booksellers, Frankfurt am Main), highlighted in his greetings that ‘Academic Publishing in Europe’ has built a reputation by providing a forum for all stakeholders to be heard, interact and envision solutions that benefit scholarly communication. The ongoing necessity and possibility for stakeholders to continue the discussion has been stressed by the PEER Economics report, which investigates the costs associated with the large-scale deposit of stage-two research outputs (Green OA) – for publishers and repositories.1 APE, being held in Berlin for the seventh time, constitutes a unique forum in Germany and Europe. In the opening address, Michael Mabe (CEO, International Association of STM Publishers, The Hague and Oxford), emphasized that the system of scholarly communication shows basic continuity – as demonstrated by major studies over the past years.2 Nevertheless, scholarly communication has also shifted, because its digital products are malleable, can be copied and are subject to mash-ups.
    [Show full text]
  • School of Computer Science
    MANCHESTER 1824 School of Computer Science Weekly Newsletter 09 Sep 2013 Contents News from the Head of School News from HoS This Week NSS results published School Events The NSS results for undergraduate programmes have been published. The External Events School achieved an overall satisfaction rate of 88%. The University and Faculty Funding Opps averages for 2013 are 85%, so within the University we are doing reasonably Prize & Award Opps well. However, amongst UK Computer Science departments we are 35th out of Research Awards 111, with almost all of the Russell Group above us. We aim to be amongst the Staff News best in the country in research and teaching so this is a disappointing position to Vacancies be in, but it is not impossible to improve our ranking significantly. If we raised our score by 5% we would be 5th equal in the UK. Around 150 students completed Links the survey and so changing the opinions of 8 students either way would make a News Submissions huge difference to our position. I urge you all to keep that in mind when dealing Newsletter Archive with students this year. School Strategy School Intranet Two new lecturers in the APT group School Seminars Dirk Koch takes up his new post on Monday 9th September, and Antoniu Pop will ESNW Seminars join us on 14th October. Dirk comes to us from the University of Oslo where he NaCTeM Seminars has been working on reconfigurable systems on FPGAs. Antoniu comes to us from INRIA Paris where he was working on programming language design and compiler optimisation for parallel architectures.
    [Show full text]
  • The Open Biodiversity Knowledge Management System in Scholarly Publishing
    Bulgarian Academy of Sciences Institute of Information and Communication Technologies Department of Linguistic Modelling and Knowledge Processing Pensoft Publishers Dissertation submitted towards the degree “Ph. D.” in doctoral program “Informatics” (01.01.2012) professional field 4.6 “Informatics and Computer Science” The Open Biodiversity Knowledge Management System in Scholarly Publishing Author: Academic adviser: M. Sci. Viktor Senderov Prof. Dr. Lyubomir Penev Academic consultant: Assoc. Prof. Dr. Kiril Simov Sofia, 2019 г. ii “The semantic web is the future of the internet and always will be.” Peter Norvig, Director of Research at Google iii BULGARIAN ACADEMY OF SCIENCES PENSOFT PUBLISHERS Abstract Institute of Information and Communication Technologies “Ph. D.” The Open Biodiversity Knowledge Management System in Scholarly Publishing by Viktor Senderov OpenBiodiv is a publicly-accessible production-stage semantic system running on top of a reasonably-sized biodiversity knowledge graph. It stores biodiversity data in a semantic interlinked format and offers facilities for working with it. It is a dynamic system that continuously updates its database as new biodiversity information be- comes available by connecting to several international biodiversity data publishers. It also allows its users to ask complex queries via SPARQL and a simplified semantic search interface. v Contents Abstract iii Availability of data and software ix Introduction1 Importance of the topic.............................1 Previous work...................................1
    [Show full text]
  • CS Research Newsletter 6
    Research Newsletter School of Computer Science Issue 1 – Spring 2014 The School of Computer Science - Kilburn Building In this issue Contact us p.2 Head of School Editorial and Research News School of Computer Science p.4 Spotlight – The National Centre for Text Mining The University of Manchester (NaCTeM) Kilburn Building p.5 Feature – Phantoms of the Mind Manchester p.6 Recent Appointments M13 9PL p.7 Grants and Awards UK Email: [email protected] Web: www.cs.manchester.ac.uk About us At The University of Manchester, we have the longest established school of computer science in the UK and one of the largest. We are constantly building on our strong research history with research groups operating across the spectrum of computer science, from fundamental theory and innovative technology, through novel hardware and software systems design, to leading-edge applications. The School is consistently ranked highly; 2nd in UK for ‘research power’ (RAE2008); 5 th in the UK by ARWU 2013 and the expertise and achievements of our staff are well-recognised internationally. Editorial by the Head of School I’m very pleased to introduce the first issue of the School of Computer Science Research Newsletter We have one of the broadest range of research interests of any School of Computer Science in the UK: from nanoscale electronic devices to Exabytes. We collaborate across the globe and with every faculty in the University which is a measure of the diversity and impact of what we do. I’m delighted to be able to share some of our recent research news and discoveries with you.
    [Show full text]
  • The Open Biodiversity Knowledge Management System in Scholarly Publishing”
    Поредицата „Автореферати на дисертации на The series Abstracts of Dissertations of the Institute of Института по информационни и комуникационни Information and Communication Technologies at the технологии при Българската академия на Bulgarian Academy of Sciences presents in an науките“ представя в електронен формат electronic format the abstracts of Doctor of Sciences автореферати на дисертации за получаване на and PhD dissertations defended in the Institute of научната степен „Доктор на науките” или на Information and Communication Technologies at the образователната и научната степен „Доктор”, Bulgarian Academy of Sciences. The studies provide защитени в Института по информационни и new original results in such areas of Information and комуникационни технологии при Българската Communication Technologies as Computer Networks академия на науките. Представените трудове and Architectures, Parallel Algorithms, Scientific отразяват нови научни и научно-приложни приноси в Computations, Linguistic Modelling, Mathematical редица области на информационните и Methods for Sensor Data Processing, Information комуникационните технологии като Компютърни Technologies for Security, Technologies for Knowledge мрежи и архитектури, Паралелни алгоритми, Научни management and processing, Grid Technologies and пресмятания, Лингвистично моделиране, Applications, Optimization and Decision Making, Signal Математически методи за обработка на сензорна Processing and Pattern Recognition, Information информация, Информационни технологии в Processing and Systems,
    [Show full text]
  • Automated Optimization Methods for Scientific Workflows in E-Science Infrastructures
    Scientific workflows have emerged as a key technology that assists scientists with the design, management, execution, sharing and reuse of in silico experiments. Workflow management systems simplify the management of scientific workflows by providing graphical interfaces for their development, monitoring and analysis. Nowadays, e-Science combines such workflow man- agement systems with large-scale data and computing resources into complex research infra- structures. For instance, e-Science allows the conveyance of best practice research in collab- orations by providing workflow repositories, which facilitate the sharing and reuse of scientific workflows. However, scientists are still faced with different limitations while reusing workflows. One of the most common challenges they meet is the need to select appropriate applications and their individual execution parameters. If scientists do not want to rely on default or experi- ence-based parameters, the best-effort option is to test different workflow set-ups using either trial and error approaches or parameter sweeps. Both methods may be inefficient or time con- suming respectively, especially when tuning a large number of parameters. Therefore, scientists require an effective and efficient mechanism that automatically tests different workflow set-ups in an intelligent way and will help them to improve their scientific results. This thesis addresses the limitation described above by defining and implementing an approach for the optimization of scientific workflows. In the course of this work, scientists’ needs are investigated and requirements are formulated resulting in an appropriate optimization concept. This concept is prototypically implemented by extending a workflow management system with an Workflows of Scientific Optimization Automated optimization framework.
    [Show full text]
  • Proceedings of the 3 Workshop on Semantic Publishing (Sepublica
    Proceedings of the 3rd Workshop on Semantic Publishing (SePublica 2013) 10th Extended Semantic Web Conference Montpellier, France, 26 May 2013 edited by Alexander García Castro, Christoph Lange, Phillip Lord, and Robert Stevens 26 May 2013 Preface This volume contains the full papers presented at SePublica 2013 (http://sepublica. mywikipaper.org), the Third International Workshop on Semantic Publishing (Machine- comprehensible Documents, Bridging the Gap between Publications and Data) held on 26 May 2013 in Montpellier, France. There were 7 submissions. Each submission was reviewed by at least 2, and on the average 3.1, programme committee members. The committee decided to accept all of them. The programme also included an invited talk given by Peter Murray-Rust from the University of Cambridge, UK, on the question of “How do we make Scholarship Seman- tic?”, which is included as the first paper in this volume. It also includes five presentations of polemics, which are not included in this proceed- ings volume but archived in the Knowledge Blog at http://event.knowledgeblog.org/ event/sepublica-2013: Hal Warren, Bryan Dennis, Eva Winer: Flash Mob Science, Open Innovation and • Semantic Publishing Idafen Santana Pérez, Daniel Garijo, Oscar Corcho: Science, Semantic Web and • Excuses Chris Mavergames: Polemic on Future of Scholarly Publishing/Semantic Publishing • Sarven Capadisli: Linked Research • We would like to thank our peer reviewers for carefully reviewing the submissions and giving constructive feedback. This proceedings volume has
    [Show full text]
  • Enhancing Systems Biology Models Through Semantic Data Integration
    School of Computing Science Enhancing Systems Biology Models Through Semantic Data Integration Allyson L. Lister Submitted for the degree of Doctor of Philosophy in the School of Computing Science, Newcastle University December 2011 c 2011, Allyson L. Lister Abstract Studying and modelling biology at a systems level requires a large amount of data of different ex- perimental types. Historically, each of these types is stored in its own distinct format, with its own internal structure for holding the data produced by those experiments. While the use of community data standards can reduce the need for specialised, independent formats by providing a common syn- tax, standards uptake is not universal and a single standard cannot yet describe all biological data. In the work described in this thesis, a variety of integrative methods have been developed to reuse and restructure already extant systems biology data. SyMBA is a simple Web interface which stores experimental metadata in a published, common format. The creation of accurate quantitative SBML models is a time-intensive manual process. Modellers need to understand both the systems they are modelling and the intricacies of the SBML format. However, the amount of relevant data for even a relatively small and well-scoped model can be overwhelming. Saint is a Web application which accesses a number of external Web services and which provides suggested annotation for SBML and CellML models. MFO was developed to for- malise all of the knowledge within the multiple SBML specification documents in a manner which is both human and computationally accessible. Rule-based mediation, a form of semantic data inte- gration, is a useful way of reusing and re-purposing heterogeneous datasets which cannot, or are not, structured according to a common standard.
    [Show full text]