ACTA UNIVERSITATIS UPSALIENSIS Uppsala Dissertations from the Faculty of Science and Technology 121
Total Page:16
File Type:pdf, Size:1020Kb
ACTA UNIVERSITATIS UPSALIENSIS Uppsala Dissertations from the Faculty of Science and Technology 121 Andrej Andrejev Semantic Web Queries over Scientific Data Dissertation presented at Uppsala University to be publicly examined in Lecture hall 2446, Polacksbacken, Uppsala, Wednesday, 23 March 2016 at 14:00 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Gerhard Weikum (Max Planck Institute for Informatics). Abstract Andrejev, A. 2016. Semantic Web Queries over Scientific Data. Uppsala Dissertations from the Faculty of Science and Technology 121. 214 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9465-0. Semantic Web and Linked Open Data provide a potential platform for interoperability of scientific data, offering a flexible model for providing machine-readable and queryable metadata. However, RDF and SPARQL gained limited adoption within the scientific community, mainly due to the lack of support for managing massive numeric data, along with certain other important features – such as extensibility with user-defined functions, query modularity, and integration with existing environments and workflows. We present the design, implementation and evaluation of Scientific SPARQL – a language for querying data and metadata combined, represented using the RDF graph model extended with numeric multidimensional arrays as node values – RDF with Arrays. The techniques used to store RDF with Arrays in a scalable way and process Scientific SPARQL queries and updates are implemented in our prototype software – Scientific SPARQL Database Manager, SSDM, and its integrations with data storage systems and computational frameworks. This includes scalable storage solutions for numeric multidimensional arrays and an efficient implementation of array operations. The arrays can be physically stored in a variety of external storage systems, including files, relational databases, and specialized array data stores, using our Array Storage Extensibility Interface. Whenever possible SSDM accumulates array operations and accesses array contents in a lazy fashion. In scientific applications numeric computations are often used for filtering or post-processing the retrieved data, which can be expressed in a functional way. Scientific SPARQL allows expressing common query sub-tasks with functions defined as parameterized queries. This becomes especially useful along with functional language abstractions such as lexical closures and second-order functions, e.g. array mappers. Existing computational libraries can be interfaced and invoked from Scientific SPARQL queries as foreign functions. Cost estimates and alternative evaluation directions may be specified, aiding the construction of better execution plans. Costly array processing, e.g. filtering and aggregation, is thus preformed on the server, saving the amount of communication. Furthermore, common supported operations are delegated to the array storage back-ends, according to their capabilities. Both expressivity and performance of Scientific SPARQL are evaluated on a real-world example, and further performance tests are run using our mini- benchmark for array queries. Keywords: RDF, SPARQL, Arrays, Query optimization, Second-order functions, Scientific workflows Andrej Andrejev, Department of Information Technology, Division of Computer Systems, Box 337, Uppsala University, SE-75105 Uppsala, Sweden. © Andrej Andrejev 2016 ISSN 1104-2516 ISBN 978-91-554-9465-0 urn:nbn:se:uu:diva-274856 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-274856) Contents 1 Introduction ...........................................................................................9 2 Background and Related Work............................................................14 2.1 Semantic Web ............................................................................14 2.2 RDF Repositories .......................................................................15 2.2.1 SPARQL endpoints and Linked Data....................................16 2.2.2 SPARQL extensions ..............................................................16 2.2.3 Storing RDF graphs ...............................................................17 2.3 Exposing Non-RDF Data as RDF ..............................................18 2.3.1 Relational data to RDF ..........................................................18 2.3.2 Objects to RDF ......................................................................19 2.3.3 XML to RDF .........................................................................20 2.3.4 Spreadsheets to RDF..............................................................21 2.3.5 Multidimensional data in RDF ..............................................22 2.4 Array Models..............................................................................24 2.5 Array Databases .........................................................................25 2.6 The Amos II System...................................................................27 3 SPARQL Language Overview ............................................................29 3.1 Example Dataset.........................................................................29 3.1.1 Turtle Syntax .........................................................................30 3.2 Graph Patterns............................................................................31 3.3 Combining the Graph Patterns ...................................................32 3.3.1 Optional Graph Patterns ........................................................33 3.3.2 Matching Alternatives ...........................................................33 3.3.3 Existence Quantifiers and Other Filters.................................35 3.3.4 Addressing Multiple Graphs..................................................35 3.4 Property Path Expressions..........................................................36 3.4.1 Precedence of Path Operators................................................37 3.4.2 Algebraic Properties of Path Operators .................................37 3.5 Aggregation and Grouping.........................................................38 3.6 Error Handling............................................................................39 3.7 Ordering and Segmentation........................................................40 3.8 Constructing New RDF Graphs .................................................41 3.9 Updating the Datasets.................................................................42 4 Scientific SPARQL..............................................................................43 4.1 Array Queries .............................................................................44 4.1.1 Array Dereference Syntax .....................................................45 4.1.2 Variables Bound to Array Subscripts ....................................47 4.1.3 Built-in Array Functions........................................................48 4.1.4 Array Arithmetic....................................................................48 4.1.5 Intra-array Computations.......................................................50 4.1.6 Array Equality .......................................................................50 4.2 Parameterized Queries - Functional Views ................................51 4.3 Lexical Closures and Second-Order Functions ..........................53 4.3.1 Array Algebra Second-order Functions.................................54 4.4 Foreign Functions.......................................................................55 4.5 Calling SciSPARQL from Algorithmic Languages ...................57 5 Scientific SPARQL Database Manager...............................................59 5.1 Architecture overview ................................................................60 5.1.1 Example Dataset ....................................................................61 5.1.2 Example Query ......................................................................63 5.2 Numeric Multidimensional Arrays.............................................69 5.2.1 Storage of Resident Arrays....................................................69 5.2.2 Array Transformations...........................................................71 5.3 Data Loaders ..............................................................................74 5.3.1 File Links...............................................................................74 5.3.2 RDF Collections ....................................................................75 5.3.3 Data Cube Vocabulary...........................................................76 5.4 Scientific SPARQL Query Processor.........................................79 5.4.1 SciSPARQL Query Structure ................................................80 5.4.2 Compositional vs. Operational SPARQL Semantics.............86 5.4.3 AmosQL Query Structure......................................................95 5.4.4 Extensions to ObjectLog and Physical Algebra ..................101 5.4.5 The Translation Algorithm ..................................................105 5.5 Polymorphic Properties Problem..............................................127 5.5.1 Directionality Problem.........................................................127 5.5.2 Normalization Problem........................................................128 6 External Storage of RDF with Arrays ...............................................130 6.1 Array Storage Extensibility Interface.......................................131