A Relational Matrix Algebra and Its Implementation in a Column Store

Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2020 A Relational Matrix Algebra and its Implementation in a Column Store Dolmatova, Oksana ; Augsten, Nikolaus ; Böhlen, Michael Hanspeter Abstract: Analytical queries often require a mixture of relational and linear algebra operations applied to the same data. This poses a challenge to analytic systems that must bridge the gap between relations and matrices. Previous work has mainly strived to fix the problem at the implementation level. This paper proposes a principled solution at the logical level. We introduce the relational matrix algebra (RMA), which seamlessly integrates linear algebra operations into the relational model and eliminates the dichotomy between matrices and relations. RMA is closed: All our relational matrix operations are performed on relations and result in relations; no additional data structure is required. Our implementation in MonetDB shows the feasibility of our approach, and empirical evaluations suggest that in-database analytics performs well for mixed workloads. DOI: https://doi.org/10.1145/3318464.3389747 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-200904 Conference or Workshop Item Published Version Originally published at: Dolmatova, Oksana; Augsten, Nikolaus; Böhlen, Michael Hanspeter (2020). A Relational Matrix Algebra and its Implementation in a Column Store. In: ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14 June 2020 - 19 June 2020, 2573-2587. DOI: https://doi.org/10.1145/3318464.3389747 A Relational Matrix Algebra and its Implementation in a Column Store Oksana Dolmatova Nikolaus Augsten Michael H. Böhlen University of Zürich University of Salzburg University of Zürich Zürich, Switzerland Salzburg, Austria Zürich, Switzerland [email protected] [email protected] [email protected] ABSTRACT Dealing with mixed workloads is challenging since the Analytical queries often require a mixture of relational and gap between relations and matrices must be bridged. Current linear algebra operations applied to the same data. This poses relational systems are poorly equipped for this task. Previ- a challenge to analytic systems that must bridge the gap be- ous attempts to deal with mixed workloads have focused tween relations and matrices. Previous work has mainly on the implementation level, for example, by introducing strived to fix the problem at the implementation level. This ordered data types; by storing matrices in special relations or paper proposes a principled solution at the logical level. We key-value structures; or by splitting queries into their rela- introduce the relational matrix algebra (RMA), which seam- tional and matrix parts. This paper resolves the gap between lessly integrates linear algebra operations into the relational relations and matrices. model and eliminates the dichotomy between matrices and We propose a principled solution for mixed workloads relations. RMA is closed: All our relational matrix operations and introduce the relational matrix algebra (RMA) to support are performed on relations and result in relations; no ad- complex data analysis within the relational model. The goal ditional data structure is required. Our implementation in is to (1) solve the integration of relations and linear algebra at MonetDB shows the feasibility of our approach, and empiri- the logical level, (2) so achieve independence from the imple- cal evaluations suggest that in-database analytics performs mentation at the physical level, and (3) prove the feasibility well for mixed workloads. of our model by extending an existing system. We are the first to achieve these goals: Other works focus on facilitating ACM Reference Format: the transition between the relational and the linear algebra Oksana Dolmatova, Nikolaus Augsten, and Michael H. Böhlen. 2020. model. We eliminate the dichotomy between matrices and A Relational Matrix Algebra and its Implementation in a Column Store. In Proc of the 2020 ACM SIGMOD International Conference on relations by seamlessly integrating linear algebra into the Management of Data, June 14-19, 2020, Portland, OR, USA. ACM, New relational model. Our implementation of RMA in MonetDB York, NY, USA, 15 pages. https://doi.org/10.1145/3318464.3389747 shows the feasibility of our approach. We define linear operations over relations and system- 1 INTRODUCTION atically process and maintain non-numerical information. We show that the relational model is well-suited for com- Many data that are stored in relational databases include plex data analysis if ordering and contextual information are numerical parts that must be analyzed, for example, sensor dealt with properly. RMA is purely based on relations and data from industrial plants, scientific observations, or point does not introduce any ordered data structures. Instead, the of sales data. The analysis of these data, which are not purely relevant row order for matrix operations is computed from numerical but also include important non-numerical val- contextual information in the argument relations. All rela- ues, demand mixed queries that apply relational and linear tional matrix operations return relations with origins. Origins algebra operations on the same data. are constructed from the contextual information (attribute Permission to make digital or hard copies of all or part of this work for names and non-numerical values) of the input relations and personal or classroom use is granted without fee provided that copies are not uniquely identify and describe each cell in the result relation. made or distributed for profit or commercial advantage and that copies bear We extend the syntax of SQL to support relational matrix this notice and the full citation on the first page. Copyrights for components operations. As an example, consider a relation 푟푎푡푖푛푔 with of this work owned by others than ACM must be honored. Abstracting with schema (푈푠푒푟, 퐵푎푙푡표, 퐻푒푎푡, 푁푒푡), that stores users and their credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request ratings for the three films ("Balto", "Heat", and "Net", one permissions from [email protected]. column per film). The SQL query SIGMOD’20, June 14-19, 2020, Portland, OR, USA SELECT * FROM INV( rating BY User ); © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-6735-6/20/06...$15.00 orders the relation by users and computes the inversion of https://doi.org/10.1145/3318464.3389747 the matrix formed by the values of the ordered numerical columns. The result is a relation with the same schema: The The set of operations is limited and does not include opera- values of attribute 푈푠푒푟 are preserved, and the values of the tions whose results depend on the row order. For instance, remaining three attributes are provided by matrix inversion there are no SQL solutions for inversion or determinant com- (see Section 5 for details). The origin of a numerical result putation. Complex operations must be programmed as UDFs. value is given by the user name in its table row and the Ordonez et al. [26] suggest UDFs for linear regression with attribute name of its column. a matrix-like result type. The UTL_NLA package [25] for At the system level, we have integrated our solution into Oracle DBMS offers linear algebra operations defined over MonetDB. Specifically, we extended the kernel with rela- UTL_NLA_ARRAY. UDFs provide a technical interface but tional matrix operations implemented over binary associa- do not define matrix operations over relations. No systematic tion tables (BATs). The physical implementation of matrix approach to maintain contextual information is provided. operations is flexible and may be transparently delegated Luo et al. [20] extend SimSQL [8], a Hadoop-based rela- to specialized libraries that leverage the underlying hard- tional system, with linear algebra functionality. RasDaMan ware (e.g., MKL [2] for CPUs or cuBLAS [1] for GPUs). The [6, 22] manages and processes image raster data. Both sys- new functionality is introduced without changing the main tems introduce matrices as ordered numeric-only attribute data structures and the processing pipeline of MonetDB, and types. Although relations and matrices coexist, operations without affecting existing functionality. are defined over different objects. Linear operations arenot Our technical contributions are as follows: defined over unordered objects and they do not support con- • We propose the relational matrix algebra (RMA), which textual information for individual cells of a matrix. extends the relational model with matrix operations. SciQL [33, 34] extends MonetDB [23] with a new data type, This is the first approach to show that the relational ARRAY, as a first-class object. An array is stored as an object model is sufficient to support matrix operations. The on its own. Arrays have a fixed schema: The last attribute new set of operations is closed: All relational matrix stores the data values of a matrix, all other attributes are operations are performed on relations and result in dimension attributes and must be numeric. Arrays come relations, and no additional data structure is required. with a limited set of operations, such as addition, filtering, • We show that matrix operations are shape restricted, and aggregation, and they must be converted

Load more