<<

Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch

Year: 2020

A Relational Matrix Algebra and its Implementation in a Column Store

Dolmatova, Oksana ; Augsten, Nikolaus ; Böhlen, Michael Hanspeter

Abstract: Analytical queries often require a mixture of relational and linear algebra operations applied to the same . This poses a challenge to analytic systems that must bridge the gap between relations and matrices. Previous work has mainly strived to fix the problem at the implementation level. This paper proposes a principled solution at the logical level. We introduce the relational matrix algebra (RMA), which seamlessly integrates linear algebra operations into the relational model and eliminates the dichotomy between matrices and relations. RMA is closed: All our relational matrix operations are performed on relations and result in relations; no additional data structure is required. Our implementa- tion in MonetDB shows the feasibility of our approach, and empirical evaluations suggest that in- analytics performs well for mixed workloads.

DOI: https://doi.org/10.1145/3318464.3389747

Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-200904 Conference or Workshop Item Published Version

Originally published at: Dolmatova, Oksana; Augsten, Nikolaus; Böhlen, Michael Hanspeter (2020). A Relational Matrix Algebra and its Implementation in a Column Store. In: ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14 June 2020 - 19 June 2020, 2573-2587. DOI: https://doi.org/10.1145/3318464.3389747 A Relational Matrix Algebra and its Implementation in a Column Store Oksana Dolmatova Nikolaus Augsten Michael H. Böhlen University of Zürich University of Salzburg University of Zürich Zürich, Switzerland Salzburg, Austria Zürich, Switzerland [email protected] [email protected] [email protected]

ABSTRACT Dealing with mixed workloads is challenging since the Analytical queries often require a mixture of relational and gap between relations and matrices must be bridged. Current linear algebra operations applied to the same data. This poses relational systems are poorly equipped for this task. Previ- a challenge to analytic systems that must bridge the gap be- ous attempts to deal with mixed workloads have focused tween relations and matrices. Previous work has mainly on the implementation level, for example, by introducing strived to fix the problem at the implementation level. This ordered data types; by storing matrices in special relations or paper proposes a principled solution at the logical level. We key-value structures; or by splitting queries into their rela- introduce the relational matrix algebra (RMA), which seam- tional and matrix parts. This paper resolves the gap between lessly integrates linear algebra operations into the relational relations and matrices. model and eliminates the dichotomy between matrices and We propose a principled solution for mixed workloads relations. RMA is closed: All our relational matrix operations and introduce the relational matrix algebra (RMA) to support are performed on relations and result in relations; no ad- complex within the relational model. The goal ditional data structure is required. Our implementation in is to (1) solve the integration of relations and linear algebra at MonetDB shows the feasibility of our approach, and empiri- the logical level, (2) so achieve independence from the imple- cal evaluations suggest that in-database analytics performs mentation at the physical level, and (3) prove the feasibility well for mixed workloads. of our model by extending an existing system. We are the first to achieve these goals: Other works focus on facilitating ACM Reference Format: the transition between the relational and the linear algebra Oksana Dolmatova, Nikolaus Augsten, and Michael H. Böhlen. 2020. model. We eliminate the dichotomy between matrices and A Relational Matrix Algebra and its Implementation in a Column Store. In Proc of the 2020 ACM SIGMOD International Conference on relations by seamlessly integrating linear algebra into the Management of Data, June 14-19, 2020, Portland, OR, USA. ACM, New relational model. Our implementation of RMA in MonetDB York, NY, USA, 15 pages. https://doi.org/10.1145/3318464.3389747 shows the feasibility of our approach. We define linear operations over relations and system- 1 INTRODUCTION atically process and maintain non-numerical information. We show that the relational model is well-suited for com- Many data that are stored in relational include plex data analysis if ordering and contextual information are numerical parts that must be analyzed, for example, sensor dealt with properly. RMA is purely based on relations and data from industrial plants, scientific observations, or point does not introduce any ordered data structures. Instead, the of sales data. The analysis of these data, which are not purely relevant row order for matrix operations is computed from numerical but also include important non-numerical val- contextual information in the argument relations. All rela- ues, demand mixed queries that apply relational and linear tional matrix operations return relations with origins. Origins algebra operations on the same data. are constructed from the contextual information (attribute Permission to make digital or hard copies of all or part of this work for names and non-numerical values) of the input relations and personal or classroom use is granted without fee provided that copies are not uniquely identify and describe each cell in the result relation. made or distributed for profit or commercial advantage and that copies bear We extend the syntax of SQL to support relational matrix this notice and the full citation on the first page. Copyrights for components operations. As an example, consider a relation 푟푎푡푖푛푔 with of this work owned by others than ACM must be honored. Abstracting with schema (푈푠푒푟, 퐵푎푙푡표, 퐻푒푎푡, 푁푒푡), that stores users and their credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request ratings for the three films ("Balto", "Heat", and "Net", one permissions from [email protected]. column per film). The SQL query SIGMOD’20, June 14-19, 2020, Portland, OR, USA SELECT * FROM INV( rating BY User ); © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-6735-6/20/06...$15.00 orders the relation by users and computes the inversion of https://doi.org/10.1145/3318464.3389747 the matrix formed by the values of the ordered numerical columns. The result is a relation with the same schema: The The set of operations is limited and does not include opera- values of attribute 푈푠푒푟 are preserved, and the values of the tions whose results depend on the row order. For instance, remaining three attributes are provided by matrix inversion there are no SQL solutions for inversion or determinant com- (see Section 5 for details). The origin of a numerical result putation. Complex operations must be programmed as UDFs. value is given by the user name in its row and the Ordonez et al. [26] suggest UDFs for linear regression with attribute name of its column. a matrix-like result type. The UTL_NLA package [25] for At the system level, we have integrated our solution into Oracle DBMS offers linear algebra operations defined over MonetDB. Specifically, we extended the kernel with rela- UTL_NLA_ARRAY. UDFs provide a technical interface but tional matrix operations implemented over binary associa- do not define matrix operations over relations. No systematic tion tables (BATs). The physical implementation of matrix approach to maintain contextual information is provided. operations is flexible and may be transparently delegated Luo et al. [20] extend SimSQL [8], a Hadoop-based rela- to specialized libraries that leverage the underlying hard- tional system, with linear algebra functionality. RasDaMan ware (e.g., MKL [2] for CPUs or cuBLAS [1] for GPUs). The [6, 22] manages and processes image raster data. Both sys- new functionality is introduced without changing the main tems introduce matrices as ordered numeric-only attribute data structures and the processing pipeline of MonetDB, and types. Although relations and matrices coexist, operations without affecting existing functionality. are defined over different objects. Linear operations arenot Our technical contributions are as follows: defined over unordered objects and they do not support con- • We propose the relational matrix algebra (RMA), which textual information for individual cells of a matrix. extends the relational model with matrix operations. SciQL [33, 34] extends MonetDB [23] with a new data type, This is the first approach to show that the relational ARRAY, as a first-class object. An array is stored as an object model is sufficient to support matrix operations. The on its own. Arrays have a fixed schema: The last attribute new set of operations is closed: All relational matrix stores the data values of a matrix, all other attributes are operations are performed on relations and result in dimension attributes and must be numeric. Arrays come relations, and no additional data structure is required. with a limited set of operations, such as addition, filtering, • We show that matrix operations are shape restricted, and aggregation, and they must be converted to relations to which allows us to systematically define the results of perform relational operations. The presence of contextual matrix operations over relations. We define row and information and its inheritance are not addressed. column origins, the part of contextual information that The MADlib library [16] for in-database analytics offers a describes values in the result relation, and prove that broad range of linear and statistical operations, defined as all our operations return relations with origins. either UDFs with C++ implementations or Eigen library calls. • We implement and evaluate our solution in detail. We Matrix operations require a specific input format: Tables show that our solution is feasible and leverages exist- must have one attribute with a row id value and another ing data structures and optimizations. array-valued attribute for matrix rows. Matrix operations return purely numeric results and cannot be nested. RMA opens new opportunities for advanced data analyt- Hutchison et al. [17] propose LARA, an algebra with tuple- ics that combine relational and linear algebra functionality, wise operations, attribute-wise operations, and tuple exten- speeds up analytical queries, and triggers the development sions. LARA defines linear and relational algebra operations of new logical and physical optimization techniques. using the same set of primitives. This is a good basis for The paper is organized as follows. Sect. 2 discusses related inter-algebra optimizations that span linear and relational work. We introduce basics in Sect. 3 and introduce relational operations. LARA offers a strong theoretical basis, works matrix algebra (RMA) in Sect. 4. We show an application out properties of the solution, and allows to store row and example in Sect. 5, discuss important properties of RMA in column descriptions during the operations. The maintenance Sect. 6, and its implementation in MonetDB in Sect. 7. We of contextual information is not considered for operations evaluate our solution in Sect. 8 and conclude in Sect. 9. that change the number of rows or columns. 2 RELATED WORK LevelHeaded [4, 5], an engine for relational and linear algebra operations, uses a special key-value structure: Each Relational DBMSs offer simple linear algebra operations, object has keys (dimension attributes) and annotations (value such as the pair-wise addition of attribute values in a rela- 1 attributes). Dimension and value attributes are stored in a tion. Some operations, e.g., matrix multiplication, can be trie and a flat columnar buffer, respectively. Linear opera- expressed via syntactically complex and slow SQL queries. tions are available through an extended SQL syntax. Key 1Some approaches support multi-dimensional arrays. Since we target linear values guarantee contextual information for rows. However, algebra, we focus on two dimensions and use the term matrix throughout. 푟 the trie key structure restricts relational operations: For ex- 푑 푒 푑 □ 푒 O V W ample, aggregations of keys and join predicates over non-key 1 1 2 1 2 3 A 30 1 attributes (i.e., subselects in SQL) are not allowed. 1 D 1 1 3 1 D 1 3 C 22 5 SciDB [31] is a DBMS that is based on arrays. Matrices and 2 B 2 2 4 2 B 2 4 relations are implemented as nested arrays. SciDB focuses B 10 1 on efficient array processing and performs linear algebra □ operations over arrays. SciDB supports element-wise op- Figure 1: Relation 푟; matrices 푑, 푒, and 푑 푒 erations and selected linear operations, such as SVD. The system also offers relational algebra operations on arrays but cannot compete with relational DBMSs such as MonetDB in 3.1 Relations terms of performance. A systematic approach to maintain A relation 푟 is a set of tuples 푟푖 with schema R. A schema, contextual information is not considered. R = (퐴, 퐵, . . .), is a finite, ordered set of attribute names. A Statistical packages, such as R [27] and pandas [21], offer a tuple 푟푖 ∈ 푟 has a value from the appropriate domain for each broad range of linear and relational algebra operations over attribute in the schema. We write 푟푖 .퐴 to denote the value of arrays. Each cell may be associated with descriptive informa- attribute 퐴 in tuple 푟푖 and 푟.퐴 to denote the set of all values tion, but this information is not always inherited as part of 푟푖 .퐴 in relation 푟. Ordered subsets of a schema, U ⊆ R, are operations (e.g., usv). No systematic solution for associating typeset in bold. |푟 | is the number of tuples in relation 푟. contextual information with numeric results is provided. The Let 푟 be a relation and U ⊆ R be attributes that form a U most important relational operations are supported, but even key of R. We write 푟 ,푘 to denote the 푘th tuple of relation 푟 basic optimizations (e.g., join ordering) are missing. sorted by the values of attributes U (in ascending order):

The R package RIOT-DB [32] uses MySQL as a backend U,푘 푟푖 = 푟 ⇐⇒ 푟푖 ∈ 푟 ∧ and translates linear computations to SQL. RIOT-DB ad- (1) dresses the main memory limitations of R. The optimization |{푟 푗 | 푟 푗 ∈ 푟 ∧ 푟 푗 .U < 푟푖 .U}| = 푘 − 1 of constructed SQL statements yields inter-operation opti- ▽ mization. However, it is difficult (or sometimes impossible) The column cast 푈 creates an ordered set 퐿 from the to express linear algebra operations in SQL, and only a few sorted values of an attribute 푈 that forms a key in relation 푟: simple operations, such as subtraction and multiplication, 퐿 = ▽푈 ⇐⇒ |퐿| = |푟 | ∧ are discussed. (2) ∀1 ≤ 푖 ≤ |푟 |(퐿[푖] = 푟푈 ,푖 .푈 ) AIDA [11] integrates MonetDB and NumPy [3] and ex- ploits the fact that both systems use C arrays as an internal The column cast is used to generate a schema from a set data structure: To avoid copying NumPy data to MonetDB, of values. We use this for operations tra, usv, and opd (see AIDA passes pointers to arrays. Data copying is still needed Table 2). The column cast is applicable if the cardinality of a to pass MonetDB results to NumPy since MonetDB does not list of attributes U is one. guarantee that multiple columns are contiguous in mem- ory, which is required by NumPy. AIDA offers a Python-like Example 3.1. Consider relation 푟 in Figure 1. The third procedural language for relational and linear operations. Se- tuple of relation 푟 sorted by the values of attribute 푉 is quences of relational operations are evaluated lazily, which 푟 (푉 ),3 = (퐴, 30, 1), the column cast of 푂 is ▽푂 = (퐴, 퐵,퐶), allows AIDA to combine and optimize sequences of rela- and the values of attribute 푊 are 푟.푊 = {1, 5, 1}. tional operations. The optimization does not include linear algebra operations. We use set notation and apply it to bags. Bags can be SystemML [14] offers a set of linear algebra primitives that ordered or unordered. To emphasize the difference, paren- are expressed in a high-level, declarative language and are theses are used for ordered bags (or lists), e.g., (3,2,3), and implemented on MapReduce. SystemML includes linear alge- curly braces for unordered bags, e.g., {3,2,3}. When transition- bra optimizations that are similar to relational optimizations ing from unordered to ordered bags, the order is specified (e.g., selecting the order of execution of matrix multiplica- explicitly. tions). The system considers only linear algebra operations. 3.2 Matrices An 푛 ×푘 matrix 푚 is a two-dimensional array with 푛 rows and 푘 columns. |푚| is the number of rows, #푚 the number 3 PRELIMINARIES of columns. The element in the 푖th row and the 푗th column This section presents notation for relations and matrices, and of matrix 푚 is 푚[푖, 푗]; the 푖th row is 푚[푖, ∗]; the 푗th column introduces the basic matrix algebra operations. is 푚[∗, 푗]. We consider the operations from the R Matrix Algebra Table 1: Shape types of matrix operations [28]: element-wise multiplication (EMU), matrix multiplica- tion (MMU), outer product (OPD), cross product (CPD), matrix Cardinalities Shape type Operations addition (ADD), matrix subtraction (SUB), transpose (TRA), |푖1 × 푗1 | → |푖1 × 푖1 | (r1,r1) USV solve equation (SOL), inversion (INV), eigenvectors (EVC), |푖1 × 푗1 |, |푖2 × 푗1 | → |푖1 × 푖2 | (r1,r2) OPD eigenvalues (EVL), QR decomposition (QQR, RQR), SVD ś sin- |푖1 × 푖1 | → |푖1 × 푖1 | (r1,c1) INV, EVC, CHF gle value decomposition (DSV, USV, VSV), determinant (DET), |푖1 × 푗1 | → |푖1 × 푗1 | (r1,c1) QQR |푖 × 푗 |, |푗 × 푗 | → |푖 × 푗 | (r ,c ) MMU rank (RNK), and Choleski factorization (CHF). Note that QR 1 1 1 2 1 2 1 2 |푖 × 푖 | → |푖 × 1| (r ,1) EVL and SVD return more than one matrix, therefore we split 1 1 1 1 |푖1 × 푗1 | → |푖1 × 1| (r1,1) VSV the operations: QQR and RQR return matrix Q and matrix R |푖1 × 푗1 | → |푗1 × 푖1 | (c1,r1) TRA of the QR decomposition, respectively; DSV, USV, and VSV |푖1 × 푗1 | → |푗1 × 푗1 | (c1,c1) RQR , DSV return vector 퐷 with the singular values, matrix 푈 with the |푖1 × 푗1 |, |푖1 × 푗2 | → |푗1 × 푗2 | (c1,c2) CPD left singular vectors, and matrix 푉 with the right singular |푖1 × 푗1 |, |푖1 × 1| → |푗1 × 1| (c1,c2) SOL vectors of SVD, respectively. |푖1 × 푗1 |, |푖1 × 푗1 | → |푖1 × 푗1 | (r∗,c∗) EMU, ADD, SUB The matrix concatenation of matrices 푚 and 푛 with 푘 rows |푖1 × 푖1 | → |1 × 1| (1,1) DET each returns a matrix ℎ with 푘 rows. The 푖th row of ℎ is the |푖1 × 푗1 | → |1 × 1| (1,1) RNK concatenation of the 푖th row of 푚 and the 푖th row of 푛. ℎ = 푚 □ 푛 ⇐⇒ |ℎ| = |푚| ∧ All operations of the matrix algebra are shape restricted. (3) ∀1 ≤ 푖 ≤ |ℎ|(ℎ[푖, ∗] = 푚[푖, ∗] ◦ 푛[푖, ∗]) This follows directly from the definitions of the matrix op- erations [15]. The first column of Table 1 lists the relevant The schema cast ΔU of attributes U creates a matrix 푚 cardinalities from these definitions. We use shape restric- (with a single column) from the attribute names of U: tion to determine the inheritance of contextual information. 푚 = ΔU ⇐⇒ #푚 = 1 ∧ |푚| = |U| ∧ Shape restriction has also been used in size propagation (4) ∀1 ≤ 푖 ≤ |U|(푚[푖, 1] = U[푖]) techniques [7] for the purpose of cost-based optimization of chains of matrix operations. Example 3.2. Consider attributes U = (퐷, 퐵). Matrix 푑 in = ΔU Figure 1 is the result of the schema cast 푑 . The result 4 RELATIONAL MATRIX ALGEBRA of concatenating matrix 푑 and matrix 푒 is 푑 □ 푒. Note that the row and column numbers (cells shaded in gray) in the matrix To seamlessly integrate matrix operations into the relational illustrations are not part of the matrix. model, we extend the relational algebra to the relational matrix algebra (RMA). For each of the matrix operations Matrix operations are shape restricted, i.e., the number we define a corresponding relational matrix operation in of result rows is equal to the number of rows of one of the RMA: emu, mmu, opd, cpd, add, sub, tra, sol, inv, evc, evl, input matrices (r), the number of columns of one of the input qqr, rqr, dsv, usv, vsv, det, rnk, chf. We use upper case matrices (c), or one (1). The same holds for the number of for matrix operations (e.g., TRA) and lower case for RMA result columns. operations (e.g., tra). RMA includes both relational algebra The dimensionality of result matrices defines the shape and relational matrix operations. The new operations behave type of matrix operations. We write r1 if the result dimen- like regular operations with relations as input and output. sionality is equal to the number of rows in the first matrix,2 r For each argument relation, 푟, of a relational matrix op- if the result dimensionality is equal to the number of rows in eration one parameter must be specified: The order schema the second matrix, and r∗ if the result dimensionality is equal U ⊆ R imposes an order on the tuples for the purpose of the to the number of rows in the first and second matrix (i.e.,1 r operation. The attributes of the order schema must form a 2 = r2). The same notation holds for the number of columns. key . The attributes of relation 푟 that are not part of the order Table 1 summarizes the shape types of matrix operations. schema U, i.e., U = R − U form the application schema. The application schema identifies the attributes with the data to Example 3.3. Matrix multiplication has shape type (r1,c2), which states that the number of result rows is equal to the which the matrix operation is applied. U number of rows of the first argument matrix, and the num- The order schema ⊆ R splits relation 푟 into four non- U U ber of columns is equal to the number of columns of the overlapping areas: Order schema ; order part 푟. ; application U U second argument matrix. Matrix addition has shape type schema ; and application part 푟. . The parts of 푟 that do not (r∗,c∗), which states that the number of result rows is equal 2Attributes that neither belong to the order schema nor the application to the number of rows of the first matrix and the number of schema must be dropped explicitly with a projection (or added to the order rows of the second matrix. schema, thus, forming a super key). include matrix values, i.e., the order and application schemas Matrix and relation constructors map between relations (U and U) and the order part (푟.U), form the contextual infor- and matrices. We use constructors and matrices to define mation for application part 푟.U. Intuitively, the order schema relational matrix operations and to analyze their properties. and application schema provide context for columns while At the implementation level, constructors are very efficient the order part provides context for rows. since they split and combine lists of attribute names and do not access the data (cf. Section 7). 푟 order T H W application 4.2 Relational Matrix Operations schema 5am 1 3 schema Relational matrix operations offer the functionality of ma- 8am 8 5 trix operations in a relational context. The general form order 7am 6 7 application of a unary relational matrix operation is opU (푟), where U part 6am 1 4 part is the order schema. A binary operation opU;V (푟, 푠) has an additional order schema V for argument relation 푠. Figure 2: Structure of a relation instance The result of a relational matrix operation is a relation that consists of (a) the base result of the corresponding ma- trix operation, and (b) contextual information, appropriately Example 4.1. Order schema U = (푇 ) splits relation 푟 in morphed from the contextual information of the argument Figure 2 into four parts: Order schema U = (푇 ), application relations to reflect the semantics of the operation. schema U = (퐻,푊 ), order part 푟.U = 푟.(푇 ) = {5푎푚, 8푎푚, 7푎푚, 6푎푚}, and application part 푟.U = 푟.(퐻,푊 ) = {(1, 3), Definition 4.6. (Base result) Consider a unary relational (8, 5), (6, 7), (1, 4)}. matrix operation opU (푟). The matrix that is the result of matrix operation OP(휇U (푟)) is the base result of opU (푟). The 4.1 Matrix and Relation Constructors base result for binary operations is defined analogously.

Figure 3 summarizes our approach for the inv operation Example 4.7. Consider inv (휎푇 >6푎푚 (푟)) in Fig. 3. The ′ 푇 and example relation 푟 = 휎 > (푟): (1) Two matrix con- 푇 6푎푚 base result is matrixℎ, which results from INV(휇푇 (휎푇 >6푎푚 (푟))). structors define matrices 푚 and 푛 that correspond to order Table 2 defines the details of how contextual informa- and application part of 푟, respectively; (2) INV inverses ma- tion is maintained in relational matrix operations. All def- trix 푛 resulting in matrix ℎ; and (3) the relation constructor initions follow the structure illustrated in Figure 3. A re- combines 푚 □ ℎ and R into result relation 푣. sult relation is composed from order parts, base result, and Definition 4.2. (Matrix constructor) Let 푟 be a relation, U schemas with the help of a relation constructor. For exam- be an order schema. The matrix constructor 휇U (푟) returns a ple, inv is defined according to its shape tape in Table 2: U U matrix that includes the values of 푟. sorted by : invU (푟) = 훾 (휇U (푟) □ INV(휇U (푟)), U ◦ U), where 휇U (푟) are 푚 = 휇U (푟) ⇐⇒ |푚| = |푟 | ∧ the rows of the order part, INV(휇U (푟)) is the base result, and U U ◦ U is the result schema. ∀1 ≤ 푖 ≤ |푟 |(푚[푖, ∗] = 푟 ,푖 .U) Operations that have a different number of rows than We use the complement notation 휇U (푟) to denote the ma- any of the input relations add a new attribute 퐶 to the re- trix that includes the values of 푟.U sorted by U. sult relation. This attribute 퐶 is for contextual information (cf. Example 4.8): Its values are either the attribute names Example 4.3. Consider Figure 3 with relation instance of the application schema of an input relation or the op- ′ = = 푟 휎푇 >6푎푚 (푟) and schema R (푇, 퐻,푊 ). The matrix con- eration name. The operations add, sub, emu require union ′ structor 휇푇 (푟 ) returns matrix 푛. compatible application schemas and non-overlapping order Definition 4.4. (Relation constructor) Let 푚 be a matrix schemas. Operations usv, opd, and tra construct the applica- with unique rows, and R be a relation schema with #푚 at- tion schema of the result from the order schema of an input tributes. The relation constructor 훾 (푚, R) returns relation 푟 relation. Therefore the cardinality of the order schemas U of with schema R: tra and usv, and V of opd must be one. 푟 = 훾 (푚, R) ⇐⇒ |푚| = |푟 | ∧ Example 4.8. Consider Figures 2 and 4. Figure 4a illustrates qqr ( ) ∀푡 (푡 ∈ 푟 ⇐⇒ ∃1 ≤ 푖 ≤ |푚|(푡 = 푚[푖, ∗])) the result of 푇 푟 . The values of 푇 define the ordering of tuples for this operation. The values of 퐻 and 푊 are the Example 4.5. In Figure 3, a relation constructor is applied values of matrix 푄 computed as part of the QR decomposition. □ to schema R and the concatenated matrices푚 ℎ to construct Figure 4b illustrates the result of tra푇 (푟). The column cast the result relation: 푣 = 훾 (푚 □ ℎ, R). ▽푇 of ordering attribute 푇 provides names for the attributes R T H W ′ ′ 푟 = 휎푇 >6푎푚 (푟) 푚 = 휇푇 (푟 ) 푣 = 훾 (푔,푇 ◦ 푇 ) T H W 1 T H W 8am 8 5 휇푇 1 7am 훾 7am -0.19 0.27 7am 6 7 2 8am 푔 = 푚 □ ℎ 8am 0.31 -0.23 = ( ′) = INV( ) 푛 휇푇 푟 ℎ 푛 □ 1 2 3 1 2 1 2 1 7am -0.19 0.27 휇푇 1 6 7 INV 1 -0.19 0.27 2 8am 0.31 -0.23 2 8 5 2 0.31 -0.23

= Figure 3: Structure of our solution for the inversion example, 푣 inv푇 (휎푇 >6푎푚 (푟))

Table 2: Splitting and morphing relations and matrices

Shape type Operations Definition

(r1,r1) usv opU (푟) = 훾 (휇U (푟) □ OP(휇U (푟)), U ◦ ▽U) = □ U ▽V (r1,r2) opd opU;V (푟, 푠) 훾 (휇U (푟) OP(휇U (푟), 휇V (푠)), ◦ ) (r1,c1) inv, evc, chf, qqr opU (푟) = 훾 (휇U (푟) □ OP(휇U (푟)), U ◦ U) = □ U V (r1,c2) mmu opU;V (푟, 푠) 훾 (휇U (푟) OP(휇U (푟), 휇V (푠)), ◦ ) (r1,1) evl, vsv opU (푟) = 훾 (휇U (푟) □ OP(휇U (푟)), U ◦ (op)) (c1,r1) tra opU (푟) = 훾 (ΔU □ OP(휇U (푟)), (퐶) ◦ ▽U) (c1,c1) rqr, dsv opU (푟) = 훾 (ΔU □ OP(휇U (푟)), (퐶) ◦ U) = ΔU □ V (c1,c2) cpd, sol opU;V (푟, 푠) 훾 ( OP(휇U (푟), 휇V (푠)), (퐶) ◦ ) = □ □ U V U (r∗,c∗) emu, add, sub opU;V (푟, 푠) 훾 (휇U (푟) 휇V (푠) OP(휇U (푟), 휇V (푠)), ◦ ◦ ) (1,1) det, rnk opU (푟) = 훾 (푟 ◦ OP(휇U (푟)), (퐶, op)) in the transposed relation. The result relation has a new It maintains all data in regular relations and illustrates the attribute 퐶 whose values are the names of the attributes in importance of maintaining contextual information. the application schema of 푟. Note that all result relations Consider relations 푢, 푓 , and 푟 in Figure 5. Relation 푢 come with sufficient contextual information for each value. records name, state and year of birth of users; relation 푓 For example, relation 푟 in Figure 2 records that Humidity records title, release year and director of films; relation 푟 (퐻) was 1 at 6am, which is also recorded in the transposed records user ratings for films. Tuple 푢1 states that user Ann relation in Figure 4b. lives in California and was born in 1980; tuple 푓1 states that film Heat was directed by Lee and was released in 1995; tu-

qqr푇 (푟) ple 푟1 states that Ann’s ratings for Balto, Heat, and Net are, T H W tra푇 (푟) respectively, 1.5, 2.0, and 0.5. 5am 0.1 0.5 C 5am 6am 7am 8am 6am 0.8 -0.4 H 1 1 6 8 u (user) f (film) r (rating) 7am 0.6 0.4 W 3 4 7 5 User State YoB Title RelY Director User Balto Heat Net 8am 0.1 0.7 (b) Transpose 푢1 Ann CA 1980 푓1 Heat 1995 Lee 푟1 Ann 1.5 2.0 0.5 (a) QR decomposition 푢2 Tom FL 1965 푓2 Balto 1995 Lee 푟2 Tom 0.0 0.0 1.5 푢3 Jan CA 1970 푓3 Net 1995 Smith 푟3 Jan 4.0 1.0 1.0 Figure 4: Examples of relational matrix operations Figure 5: Example database

5 RMA IN ACTION The task is to determine how similar each of Lee’s films This section gives an application example with a mixed work- is to any other film, based on the ratings from California load that combines relational and linear algebra operations. users. The covariance [19] is used to compute this similarity. In addition, we need relational algebra operations (e.g., se- w4 w3 w8 Z C Ann Jan lection 휎, aggregation 휗, rename 휌, and join ) to retrieve U B H N T B H N B -1.25 1.25 selected ratings and films, aggregate ratings, and combine Ann -1.25 0.5 -0.25 푧1 B 1.56 -0.62 -2.5 H 0.5 -0.5 information from different tables. The key observation is Jan 1.25 -0.5 0.25 푧2 H -0.62 0.25 1 that a mixture of matrix and relational operations is required N -0.25 0.25 to determine the similarities of the ratings. The solution in Figure 6 includes three key steps: Data Figure 7: Steps during the computation preparation (푤1), covariance computation (푤2-푤7), and re- trieving Lee’s films together with all similarities푤 ( 8). Note result of a relational matrix operation can be reduced to the the seamless integration of linear and relational algebra3: result of the corresponding matrix operation. Origins guar- The entire process frequently switches between linear and antee that each result relation includes sufficient inherited relational operations. contextual information to relate argument and result relation. We prove that each relational matrix operations is matrix consistent and returns a relation with origins. 푤1 = 휋푈 ,퐵,퐻,푁 (휎푆=’CA’ (푢 Z 푟)) 푤2 = 휗퐴푉퐺 (퐵),퐴푉퐺 (퐻),퐴푉퐺 (푁 ) (푤1) 6.1 Matrix Consistency = 푤3 휋푈 ,퐵,퐻,푁 (sub푈 ;푉 (푤1, 휌푉 (휋푈 (푤1)) × 푤2)) Matrix consistency ensures that the result relation includes = all cell values that are present in the base result and the order 푤4 tra푈 (푤3) 푤5 = mmu (푤4,푤3) of rows in the base result can be derived from contextual in- 퐶;푈 formation in the result relations. First, we define reducibility = 푤6 푤5 × 휌푀 (휗퐶푂푈 푁푇 (∗) (푤1)) to transition from relations to matrices. 푤7 = 휋 /( − ) /( − ) /( − ) (푤6) 퐶,퐵 푀 1 ,퐻 푀 1 ,푁 푀 1 Definition 6.1. (Reducibility) Let 푟 be a relation, U be an = Z 푤8 휋푇,퐵,퐻,푁 (휎퐷=’Lee’ (푤7 퐶=푇 푓 )) order schema. Relation 푟 is reducible to matrix 푚 iff 푚 can be constructed from the attribute values of U in relation 푟 Figure 6: Computing the similarity of the ratings sorted by U: 푟 →U 푚 ⇐⇒ 휇U (푟) = 푚

In the following we discuss the algebra expressions in ′ Example 6.2. Consider Fig. 3 with relation 푟 = 휎푇 >6푎푚 (푟), Figure 6. First, we join 푢 and 푟 to select ratings from Cali- matrix 푛, and order schema 푇 . From Example 4.3 we have fornia users (푤1). Next, we compute the covariance using 휇 (푟 ′) = 푛. Relation 푟 ′ is reducible to matrix 푛 since 푛 can = 1 푇 its standard definition19 [ ]: 푐표푣 (푋, 푌) 푛−1 [(푋 − 퐸[푋]) ∗ T be constructed from the values of 퐻 and 푊 in the argument ( − [ ]) ] ( ) = ′ 푌 퐸 푌 . The expectation of an attribute, e.g., 퐸 퐻 relation sorted by 푇 , i.e., 푟 →푇 푛. 휗퐴푉퐺 (퐻) (...), is computed via aggregation (푤2). Relational matrix operations, sub, tra and mmu, are used to subtract (푋 − Definition 6.3. (Matrix consistency) Consider a unary ma- 퐸[푋]), transpose (T), and multiply (∗) relations (푤3,푤4,푤5). trix operation OP(푚). The corresponding relational matrix Next, we compute the unbiased covarinace (푤6,푤7). Finally, operation op is matrix consistent iff for all relations 푟 that are we join 푤7 and 푓 to select Lee’s films. reducible to matrix 푚, the result relation opU (푟) is reducible to OP(푚): Figure 7 illustrates relations 푤3, 푤4, and 푤8. Consider ′ ∀푟,푚, U(푟 →U 푚 =⇒ ∃U (opU (푟) →U′ OP(푚))) transpose tra푈 (푤3) with order schema 푈 and application schema 푈 = (퐵, 퐻, 푁 ). The result of this operation is a re- A binary relational matrix operation is matrix consistent lation 푤4 with schema (퐶,퐴푛푛,퐽푎푛). The values of attribute if its result is reducible to the result of the corresponding 퐶 are the attribute names in the application schema of 푤3. binary matrix operation. Note that each operation preserves schema and ordering Example 6.4. Consider Figures 2 and 8 with relation 푟, information as the crucial parts of contextual information. matrix 푔, matrix RQR(푔) and relation rqr (푟). This makes it possible to interpret the tuples in result rela- 푇 • 푟 →푇 푔: Relation 푟 is reducible to matrix 푔 tion 푤8. For example, tuple 푧1 states that Lee’s film Balto has • rqr ( ) → RQR( ) rqr ( ) the smallest covariance to film Net. 푇 푟 퐶 푔 : relation 푇 푟 is reducible to matrix RQR(푔) 6 PROPERTIES OF RMA 6.2 Origins of Result Relations This section defines two crucial requirements for relational The result of a relational matrix operation is a relation that, in matrix operations. Matrix consistency guarantees that the addition to the base result, includes a row origin and a column 3We use the first character of the attribute name to refer to attributes. origin. Origins (1) uniquely define the relative positioning of 푔 Column origins (푐표) are marked by rectangles (all values 1 2 RQR(푔) rqr푇 (푟) inside a rectangle form together the column origin for the 1 1 3 1 2 C H W relation). Row origins (푟표) are marked by ellipses. 2 1 4 1 -10.1 -8.8 H -10.1 -8.8 3 6 7 2 0.0 -4.6 W 0.0 -4.6 = = 푝2 usv푇 (푟) 4 8 5 푝1 rnk퐻 (휋퐻,푊 (푟)) C rnk T 5am 6am 7am 8am Figure 8: Example of matrix consistency 푟 1 5am -0.2 0.5 -0.8 0.4 푟표 = 푟 = (r) 6am -0.3 0.6 0.6 0.4 푐표 = op = (rnk) 7am -0.7 0.2 0.0 -0.7 result values, (2) give a meaning to values with respect to the 8am -0.7 -0.6 0.0 0.4 applied operation, and (3) establish a connection between 푟표 = 푟 .U = (5am, 6am, 7am, 8am) argument relations of an operation and its result relation. 푟 푐표 = ▽U = (5am, 6am, 7am, 8am) T H W 푝3 = qqr (푟) Example 6.5. Consider inversion and result relation 푣 in 5am 1 3 푊 ,푇 W T H Figure 3. Values 7푎푚 and 8푎푚 show that (1) value -0.19 pre- 8am 8 5 3 5am 0.1 cedes value 0.31 because 7푎푚 precedes 8푎푚; (2) -0.19 is the 7am 6 7 4 6am 0.1 inversion value for humidity and for time 7푎푚; (3) value -0.19 6am 1 4 5 8am 0.7 in relation 푣 is connected to value 6 in the argument relation 7 7am 0.5 since they have the same origins (7푎푚 and H). 푟표 = 푟 .U = ((3,5am), (4,6am), (5,8am), (7,7am)) U Origins are either inherited order or application schemas 푐표 = = (H) from argument relations, or constants. The shape type of an Figure 9: Examples of origins operation determines the cardinality of inherited contextual information. The indices in the shape type specify the input = U = U = U′ = For 푝2 usv푇 (푟), we have (푇 ), (퐻,푊 ), (푇 ), relation, from which an origin is inherited. For example, if ′ and U = (5am, 6am, 7am, 8am). The shape type of usv is (r1, the first element of the shape type isc1, the row origin is the r1) (see Table 2) this makes 푝2.푇 = 푟.푇 a row origin and ▽푇 schema cast of the application schema of the first argument = (5am, 6am, 7am, 8am) a column origin. relation. Indices ∗ and 2 apply only to binary operations.

Definition 6.6. (Origins) Consider a unary, 푣 = opU (푟), 6.3 Correctness = Theorem 6.8. or binary, 푣 opU;V (푟, 푠), matrix consistent operation with All relational matrix operations return ma- shape type (푥,푦), base result 푚, and attribute list U′ such that trix consistent relations with a row and column origin. ′ ′ 푣 →U′ 푚. Consider Table 3. 푣.U is a row origin iff 푣.U is We provide the proof for the theorem in our technical equal to ro for the given shape type 푥. U′ is a column origin report [10]. iff U′ is equal to co for the given shape type 푦. The following example uses a sequence of tra operations to illustrate the importance of origins. A result relation with Table 3: Definition of origins for shape type푥 ( ,푦) origins inherits sufficient contextual information, such that each value can be interpreted. Origins also carry sufficient x ro y co information about the order of rows, such that in sequences U ▽U r1 푟 . r1 of relational matrix operations no ordering information is V ▽V r2 푠. r2 lost between operations. c1 ΔU c1 U c2 ΔV c2 V Example 6.9. Consider a relation instance 푟 that is re- r∗ (푟 .U, 푠.V) r∗ ▽U ducible to matrix 푛. Figure 10 illustrates the matrix expres- ΔU ΔV U c∗ ( , ) c∗ sion TRA(TRA(푛)) and the corresponding relational matrix 1 ′푟′ 1 ′표푝′ expression tra퐶 (tra푇 (푟)). Operation 푇 tra(푟) returns re- lation 푟1, which in addition to the application schema (at- tributes 5푎푚, 6푎푚, 7푎푚, 8푎푚) also includes attribute퐶, which Example 6.7. Figure 9 illustrates relation 푟 and the origins is preserved together with the application schema. for operations rnk • 퐻 (휋퐻,푊 (푟)) with shape type (1,1) 7 IMPLEMENTATION usv • 푇 (푟) with shape type (r1,r1) We discuss the integration of our solution into MonetDB. • qqr푊 ,푇 (푟) with shape type (r1,c1) The implementation of relational matrix operations includes 푟 푛 푋 ↓푌 returns BAT 푋, whose OIDs have the same order as T H W 1 2 OIDs of BAT 푌 . 푋 ↓푋 denotes 푋 sorted by its own values. 5am 1 3 1 1 3 8am 8 5 푟 →푇 푛 2 1 4 7.2 RMA Integration 7am 6 7 3 6 7 As a first step, we have extended the SQL parser to make the 6am 1 4 4 8 5 relational matrix operations available in the FROM clause tra (r BY U) 푇 (푟) TRA(푛) of SQL [9]. The syntax specifies ordering for an argument relation 푟. As an example, consider relations 푟 and 푟1 푛1 푠 and ordering attributes U and V. The unary operation invU C 5am 6am 7am 8am 1 2 3 4 and the binary operation mmuU;V are expressed as: H 1 1 6 8 푟1 →퐶 푛1 1 1 1 6 8 W 3 4 7 5 2 3 4 7 5 SELECT * FROM INV(r BY U); SELECT * FROM MMU(r BY U, s BY V); tra (푟1) TRA(푛1) 퐶 These basic constructs can be composed to build more 푟2 푛2 complex expressions. For instance, folding 푤5, 푤6 and 푤7 C H W 1 2 from Figure 6 yields the RMA expression: 5am 1 3 1 1 3 휋퐶,퐵/(푀−1),퐻/(푀−1),푁 /(푀−1) ( 6am 1 4 푟2 →퐶 푛2 2 1 4 7am 6 7 3 6 7 mmu퐶;푈 (푤4,푤3) × 휌푀 (휗퐶푂푈 푁푇 (∗) (푤1))) 8am 8 5 4 8 5 The SQL translation of this expression is: Figure 10: Origins and matrix consistency SELECT C, B/(M-1), H/(M-1), N/(M-1) FROM MMU(w4 BY C, w3 BY U) AS w5 CROSSJOIN ( SELECTCOUNT (*) AS M FROM w1 ) AS t; the processing of contextual information and the computa- tion of the base result. Contextual information is handled Algorithm 1 processes a node that represents a unary re- inside MonetDB, while the computation of the base result lational matrix operation opU (푟) and translates it to a list of can be done in MonetDB or delegated to external libraries BAT expressions. In lines 2 - 7, the BATs of relation 푟 are (e.g., MKL). The integration of each relational matrix oper- split, sorted, and morphed to get BATs 푋 with row origins ation requires extensions throughout the system, but does and BATs 푌 with the application part. Splitting (lines 2 and not change the query processing pipeline and no new data 4) divides a relation into two parts: The application part, on structures are introduced. To extend MonetDB with addition, which the matrix operations are performed, and the contex- QR decomposition, linear regression, and the transformation tual information, which gives a meaning to the application of numerical data to the MKL format we touch 20 (out of part. BATs 퐵 are split into application part and order part 4500) files and add 2500 lines of code. according to U. Sorting (lines 3 and 4) determines the order of the tuples for a specific matrix operation. The order schema 7.1 MonetDB U is used to sort the BATs: BATs in U are sorted according to their values while the other BATs in 퐵 are sorted according MonetDB stores each column of a table as a binary associa- to the OIDs of the BATs in U. The order is established for tion table (BAT). A BAT is a table with two columns: Head each operation based on the contextual values in the relation. and tail. The head is a column with object identifiers (OID), Morphing (lines 5-7) morphs contextual information so that it while the tail is a column with attribute values. All attribute can be added to the base result. Finally, the matrix operation values of a tuple in a relation have the same OID value. Thus, is applied to 푌 (line 8). Merging (line 9) combines the result of a tuple can be constructed by concatenating all tail values a matrix operation with relevant contextual information and with the same OID. MonetDB operations manipulate BATs, constructs the result relation with row and column origins. and relational operations are represented and executed as Merging and splitting are efficient operations that work at sequences of BAT operations. Example BAT operations are the schema level and do not access the data. 퐵1 ∗ 퐵2, 퐵1/퐵2, and 퐵1 − 퐵2 for element-wise multiplication, division, and subtraction, and 푠푢푚(퐵) to sum the values in Example 7.1. Figure 11 illustrates Algorithm 1 for 푣 = = BAT 퐵. inv푇 (휎푇 >6푎푚 (푟)). Splitting: input list 퐵 (푇, 퐻,푊 ) is split One important BAT operation is leftfetchjoin (↓), which into order list 퐷 = (푇 ) and application list 퐵 \ 퐷 = (퐻,푊 ). returns a BAT with OIDs sorted according to the order of Sorting: BAT 푇 is sorted, producing 퐺. Then, (퐻,푊 ) are OIDs of another BAT from the same relation. For instance, sorted according to 퐺 returning (퐻 ↓ 푇,푊 ↓ 푇 ). Morphing: input splitting sorting morphing eval merging

휎푇 >6푎푚 (푟) 퐷 퐵 \ 퐷 퐺 푌 푋 퐹 퐶표푛푐푎푡 (푋, 퐹)

T H W T H W T↓T H↓T W↓T T↓T H W T H W 8am 8 5 8am 8 5 7am 6 7 7am -0.19 0.27 7am -0.19 0.27 7am 6 7 7am 6 7 8am 8 5 8am 0.31 -0.23 8am 0.31 -0.23

= Figure 11: Splitting, sorting, morphing, merging for query 푣 inv푇 (휎푇 >6푎푚 (푟))

Algorithm 1: UnaryRMA(op, U, 푟) Algorithm 2: INV(퐵)

1 퐵 ← 퐵퐴푇푠(푟); 푌 ← {} ; 1 푛 ← 퐵.푙푒푛푔푡ℎ; 퐵푅 ← 퐼퐷푚푎푡푟푖푥 (푛); 2 퐷 ← {푏|푏 ∈ 퐵,푏 is in order schema U} ; 2 for 푖 = 1 to 푛 do 3 퐺 ← 푠표푟푡 (퐷) ; 3 푣1 ← 푠푒푙 (퐵푖, 푖); 퐵푖 ← 퐵푖 /푣1; 퐵푅푖 ← 퐵푅푖 /푣1; 4 for 푏 ∈ 퐵 \ 퐷 do 푌 ← 푌 ∪ 푏 ↓퐺; 4 for 푗 = 1 to 푛 do ≠ 5 if ShapeType(op) ∈ {(r,r), (r,c), (r,1)} then 푋 ← 퐺; 5 if 푖 푗 then 6 6 else if ShapeType(op) ∈ {(c,r), (c,c)} then 푋 ← 푛푒푤퐵퐴푇 (푌); 푣2 ← 푠푒푙 (퐵푗, 푖); 퐵푗 ← 퐵푗 − 퐵푖 ∗ 푣2; 7 7 else 푋 ← 푛푒푤퐵퐴푇 (푟); 퐵푅푗 ← 퐵푅푗 − 퐵푅푖 ∗ 푣2; 8 퐹 ← 푒푣푎푙 (op, 푌); 8 return 퐵푅; 9 return 퐶표푛푐푎푡 (푋, 퐹);

returns the 푖th value in 퐵. With the exception of the 푠푒푙 oper- Since inv is of shape type (r1,c1), row contextual informa- ation, all operations are standard MonetDB BAT operations tion is the order schema: 푋 = 푇 ↓ 푇 . Merging: BATs 푋 are that are also used for relational queries. For example, the concatenated with BATs 퐹 to form the result. operation on 퐵푖 ← 퐵푖 /푣1; divides each element of a BAT by a scalar value. 7.3 Computing the Base Result 8 PERFORMANCE EVALUATION Line 8 of Algorithm 1 calls the procedure that computes the matrix operation. The computation can be done either in Setup. All runtimes are averages over 3 runs on an Intel(R) the kernel of MonetDB or by calling an external library (e.g., Xeon(R) E5-2603 CPU, 1.7 GHz, 12 cores, no hyper-threading, MKL [2]). Calling an external library requires copying data 98 GB RAM (L1: 32+32K, L2:256K, L3:15360K), Debian 4.9.189. from BATs to the external format and copying the result Competitors. We empirically compare the implementa- back. The query optimizer decides about external library tions of our relational matrix algebra (RMA+4) with the sta- calls based on the complexity of the operation, the amount of tistical package R, the array database SciDB, and two state- data to be copied, and the relative performance of the matrix of-the-art in-database solutions, AIDA [11] and MADlib [16]. operation in MonetDB compared to the external library. (1) We implemented RMA+ in MonetDB (v11.29.8) with two The no-copy implementation of matrix operations in the options for matrix operations: (a) BATs (RMA+BAT): No- kernel of MonetDB is performed over BATs directly. Essen- copy implementation in the kernel of MonetDB; (b) MKL tially, standard algorithms must be reduced to BAT opera- (RMA+MKL): Copy BATs to an MKL (v2019.5.281) [2] com- tions. The process of reducing is highly dependent on the patible format (contiguous array of doubles), then copy the operation. The goal is to design algorithms that access entire result back to BATs. We execute linear operations (add, sub, columns and minimize accesses to single elements of BATs. emu) on BATs and use MKL for more complex operations. To achieve this standard, value-based algorithms must be When the matrices do not fit into main memory we switch to transformed to vectorized BAT operations. BATs. Due to the full integration of RMA+, MonetDB takes Algorithm 2 illustrates the reduction for the Gauss Jor- care of core usage and work distribution, and all cores are dan elimination method for the INV computation. The al- used for relational and for matrix operations. (2) SciDB [31] gorithm takes a list of BATs 퐵 = (퐵 , 퐵 , .., 퐵 ) and returns 1 2 푛 uses an array data model, and queries are expressed in the the inversion as a list of BATs 퐵푅 of the same size. Function 퐼퐷푚푎푡푟푖푥 (푛) creates a list of BATs that represents the iden- tity matrix of size 푛 × 푛. The selection operation 푠푒푙 (퐵, 푖) 4The implementation can be found here: https://github.com/oksdolm/RMA high-level, declarative language AQL (Array Query Lan- Figure 12 shows the results. (1) Handling contextual infor- guage) [30]. SciDB uses all available cores. (3) AIDA is a state- mation is efficient and scales to large numbers of attributes. of-the-art solution for the integration of matrix operations (2) The optimized operators that (partially) avoid sorting into a relational database and was shown to outperform other clearly outperform their non-optimized counterparts. solutions like Spark or the pandas library for Python [11]. 4 4 AIDA executes matrix operations in Python and offers a add add add, relative sorting add, relative sorting Python-like syntax for relational operations, which are then 3 qqr 3 qqr qqr, w/o sorting qqr, w/o sorting translated into SQL and executed in MonetDB (v11.29.3). 2 2

AIDA uses all cores both in MonetDB and in Python. We also 1 integrate our solution into MonetDB, which makes AIDA a Runtime (sec) Runtime (sec) 1 0 particularly interesting competitor. (4) MADlib [16] (v1.10) 200 400 600 800 1,000 20 40 60 80 100 provides a collection of UDFs for PostgreSQL (v9.6) for in- #attributes in order schema #attributes in order schema database matrix and statistical calculations. MADlib does (a) 100K tuples (b) 1M tuples not use multiple cores, which affects its overall performance. (5) The R package (v3.2.3) is highly tuned for matrix opera- Figure 12: Handling contextual information tions and is a representative of a non-database solution. R performs all relational operations with data.tables structures Note that a number of operations (cpd, sol, rqr, dsv, tra, and transforms the relevant columns to matrices to com- det, rnk) do not preserve row context since the number pute the matrix operations. An alternative approach to use of rows changes. Instead, a single column with predefined character matrices for all operations is very inefficient (cf. values (operation name or attribute names of the application Section 8.5). R uses all cores for matrix operations but runs schema) is created, which is negligible in the overall runtime. relational operations on a single core. Data. BIXI [18] stores trips and stations of Montreal’s pub- 8.2 Wide and Sparse Relations lic bicycle sharing system, years 2014-2017. DBLP [24] stores Wide relations. Current databases scale better in the num- authors with their publication counts per conference as well ber of tuples than in the number of attributes. We test our as conference rankings. The synthetic dataset used in the RMA+ implementation in MonetDB on wide relations. We experiment to the effect of sparsity includes values generate relations with 1000 tuples, one order attribute, and between 0 and 5,000,000. All other synthetic datasets include a varying number of application attributes. In Table 4, we in- real-valued numeric attributes with uniformly distributed crease the number of attributes from 1K to 10K and measure values between 0 and 10,000. the runtime of the add operation. MonetDB can handle wide relations with several thousands of attributes, even though the runtime per column increases with the attribute number.

8.1 Maintaining Contextual Information Table 4: add over wide relations in RMA+ A salient feature of our approach is that contextual informa- #attr 1K 2K 3K 4K 5K 6K 7K 8K 9K 10K tion is maintained during matrix operations. We analyze the sec 0.6 2.2 4.8 8.8 13.4 20 27 36 47 62 scalability of maintaining context and study an optimization that avoids sorting. To this end, we generate relations with a single application column and an increasing number of order Sparse relations. We analyze the effect of MonetDB’s built- columns. We compute add and qqr on these relations. Since in compression on relations with many zeros. We add two add and qqr are inexpensive for single column matrices, the relations (5M tuples, one order, 10 application attributes) main cost is the maintenance of the order part. with uniformly distributed non-zero values (range 1-5M). In To handle contextual information we split, sort, morph, Table 5 we increase the percentage of zero values (position and merge lists of BATs (cf. Section 7.2). Sorting is the most of zeros is random) and measure the runtime: The add oper- expensive operation. Fortunately, sorting is not always nec- ation on sparse matrices is up to two times faster than the essary. For example, permuting the input rows for the qqr same operation on dense matrices. Thus, RMA+ leverages operation will affect the order of the result rows, but will MonetDB’s compression features. not change their values. Therefore, sorting is not required. Table 5: add over sparse relations in RMA+ In element-wise operations like add, emu, or sol, only the relative order of the rows in the two input relations matters. % 0 10 20 30 40 50 60 70 80 90 100 Thus, only the order part of the second relation requires sec 1.68 1.60 1.49 1.41 1.33 1.25 1.16 0.99 0.94 0.89 0.76 sorting (to get the same order). 8.3 RMA+ vs. Non-Database Approaches data type for matrix operations and the data.tables storage We study the scalability of RMA+ to large relations and com- structure for relational operations. While data.tables sup- pare to 푅 as a non-database solutions for matrix operations. ports simple linear operations like linear model construction, In Table 6 we measure the runtime for qqr on tables with up the data must be transformed to the matrix type for more to 100M tuples and 70 attributes in the application schema. complex operations like CPD, OPD, or MMU. Matrices cannot For relations up to a size of 50Mx40, RMA+ delegates the store a mix of numerical and non-numerical values, which matrix computation to MKL; the runtime includes copying is required when working with tables; R offers character the data. RMA+ is consistently faster than R since MKL can matrices, but they are very inefficient, e.g., joining trips and better leverage the hardware. R fails for sizes above 50Mx40 stations in the BIXI dataset takes 40 sec for the character since it runs out of memory. In RMA+ we switch to the BAT matrix type and less than 2 sec for data.tables. implementation, which leverages the memory management Figure 13 shows the percentage of time spent for data of MonetDB. The Gram-Schmidt qqr baseline [13] that we transformations on relations with 50 columns and a vary- implemented over BATs is slower than the MKL algorithm ing number of rows (100k to 500k). For R we measure the (e.g., 834 vs. 61.4 sec for 50Mx40), which explains the increase time of transforming the relation from data.table to matrix in runtime. RMA+ scales to large relations that do not fit into and back as a percentage of the overall query time, which memory (e.g., relation size 100Mx70 requires 56GB). includes the actual matrix operation. For RMA+ we measure the time share for copying the data from a list of BATs to a Table 6: Runtimes of qqr in seconds in R and RMA+ contiguous, one-dimensional array for MKL, and for copy- ing the result back; the overall runtime in addition includes 10 attr 40 attr 70 attr the matrix computation in MKL (but excludes the MonetDB System R RMA+ R RMA+ R RMA+ query pipeline of query parsing, query tree creation, etc.). 5M tup 3.5 2.1 20 6.6 47 11.6 50M tup 37 21.3 221 61.4 fail 2018 100M tup 74 40 fail 1690 fail 4064 #rows (#columns = 50) #rows (#columns = 50) 500K 81 75 64 21 7 7 500K 92 92 86 53 44 43 300K 79 77 63 21 7 7 300K 91 91 86 55 45 40 8.4 RMA+ vs. Array Databases 100K 84 74 69 23 9 10 100K 86 86 80 48 37 35 We study the performance of RMA+ vs. SciDB [31] as a ADD EMU MMU QQR DSV VSV ADD EMU MMU QQR DSV VSV representative of array databases. We compute add on two (a) Data.table and matrix (b) List of BATs and 1D array matrices with 10 columns and a varying number of rows, followed by a selection 5. The resulting runtimes for Ubuntu Figure 13: Data transformation share: (a) R, (b) RMA+ are shown in Table 7. RMA+ outperforms SciDB by more than an order of magnitude. RMA+ performs addition directly Clearly, the overhead of transforming data matters for over pairs of relations, while SciDB must compute a so-called both R and RMA+. We draw the following conclusions: (a) array join [30] over the input arrays in order to add their Transforming data between data structures is costly. (b) For values. simple operations like ADD and EMU, the transformation over- head dominates the overall runtime (up to 92%). (c) For com- Table 7: add followed by a selection: RMA+ vs. SciDB plex operations, the performance of the matrix operation #tuples 1M 5M 10M 15M dominates the overall runtime. RMA+ 4.6s 24.4s 1m18s 1m39s SciDB 1m21s 7m6s 13m2s 18m23s 8.6 Efficiency for Mixed Workloads We analyze four workloads that require a mix of relational operations and matrix operations, and we compare our im- 8.5 Overhead of Data Transformation plementation of RMA (RMA+) to its competitors (R, AIDA, We investigate the overhead of data transformation for vari- MADlib). The workloads stem from applications on our real- ous matrix operations in a mixed relational/matrix scenario. world datasets and differ in the complexity of relational vs. RMA+ is free to execute matrix operations directly on matrix part. On the BIXI dataset, we compute (1) the lin- BATs or rearrange the numerical data in main memory and ear regression between distance and duration for individual delegate the matrix operations to specialized packages like trips, and (2) journeys connecting up to 5 trips; on DBLP we MKL [2]. R does not enjoy this flexibility: R uses the matrix compute the (3) covariance between conferences based on 5We run this experiment on Ubuntu 14.04 since SciDB does not support the publication counts per conference and author; (4) on a Debian; Ubuntu runs on a server with 4 cores and 16GB of RAM. synthetic dataset based on BIXI we count trips per rider. (1) Trips ś Ordinary Linear Regression. Trips in BIXI in- coordinates to compute the distances between subsequent clude start date and start station, end date and end station, stations in a journey. At the matrix level, we do a multiple duration, and a membership flag for the rider; stations have linear regression analysis with the distances as independent a code, a name, and coordinates. At the level of relations, we variables and the overall duration as the dependent variable. need to perform the following steps: (a) Ag- Figure 15a shows the runtime for journey lengths of 1 gregate the trips and select those trips that were performed to 5 trips (i.e., 1 to 5 independent variables). The solid part at least 50 times; (b) join trips and stations to retrieve the is the time for data preparation (relational operations); the station coordinates and compute the distance. We use the dashed light part is the time for multiple linear regression OLS method [29] to compute the linear regression between (matrix operations). RMA+ and AIDA again outperform R on distance and duration. OLS uses cross product, matrix mul- the relational part of the query. The relational part operates tiplication, and inversion: MMU(INV(CPD(퐴, 퐴)), CPD(퐴,푉 )), on purely numerical data and AIDA shows comparable join where 퐴 is the matrix with the independent variables, and 푉 performance to RMA+. MADlib spends about two third of the is the vector with the dependent variable. relational runtime on distance computations and is therefore Figure 14a shows the runtime results for trips reported in slower than its competitors also on the relational part. the years 2014 (3.1M trips), 2014-2015 (6.1M trips), 2014-2016 10 (10.5M trips), and 2014-2017 (14.5M trips), respectively. The 200 RMA+ AIDA R MADlib RMA+MKL RMA+BAT input data consists of numeric and non-numeric types such 8 150 as date and time. We break the runtime down into data prepa- 6 100 ration (solid area of the bar) and matrix computation time 4 Runtime (sec) 50 Runtime (sec) (dashed light area) for RMA+, R, and AIDA; for R we also 2 show the load time from a CSV file (dark area). RMA+ and 0 1 2 3 4 5 0 1 2 3 4 5 AIDA outperform R and MADlib in all scenarios. R performs #trips #trips poorly on the relational operations of the data preparation (a) System Comparison (b) RMA+BAT vs RMA+MKL step: The join implementation of R does not leverage mul- tiple cores, and R lacks a query optimizer, which adversely Figure 15: Journeys (Multiple Linear Regression) affects the relational performance. MADlib is outperformed by all other solutions due to the slow computation of the (3) Conferences ś Covariance Computation. We compute linear regression. RMA+ outperforms AIDA on all datasets. the covariance between conferences with A++ rating to Although both RMA+ and AIDA compute the relational op- lower rated conferences based on the number of publications erations in MonetDB, RMA+ is up to 6.3 times faster: While per author and conference. The data includes two tables. AIDA passes pointers to access numerical Python data in 푟푎푛푘푖푛푔 stores a rating (e.g., A++, A+, B) for each conference. MonetDB, this does not work for other data types (e.g., date, 푝푢푏푙푖푐푎푡푖표푛 stores the number of publications per author time, string) due to different storage formats12 [ ]. Therefore, and conference; the first attribute is the author, the other expensive data transformations must be applied. attributes are conference names (i.e., the result of SQL PIVOT over a count-aggregate by conference and author). The query 5 25 RMA+ AIDA R MADlib RMA+MKL RMA+BAT computes the covariance matrix on 푝푢푏푙푖푐푎푡푖표푛 and joins the 20 4 result with 푟푎푛푘푖푛푔 to select A++ conferences.

15 3 We measure the runtime for 푝푢푏푙푖푐푎푡푖표푛 tables of increas-

10 2 ing sizes: (1) 337363x266 (i.e., 337363 authors and 266 confer- Runtime (sec) Runtime (sec) 5 1 ences), (2) 550085x519, (3) 722891x744, and (4) 876559x882. The 푟푎푛푘푖푛푔 table stores 882 tuples. Note that the number of 0 0 3.1 6.5 10.5 14.5 3.1 6.5 10.5 14.5 #tuples (M) #tuples (M) result rows of covariance is identical to the number of input columns, e.g., covariance of 푝푢푏푙푖푐푎푡푖표푛푠 with 266 columns (a) System Comparison (b) RMA+BAT vs RMA+MKL returns a relation (or matrix) of size 266x266. Figure 14: Trips (Ordinary Linear Regression) Figure 16a shows the runtime results for RMA+, R, and AIDA. MADlib runs for 77, 429, 1086, resp. 1814 seconds (2) Journeys ś Multiple Linear Regression. We compose on the different relation sizes and, thus, is omitted from the trips that meet in a station into journeys. We start from 15M figure. In all systems, the covariance computation dominates one-trip journeys of the form (start station, end station, du- the overall runtime with at least 90%. Since AIDA does not ration); all attributes are numerical. During data preparation, support covariance, we implement covariance via cross prod- we perform joins to create journeys of up to five trips, select uct [19] in all algorithms except MADlib, which has a cov() those that appear at least 50 times, and join stations with their function but does not support cross product. For the cross product in RMA+ we use the routine cblas_dsyrk() since 1.4-1.9, cf. Figure 15b). For the conference query, RMA+MKL the result of multiplication is symmetric, in AIDA we use is 24 to 70 times faster since the cross product requires single a.t @ a, in R we use crossproduct6. element access and operates on relations with a large num- Note that the covariance computations in AIDA and R ber of attributes. For the trip count, RMA+BAT outperforms do not return contextual information. In order to join the RMA+MKL in all settings (cf. Figure 17b). Although elemen- result with 푟푎푛푘푖푛푔 and to select all A++ conferences, the twise addition is highly efficient in MKL, the transformation conference names must be manually added as a new column. overhead cannot be amortized.

30 1,000 RMA+ AIDA R RMA+MKL RMA+BAT 8.7 Discussion 25 800 20 The key learnings from our empirical evaluation are the fol- 600 15 lowing: (1) RMA+ excels for mixed workloads that include 400 10 both standard relational and matrix operations. (2) Only Runtime (sec) Runtime (sec) 5 200 RMA+ can avoid data transformations in mixed workloads; 0 1 2 3 4 0 1 2 3 4 data transformations may be costly and consume more than size of 푝푢푏푙푖푐푎푡푖표푛푠 table size of 푝푢푏푙푖푐푎푡푖표푛푠 table 90% of the overall runtime. (3) For complex matrix opera- (a) Systems Comparison (b) RMA+BAT vs RMA+MKL tions, however, transforming the data to a suitable format may pay off: In our approach, we are free to transform the Figure 16: Conferences (Covariance Computation) data whenever beneficial. (4) In terms of scalability to large (4) Trip Count. In Figure 17 we compute the number of relations/matrices, our solution outperforms all competitors trips per rider to 10 different destinations. Each tuple in since it relies on the memory management of the database the input relations stores a rider and the number of trips system for both the standard relational and the matrix oper- to each of the 10 locations for one year. We use add on the ations. (5) Finally, the handling of contextual information, a relations of two different years to get the trip count fora feature of RMA, is efficient and can leverage optimizations period of two years. We vary the number of riders from that avoid expensive sortings. 1M to 15M and measure the runtime. Since add is a simple operation, RMA+ uses the no-copy implementation on BATs 9 CONCLUSION (RMA+BAT). RMA+ is faster than AIDA and R because it In this paper, we targeted applications that store data in re- does not transfer data to Python (as AIDA) and does not lations and must deal with queries that mix relational and translate data.tables to matrices (as R). MADlib takes 23, 119, linear algebra operations. We proposed the relational matrix 299, resp. 480 seconds for the different input sizes and, thus, algebra (RMA), an extension of the relational model with is again omitted from the figure. matrix operations that maintain important contextual in- formation. RMA operations are defined over relations and 7 7 RMA+ AIDA R RMA+MKL RMA+BAT 6 6 can be nested. We implemented RMA over the internal data 5 5 structures of MonetDB, and the implementation does not 4 4 require changes in the query processing pipeline. Our inte- 3 3 gration is competitive with state-of-the-art approaches and

Runtime (sec) 2 Runtime (sec) 2 1 1 excels for mixed workloads.

0 1 5 10 15 0 1 5 10 15 RMA opens new opportunities for cross algebra optimiza- #tuples(M) #tuples(M) tions that involve both relational and linear algebra oper- (a) Systems Comparison (b) RMA+BAT vs RMA+MKL ations. It is also interesting to investigate the handling of wide tables, e.g., by storing them as skinny tables that are Figure 17: Trip Count (Matrix Addition) accessed accordingly or by combining operations to avoid RMA+BAT vs. RMA+MKL. Following our policy, RMA+ the generation of wide intermediate tables. delegates matrix operations to MKL (RMA+MKL) in Fig- ures 14a, 15a, and 16a (the operations are complex and we ACKNOWLEDGMENTS do not run out of memory), and uses the no-copy implemen- We thank Joseph Vinish D’silva for providing the source add tation on BATs (RMA+BAT) in Figure 17a ( is a linear code of AIDA and helping out with the system. We thank the operation). We compare RMA+BAT to RMA+MKL in all sce- MonetDB team for their support. We thank Roland Kwitt for narios. RMA+MKL outperforms RMA+BAT for the queries the discussion about application scenarios. The project was on trips (factor 1.8-3.8, cf. Figure 14b) and journeys (factor partially supported by the Swiss National Science Foundation 6We do not use function cov() since it uses a single core only and is slower. (SNSF) through project number 407540_167177. REFERENCES [17] D. Hutchison, B. Howe, and D. Suciu. Laradb: A minimalist kernel for [1] cuBLAS ś NVIDIA BLAS library for GPUs. https://developer.nvidia. linear and relational algebra computation. CoRR, abs/1703.07342, 2017. com/cublas. [18] Kaggle Inc. BIXI Montreal (public bicycle sharing system). https: [2] MKL ś Intel Math Kernel Library. https://software.intel.com/en-us/ //www.kaggle.com/aubertsigouin/biximtl, 2019. mkl. [19] R.A. Johnson and D.W. Wichern. Applied Multivariate Statistical Anal- [3] NumPy. http://www.numpy.org, 2018. ysis. Applied Multivariate Statistical Analysis. Pearson Prentice Hall, [4] C. R. Aberger, A. Lamb, K. Olukotun, and C. Re. Levelheaded: A unified 2007. engine for and linear algebra querying. In ICDE. [20] S. Luo, Z. J. Gao, M. Gubanov, L. L. Perez, and C. Jermaine. Scalable IEEE, 2018. linear algebra on a relational database system. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pages 523ś534, [5] C. R. Aberger, S. Tu, K. Olukotun, and C. Re. Emptyheaded: A relational April 2017. engine for graph processing. In Proceedings of the 2016 International [21] W. McKinney. pandas: a foundational python library for data analysis Conference on Management of Data, SIGMOD ’16, pages 431ś446, New and statistics, 2011. York, NY, USA, 2016. ACM. [22] D. Misev and P. Baumann. Homogenizing data and retrieval [6] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The in scientific applications. In Proceedings of the ACM Eighteenth Inter- Multidimensional Database System RasDaMan. SIGMOD Rec., 27(2), national Workshop on Data Warehousing and OLAP, DOLAP ’15, pages June 1998. 25ś34, New York, NY, USA, 2015. ACM. [7] M. Boehm, A. Kumar, J. Yang, and H. V. Jagadish. in [23] MonetDB. Online MonetDB reference. https://www.monetdb.org/ Machine Learning Systems. Synthesis Lectures on Data Management. Home, 2017. Morgan & Claypool Publishers, 2019. [24] University of Trier. DBLP computer science bibliography. https: [8] Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine. //dblp.uni-trier.de, 2019. Simulation of database-valued Markov chains using SimSQL. In Pro- [25] Oracle. Online Oracle UTL_NLA Package Reference. https://docs. ceedings of the 2013 ACM SIGMOD International Conference on Man- oracle.com/database/121/ARPLS/u_nla.htm, 2016. agement of Data, SIGMOD ’13, pages 637ś648, New York, NY, USA, [26] C. Ordonez. Building statistical models and scoring with UDFs. In 2013. ACM. Proceedings of the 2007 ACM SIGMOD International Conference on Man- [9] O. Dolmatova, N. Augsten, and M. H. Böhlen. Preserving contextual agement of Data, SIGMOD ’07, pages 1005ś1016, New York, NY, USA, information in relational matrix operations. In Proceedings of the 36th 2007. ACM. International Conference on Data Engineering, ICDE ’20. To appear. [27] R project. The R Project for Statistical Computing. https://www.r- [10] O. Dolmatova, N. Augsten, and M. H. Böhlen. A relational matrix project.org/, 2018. algebra and its implementation in a column store (extended version). [28] Quick R. R Matrix Algbera package overview. http://www.statmethods. CoRR, abs/2004.05517, 2020. net/advstats/matrix.html, 2017. [11] J. V. D’silva, F. De Moor, and B. Kemme. AIDA: Abstraction for ad- [29] C. Radhakrishna Rao and H. Toutenburg. Linear Models, Least Squares vanced in-database analytics. Proc. VLDB Endow., 11(11):1400ś1413, and Alternatives. Springer Series in Statistics. Springer-Verlag New July 2018. York, 1995. [12] J. V. D’silva, F. De Moor, and B. Kemme. Keep your host language [30] Inc. SciDB. SciDB User’s Guide Version 13.3.6203. In SciDB User’s object and also query it: A case for SQL query support in RDBMS for Guide, pages 1ś258, 2013. host language objects. In Carlos Maltzahn and Tanu Malik, editors, [31] M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The Architecture Proceedings of the 31st International Conference on Scientific and Statis- of SciDB. In Proceedings of the 23rd International Conference on Scientific tical Database Management, SSDBM 2019, Santa Cruz, CA, USA, July and Statistical Database Management, SSDBM’11, pages 1ś16. Springer- 23-25, 2019, pages 133ś144. ACM, 2019. Verlag, 2011. [13] W. Gander. Algorithms for the QR-Decomposition. Technical report, [32] Y. Zhang, H. Herodotou, and J. Yang. RIOT: I/O-efficient numerical ETH Zurich, April 1980. computing without SQL. CoRR, abs/0909.1766, 2009. [14] A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, [33] Y. Zhang, M. Kersten, M. Ivanova, and N. Nes. SciQL: Bridging the S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative Gap Between Science and Relational DBMS. In Proceedings of the machine learning on MapReduce. In Proceedings of the 2011 IEEE 27th 15th Symposium on International Database Engineering & Applications, International Conference on Data Engineering, ICDE ’11, pages 231ś242, IDEAS ’11. ACM, 2011. Washington, DC, USA, 2011. IEEE Computer Society. [34] Y. Zhang, M. Kersten, and S. Manegold. SciQL: Array Data Process- [15] G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins ing Inside an RDBMS. In Proceedings of the 2013 ACM SIGMOD In- University Press, 3rd edition, 1996. ternational Conference on Management of Data, SIGMOD ’13, pages [16] J. M. Hellerstein, C. Re, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gora- 1049ś1052. ACM, 2013. jek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib Ana- lytics Library: Or MAD Skills, the SQL. Proc. VLDB Endow., 5(12):1700ś 1711, August 2012.