Tensor Relational Algebra for Distributed Machine Learning System Design
Total Page:16
File Type:pdf, Size:1020Kb
Tensor Relational Algebra for Distributed Machine Learning System Design Binhang Yuany, Dimitrije Jankovy, Jia Zouz, Yuxin Tangy, Daniel Bourgeoisy, Chris Jermainey yRice University; zArizona State University y{by8, dj16, yuxin.tang, dcb10, cmj4}@rice.edu; [email protected] ABSTRACT This operational approach has certain drawbacks, namely that We consider the question: what is the abstraction that should be the system has limited latitude for automatic optimization. The implemented by the computational engine of a machine learning programmer is responsible for making sure that the operations can system? Current machine learning systems typically push whole run successfully using available resources (such as the amount of tensors through a series of compute kernels such as matrix multi- RAM on each GPU), and if the operations cannot run successfully, plications or activation functions, where each kernel runs on an AI the programmer must figure out how to break the operations up into accelerator (ASIC) such as a GPU. This implementation abstraction smaller pieces that can run. Tasks such as getting a computation provides little built-in support for ML systems to scale past a single to run on multiple GPUs or on multiple machines in a distributed machine, or for handling large models with matrices or tensors cluster so as to minimize communication are left to the programmer. that do not easily fit into the RAM of an ASIC. In this paper, we Toward declarative tensor programming. There has been work present an alternative implementation abstraction called the tensor on programming models that are more declarative. PyTorch and relational algebra (TRA). The TRA is a set-based algebra based on TensorFlow now both support variants of Einstein notation—a clas- the relational algebra. Expressions in the TRA operate over binary sical notation for specifying operations over tensors, and work on tensor relations, where keys are multi-dimensional arrays and val- optimizing such computations has made its way into commercial ues are tensors. The TRA is easily executed with high efficiency systems [8]. Researchers have proposed variants of the Ricci calcu- in a parallel or distributed environment, and amenable to auto- lus as a specification language for ML[32]. There have been other matic optimization. Our empirical study shows that the optimized proposals for declarative tensor programming languages that allow TRA-based back-end can significantly outperform alternatives for for automatic generation of compute kernels that can be optimized running ML workflows in distributed clusters. to handle the data at hand such as Tensor Comprehensions [43]. Nevertheless, while there has been attention paid at the ques- PVLDB Reference Format: tion of how to specify ML computations declaratively, there has Binhang Yuan, Dimitrije Jankov, Jia Zou, Yuxin Tang, Daniel Bourgeois,and been little attention paid to the question of what the correct im- Chris Jermaine. Tensor Relational Algebra for Distributed Machine plementation abstraction for ML system design should be. That Learning System Design. PVLDB, 14(8): 1338 - 1350, 2021. is, what target should a ML system back-end present to the front- doi:10.14778/3457390.3457399 end?1 There are several requirements for such a back-end interface. It should be able to express most/all important ML or numerical 1 INTRODUCTION computations. It should be hardware agnostic, but computations In widely-used machine learning (ML) systems such as Tensor- specified using the interface should be easily scaled to multiple Flow [4] and PyTorch [3], computations are usually specified op- ASICs and multiple machines. It should allow for computations erationally: a sequence of mathematical operations such as matrix over very large input data. It should facilitate easy, automatic op- multiplications, activation functions (ReLU, tanh), convolutions, timization. And it should provide for execution times that are as etc., are applied to a set of input tensors to define a computation. fast as a carefully-designed, “hand-built” computation on top of the HPC systems such as ScaLAPACK [17] and distributed analytic very best tools such as ScaLAPACK, Dask, TensorFlow, or PyTorch. packages such as Dask [1] offer a similar, operational interface. The tensor relational algebra. In this paper, we argue that the Operations are “black boxes” in the sense that the internals of the implementation abstraction that should be offered by a ML system operation are mostly opaque to the system. An operation such as back-end is the tensor relational algebra, or TRA for short. The TRA a matrix multiply is not a logical operation that the ML system is a simple variant of the classical relational algebra (RA), which figures out how to best optimize at runtime. Instead, it is aphysical serves as the standard implementation abstraction for modern re- operation that has to run somewhere, on some hardware, via the lational database system design. The key difference is that in the invocation of a computational kernel. TRA, the relations over which computations are performed are always binary relations between :-dimensional vectors (keys) and This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of A-dimensional tensors. this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. 1Throughout the paper, we use the term front-end to refer to the programmer-facing Proceedings of the VLDB Endowment, Vol. 14, No. 8 ISSN 2150-8097. API and compiler; in PyTorch, for example, this would be the part of the system that doi:10.14778/3457390.3457399 accepts Einstein notation and transforms it into a set of executable operations. We use the term back-end to refer to the sub-system that actually executes the computation. 1338 Of course, it is widely understood that a :-dimensional tensor compiler and auto-differentiation engine that make up theML can be stored as a binary relation between :-dimensional keys system’s front-end? and real number values, e.g. [25]. Thus, why not use classical RA as the implementation abstraction? There is good evidence that 2.2 TRA: The Basics a compute engine based upon such an abstraction will perform We propose the TRA as this ML implementation abstraction. The very poorly over the dense-tensor computations common in deep TRA operates over tensor relations containing pairs of the form: learning [35]: the overhead associated with storing each entry in a (key, array) high-dimensional tensor as a separate tuple and moving billions of . tuples through a system can result in poor performance, especially Conceptually, these tensor relations store sets of arrays. Each key when the competition is a high-performance CPU or GPU kernel. value serves, not surprisingly, as the key for the pair. This is why the TRA specifically allows for tensors or “chunks” of Tensors are decomposed into sets of sub-tensors to represent a larger tensor to be stored in each tuple—the fixed, per-tuple cost them as tensor relations. For example, consider the matrix A, is incurred by a much smaller number of tuples. 2 1 2 5 6 3 We argue that the TRA has three key advantages as an imple- 6 7 6 3 4 7 8 7 mentation abstraction: expressivity, easy optimization, and high A = 6 7 , 6 9 10 13 147 performance. The joins and aggregations offered by the TRA can 6 7 611 12 15 167 implement the indexed operations required by the Einstein nota- 4 5 tion and related specification languages. It is easy to implement we may store this as a tensor relation ( and optimize relational operations in a distributed environment, as (︃ [︃1 2]︃)︃ (︃ [︃5 6]︃)︃ R = h0, 0i, , h0, 1i, , the decades of success enjoyed by relational database systems has 퐴 3 4 7 8 demonstrated. And by design, an ML system back-end implement- ) ing the TRA is able to run classical HPC algorithms which rely on (︃ [︃ 9 10]︃)︃ (︃ [︃13 14]︃)︃ h1, 0i, , h1, 1i, . decomposing matrices into chunks, and it can run them almost as 11 12 15 16 well as special-purpose HPC softwares such as ScaLAPACK. The TRA offers a similar set of operations to the RA: joins, ag- Our Contributions. We propose the TRA as well as an implemen- gregations, selections. It makes an excellent compilation target for tation algebra (IA) which is easy to implement in a distributed sys- tensor manipulation languages like the Einstein notation for several tem. We consider how computations in the TRA can be transformed reasons. First, these languages typically match elements in tensors into computations in the IA, and propose a set of transformations or based on shared index values (which can be implemented as rela- equivalences that allow re-writes of computations in the IA. Finally, tional joins) and then sum out various dimensions (implemented as we implement a prototype of the IA, and show that it can enable aggregations). The problem with using the RA as an implementa- efficient, distributed implementations of ML computations, which tion abstraction for tensors, where tensors are stored relationally as can reach or even significantly outperform the existing HPC andML keys identifying a single non-zero value in a tensor, is performance. systems including ScaLAPACK, Dask, TensorFlow, and PyTorch. Tensors can have millions or billions of entries, and it seems impos- sible to build a back-end with acceptable performance if it has to 2 TRA OVERVIEW process one tuple per non-zero value. 2.1 Motivation Thus, the TRA operates over sub-tensors or “chunks”, and ex- There has been significant interest in tensor manipulation lan- pects the ML system front-end to supply a kernel function (typically guages, which allow for declarative specification of ML and numer- a high-performance CPU or GPU operation) to operate over the ical computations.