In-RDBMS Hardware Acceleration of Advanced Analytics

Divya Mahajan∗ Joon Kyung Kim∗ Jacob Sacks∗ Adel Ardalan† Arun Kumar‡ Hadi Esmaeilzadeh‡ ∗Georgia Institute of Technology †University of Wisconsin-Madison ‡University of California, San Diego {divya mahajan,jkkim,jsacks}@gatech.edu [email protected] {arunkk,hadi}@eng.ucsd.edu

ABSTRACT The data revolution is fueled by advances in machine learning, databases, and hardware design. Programmable accelerators are making their way into each of these areas independently. As Enterprise Bismarck [7] such, there is a void of solutions that enables hardware accelera- In-Database Glade [8] tion at the intersection of these disjoint fields. This paper sets out Analytics Centaur [3] to be the initial step towards a unifying solution for in-Database DoppioDB [4] Acceleration of Advanced Analytics (DAnA). Deploying special- Modern DAnA Analytical ized hardware, such as FPGAs, for in-database analytics currently Acceleration Programing Platforms Paradigms requires hand-designing the hardware and manually routing the data. Instead, DAnA automatically maps a high-level specifica- tion of advanced analytics queries to an FPGA accelerator. The TABLA [5] accelerator implementation is generated for a User Defined Func- Catapult [6] tion (UDF), expressed as a part of an SQL query using a Python- Figure 1: DAnA represents the fusion of three research directions, embedded Domain-Specific Language (DSL). To realize an effi- in contrast with prior works [3–5,7–9] that merge two of the areas. cient in-database integration, DAnA accelerators contain a novel hardware structure, Striders, that directly interface with the buffer terprise settings. However, data-driven applications in such envi- pool of the database. Striders extract, cleanse, and process the ronments are increasingly migrating from simple SQL queries to- training data tuples that are consumed by a multi-threaded FPGA wards advanced analytics, especially machine learning (ML), over engine that executes the analytics algorithm. We integrate DAnA large datasets [1, 2]. As illustrated in Figure 1, there are three con- with PostgreSQL to generate hardware accelerators for a range of current and important, but hitherto disconnected, trends in this data real-world and synthetic datasets running diverse ML algorithms. systems landscape: (1) enterprise in-database analytics [3, 4] , (2) Results show that DAnA-enhanced PostgreSQL provides, on aver- modern hardware acceleration platforms [5, 6], and (3) program- age, 8.3× end-to-end speedup for real datasets, with a maximum ming paradigms which facilitate the use of analytics [7, 8]. of 28.2×. Moreover, DAnA-enhanced PostgreSQL is, on average, The database industry is investing in the integration of ML algo- 4.0× faster than the multi-threaded Apache MADLib running on rithms within RDBMSs, both on-premise and cloud-based [10,11]. Greenplum. DAnA provides these benefits while hiding the com- This integration enables enterprises to exploit ML without sacrific- plexity of hardware design from data scientists and allowing them ing the auxiliary benefits of an RDBMS, such as transparent scal- to express the algorithm in ≈30-60 lines of Python. ability, access control, security, and integration with their business intelligence interfaces [7, 8, 12–18]. Concurrently, the computer PVLDB Reference Format: architecture community is extensively studying the integration of Divya Mahajan, Joon Kyung Kim, Jacob Sacks, Adel Ardalan, Arun Ku- specialized hardware accelerators within the traditional compute mar, and Hadi Esmaeilzadeh. In-RDBMS Hardware Acceleration of Ad- vanced Analytics. PVLDB, 11(11): 1317-1331, 2018. stack for ML applications [5, 9, 19–21]. Recent work at the inter- arXiv:1801.06027v2 [cs.DB] 18 Sep 2018 DOI: https://doi.org/10.14778/3236187.3236188 section of databases and computer architecture has led to a growing interest in hardware acceleration for relational queries as well. This 1. INTRODUCTION includes exploiting GPUs [22] and reconfigurable hardware, such as Field Programmable Gate Arrays (FPGAs) [3,4,23–25], for rela- Relational Database Management Systems (RDBMSs) are the tional operations. Furthermore, cloud service providers like Ama- cornerstone of large-scale data management in almost all major en- zon AWS [26], Microsoft Azure [27], and Google Cloud [28], are Permission to make digital or hard copies of all or part of this work for also offering high-performance specialized platforms due to the po- personal or classroom use is granted without fee provided that copies are tential gains from modern hardware. Finally, the applicability and not made or distributed for profit or commercial advantage and that copies practicality of both in-database analytics and hardware acceleration bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific hinge upon exposing a high-level interface to the user. This triad permission and/or a fee. Articles from this volume were invited to present of research areas are currently studied in isolation and are evolv- their results at The 44th International Conference on Very Large Data Bases, ing independently. Little work has explored the impact of moving August 2018, Rio de Janeiro, Brazil. analytics within databases on the design, implementation, and in- Proceedings of the VLDB Endowment, Vol. 11, No. 11 tegration of hardware accelerators. Unification of these research Copyright 2018 VLDB Endowment 2150-8097/18/07. directions can help mitigate the inefficiencies and reduced produc- DOI: https://doi.org/10.14778/3236187.3236188 tivity of data scientists who can benefit from in-database hardware and computation. This unified abstraction is backed by an ex- acceleration for analytics. Consider the following example. tensive compilation workflow that automatically transforms the Example 1 A marketing firm uses the Amazon Web Services (AWS) specification to an accelerated execution. Relational Data Service (RDS) to maintain a PostgreSQL database • Integrates an FPGA and an RDBMS engine through Striders that of its customers. A data scientist in that company forecasts the are a novel on-chip interfaces. Striders bypass CPU to directly hourly ad serving load by running a multi-regression model across access the training data from the buffer pool, transfer this data a hundred features available in their data. Due to large training onto the FPGA, and unpack the feature vectors and labels. times, she decides to accelerate her workload using FPGAs on • Offers a novel execution model that fuses thread-level and data- Amazon EC2 F1 instances [26]. Currently, this requires her to level parallelism to execute the learning algorithm computations. learn a hardware description language, such as Verilog or VHDL, This model exposes a domain specific instruction set architecture program the FPGAs, and go through the painful process of hard- that offers automation while providing efficiency. ware design, testing, and deployment, individually for each ML al- We prototype DAnA with PostgreSQL to automatically accel- gorithm. Recent research has developed tools to simplify FPGA erate the execution of several popular ML algorithms. Through a acceleration for ML algorithms [19, 29, 30]. However, these solu- comprehensive experimental evaluation using real-world and syn- tions do not interface with or support RDBMSs, requiring her to thetic datasets, we compare DAnA against the popular in-RDBMS manually extract, copy, and reformat her large dataset. ML toolkit, Apache MADlib [15], on both PostgreSQL and its To overcome the aforementioned roadblocks, we devise DAnA, parallel counterpart, Greenplum. Using Xilinx UltraScale+ VU9P a cohesive stack that enables deep integration between FPGA ac- FPGA, we observe DAnA generated accelerators provide on aver- celeration and in-RDBMS execution of advanced analytics. DAnA age 8.3× and 4.0× end-to-end runtime speedups over PostgreSQL exposes a high-level programming interface for data scientists/an- and Greenplum running MADlib, respectively. An average of 4.6× alysts based on conventional languages, such as SQL and Python. of the speedup benefits are obtained through Striders, as they effec- Building such a system requires: (1) providing an intuitive pro- tively bypass the CPU and its memory subsystem overhead. gramming abstraction to express the combination of ML algo- rithm and required data schemas; and (2) designing a hardware 2. BACKGROUND mechanism that transparently connects the FPGA accelerator to the Before delving into the details of DAnA, this section discusses database engine for direct access to the training data pages. the properties of ML algorithms targeted by our holistic framework. To address the first challenge, DAnA enables the user to express RDBMS User-Defined Functions (UDFs) using familiar practices 2.1 Iterative Optimization and Update Rules of Python and SQL. The user provides their ML algorithm as an During training, a wide range of supervised machine learning update rule using a Python-embedded Domain Specific Language algorithms go through a cyclic process that demand constant itera- (DSL), while an SQL query specifies data management and re- tion, tuning, and improvement. These algorithms use optimization trieval. To convert this high level ML specification into an acceler- procedures that iteratively minimize a loss function – distinct for ated execution without manual intervention, we develop a compre- each learning algorithm – by using one tuple (input-output pair) at hensive stack. Thus, DAnA is a solution that breaks the algorithm- a time to generate updates for the learning model. Each ML al- data pair into software execution on the RDBMS for data retrieval gorithm has a specific loss function that mathematically captures and hardware acceleration for running the analytics algorithm. the measure of the learning model’s error. Improving the model With respect to the second challenge, DAnA offers Striders, corresponds to minimizing this loss function using an update rule, which avoid the inefficiencies of conventional Von-Neumann CPUs which is applied repeatedly over the training model, one training for data handoff by seamlessly connecting the RDBMS and FPGA. data tuple (input-output pair) at a time, until convergence. Striders directly feed the data to the analytics accelerator by walk- Example. Given a set of N pairs of {(x ,y∗),...,(x ,y∗ )} con- ing Jacthe RDBMS buffer pool. Circumventing the CPU allevi- 1 1 N N stituting the training data, the goal is to find a hypothesis function ates the cost of data transfer through the traditional memory sub- h(w(t),x) that can accurately map x → y. The equation below spec- system. These Striders are backed with an Instruction Set Ar- l(w(t),x ,y∗) chitecture (ISA) to ensure programmability and ability to cater to ifies an entire update rule, where i i is the loss function y∗ the variations in the database page organization and tuple length that signifies the error between the output and the predicted out- h(w(t),x ) x across different algorithms and training datasets. They are de- put estimated by a hypothesis function i for input . signed to ensure multi-threaded acceleration of the learning al- (t) ∗ (t+ ) (t) ∂(l(w ,xi,y )) gorithm to amortize the cost of data accesses across concurrent w 1 = w − µ × i (1) ∂w(t) threads. DAnA automatically generates the architecture of these accelerator threads, called execution engines, that selectively com- For each (x,y∗) pair, the goal is to a find a model (w) that mini- (t) bine a Multi-Instruction Multi-Data (MIMD) execution model with mizes the loss function l(w ,xi,yi) using an iterative update rule. Single-Instruction Multi-Data (SIMD) semantics to reduce the in- While the hypothesis (y = h(w,x)) and loss function vary substan- struction footprint. While generating this MIMD-SIMD accelera- tially across different ML algorithms, the optimization algorithm tor, DAnA tailors its architecture to the ML algorithm’s computa- that iteratively minimizes the loss function remains fixed. As such, tion patterns, RDBMS page format, and available FPGA resources. the two required components are the hypothesis function to define As such, this paper makes the following technical contributions: the machine learning algorithm and an optimization algorithm that • Merges three disjoint research areas to enable transparent and iteratively applies the update rule. efficient hardware acceleration for in-RDBMS analytics. Data Amortizing the cost of data accesses by parallelizing the op- scientists with no hardware design expertise can use DAnA to ∗ timization. In Equation (1), a single (xi, yi ) tuple is used to update harness hardware acceleration without manual data retrieval and the model. However, it is feasible to use a batch of tuples and com- extraction whilst retaining familiar programming environments. pute multiple updates independently when the optimizer supports • Exposes a high-level programming interface, which combines combining partial updates [31–37]. This offers a unique opportu- SQL UDFs with a Python DSL, to jointly specify training data nity for DAnA to rapidly consume data pages brought on-chip by Catalog Hardware Strider Design FPGA (Accelerator Design Page Buffer Execution Engine Generator & Execution Binary) Python UDF Design Strider Strider #1 … #n …….. Translator Strider Files F1 .. linearR = dana.algo (m, in, …. , out) Instructions Access Methods Thread Thread …….. … err = linearR.subtract(sum, output) Operation Map Buffer Pool #1 #n grad = linearR.multiply(err, input) Compiler Hierarchical Compute Interconnect linearR.setInter(grad) TablesData Pages ……. Dataflow Graph Instructions Tables SQL Query Query Tree Execution Buffer SELECT * FROM dana.linearR Parser Optimizer Executor (“training_data_table”); CPU (Logical Plan) Plan Manager

Figure 2: Overview of DAnA, that integrates FPGA acceleration with the RDBMS engine. The Python-embedded DSL is an interface to express the ML algorithm that is converted to hardware architecture and its execution schedules (stored in the RDBMS catalog). The RDBMS engine fills the buffer pool. FPGA Striders directly access the data pages to extract the tuples and feed them to the threads. Shaded areas show the entangled components of RDBMS and FPGA working in tandem to accelerate in-database analytics. Striders while efficiently utilizing the large, ever-growing amounts instruction schedules) in the RDBMS’s catalog along with the name of compute resources available on the FPGAs through simultane- of a UDF to be invoked from the query. As shown in Figure 2, the ous multi-threaded acceleration. Examples of commonly used it- RDBMS catalog is shared by the database engine and the FPGA. erative optimization algorithms that support parallel iterations are The RDBMS parses, optimizes, and executes the query while treat- variants of gradient descent methods, which can be applied across ing the UDF as a black box. During query execution, the RDBMS a diverse range of ML models. DAnA is equipped to accelerate the fills the buffer pool, from which DAnA ships the data pages to the training phase of any hypothesis and objective function that can be FPGA for processing. DAnA and the RDBMS engine work in tan- minimized using such iterative optimization. Thus, the user simply dem to generate the appropriate data stream, data route, and accel- provides the update rule via the DAnA DSL described in §4.1. erator design for the {ML algorithm, database page layout, FPGA} In addition to providing a background on properties of machine triad. Each component of DAnA is briefly described below. learning algorithms targeted by DAnA, Appendix A in our tech Programming interface. The front end of DAnA exposes a report (http://act-lab.org/artifacts/dana/addendum.pdf) pro- Python-embedded DSL (discussed in §4.1) to express the ML algo- vides a brief overview on Field Programmable Gate Arrays (FP- rithm as a UDF. The UDF includes an update rule that specifies how GAs). It provides details about the reconfigurability of FPGAs and each tuple or record in the training data updates the ML model. It how they offer a potent solution for hardware acceleration. also expects a merge function that specifies how to process multiple tuples in parallel and aggregate the resulting ML models. DAnA’s 2.2 Insights Driving DANA DSL constitutes a diverse set of operations and data types that cater Database and hardware interface considerations. To obtain to a wide range of advanced analytics algorithms. Any legitimate large benefits from hardware acceleration, the overheads of a tradi- combination of these operations can be automatically converted to tional Von-Neumann architecture and memory subsystem need to a final synthesizable FPGA accelerator. be avoided. Moreover, data accesses from the buffer pool need to be Translator. The user provided UDF is converted into a at large enough granularities to efficiently utilize the FPGA band- hierarchical DataFlow Graph (hDFG) by DAnA’s parser, discussed width. DAnA satisfies these criteria through Striders, its database- in detail in §4.4. Each node in the hDFG represents a mathe- aware reconfigurable memory interface, discussed in § 5.1. matical operation allowed by the DSL, and each edge is a multi- Algorithmic considerations. The training data retrieved from the dimensional vector on which the operations are performed. The in- buffer pool and stored on-chip must be consumed promptly to avoid formation in the hDFG enables DAnA’s backend to optimally cus- throttling the memory resources on the FPGA. DAnA achieves tomize the reconfigurable architecture and schedule and map each this by leveraging the algorithmic properties of iterative optimiza- operation for a high-performance execution. tion to execute multiple instances of the update rule. The Python- Strider-based customizable machine learning architecture. embedded DSL provides a concise means of expressing this update To target a wide range of ML algorithms, DAnA offers a paramet- rule for a broad set of algorithms while facilitating parallelization. ric reconfigurable hardware design solution that is hand optimized DAnA leverages these insights to provide a cross-stack solution by expert hardware designers as described in § 5. The hardware in- that generates FPGA-synthesizable accelerators that directly inter- terfaces with the database engine through a specialized structure face with the RDBMS engine’s buffer pool. The next section pro- called Striders, that extract high-performance, and provide low- vides an overview of DAnA. energy computation. Striders eliminate CPU from the data transfor- mation process by directly interfacing with database’s buffer pool 3. DANA WORKFLOW to extract the training data pages. They process data at a page gran- Figure 2 illustrates DAnA’s integration within the traditional ularity to amortize the cost of per-tuple data transfer from mem- software stack of data management systems. With DAnA, the data ory to the FPGA. To exploit this vast amount of data available on- scientist specifies her desired ML algorithm as a UDF using a sim- chip, the architecture is equipped with execution engines that run ple DSL integrated within Python. DAnA performs static analysis multiple parallel instances of the update rule. This architecture is and compilation of the Python functions to program the FPGA with customized by DAnA’s compiler and hardware generator in accor- a high-performance, energy-efficient hardware accelerator design. dance to the FPGA specifications, database page layout, and the The hardware design is tailored to both the ML algorithm and page analytics function. specifications of the RDBMS engine. To run the hardware accel- Instruction Set Architectures. Both Striders and the execution erated UDF on her training data, the user provides a SQL query. engine can be programmed using their respective Instruction Set DAnA stores accelerator metadata (Strider and execution engine Architectures (ISAs). The Strider instructions process page head- Table 1: Language constructs of DAnA’s Python-embedded DSL. ers, tuple headers, and extract the raw training data from a database Type Keyword Description page. Different page sizes and page layouts can be targeted using Component algo To specify an instance of the learning algorithm this ISA. The execution engine’s ISA describes the operation flow input Algorithm input output Algorithm output required to run the analytics algorithm in selective SIMD mode. Data Types model Machine learning model Compiler and hardware generator. DAnA’s compiler and hard- inter Interim data type meta Meta parameters ware generator ensure compatibility between the hDFG and the +,-,*, /, >, < Primary operations Mathematical sigmoid, gaussian, sqrt Non linear operations hardware accelerator. For the given hDFG and FPGA specifications Operations (such as number of DSP Slices and BRAMs), the hardware genera- sigma, norm, pi Group operations merge(x, int, "operation") Specify merge operation and number of merge instances tor determines the parameters for the execution engine and Striders Built-In Special setEpochs(int) Set the maximum number of epochs to generate the final FPGA synthesizable accelerator. The com- Functions setConvergence(x) Specify the convergence criterion piler converts the database page configuration into a set of Strider setModel(x) Set the model variable instructions that process the page and tuple headers and transform dataset. The user can specify meta variables using dana.meta, the user data into a floating point format. Additionally, the compiler value of which remains constant throughout execution. As such, generates a static schedule for the accelerator, a map of where each meta variables can be directly sent to the FPGA before algorithm operation is performed, and execution engine instructions. execution. All variables used for a particular algorithm are linked As described above, providing flexibility and reconfigurability of to an algo construct. hardware accelerators for advanced analytics is a challenging but algorithm = dana.algo (mo, in, out) pertinent problem. DAnA is a multifaceted solution that untangles The algo component allows the user to link together the three func- these challenges one by one. tions – update rule, merge, and terminator – of a single UDF. Addi- tionally, the analyst can use untyped intermediate variables, which 4. FRONT-END INTERFACE FOR DANA are automatically labeled as dana.inter by DAnA’s backend. DAnA’s DSL provides an entry point for data scientists to exploit Mathematical operations. The DSL supports mathematical op- hardware acceleration for in-RDBMS advanced analytics. This erations performed on both declared and untyped intermediate vari- section elaborates on the constructs and features of the DSL and ables. Primary and non-linear operations, such as *, +, ... , sig- how they can be used to train a wide range of learning algorithms moid, only require the operands as input. The dimensionality of for advanced analytics. This section also explains how a UDF de- the operation and its output is automatically inferred by DAnA’s fined in this DSL is translated into an intermediate representation, translator (as discussed in § 4.4) in accordance to the operands’ i.e., in this case a hierarchical DataFlow Graph (hDFG). dimensions. Group operations, such as sigma, pi, norm, perform computation across elements. Sigma refers to summation, pi indi- 4.1 Programming For DANA cates product operator, and norm calculates the magnitude of a mul- DAnA exposes a high-level DSL for database users to express tidimensional vector. Group operations require the input operands their learning algorithm as a UDF. Embedding this DSL within and the grouping axis which is expressed as a constant and allevi- Python allows support for intricate update rules using a framework ates the need to explicitly specify loops. The underlying primary familiar to database users whilst not requiring a full language com- operation is performed on the input operands prior to grouping. piler. This DSL meets the following objectives: Built-in functions. The DSL provides four built-in functions to 1. Incorporates language constructs commonly seen in a wide class specify the merge condition, set the convergence criterion, and link of supervised learning algorithms. the updated model variable to the algo component. The merge(x, 2. Supports expression of any iterative update rule, not just variants int, “op”) function is used to specify how multiple threads of the of gradient descent, whilst conforming to the DSL constructs. update rule are combined. Convergence is dictated either by a 3. Segregates algorithmic specification from hardware-dependent specifying fixed number of epochs (1 epoch is a single pass over implementation. the entire training data set) or a user-specified condition. Function The language constructs of this DSL – components, data decla- setEpochs(int) sets the number of terminating epochs and setCon- rations, mathematical operations, and built-in functions – are sum- vergence(x) frames termination based on a boolean variable x. Fi- marized in Table 1. Users express the learning algorithm using nally, the setModel(x) function links a DAnA variable (the updated these constructs and provide the (1) update rule - to decide how model) to the corresponding algo component. each tuple in the training data updates the model; (2) merge func- All the described language constructs are supported by DAnA’s tion - to specify the combination of distinct parallel update rule reconfigurable architecture, hence, can be synthesized on an FPGA. threads; and (3) terminator - to describe convergence. An example usage of these constructs to express the update rule, 4.2 Language Constructs merge function, and convergence for linear regression algorithm running the gradient descent optimizer is provided below. Data declarations. Data declarations delineate the semantics of the data types used in the ML algorithm. The DSL supports the 4.3 Linear Regression Example following data declarations: input, output, inter, model, and meta. Update rule. As the code snippet below illustrates, the data sci- Each variable can be declared by specifying its type and dimen- entist first declares different data types and their corresponding di- sions. A variable is an implied scalar if no dimensions are speci- mensions. Then she defines the computations performed over these fied. Once the analyst imports the dana package, she can express variables specific to linear regression. the required variables. The code snippet below declares a multi- #Data Declarations dimensional ML model of size [5][2] using dana.model construct. mo = dana.model ([10]) mo = dana.model ([5][2]) in = dana.input ([10]) In addition to dana.model, the user can provide dana.input and out = dana.output () lr = dana.meta (0.3) #learning rate dana.output to express a single input-output pair in the training linearR = dana.algo (mo, in, out) mo in mo in

SIGMA SIGMA Thread

#Gradient or Derivative of the Loss Function 1 Update Rule s out s out s = sigma ( mo * in, 1)

…… n er = s - out Thread - - s = sigma (mo * in, 1) grad = er * in er = s - out er in er in #Gradient Descent Optimizer grad = er * in × × up = lr * grad merge up = lr * grad grad grad mo_up = mo - up boundary linearR.setModel(mo_up) mo_up = mo - up + lr In this example, the update rule uses the gradient of the loss Merge Function × …… function. The gradient descent optimizer updates the model in the up grad = merge(grad, merge_coef, "+") mo ∂(l) negative direction of the loss function derivative ( (t) ). The ana- - ∂w Convergence Criteria lyst concludes with the setModel() function to identify the updated setEpochs(10000) mo_up model, in this case mo up. (a) Code snippet (b) Hierarchical DFG Merge function. The merge function facilitates multiple concur- Figure 3: Translator-generated hDFG for the linear regression code rent threads of the update rule on the FPGA accelerator by specify- snippet expressed in DAnA’s DSL. ing the functionality at the point of merge. merge_coef = dana.meta (8) grad = linearR.merge(grad, merge_coef, ”+”) 4.4 Translator In the above merge function, the intermediate grad variable DAnA’s translator is the front-end of the compiler, which con- has been combined using addition, and the merge coefficient verts the user-provided UDF to a hierarchical DataFlow Graph (merge coef) specifies the batch size. DAnA’s compiler implicitly (hDFG). The hDFG represents the coalesced update rule, merge understands that the merge function is performed before the gradi- function, and convergence check whilst maintaining the data de- ent descent optimizer. Specifically, the grad variable is calculated pendencies. Each node of the hDFG represents a multi-dimensional separately for each tuple per batch. The results are aggregated to- operation, which can be decomposed into smaller atomic sub- gether across the batches and used to update the model. Alterna- nodes. An atomic sub-node is a single operation performed by the tively, partial model updates for each batch could be merged. accelerator. The hDFG transformation for the linear regression ex- merge_coef = dana.meta (8) ample provided in the previous section is shown in Figure 3. m1 = linearR.merge(mo_up, merge_coef, ”+”) The aim of the translator is to expose as much parallelism avail- m2 = m1/merge_coef able in the algorithm to the remainder of the DAnA workflow. This lineaR.setModel(m2) includes parallelism within a single instance of the update rule and The mo up is calculated by each thread for tuples in its batch sepa- among different threads, each running a version of the update rule. rately and consecutively averaged. Thus, DAnA’s DSL provides To accomplish this, the translator (1) maintains the function bound- the flexibility to create different learning algorithms without re- aries, especially between the merge function and parallelizable por- quiring any modification to the update rule by specifying differ- tions of the update rule, and (2) automatically infers dimensionality ent merge points. In the above example, the first definition of the of nodes and edges in the graph. merge function creates a linear regression running batched gradient The merge function and convergence criteria are performed once descent optimizer, whereas, the second definition corresponds to a per epoch. In Figure 3b, the colored node represents the merge op- parallelized stochastic gradient descent optimizer. eration that combines the gradients generated by separate instances Convergence function. The user also provides the termination of the update rule. These update rule instances are run in paral- criteria. As shown in the code snippet below, the convergence lel and consume different records or tuples from the training data; checks for the conv variable, which, if true, terminates the train- thus, they can be readily parallelized across multiple threads. To ing. Variable conv compares the Euclidean norm of grad with a generate the hDFG, the translator first infers the dimensions of each conv factor constant. operation node and its output edge(s). For basic operations, if both convergenceFactor = dana.meta (0.01) the inputs have same dimensions, it translates into an element by n = norm(grad , i) conv = n < convergenceFactor element operation in the hardware. In case the inputs do not have linear.setConvergence(conv) same dimensions, the input with lower dimension is logically repli- Alternatively, the number of epochs can be used for convergence cated, and the generated output possess the dimensions of the larger using the syntax linearR.setEpochs(10000). input. Nonlinear operations have a single input that determines the output dimensions. For group operations, the output dimension is Query. A UDF comprising the update rule, merge function, and determined by the axis constant. For example, a node performing convergence check describes the entire analytics algorithm. The sigma(mo * in, 2), where variables mo and in are matrices of sizes linearR UDF can then be called within a query as follows: [5][10] and [2][10], respectively, generates a [5][2] output. SELECT * FROM dana.linearR( ’ t r a i n i n g d a t a t a b l e ’); The information captured within the hDFG allows the hard- Currently, for high efficiency and low latency, DAnA’s DSL and ware generator to configure the accelerator architecture to opti- compiler do not support dynamic variables, as the FPGA and CPU mally cater for its operations. Resources available on the FPGA do not interchange runtime values and only interact for data hand- are distributed on-demand within and across multiple threads. Fur- off. DAnA only supports variable types which either have been thermore, DAnA’s compiler maps all the operations to the accel- explicitly instantiated as DAnA’s data declarations, or inferred as erator architecture to exploit fine-grained parallelism within an up- intermediate variables (dana.inter) by DAnA’s translator. As such, date rule. Before delving into the details of hardware generation this Python-embedded DSL provides a high level programming ab- and compilation, we discuss the reconfigurable architecture for the straction that can easily be invoked by an SQL query and extended FPGA (Strider and execution engine). to incorporate algorithmic advancements. In the next section we discuss the process of converting this UDF into a hDFG. 5. HARDWARE DESIGN FOR IN- Shifter Access Engine DATABASE ACCELERATION Memory Memory Controller Page Buffers Controller DAnA employs a parametric accelerator architecture comprising & Striders a multi-threaded access engine and a multi-threaded execution en- Configuration Configuration Registers Registers gine, shown in Figure 4. Both engines have their respective custom

Instruction Set Architectures (ISA) to program their hardware de- Controller Controller Execution Engine signs. The access engine harbors Striders to ensure compatibility AU4 AU3 AU2 AU1 AU0 PC PC AU0 AU1 AU2 AU3 AU4 between the data stored in a particular database engine and the exe- PC PC cution engines that perform the computations required by the learn- Thread #1 Thread #m ing algorithm. The access and execution engines are configured Controller Controller according to the page layout and UDF specification, respectively. AU4 AU3 AU2 AU1 AU0 PC PC AU0 AU1 AU2 AU3 AU4 The details of each of these components are discussed below. PC PC 5.1 Access Engine and Striders Figure 4: Reconfigurable accelerator design in its entirety. The access engine reads and processes the data via its Striders, while 5.1.1 Architecture and Design the execution engine operates on this data according to the UDF. The multi-threaded access engine is responsible for storing pages Table 2: Strider ISA to read, extract, and clean the page data. of data and converting them from a database page format to raw Instruction Bits numbers that are processed by the execution engine. Figure 5 Instruction Code 21 - 18 17 - 12 11 - 6 5 - 0 shows a detailed diagram of this access engine. The access engine Read Bytes readB Opcode = 0 Read Address uses the Advanced Extensible Interface (AXI) interface to transfer Extract Bytes extrB Opcode = 1 Byte Offset Write Bytes writeB Opcode = 2 Read Address # of Bytes Write Address the data to and from the FPGA, the shifters properly align the data, Extract Bits extrBi Opcode = 3 and the Striders unpack the database pages. AXI interface is a type Clean cln Opcode = 4 # of Bits of Advanced Microcontroller Bus Architecture open-standard, on- Insert ins Opcode = 5 Start Location Offset Reserved chip interconnect specification for system-on-a-chip (SoC) designs. Add ad Opcode = 6 Subtract sub Opcode = 7 Read Address Immediate It is vendor agnostic and standardized across different hardware Multiply mul Opcode = 8 1 Read Address 2 Operand platforms. The access engine uses this interface to transfer uncom- Branch Enter bentr Opcode = 9 0 pressed database pages to page buffers and configuration data to Branch Exit bexit Opcode = 10 Condition Value configuration registers. Configuration data comprises Strider and and route the training data to the execution engine. The number execution engine instructions and necessary meta-data. Both the of Striders and database pages stored on-chip can be adjusted ac- training data in the database pages and the configuration data are cording to the BRAM storage available on the target FPGA. The passed through a shifter for alignment, according to the read width internal workings of the Strider are dictated by its instructions that of the block RAM on the target FPGA. A separate channel for depend on the page layout and page size of the target RDBMS. We configuration data incorporates a finite state machine to dictate the next discuss the novel ISA to program these Striders. route and destination of the configuration information. To amortize the cost of data transfer and avoid the suboptimal us- 5.1.2 Instruction Set Architecture for Striders age of the FPGA bandwidth, the access engine and Striders process We devise a novel fixed-length Instruction Set Architecture database data at a page level granularity. Training data is written to (ISA) for the Striders that can target a range of RDBMS engines, multiple page buffers, where each buffer stores one database page such as PostgreSQL and MySQL (innoDB), that have similar back- at a time and has access to its personal Strider. Alternatively, each end page layouts. An uncompressed page from these RDBMSs, tuple could have been extracted from the page by the CPU and sent once transferred to the page buffers, can be read, extracted, and to the FPGA for consumption by the execution engine. This ap- cleansed using this ISA, which comprises light-weight instructions proach would fail to exploit the bandwidth available on the FPGA, specialized for pointer chasing and data extraction. Each Strider as only one tuple would be sent at a time. Furthermore, using the is programmed with the same instructions but operates on different CPU for data extraction would have a significant overhead due to pages. These instructions are generated statically by the compiler. the handshaking between CPU and FPGA. Offloading tuple extrac- Table 2 illustrates the 10 instructions of this ISA. Every instruc- tion to the accelerator using Striders provides a unique opportunity tion is 22 bits long, comprising a unique operation code (opcode) to dynamically interleave unpacking of data in the access engine as identification. The remaining bits are specific to the opcode. and processing it in the execution engine. Instructions Read Bytes and Write Bytes are responsible for read- It is common for data to be spread across pages, where each page ing and writing data from the page buffer, respectively. The ISA requires plenty of pointer chasing. Two tuples cannot be simulta- provides the flexibility to extract data at byte and bit granularity neously processed from a single page buffer, as the location of one using the Extract Byte and Extract Bit instructions. The Clean could depend on the previous. Therefore, we store multiple pages instruction can remove parts of the data not required by the execu- on the FPGA and parallelize data extraction from the pages across tion engine. Conversely, the Insert instruction can add bits to the their corresponding Striders. For every page, the Strider first pro- data, such as NULL characters and auxiliary information, which is cesses the page header and extracts necessary information about particularly useful when the page is to be written back to memory. the page and stores it in the configuration registers. The informa- Basic math operations, Add, Subtract, and Multiply, allow calcu- tion includes offsets, such as the beginning and size of each tuple, lation of tuple sizes, byte offsets, etc. Finally, the Bentr and Bexit which is either located or computed from the data in the header. branch instructions are used to specify jumps or loop exits, respec- This auxiliary page information is used to trace the tuple addresses tively. This feature invariably reduces the instruction footprint as and read the corresponding data from the page buffer. After each repeated patterns can be succinctly expressed using branches while page buffer, the shifter ensures alignment of the tuple data for the enabling loop exits that depend on a dynamic runtime variable. Strider. From the tuple data, its header is processed to extract Buffer Pool or Configuration Data Page Read Controls Data From Pagei

Shifter Memory Controller

Page Page Page i Page Size Insert 1 2 Finite State Constants Machine Tuple Size

Tuples per Page Config Buffer

Data Config Instruction FIFO Instruction FIFO Num Threads Route Data Strider1 Strider2 Strideri Tuple Offset Config Data i 1 2

Data for Threadi Thread Thread Thread Figure 5: Access engine design uses Striders as the main interface between the RDBMS and execution engines. Uncompressed data pages are read from the buffer pool and stored in on-chip page buffers. Each page has a corresponding strider to extract the tuple data. Page Header more database pages can now be stored on-chip as the BRAM ca- pacity is rapidly increasing with the new FPGAs such as Arria 10 page size free space start free space end that offers 7 MB and UltraScale+ VU9P with 44 MB of memory. offset special space tuple pointer 1 tuple pointer 2 Free Therefore, the execution engine needs to furnish enough computa- Space tional resources that can process this copious amount of on-chip Tuple2 data. Our reconfigurable execution engine architecture can run Tuple1 Special Space multiple threads of parallel update rules for different data tuples. This architecture is backed by a Variable Length Selective SIMD User Training Data ISA, that aims to exploit both regular and irregular parallelism in Figure 6: Sample page layout similar to PostgreSQL. ML algorithms whilst providing the flexibility to each component An example page layout representative of PostgreSQL and of the architecture to run independently. MySQL is illustrated in Figure 6. Such layouts are divided into Reconfigurable compute architecture. All the threads in the a page header, tuple pointers, and tuple data and can be processed execution engine are architecturally identical and perform the same using the following assembly code snippet written in Strider ISA. computations on different training data tuples. DAnA balances the \\Page Header Processing readB 0, 8, %cr resources allocated per thread vs. the number of threads to ensure readB 8, 2, %cr high performance for each algorithm. The hardware generator of readB 10, 4, %cr extrB %cr, 2, %cr DAnA (discussed in §6.1) determines this division by taking into \\Tuple Pointer Processing account the parallelism in the hDFG, number of compute resources readB %cr, 4, %treg available on chip, and number of striders/page buffers that can fit extrB 0, 1 ,%cr extrB 1, 1 ,%treg on the on-chip BRAM. The architecture of a single thread is a hi- \\Tuple extraction and processing erarchical design comprising analytic clusters (ACs) composed of bentr multiple analytic units (AUs). As discussed below, the AC archi- ad %treg, %treg, 0 readB %treg, %cr, %treg tecture is designed while keeping in mind the algorithmic proper- extrB %treg, %cr, %treg cln %treg, %cr, 2 ties of multi-threaded iterative optimizations, and the AU caters to bexit 1, %treg, %cr commonly seen compute operations in data analytics. Each line in the assembly code is identified by its instruction name Analytic cluster. An Analytic Cluster (AC), shown in Figure 7a, (opcode) and its corresponding fields. The first four assembly in- is a collection of AUs designed to reduce the data transfer latency structions process the page header to obtain the configuration in- between them. Thus, hDFG nodes which exhibit high data depen- formation. For example, the (readB 0, 8, %cr) instruction, reads dencies are all scheduled to a single cluster. In addition to provid- 8 bytes from address 0 in the page buffer and adds this page size ing greater connectivity among the AUs within an AC, the cluster information into a configuration register. Each variable shown at serves as the control hub for all its constituent AUs. The AC runs in %(reg) corresponds to an actual Strider hardware register. The %cr a selective SIMD mode, where the AC specifies which AUs within is a configuration register, and %t is a temporary register. Next, a cluster perform an operation. Each AU within a cluster is ex- the first tuple pointer is read to extract the byte-offset and length pected to execute either a cluster level instruction (add, subtract, (bytes) of the tuple. Only the first tuple pointer is processed, as all multiply, divide, etc.) or a no-operation (NOP). Finer details about the training data tuples are expected to be identical. Each corre- the source type, source operands, and destination type can be stored sponding tuple is processed by adding the tuple size to the previous in each individual AU for additional flexibility. This collective in- offset to generate the page address. This address is used to read struction technique simplifies the AU design, as each AU no longer the data from the page buffer, which is then cleansed by removing requires a separate controller to decode and process the instruction. its auxiliary information. The above step is repeated for each tuple Instead, the AC controller processes the instruction and sends con- using the bentr and bexit instructions. The loop is exited when trol signals to all the AUs. When the designated AUs complete their the tuple offset address reaches the free space in the page. Finally, execution, the AC proceeds to the next instruction by incrementing cleaned data is sent to the execution engines. the program counter. To exploit the data locality among the oper- 5.2 Execution Engine Architecture ations performed within an AC, different connectivity options are provided. Each AU within an AC is connected to both its neighbors, The execution engines execute the hDFG of the user provided and the AC has a shared line topology bus. The number of AUs per UDF using the Strider processed training data pages. More and AC are fixed to 8 to obtain highest operational frequency. A single thread generally contains more than one instance of an AC, each Data from Memory Interface information, model, and training data schema from the DBMS cata- Compute Controller log. FPGA-specific information, such as the number of DSP slices, AU7 AU6 AU5 AU4 AU3 AU2 AU1 AU0 Selective SIMD Program Instruction the number of BRAMs, the capacity of each BRAM, the number Counter Buffer of read/write ports on a BRAM, and the off-chip communication Communication Controller bandwidth are provided by the user. Using this information, the (a) Analytic Cluster hardware generator distributes the resources among access and ex- ecution engine. Sizes of the DBMS page, model, and a single train- Control Signals from Signals Control Neighbor Connections Neighbor Network

Left Bus Data Route Instruction PCController ing data record determine the amount of memory utilized by each FIFO Buffer Strider. Specifically, a portion of the BRAM is allocated to store the extracted raw training data and model. The remainder of the Data Destination Memory Type BRAM memory is assigned to the page buffer to store as many Scratchpad Right pages as possible to maximize the off-chip bandwidth utilization. Once the number of resident pages is determined, the hardware (b) Analytic Unit generator uses the FPGA’s DSP information to calculate the num- Figure 7: (a) Single analytic cluster comprising analytic units op- ber of AUs which can be synthesized on the target FPGA. Within erating in a selective SIMD mode and (b) an analytic unit that is each AU, the ALU is customized to contain all the operations re- the pipelined compute hub of the architecture. quired by the hDFG. The number of AUs determines the number performing instructions independently. Data sharing among ACs is of ACs. Each thread is allocated a number of ACs determined by made possible via a shared line topology inter-AC bus. the merge coefficient provided by the programmer. It creates at most as many threads as the coefficient. To decide the allocation of Analytic unit. The Analytic Unit (AU) shown in Figure 7b, is the resources to each thread vs. number of threads, we equip the hard- basic compute element of the execution engine. It is tailored by ware generator with a performance estimation tool that uses the the hardware generator to satisfy the mathematical requirements of static schedule of the operations for each design point to estimate its the hDFG. Control signals are provided by the AC. Data for each relative performance. It chooses the smallest and best-performing operation can be read from the memory according to the source design point which strikes a balance between the number of cycles type of the AC instruction. Training data and intermediate results for data processing and transfer. Performance estimation is viable, are stored in the data memory. Additionally, data can be read from as the hDFG does not change, there is no hardware managed cache, the bus FIFO (First In First Out) and/or the registers corresponding and the accelerator architecture is fixed during execution. Thus, to the left and right neighbor AUs. Data is then sent to the Arith- there are no dynamic irregularities that hinder estimation. This metic Logic Unit (ALU), that executes both basic mathematical op- technique is commensurate with prior works [5, 19, 30] that per- erations and complicated non-linear operations, such as sigmoid, form a similar restricted design space exploration in less than five gaussian, and square root. The internals of the ALU are reconfig- minutes with estimates within 5% of the physical measurements. ured according to the operations required by the hDFG. The ALU Using these specifications, the hardware generator converts the then sends its output to the neighboring AUs, the shared bus within final architecture into a functional and synthesizable design that can the AC, and/or the memory as per the instruction. efficiently run the analytics algorithm. Bringing the Execution Engine together. Results across the threads are combined via a computationally-enabled tree bus in ac- 6.2 Compiler cordance to the merge function. This bus has attached ALUs to The compiler schedules, maps, and generates the micro- perform computations on in-flight data. The pliability of the archi- instructions for both ACs and AUs for each sub-node in the hDFG. tecture enables DAnA to generate high-performance designs that For scheduling and mapping a node, the compiler keeps track of efficiently utilize the resources on the FPGA for the given RDBMS the sequence of scheduled nodes assigned to each AC and AU on engine and algorithm. The execution engine is programmed us- a per-cycle basis. For each node which is “ready”, i.e., all its pre- ing its own novel ISA. Due to space constraints, the details of this decessors have been scheduled, the compiler tries to place that op- ISA are added to the Appendix B of our tech report (http://act- eration with the goal to improve throughput. Elementary and non- lab.org/artifacts/dana/addendum.pdf). linear operation nodes are spread across as many AUs as required by the dimensionality of the operation. As these operations are 6. BACKEND FOR DANA completely parallel and do not have any data dependencies within a node, they can be dispersed. For instance, an element-wise vector- DAnA’s translator, scheduler, and hardware generator together vector multiplication, where each vector contains 16 scalar values configure the accelerator design for the UDF and create its run- will be scheduled across two ACs (8 AUs per ACs). Group oper- time schedule. As discussed in § 4.4, the translator converts the ations exhibit data dependencies, hence, they are mapped to mini- user-provided UDF, merge function, and convergence criteria into mize the communication cost. After all the sub-nodes are mapped, a hDFG. Each node of the hDFG comprises of sub-nodes, where the compiler generates the AC and AU micro-instructions. each sub-node is a single instruction in the execution engine. Thus, The FPGA design, its schedule, operation map, and instructions all the sub-nodes in the hDFG are scheduled and mapped to the fi- are then stored in the RDBMS catalog. These components are exe- nal accelerator hardware design. The hardware generator outputs cuted when the query calls for the corresponding UDF. a single-threaded architecture for the operations of these sub-nodes and determines the number of threads to be instantiated. The sched- uler then statically maps all operations to this architecture. 7. EVALUATION We prototype DAnA by integrating it with PostgreSQL and com- 6.1 Hardware Generator pare the end-to-end runtime performance of DAnA generated ac- The hardware generator finalizes the parameters of the recon- celerators with a popular scalable in-database advanced analytics figurable architecture (Figure 4) for the Striders and the execution library, Apache MADlib [14, 15], for both PostgreSQL and Green- engine. The hardware generator obtains the database page layout plum RDBMSs. We compare the end-to-end runtime performance Table 3: Descriptions of datasets and machine learning models Table 5: Absolute runtimes across all systems. used for evaluation. Shaded rows are synthetic datasets. Workloads MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL Training Data Workloads Machine Learning Algorithm Model Topology Remote Sensing LR 3s 600ms 1s 100ms 0s 100ms # of Tuples # 32KB Pages Size (MB) WLAN Remote Sensing Logistic Regression, SVM 54 581,102 4,924 154 14s 0ms 14s 0ms 0s 610ms WLAN Logistic Regression 520 19,937 1,330 42 Remote Sensing SVM 1s 700ms 0s 600ms 0s 90ms Netflix Low Rank Matrix Factorization 6040, 3952, 10 6,040 3,068 96 Netflix 62s 300ms 69s 200ms 7s 890ms Patient Linear Regression 384 53,500 1,941 61 Blog Feedback Linear Regression 280 52,397 2,675 84 Patient 2s 800ms 0s 900ms 1s 180ms S\N Logistic Logistic Regression 2,000 387,944 96,986 3,031 Blog Feedback 1s 600ms 0s 500ms 0s 340ms S\N SVM SVM 1,740 678,392 169,598 5,300 S/N Logistic 54m 52s 49m 53s 2m 11s S\N LRMF Low Rank Matrix Factorization 19880, 19880, 10 19,880 50,784 1,587 S\N Linear Linear Regression 8,000 130,503 130,503 4,078 S/N SVM 56m 26s 12m 50s 4m 4s S\E Logistic Logistic Regression 6,033 1,044,024 809,339 25,292 S/N LRMF 0m 23s 0m 3s 0m 2s S\E SVM SVM 7,129 1,356,784 1,242,871 38,840 S/N Linear S\E LRMF Low Rank Matrix Factorization 28002, 45064, 10 45,064 162,146 5,067 29m 7s 24m 16s 5m 35s S\E Linear Linear Regression 8000 1,000,000 1,027,961 32,124 S/E Logistic 66h 45m 0s 8h 30m 0s 0h 11m 24s Table 4: Xilinx Virtex UltraScale+ VU9P FPGA specifications. S/E SVM 0h 6m 0s 0h 5m 24s 0h 1m 12s S/E LRMF 0h 54m 36s 0h 26m 24s 0h 39m 0s FPGA Capacity Frequency BRAM Size # DSPs S/E Linear 6h 36m 36s 5h 22m 12s 0h 16m 48s 1,182 K LUTS 2,364 K Flip-Flops 150 MHz 44 MB 6,840 tivity by varying the page size on PostgreSQL and Greenplum, we of these three systems. Next, we investigate the impact of Strid- measured end-to-end runtimes for 8, 16, and 32 KB page sizes. We ers on the overall runtime of DAnA and how accelerator perfor- found that page size had no significant impact on the runtimes. Ad- mance varies with the system parameters. Such parameters in- ditionally, we did a sweep for 4, 8, and 16 segments for Greenplum. clude the buffer page-size, number of Greenplum segments, multi- We observed the most benefits with 8 segments, making it our de- threading on the hardware accelerator, and bandwidth and com- fault choice. Results are obtained for both warm cache and cold pute capability of the target FPGA. We also aim to understand the cache settings to better interpret the impact of I/O on the overall overheads of performing analytics within RDBMS, thus compare runtime. In the case of a warm cache, before query execution, train- MADlib+PostgreSQL with software libraries optimized to perform ing data tables for the publicly available dataset reside in the buffer analytics outside the database. Furthermore, to delineate the over- pool, whereas only a part of the synthetic datasets are contained in head of reconfigurable architecture, we compare our FPGA designs the buffer pool. For the cold cache setting, before execution, no with custom hard coded hardware designs targeting a single or fixed training data tables reside in the buffer pool. set of machine learning algorithms. Datasets and workloads. Table 3 lists the datasets and machine 7.1 End-to-End Performance learning models used to evaluate DAnA. These workloads cover a Publicly available datasets. Figures 8a and 8b illustrate end-to- diverse range of machine learning algorithms, – Logistic Regres- end performance of MADlib+PostgreSQL, Greenplum+MADlib, sion (Logistic), Support Vector Machines (SVM), Low Rank Ma- and DAnA, for warm and cold cache. The x-axis represents the trix Factorization (LRMF), and Linear Regression (Linear). Re- individual workloads and y-axis the speedup. The last bar pro- mote Sensing, WLAN, Patient, and Blog Feedback are publicly vides the geometric mean (geomean) across all workloads. On available datasets, obtained from the UCI repository [38]. Remote average, DAnA provides 8.3× and 4.8× end-to-end speedup over Sensing is a classification dataset used by both logistic regression PostgreSQL and 4.0× and 2.5× speedup over 8-segment Green- and support vector machine algorithms. Netflix is a movie recom- plum for publicly available datasets in warm and cold cache setting, mendation dataset for LRMF algorithm. The model topology, num- respectively. The benefits diminish for cold cache as the I/O time ber of tuples, and number of uncompressed 32 KB pages that fit the adds to the runtime and cannot be parallelized. The overall runtime entire training dataset are also listed in the table. Publicly available of the benchmarks reduces from 14 to 1.3 seconds with DAnA in datasets fit entirely in the buffer pool, hence impose low I/O over- contrast to MADlib+PostgreSQL. heads. To evaluate the performance of out-of-memory workloads, The maximum speedup is obtained by Remote Sensing LR, we generate eight synthetic datasets, shown by the shaded rows in 28.2× with warm cache and 14.6× with cold cache. This work- Table 3. Synthetic Nominal (S\N) and Synthetic Extensive (S\E) load runs logistic regression algorithm to perform non-linear trans- datasets are used to evaluate performance with the increasing sizes formations to categorize data in different classes and offers copious of datasets and model topologies. Finally, Table 5 provides abso- amounts of parallelism for exploitation by DAnA’s accelerator. In lute runtimes for all workloads across our three systems. contrast, Blog Feedback sees the smallest speedup of 1.9× (warm Experimental setup. We use the Xilinx Virtex UltraScale+ VU9P cache) and 1.5× (cold cache) due to the high CPU vectorization as the FPGA platform for DAnA and synthesize the hardware at potential of the linear regression algorithm. 150 MHz using Vivado 2018.2. Specifications of the FPGA board Synthetic nominal and extensive datasets. Figures 9 and are provided in Table 4. DAnA accelerators . The baseline exper- 10 depict end-to-end performance comparison for synthetic nom- iments for MADlib were performed on a machine with four Intel inal and extensive datasets across our three systems. Across S/N i7-6700 cores at 3.40GHz running Ubuntu 16.04 xLTS with ker- datasets, shown in Figure 9, DAnA achieves an average speedup of nel 4.8.0-41, 32GB memory, a 256GB Solid State Drive storage. 13.2× in warm cache and 9.5× in cold cache setting. In compar- We run each workload with MADlib v1.12 on PostgreSQL v9.6.1 ison to 8-segment Greenplum, for S/N datasets, DAnA observes and Greenplum v5.1.0 to measure single- and multi-threaded per- a gain of 5.0× for warm cache and 3.5× for cold cache. The formance, respectively. average speedup as shown in Figure 10, across S/E datasets in Default setup. Our default setup uses a 32 KB buffer page size comparison to MADlib+PostgreSQL are 12.9× for warm cache and 8 GB buffer pool size across all the systems. As DAnA op- and 11.9× for cold cache. These speedups reduce to 5.9× (warm erates with uncompressed pages to avoid on-chip decompression cache) and 7.0× (cold cache) when compared against 8-segment overheads, 32 KB pages are used as a default to fit at least 1 tuple MADlib+Greenplum. Higher benefits of acceleration are observed per page for all the datasets. To understand the performance sensi- with larger datasets as DAnA accelerators are exposed to more MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 28.2 18.4 15.1 6.3 8.3 5×

4× 3.7 3.4 3.0 3.1 3× 2.7 Real Speedup over PostgreSQL Hot Cache Dataset PostgreSQL+ Greenplum+ DAnA+Postg 2.1 MADlib MADlib reSQL 2× 1.9 Remote 1 3.40 28.20 8.29 Speedup Sensing LR 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 WLAN 1 1.00 18.42 18.42 1× 0.9

Remote 1 2.70 15.10 5.59 End-to-End Runtime Sensing SVM Netflix 1 0.9 6.32 7.02 0× Patient 1 3.0 3.65 1.22 Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL Blog Feedback 1 3.1 1.86 0.60 28.2 18.4 15.1 6.3 8.3 Geomean 1.0 2.1 8.3 4.0 LR SVM 5×

4× 3.7 3.4 3.0 3.1 3× 2.7 Real Speedup over PostgreSQL Hot Cache Dataset PostgreSQL+ Greenplum+ DAnA+Postg 2.1 MADlib MADlib reSQL 2× 1.9 Remote 1 3.40 28.20 8.29 Speedup Sensing LR 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 WLAN 1 1.00 18.42 18.42 1× 0.9

Remote 1 2.70 15.10 5.59 End-to-End Runtime Sensing SVM Netflix 1 0.9 6.32 7.02 0× MADlib+PostgreSQL MADlib+GreenplumPatient 1 3.0DAnA+PostgreSQL3.65 1.22 Blog Feedback 1 3.1 1.86 0.60 Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean 4.9 14.6 8.6 6.0 5× Geomean 1.0 2.1 8.3 4.0 4.8 LR SVM

4× Real Speedup over PostgreSQL Cold Cache 3.2 Dataset PostgreSQL+ Greenplum+ DAnA+Postg MADlib MADlib reSQL 3× 2.6 Remote 1 3.20 4.89 1.53 2.4 2.4 Sensing LR 2.2 WLAN 1 1.00 14.58 14.58 2× 1.9 Speedup Remote 1 2.40 8.61 3.59 1.5 Sensing SVM 1.0 MADlib+PostgreSQL1.0 1.0 1.0 MADlib+Greenplum1.0 0.9 1.0 1.0DAnA+PostgreSQL1.0 Netflix 1 0.9 6.01 6.68 1×

End-to-End Runtime 28.2 18.4 15.1 6.3 8.3 Patient 1 2.4 2.23 0.85 5× Blog Feedback 1 2.6 1.48 1.23 Geomean 1.0 1.9 4.8 2.9 0× 4× Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean 3.7 MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL LR3.4 SVM 3.0 3.1 4.9 14.6 8.6 6.0 3× 2.7 5× 4.8 Real Speedup over PostgreSQL Hot Cache MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL Dataset PostgreSQL+ Greenplum+ DAnA+Postg 2.1 28.2 18.4 15.1 6.3 8.3 MADlib MADlib reSQL 2× 1.9 45×× Remote 1 3.40 28.20 8.29 Speedup Real Speedup over PostgreSQL Cold Cache Sensing LR 3.2 1.0 1.0 1.0 1.0 1.0Dataset PostgreSQL+1.0 Greenplum+ DAnA+Postg1.0 1.0 WLAN 1 1.00 18.42 18.42 0.9 4× 1× MADlib MADlib reSQL 3× 3.7 2.6 Remote End-to-End Runtime 1 2.70 15.10 5.59 Remote 1 3.20 4.89 1.53 3.4 2.4 2.4 Sensing SVM 2.2 3.1 Sensing LR 3.0 1.9 Netflix 1 0.9 6.32 7.02 0× WLAN 1 1.00 14.58 14.58 23×× 2.7 Real Speedup over PostgreSQL Hot Cache Speedup Patient 1 3.0 3.65 1.22 Remote 1 2.40 8.61 3.59 1.5 Blog Feedback 1 3.1 1.86 0.60 Remote Sensing WLAN Remote SensingSensingDataset Netflix SVM PostgreSQL+PatientGreenplum+ DAnA+PostgBlog Feedback Geomean 2.1 MADlib MADlib reSQL 1.0 1.0 1.0 1.0 1.0 0.9 1.0 1.0 1.9 1.0 Geomean 1.0 2.1 8.3 4.0 LR SVM Netflix 1 0.9 6.01 6.68 12×× Speedup

Remote End-to-End Runtime Patient 1 1 3.402.4 28.202.23 8.290.85 Sensing LR MADlib+PostgreSQL MADlib+GreenplumBlog Feedback 1 2.6DAnA+PostgreSQL1.48 1.23 WLAN 1 1.00 18.42 18.42 1.0 1.0 1.0 1.0 1.0 0.9 1.0 1.0 1.0 20.2 8.7 Geomean8.0 1.0 1.941.8 4.8 2.913.2 01×× 5× Remote 1 2.70 15.10 5.59 End-to-End Runtime 4.4 Sensing SVM Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean Netflix 4.2 1 0.9 6.32 7.02 0× LR SVM 4× Patient 1 3.0 3.65 1.22 Blog Feedback 1 3.1 1.86 0.60 Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean Synthetic N Speedup over PostgreSQL Hot Cache Geomean 1.0 2.1 8.3 4.0 LR SVM Dataset PostgreSQL+ Greenplum+ DAnA+Postg 3× MADlib MADlib reSQL 2.6 S/N 1 1.10 20.16 18.33 Logistic 2×

S/N SVM 1 4.40 8.70 1.98 Speedup S/N 1 7.99 4.17 0.52 1.1 1.2 LRMF 1.0 1.0 1.0 1.0 1.0 S/N 1×

1 1.20 41.81 34.84 End-to-End Runtime Linear MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL Geomean 1.0 2.6 13.2 5.1 14.6 8.6 6.0 50×× 4.9 4.8 S/N Logistic S/N SVM S/N LRMF S/N Linear Geomean MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 4× Real Speedup over PostgreSQL Cold Cache MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 20.2 8.7 8.0 41.8 13.2 3.2 18.4 15.1 6.3 8.3 5× Dataset PostgreSQL+ Greenplum+ DAnA+Postg 28.2 4.4 MADlib MADlib reSQL 3×5× 2.6 MADlib+PostgreSQL MADlib+Greenplum4.2 DAnA+PostgreSQL Remote 1 3.20 4.89 1.53 2.4 2.4 4× Sensing LR 2.2 14.6 8.6 6.0 4× 1.9 5× 4.9 4.8 WLAN 1 1.00 14.58 14.58 2× Synthetic N Speedup over3.7 PostgreSQL Hot Cache Speedup 3.4 1.5 Remote 1 2.40 8.61 3.59 Dataset PostgreSQL+ Greenplum+ DAnA+Postg3.1 3× Sensing SVM 1.0 1.0 1.0 1.0 1.0 MADlib1.0 3.0MADlib 1.0reSQL 1.0 2.6 Netflix 1 0.9 6.01 6.68 3× 0.9 4× MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 1× 2.7 S/N 1 1.10 20.16 18.33 Real Speedup over PostgreSQL Hot Cache End-to-End Runtime Patient 1 2.4 2.23 0.85 LogisticReal Speedup over PostgreSQL Cold Cache 3.2 28.2 18.4 15.1 6.3 8.3 Dataset PostgreSQL+ Greenplum+ DAnA+Postg 2× 5× Blog Feedback 1 2.6 1.48 1.23 DatasetS/N SVMPostgreSQL+ Greenplum+1 4.40DAnA+Postg8.70 1.98 2.1 Speedup MADlib MADlib reSQL 1.9 0×2× S/N MADlib MADlib reSQL 3× 2.6 Geomean 1.0 1.9 4.8 2.9 Speedup 1 7.99 4.17 0.52 1.1 1.2 Remote 1 3.40 28.20 8.29 Remote LRMF 1 3.20 4.89 1.53 1.0 1.0 2.4 1.0 2.41.0 1.0 Sensing LR MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 2.2 Remote Sensing WLAN Remote SensingSensing Netflix LRS/N Patient Blog Feedback Geomean 1× 4× 1.0 10.01.0 1.0 5.51.0 6.5 1.0 7.8 1.01 1.20 28.71.0 41.81 34.841.0 9.5 End-to-End Runtime 3.7 1.9 WLAN 1 1.00 18.42 18.42 1× LR SVM WLAN Linear0.9 1 1.00 14.58 14.58 2× 3.4

5× Speedup Remote End-to-End Runtime 1.5 1 2.70 15.10 5.59 Remote Geomean 11.0 2.402.6 8.6113.2 3.595.1 3.0 3.1 Sensing SVM Sensing SVM 4.4 0× 3×1.0 1.0 1.0 1.0 2.7 1.0 0.9 1.0 1.0 1.0 Netflix 1 0.9 6.32 7.02 04×× Netflix Real Speedup1 over PostgreSQL0.9 Hot6.01 Cache 6.68 1× S/N Logistic S/N SVM S/N LRMF S/N Linear Geomean Patient 1 3.0 3.65 1.22 Remote Sensing WLAN Remote SensingPatient DatasetNetflix PostgreSQL+1 PatientGreenplum+2.4 BlogDAnA+Postg2.23 Feedback0.85 Geomean End-to-End Runtime 2.1 Blog Feedback 1 3.1 1.86 0.60 Blog Feedback MADlib1 MADlib2.6 reSQL1.48 1.23 2× 1.9 Remote Speedup Geomean Synthetic 1.0N Speedup over2.1 PostgreSQL Cold8.3 Cache 4.0 3× LR SVM Geomean 1.0 1 1.9 3.40 4.828.20 2.92.78.29 0× Sensing LR Dataset PostgreSQL+ Greenplum+ DAnA+Postg WLAN 1 1.00 18.42 18.42 Remote1.0 Sensing WLAN1.0 1.0Remote1.0 Sensing Netflix1.0 0.9 Patient1.0 Blog Feedback1.0 Geomean1.0 MADlib MADlib reSQL 1×

Remote 1 2.70 15.10 5.59 End-to-End Runtime LR SVM S/N 1 1.10 10.05 9.13 2× Speedup Sensing SVM Logistic Netflix 1 1.20.9 6.32 7.02 S/N SVM 1 5.50 6.47 1.18 1.0 1.1 1.0 1.0 1.0 1.0 0× S/N 1 7.78 4.36 0.56 1× Patient 1 3.0 3.65 1.22 LRMF End-to-End Runtime Blog Feedback 1 3.1 1.86 0.60 Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean S/N 1 1.20 28.74 23.95 Geomean 1.0 2.1 8.3 4.0 LR SVM Linear 0× Geomean 1.0 2.7 9.5 3.5 MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL S/N Logistic S/N SVM S/N LRMF S/N Linear Geomean 20.2 8.7 8.0 41.8 13.2 5× 4.4 4.2 MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 10.0 5.5 6.5 7.8 28.7 9.5 4× 5× 4.4 Synthetic N Speedup over PostgreSQL Hot Cache MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL Dataset PostgreSQL+ Greenplum+ DAnA+Postg 3× 14.6 8.6 6.0 2.6 4× MADlib MADlib reSQL 5× 4.9 4.8 MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL S/N 1 1.10 20.16 18.33 20.2 8.7 8.0 41.8 13.2 Logistic 5× 2× Synthetic N Speedup over PostgreSQL Cold Cache 3× 2.7 S/N SVM 1 4.40 8.70 1.98 Speedup 4.4 4× Dataset PostgreSQL+ Greenplum+ DAnA+Postg 4.2 S/N 1 7.99 4.17 0.52 1.1 1.2 LRMF Real Speedup over PostgreSQL Cold Cache 1.0 3.2 1.0 1.0 MADlib1.0 MADlib reSQL 1.0 4× DatasetS/N PostgreSQL+ Greenplum+ DAnA+Postg 1× S/N 1 1.10 10.05 9.13 2× Speedup 1 1.20 41.81 34.84 End-to-End Runtime MADlib+PostgreSQL MADlib+GreenplumLogistic DAnA+PostgreSQL MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL Linear MADlib MADlib reSQL 3× Synthetic N Speedup over PostgreSQL Hot2.6 Cache 1.2 RemoteGeomean 1.0 2.6 13.2 5.1 28.2 18.4 2.415.1 S/N SVM6.3 2.41 5.50 6.47 1.18 8.3 1.1 14.6 8.6 6.0 1 3.20 4.89 1.53 5× MADlib+PostgreSQL MADlib+GreenplumDataset PostgreSQL+ Greenplum+2.2 DAnA+PostgDAnA+PostgreSQL 3× 1.0 4.9 1.0 1.0 1.0 1.0 4.8 Sensing LR 0× S/N MADlib 1 MADlib 7.78 reSQL 4.36 0.56 51×× 2.6

7.8 278.2 19.0 12.9 End-to-End Runtime WLAN 1 1.00 14.58 14.58 LRMF 1.9 2×5× S/N Logistic S/N SVM 4.7 S/NS/N LRMF 1 S/N Linear1.10 20.16 Geomean18.33 Speedup S/N Remote Logistic 1 1.20 28.741.5 23.95 1 2.40 8.61 3.59 Linear Sensing SVM 4× 2× S/N SVM 1 3.74.40 8.70 1.98 Speedup 40×× 1.0 1.0 1.0 1.0 1.0 Geomean 1.0 1.0 2.71.0 9.5 1.03.5 Netflix 1 0.9 6.01 6.68 3.4 0.9 Real Speedup over PostgreSQL Cold Cache 1×4× S/N 1 7.99 4.17 0.52 S/N Logistic S/N SVM S/N LRMF 1.2S/N Linear Geomean

End-to-End Runtime 1.13.2 Patient 1 2.4 2.23 0.85 LRMFDataset PostgreSQL+3.0 Greenplum+ DAnA+Postg3.1 1.0 1.0 1.0 1.0 1.0 3× 2.7 S/N MADlib MADlib reSQL 1× 3×

1 1.20 41.81 34.84 End-to-End Runtime Blog FeedbackReal Speedup over1 PostgreSQL2.6 Hot Cache 1.48 1.23 2.6 3× RemoteLinear 2.4 2.4 Geomean 1.0 1.9 4.8 2.9 0× 1 3.20 4.89 1.53 2.2 Dataset PostgreSQL+ Greenplum+ DAnA+Postg SensingGeomean LR 1.0 2.6 13.2 5.1 2.1 MADlib MADlib reSQL Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback1.9 2.2Geomean 0× 1.9 Synthetic E Speedup over PostgreSQL Hot Cache-1 2× WLAN2.1 1 1.00 14.58 14.58 2× Speedup Remote Speedup 1 3.40 28.20 8.29 2× LR SVM Remote 1 2.40 8.61 3.59 S/N Logistic S/N SVM S/N LRMF S/N Linear 1.5Geomean Sensing LR Dataset PostgreSQL+ Greenplum+ DAnA+Postg Speedup MADlib MADlib reSQL 1.0 1.0 1.0 1.0 1.0Sensing SVM 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 WLAN 1 1.00 18.42 18.42 1.1 Netflix0.9 1.1 1 1.2 0.9 6.01 6.68 0.9 S/E 1 7.85 278.24 35.45 1× 1.0 1.0 1.0 1.0 1.0 1× End-to-End Runtime Remote End-to-End Runtime Logistic 1 2.70 15.10 5.59 1× Patient 1 2.4 2.23 0.85

Sensing SVM End-to-End Runtime S/E SVM 1 1.11 4.71 4.25 Netflix 1 0.9 6.32 7.02 Blog Feedback 1 2.6 1.48 1.23 S/E LRMF 1 2.08 1.12 0.54 0× 0× Patient 1 3.0 3.65 1.22 0× Geomean 1.0 1.9 4.8 2.9 S/E 1 1.23 19.01 15.44 Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean Blog FeedbackLinear 1 3.1 1.86 0.60 S/E LogisticMADlib+PostgreSQLS/E SVM MADlib+GreenplumS/E LRMF S/E Linear DAnA+PostgreSQLGeomean Geomean Geomean 1.0 1.0 2.1 2.2 8.3 12.9 4.0 5.9 LR 10.0 5.5 SVM6.5 7.8 28.7 9.5 LR SVM 5× (a) End-to-End Performance for Warm Cache (b) End-to-End Performance for Cold Cache 4.4 MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 4× 7.8 278.2 19.0 12.9 Figure 8: End-to-end runtime performance comparison for publicly available5× datasets with MADlib+PostgreSQL4.7 as baseline.

Synthetic N Speedup over PostgreSQL Cold Cache 3× 2.7 MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 4× MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL Dataset PostgreSQL+ Greenplum+ DAnA+Postg 20.2 8.7 8.0 41.8 13.2 10.0 5.5 6.5 7.8 28.7 9.5 MADlib MADlib reSQL 5× 5× S/N 1 1.10 10.05 9.13 2× Speedup 4.4 3× Logistic MADlib+PostgreSQL MADlib+Greenplum4.2 DAnA+PostgreSQL 4.4 S/N SVM 1 5.50 6.47 1.18 1.1 1.2 4× 1.0 7.8 243.8 1.0 1.0 1.0 17.0 1.0 11.9 2.1 2.2 S/N 1 7.78 4.36 0.56 1× Synthetic E Speedup over PostgreSQL Hot Cache-1 4× LRMF End-to-End Runtime 5× 2×

Dataset PostgreSQL+ Greenplum+ DAnA+Postg Speedup Synthetic N Speedup over PostgreSQL Hot Cache 4.4 S/N 1 1.20 28.74 23.95 MADlib MADlib reSQL 1.2 LinearDataset PostgreSQL+ Greenplum+ DAnA+Postg MADlib+PostgreSQL1.1 MADlib+Greenplum1.1 DAnA+PostgreSQL 3×0× S/E Synthetic N Speedup1 over PostgreSQL7.85 Cold278.24 Cache 35.452.6 3× 1.0 1.0 1.0 1.0 1.0 2.7 Geomean MADlib 1.0 MADlib 2.7 reSQL 9.5 3.5 4× MADlib+PostgreSQL MADlib+GreenplumLogisticDataset PostgreSQL+ Greenplum+DAnA+PostgreSQLDAnA+Postg 1× 20.2 8.7 8.0 41.8 13.2 S/N S/N Logistic S/N SVM S/N LRMF S/N Linear Geomean End-to-End Runtime 5× 1 1.10 20.16 18.33 14.6 8.6 S/E SVM6.0 MADlib 1 MADlib1.11 reSQL 4.71 4.25 Logistic 4.9 4.8 4.4 5×2× S/NS/E LRMF 11 1.102.08 10.051.12 9.130.54 2× Speedup S/N SVM 1 4.40 8.70 1.98 3× Speedup 4.2 LogisticS/E 0× S/N 1 1.23 19.01 15.44 1 7.99 4.17 0.52 1.1 S/NLinear SVM 1 1.25.50 6.47 1.18 4× 1.2 LRMF 1.0 1.0 1.0 1.0 1.0 1.0S/E Logistic1.1 1.0S/E SVM 1.0S/E LRMF 1.0S/E Linear 1.0Geomean S/NGeomean S/N 4×1× Synthetic N1.01 Speedup over7.782.2 PostgreSQL4.3612.9 Hot Cache 0.565.9 1×

1 1.20 41.81 34.84 End-to-End Runtime Synthetic E Speedup over PostgreSQL Cold Cache-1 2× 1.7 End-to-End Runtime LinearReal Speedup over PostgreSQL Cold Cache Speedup LRMF Dataset PostgreSQL+ Greenplum+ DAnA+Postg 3.2 S/N Dataset PostgreSQL+ Greenplum+ DAnA+Postg 3× Geomean 1.0 2.6 13.2 5.1 1 1.21.20 28.74 23.95 2.6 Dataset PostgreSQL+MADlibGreenplum+MADlibDAnA+PostgreSQL 1.0 1.0 1.0 Linear1.1 1.1 MADlib1.0 MADlib reSQL 1.0 MADlib MADlib reSQL 3×0× 0× S/E 1 7.83 243.78 31.15 1× 0.8 GeomeanS/N 1.0 1 2.71.10 2.6 9.520.16 3.518.33 Remote Logistic 1 3.20 4.89 1.53 End-to-End Runtime S/N Logistic S/N SVM2.4 S/N LogisticLRMF 2.4 S/N Linear Geomean S/N Logistic S/N SVM S/N LRMF S/N Linear Geomean Sensing LR 2.2 2× S/E SVM 1 0.77 4.35 5.62 S/N SVM 1 4.40 8.70 1.98 Speedup WLAN 1 1.00 14.58 14.58 1.9 2× S/N 1 7.99 4.17 0.52 1.2 S/E LRMF 1 1.13 1.12 0.99 Speedup 0× 1.1 Remote 1 2.40 8.61 3.59 (a) End-to-End PerformanceLRMF for Warm Cache1.5 1.0 (b) End-to-End1.0 Performance1.0 for1.0 Cold Cache1.0 S/E 1 1.23 17.02 13.83 Sensing SVM S/E Logistic S/E SVM S/ES/N LRMF S/E Linear Geomean 1× Linear 1.0 1.0 1.0 1.0 1.0 1.0 1 1.01.20 41.81 1.0 34.84 End-to-End Runtime Netflix 1 0.9 6.01 6.68 0.9Linear Geomean 1.0 1.7 11.9 7.0 1× Figure 9: End-to-end runtime performance comparison for synthetic nominal datasets with MADlib+PostgreSQL as baseline. End-to-End Runtime Geomean 1.0 2.6 13.2 5.1 Patient 1 2.4 2.23 0.85 0× Blog Feedback 1 2.6 1.48 1.23 Geomean 1.0 1.9 4.8 2.9 0× MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL S/N LogisticMADlib+PostgreSQLS/N SVM MADlib+GreenplumS/N LRMF S/N LinearDAnA+PostgreSQLGeomean 7.8 278.2 19.0 12.9 7.8 243.8 17.0 11.9 5Remote× Sensing WLAN Remote Sensing4.7 Netflix Patient Blog Feedback Geomean 5× LR SVM 4.4 4× 4×

3× 3× MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 10.0 5.5 6.5 7.8 28.7 2.2 9.5 Synthetic E Speedup over PostgreSQL Hot Cache-1 5× 2.1 7.8 278.2 19.0 12.9 2× Synthetic E Speedup over PostgreSQL Cold Cache-1 5×2× 4.7 1.7

Dataset PostgreSQL+ Greenplum+ DAnA+Postg Speedup 4.4 Speedup MADlib MADlib reSQL 1.1 Dataset 1.1PostgreSQL+ Greenplum+1.2 DAnA+Postg 1.1 1.1 1.2 S/E 1 7.85 278.24 35.45 4× 1.0 1.0 1.0 MADlib 1.0MADlib reSQL 1.0 4× 1.0 1.0 1.0 1.0 1.0 Logistic 1× S/E 1 7.83 243.78 31.15 1× 0.8 End-to-End Runtime Logistic End-to-End Runtime S/E SVM 1 1.11 4.71 4.25 MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL S/E LRMFSynthetic N Speedup1 over2.08 PostgreSQL Cold1.12 Cache 0.54 3× S/E SVM 1 0.77 4.35 5.622.7 0× 3×0× 10.0 5.5 6.5 7.8 28.7 9.5 S/EDataset PostgreSQL+1 Greenplum+1.23 DAnA+Postg19.01 15.44 S/E LRMF 1 1.13 1.12 0.99 5× Linear MADlib MADlib reSQL S/EMADlib+PostgreSQL Logistic S/E SVM MADlib+GreenplumS/ES/E LRMF 1 S/E Linear1.23DAnA+PostgreSQL17.02 Geomean13.83 S/E Logistic S/E SVM S/E LRMF S/E Linear Geomean2.2 LinearSynthetic E Speedup over PostgreSQL Hot Cache-1 2.1 4.4 GeomeanS/N 1.0 1 2.21.10 12.910.05 5.99.13 2× Speedup 20.2 8.7 8.0Geomean 1.0 41.81.7 11.9 7.013.2 2×

Logistic Dataset PostgreSQL+ Greenplum+ DAnA+Postg Speedup 5× (a) End-to-End PerformanceMADlib for WarmMADlib 1.2reSQL Cache 4× (b) End-to-End Performance for Cold Cache S/N SVM 1 5.50 6.47 1.18 1.0 1.1 1.04.4 1.0 1.0 1.0 1.1 1.1 1.2 S/N 1 7.78 4.36 0.56 1× S/E 4.2 1 7.85 278.24 35.45 1.0 1.0 1.0 1.0 1.0 LRMF End-to-End Runtime Logistic 1× 4× Figure 10: End-to-end runtime performance comparison for syntheticEnd-to-End Runtime extensive datasets with MADlib+PostgreSQL as baseline. S/N Benefits1 of Strider1.20 28.74 23.95 S/E SVM Synthetic1 N Speedup1.11 over PostgreSQL4.71 Cold Cache4.25 3× 2.7 Linear S/E LRMFDataset PostgreSQL+ Greenplum+ DAnA+Postg Synthetic DatasetN Speedup DAnAover PostgreSQL w/o DAnA Hot w Cache 0× 1 2.08 1.12 0.54 MADlib MADlib reSQL Geomean 1.0 Strider 2.7 Strider 9.5 3.5 DAnA without Strider S/E 1 DAnA1.23 with Strider19.01 15.44 0× Dataset PostgreSQL+ Greenplum+ DAnA+Postg 3× S/N Logistic S/N SVM S/N LRMF S/N Linear Geomean Remote LinearS/N 1 1.10 10.05 2.6 9.13 2× MADlib MADlib 4.00 reSQL28.20 28.2 20.2 41.8 278.2 Speedup S/E Logistic S/E SVM S/E LRMF S/E Linear Geomean Sensing GeomeanLogistic 1.0 2.2 12.9 5.9 tion engines. Figure 11 compares the end-to-end runtime of DAnA S/N 20× 19.0 19.0 LR 1 1.10 20.16 18.33 18.4 S/N SVM 1 5.50 6.47 1.18 1.1 1.2 Logistic 1.0 1.0 1.0 1.0 1.0 WLAN 12.21 18.42 2× S/N 1 7.78 4.36 0.56 1×

Speedup Striders S/N SVM 1 4.40 8.70 1.98 withEnd-to-End Runtime and without using warm cache MADlib+PostgreSQL Remote 15.10 15.1 LRMF S/N 1 7.99 4.17 0.52 16× MADlib+PostgreSQL MADlib+Greenplum 1.2 DAnA+PostgreSQL Sensing 1.0 1.1 1.0 1.0 S/N 1.0 1 1.20 1.028.74 23.95 LRMF SVM 1.928862027 7.8 243.8 Linear 17.0 11.9 S/N 1× 12.2 as baseline.0× DAnA with and without Striders achieve, on aver- Netflix1 0.57972693081.20 41.816.32 34.84 End-to-End Runtime 5× Geomean 1.0 2.7 9.5 3.5 Linear 12× 10.8 Patient 0.7629714796 3.65 4.4 S/N Logistic S/N SVM S/N LRMF S/N Linear Geomean Geomean 1.0 2.6 13.2 5.1 Blog age, 10.7× and 2.3× speedup in comparison to the baseline. Even 1.86 0× 8.7 Feedback 1.135196515 4× 8× S/N Logistic S/N SVM S/N LRMF S/N Linear Geomean6.6 S/N Speedup 6.3 6.3 19 20.16 though raw application hardware acceleration has its benefits, inte- Logistic 4.7 S/N SVM 2.2471383310 8.7009847700 3× 4.0 3.7 4.2 S/N 4× 2.9 MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL

0.852334020704.173580788 End-to-End Runtime LRMF 1.9 1.9 2.2 1.8 2.3 grating Striders to directly interface with the database engine am- S/N 0.8 1.1 0.9 1.1 7.8 243.8 17.0 11.9 6.283642904 41.80711785 0.6 0.3 SyntheticLinear E Speedup over PostgreSQL Cold Cache-1 2× 1.7 5× Speedup

MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL plifies those performance benefits by 4.6×. The Striders bypass

S/E 4.4

Dataset PostgreSQL+ Greenplum+ DAnA+Postg

2.905628577 278.24182390 1.2 Logistic 1.1 1.1 MADlib MADlib reSQL 1.0 7.8 278.2 1.0 1.0 1.0 19.0 1.0 12.9 S/E SVM 1.764241115 4.71871646 5× S/E 4×

0.8 4.7 S/N

1× SVM S/E S/E S/N

1 7.83 243.78 31.15 S/E S/N SVM End-to-End Runtime

S/E LRMF 0.291234120 1.12269605 LR the bottlenecks in the memory subsystem of CPUs and provide an S/N S/E Linear Linear Netflix WLAN Logistic SVM LRMF LRMF Patient

S/E Blog Logistic Remote Sensing

6.634635916 19.01548161 Remote Sensing

S/E SVM 1 0.77 4.35 5.62 Logistic Linear Benefits of Strider Geomean Feedback 3× S/E LRMF Geomean 1 2.31.13 10.81.12 0.99 40×× on-chip opportunity to intersperse the tasks of the access and execu- Dataset DAnA w/o DAnA w S/E 1 1.23 17.02 13.83 S/E Logistic S/E SVM S/E LRMF StriderS/E LinearStrider Geomean DAnA without Strider DAnA with Strider Linear Figure 11: Comparison of DAnA with and without Striders with Remote 4.00 28.20 tion engines.MADlib+PostgreSQL The above evaluationMADlib+Greenplum demonstratesDAnA+PostgreSQL the effectiveness Geomean 1.0 1.7 11.9 7.0 3× Synthetic E Speedup over PostgreSQL Cold Cache-1 2× 28.2 20.2 41.8 278.2 1.7 Sensing Speedup 20× 7.818.4278.2 19.0 19.0 19.012.9 PostgreSQLMADlib+PostgreSQL +MADlib as the baseline.MADlib+GreenplumDataset PostgreSQL+LR Greenplum+ DAnA+PostgDAnA+PostgreSQL2.2 5× 4.7 1.1 1.1 1.2 Synthetic E Speedup over PostgreSQL Hot Cache-1 2.1 WLANMADlib MADlib12.21 reSQL18.42 of DAnA1.0 and Striders1.0 in integrating1.0 with PostgreSQL.1.0 1.0 2× 10.0 5.5 6.5 S/E7.8 1 7.83 28.7243.78 31.15 9.5 1× 0.8 End-to-End Runtime Dataset PostgreSQL+ Greenplum+ DAnA+Postg Speedup Remote 15.10 5× Logistic 16× 15.1 MADlib MADlib reSQL Sensing 4× opportunities for parallelization,1.1 whichS/E SVM 1.1 enables1.9288620271 0.771.2 the accelerator4.35 5.62 to S/E 1 7.85 278.24 35.45 1.0 1.0 1.0 SVM4.4 1.0 1.0 Logistic 1× S/E LRMF Netflix 0.57972693081 1.13 6.321.12 0.99 0× 12.2 End-to-End Runtime 4× 7.212× Performance Sensitivity 10.8 S/E SVM 1 1.11 4.71 4.25 hide the overheads such as data transferS/E Patient across0.76297147961 1.23 platforms,3.6517.02 13.83 on-chip 3× S/E Logistic S/E SVM S/E LRMF S/E Linear Geomean S/E LRMF 1 2.08 1.12 0.54 Linear Blog 8.7 Geomean 1.01.1351965151.7 1.8611.9 7.0 S/E 1 1.23 19.01 15.44 0× Feedback 2.1 2.2 Synthetic N Speedup over PostgreSQL Cold Cache data3× alignment, and setting up the executionSynthetic E Speedup pipeline. over PostgreSQL Hot These Cache-1 2.7 results 8× 6.6 S/N Speedup Linear S/E Logistic S/E SVM S/E LRMF S/E19 Linear20.16 Geomean Multi-threading2× 6.3 in Greenplum. As6.3 shown in Figure 13, for

Dataset PostgreSQL+ Greenplum+ DAnA+Postg Speedup DatasetGeomeanPostgreSQL+1.0Greenplum+ 2.2DAnA+Postg12.9 5.9 Logistic MADlib MADlib reSQL 4.7 MADlib MADlib reSQL S/N SVM 2.2471383310 8.7009847700 4.0 3.7 4.2 1.2 show the efficacy of the multi-threadingS/E employed by DAnA’s ex- 1.1 1.1 S/N 2× S/N 1 7.85 278.24 35.45 4× 1.0 1.0 1.0 1.0 2.9 1.0 1 1.10 10.05 9.13 0.852334020704.173580788 End-to-End Runtime publicly available datasets, the default configuration of 8-segment2.3 Speedup Logistic LRMF 1× 1.9 1.9 2.2 1.8 Logistic End-to-End Runtime 1.1 S/E SVMS/N 1 1.11 4.71 4.25 0.8 0.9 1.1 S/N SVM 1 5.50 6.47 1.18 6.2836429041.241.80711785 0.6 0.3 ecution1.0 engine1.1 in comparison1.0 to1.0 the scale-outLinear 1.0 Greenplum1.0 engine.

S/E LRMF 1 2.08 1.12 0.54

0× . × . × Greenplum provides 2 1 (warm cache) and 1 9 (cold cache) S/N 1 7.78 4.36 0.56 1×

S/E

End-to-End Runtime

LRMF S/E 2.9056285771 278.241823901.23 19.01 15.44 Logistic

Linear S/N 1 1.20 28.74 23.95 The total runtime for S/N Logistic andS/E S/ESVM 1.764241115 Logistic4.71871646 reduces from S/E Logistic S/E SVM S/E LRMF S/E Linear Geomean S/E S/N SVM S/E

higher speedups than its PostgreSQL counterpart.S/N The 8-segment

Linear Geomean 1.0 2.2 12.9 5.9 S/E S/N SVM

S/E LRMF 0.291234120 1.12269605 LR S/N S/E

0× Linear Linear Netflix WLAN SVM LRMF LRMF

Geomean 1.0 2.7 9.5 3.5 Patient

S/E Blog Logistic Remote Sensing

6.634635916 19.01548161 Remote 55 minutes to 2 minutes and 66 hrs to 12 minutes, respectively. Sensing S/N Logistic S/N SVM S/N LRMFLinear S/N Linear Geomean Logistic

Greenplum performs the best amongst all options and performanceGeomean These workloads achieve high reductionGeomean in total2.3 runtimes10.8 due to Feedback MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL does not scale as the segments increase. 7.8 243.8 17.0 11.9 their5× high skew towards compute time, which DAnA’s execution 4.4 Performance Sensitivity to FPGA resources. Two main re- Benefits of Strider engine4× is specialized in handling. For S/N SVM, DAnA is only Dataset DAnA w/o DAnA w source constraints on the FPGA are its compute capability and Strider Strider able to reduce the runtimeDAnA without by Strider 20 seconds. ThisDAnA can with beStrider attributed to MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL Remote 4.00 28.20 3× 28.2 20.2 41.8 278.2 Sensing 20× 18.4 19.0 19.0 bandwidth.7.8 DAnA243.8 configures the template architecture17.0 in accor-11.9 LR the high I/O time incurred by the benchmark in comparison to its 5× WLAN 12.21 18.42 4.4 SyntheticRemote E Speedup over PostgreSQL Cold Cache-1 2× 1.7 dance to the algorithmic parameters and FPGA constraints. To 15.10 Speedup 16× 15.1 Dataset SensingPostgreSQL+ Greenplum+ DAnA+Postg compute time, thus, the accelerator frequently stalls for the buffer 4× SVM 1.928862027 1.1 1.1 1.2 MADlib MADlib reSQL 1.012.2MADlib+PostgreSQL1.0 MADlib+Greenplum1.0 Benefits of1.0 Strider DAnA+PostgreSQL1.0 maximize compute resource utilization, DAnA supports a multi- S/E Netflix 0.57972693081 7.83 6.32243.78 31.15 1× 0.8

End-to-End Runtime Dataset DAnA w/o DAnA w Logistic pool12× page7.8 replacements278.2 to complete. Nevertheless,19.0 for S/E SVM12.9 10.8, Patient 0.7629714796 3.65 5× 4.7 Strider Strider 3× DAnA without Strider DAnA with Strider S/E SVM Blog 1 0.77 4.35 5.62 1.86 8.7Remote 4.00 28.20 threaded28.2 execution engine, where20.2 each thread41.8 278.2 runs a version of S/E LRMFFeedback 1.1351965151 1.13 1.12 0.99 0× PostgreSQL Segments = 4 SegmentsSensing = 8 Segments = 16 DAnA8× still reduces the absolute runtime from 55 to 39 minutes.6.6 20× 18.4 19.0 19.0 S/N Speedup S/E 1 19 1.23 20.16 17.02 13.83 1.5×4× S/E Logistic 6.3S/E SVM S/E LRMFLRSynthetic E Speedup6.3 S/E over PostgreSQLLinear Cold Cache-1Geomean 2× 1.7 Linear Logistic WLAN 12.21 18.42 theSpeedup update rule. We perform an analysis for varying number of S/N SVM 2.2471383310 8.7009847700 Dataset PostgreSQL+4.2 Greenplum+ DAnA+Postg4.7 Geomean 1.0 1.7 11.9 7.0 4.0 3.7 Remote 1.1 1.1 1.2 S/N 1.26 MADlib MADlib 15.10reSQL 16× 1.0 15.1 1.0 1.0 1.0 1.0 1.3Evaluating× 4× Striders.1.21 A crucial partSensing of DAnA2.9 ’s accelerators is 0.852334020704.173580788 End-to-End Runtime 2.3 LRMF 3× 1.9 1.9 2.2S/E 1 7.83 1.8 243.78 31.15 threads1× on the final accelerator0.8 by changing the merge coefficient. 1.11.14 SVM 1.928862027 End-to-End Runtime S/N 0.6 0.8 Logistic 0.9 1.1 6.283642904 41.80711785 Netflix 0.5797269308 6.32 0.3 12.2 Linear 1.1their× direct1 integration1.03 1 with1 the1.02 buffer12.1S/E SVM pool1 via1 Striders0.771 .4.35 To2.21 evaluate5.62 12× 10.8 0× Synthetic E Speedup over PostgreSQL Hot Cache-1 0.97 0.95 0.96 0.95 0.96 Patient 0.7629714796 3.65 S/E A merge coefficient of 2 implies a maximum of two threads. How-

0× S/E LRMF 1 1.13 1.12 0.99

2.905628577 278.24182390 2× 0.90 0.89

Dataset PostgreSQL+Logistic Greenplum+ DAnA+Postg Speedup 0.87 Blog S/E 8.7 MADlibS/E SVM 1.764241115MADlib 4.71871646reSQL 0.9× 1.1351965151 0.801.231.86 17.02 13.83 S/E Logistic S/E SVM S/E LRMF S/E Linear Geomean the effectiveness of Striders, we simulateFeedback an alternate1.2 S/E design where S/N

Linear SVM S/E S/N

1.1 1.1 S/E S/N 0.73 8× SVM

S/E S/E LRMF 0.291234120 1.12269605 1.0 LR 1.0 1.0 1.0 1.0 ever, a large merge coefficient, such as 2048, does not warrant6.6 2048 S/N 1 7.85 278.24 35.45 S/N Speedup S/E

0.69 Linear Linear 6.3 6.3 Netflix WLAN SVM

Geomean LRMF 1.0 19 1.720.16 11.9 LRMF 7.0

Logistic Patient S/E 1× Blog Logistic Logistic End-to-End Runtime Remote Sensing

6.634635916 19.01548161 Remote Sensing

0.6× Logistic 4.7 S/E SVM Linear 1 1.11 4.71 4.25 DAnA’s execution engines are fed byS/N SVM the2.2471383310 CPU.8.7009847700 In this0.54 alternative, 4.0 4.2 Speedup Geomean Feedback threads, as the FPGA3.7 may not have enough resources. In Ultra- S/E LRMF Geomean1 2.082.3 10.81.12 0.54 S/N 4× 2.9

0.852334020704.173580788 End-to-End Runtime Segment Sweep 0.42 0.42LRMF 1.9 1.9 2.2 2.3 S/E 0.4×0× 0.39 1.8 PostgreSQ1 Segments1.23 Segments19.01 Segments15.44 = the CPU transforms the training tuplesS/N and sends them to the execu- 0.8 1.1 0.9 1.1 Linear 0.31 6.283642904 41.80711785 0.6 0.3 L = 4 = 8 16 S/E Logistic S/E SVM S/E LRMFLinear S/E Linear Geomean Scale+ FPGA, maximum 1024 compute units can be instantiated.

0× Geomean

1.0 2.2 12.9 5.9

Remote 0.31 0.87 1.00 0.69 S/E

0.2× 2.905628577 278.24182390

Sensing Logistic

LR S/E SVM 1.764241115 4.71871646 S/E S/N SVM S/E S/N S/E S/N SVM

WLAN 1.03 1.21 1.00 0.95 S/E LRMF 0.291234120 1.12269605 LR S/N S/E

0.0× Linear Linear Netflix WLAN SVM LRMF LRMF Patient Remote 0.42 0.96 1.00 1.26 S/E Blog Logistic Remote Sensing

6.634635916 19.01548161 Remote Sensing Sensing Remote Sensing WLAN Remote Sensing Netflix LinearPatient Blog Feedback Geomean Logistic Geomean SVM LR SVM Geomean 2.3 10.8 Feedback Netflix 1.14 1.02 1.00 0.90 Patient 0.42 0.97 1.00 0.73 Blog 0.39 0.80 1.00 0.95 Feedback PostgreSQL Segments = 4 Segments = 8 Segments = 16 Geomean 0.54 0.96 1.00 0.89 1.5× MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 7.8 243.8 17.0 11.9 1.3× 1.21 1.26 5× 1.14 Benefits of Strider 4.4 1.1× 1 1.03 1 1 1.02 1 1 1 1 Dataset DAnA w/o DAnA w 4× 0.95 0.96 0.97 0.95 0.96 Strider Strider DAnA without Strider DAnA with Strider 0.87 0.90 0.89 Remote 4.00 28.20 28.2 20.2 41.8 278.2 0.9× 0.80 Sensing 20× 19.0 19.0 0.73 LR 3× 18.4 0.69 WLAN 12.21 18.42 0.6× 0.54 Remote 15.10 16× 15.1 Benefits of Strider Speedup Sensing Synthetic E Speedup over PostgreSQL Cold Cache-1 2× Dataset SegmentDAnA w/o Sweep DAnA w 1.7 0.42 0.42 SVM 1.928862027 Speedup 0.4× 0.39 Dataset PostgreSQL+ Greenplum+ DAnA+Postg 12.2 PostgreSQ SegmentsStrider SegmentsStrider Segments = DAnA without Strider DAnA with Strider Netflix 0.5797269308 6.32 1.1 1.1 1.2 0.31 MADlib MADlib reSQL 12× 1.0 1.0 1.0 RemoteL 1.0 = 44.00 = 8 28.20 1.016 10.8 Patient 0.7629714796 3.65 28.2 20.2 41.8 278.2 S/E 1 7.83 243.78 31.15 1× 0.8 Remote Sensing 0.31 0.87 1.00 0.69 0.2× 20× 19.0 19.0 Blog End-to-End Runtime Sensing LR 18.4 Logistic 1.86 8.7 Feedback 1.135196515 S/E SVM 1 0.77 4.35 5.62 8× LR WLAN 12.21 18.42 6.6 S/N Speedup 6.3 WLAN 1.036.3 1.21 1.00 0.95 S/E LRMF 1 1.1319 20.161.12 0.99 0× Remote 15.10 0.0× 16× 15.1 Logistic Remote Sensing S/E 0.42 0.96 1.004.7 1.26 Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean S/N SVM 1 2.24713833101.23 8.700984770017.02 13.83 4.0S/E Logistic S/E SVM3.7 S/ESensing LRMF SVM 4.2 1.928862027S/E Linear Geomean Linear S/N 4× SVM Netflix 0.57972693082.9 6.32 12.2

0.852334020704.173580788 End-to-End Runtime Geomean LRMF 1.0 1.7 11.9 7.0 2.2 2.3 LR SVM 1.9 1.9 Netflix 1.14 1.02 1.81.00 0.90 12× 10.8 S/N 0.8 1.1 Patient0.9 0.7629714796 3.65 1.1 6.283642904 41.80711785 0.6 0.3 Linear Patient Blog 0.42 0.97 1.00 0.73 8.7

1.86

Blog 1.135196515 Feedback

S/E

0.39 0.80 1.00 0.95

2.905628577 278.24182390 8× Feedback 6.6 Logistic S/N Speedup

19 20.16 6.3 6.3 S/E SVM 1.764241115 4.71871646 Geomean Logistic0.54 0.96 1.00 0.89 S/E S/N SVM S/E 4.7 S/N S/E S/N SVM S/N SVM 2.2471383310 8.7009847700 S/E LRMF 0.291234120 1.12269605 LR 4.2 S/N 4.0 S/E Linear Linear Netflix

WLAN 3.7 SVM LRMF LRMF

Patient S/N S/E Blog 4× 2.9 Logistic Remote End-to-End Runtime

Sensing 0.852334020704.173580788

6.634635916 19.01548161 Remote Sensing 2.2 2.3 Linear LRMF Logistic 1.9 1.9 1.8

Geomean 1.1 Feedback S/N 0.8 0.9 1.1 Geomean 2.3 10.8 6.283642904 41.80711785 0.6 0.3 Linear

PostgreSQL Segments = 4 Segments = 8 Segments = 16 0×

S/E

1.5× 2.905628577 278.24182390 Logistic

S/E SVM 1.764241115 4.71871646 S/E S/N SVM S/E S/N S/E S/N SVM

1.26 S/E LRMF 0.291234120 1.12269605 LR S/N S/E Linear Linear Netflix WLAN 1.3× SVM 1.21 LRMF LRMF Patient

S/E Blog Logistic Remote Sensing

6.634635916 19.01548161 Remote Sensing 1.14 Linear Logistic Geomean 1.1× 1 1.03 1 1 1.02 1 Geomean1 2.3 1 10.8 1 Feedback 0.95 0.96 0.97 0.95 0.96 0.87 0.90 0.89 0.9× 0.80 PostgreSQL Segments = 4 Segments = 8 Segments = 16 0.69 0.73 1.5× 0.6× 0.54 Speedup 1.3× 1.21 1.26 Segment Sweep 0.42 0.42 0.4× 0.39 1.14 PostgreSQ Segments Segments Segments = 0.31 L = 4 = 8 16 1.1× 1 1.03 1 1 1.02 1 1 1 1 0.95 0.96 0.97 0.95 0.96 Remote 0.31 0.87 1.00 0.69 0.2× 0.87 0.90 0.89 Sensing 0.9× 0.80 LR 0.73 WLAN Benefits1.03 of Strider1.21 1.00 0.95 0.0× 0.69 RemoteDataset DAnA0.42 w/o 0.96DAnA w 1.00 1.26 0.6× Sensing Strider Strider Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean 0.54 DAnA without Strider DAnA with Strider Speedup SVM Remote 4.00 28.20 28.2LR SVM 20.2 Segment41.8 Sweep278.2 0.42 0.42 NetflixSensing 1.14 1.02 1.00 0.90 20× 19.0 19.0 0.4× 0.39 LR 18.4 PostgreSQ Segments Segments Segments = 0.31 Patient 0.42 0.97 1.00 0.73 L = 4 = 8 16 WLAN 12.21 18.42 Blog Remote Remote 0.39 0.80 1.00 0.95 0.31 0.87 1.00 0.69 0.2× Feedback 15.10 16× 15.1 Sensing Sensing Geomean 0.54 0.96 1.00 0.89 LR SVM 1.928862027 WLAN 1.03 1.21 1.00 0.95 Netflix 0.5797269308 6.32 12.2 0.0× 12× Remote 0.42 0.96 1.00 1.26 10.8 Patient 0.7629714796 3.65 Sensing Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean Blog SVM 1.86 8.7 LR SVM Feedback 1.135196515 8× Netflix 1.14 1.02 1.00 0.906.6 S/N Speedup 19 20.16 6.3 0.426.3 0.97 1.00 0.73 Logistic Patient Blog 4.7 S/N SVM 2.2471383310 8.7009847700 4.0 3.7 4.20.39 0.80 1.00 0.95 S/N 4× Feedback 2.9

0.852334020704.173580788 End-to-End Runtime LRMF 1.9 1.9 Geomean2.2 0.54 0.96 1.81.00 0.89 2.3 S/N 0.8 1.1 0.9 1.1 6.283642904 41.80711785 0.6 0.3 Linear

S/E

2.905628577 278.24182390

Logistic

S/E SVM 1.764241115 4.71871646 S/E S/N SVM S/E S/N S/E S/N SVM

S/E LRMF 0.291234120 1.12269605 LR S/N S/E Linear Linear Netflix WLAN SVM LRMF LRMF Patient

S/E Blog Logistic Remote Sensing

6.634635916 19.01548161 Remote Sensing Linear Logistic Geomean Geomean 2.3 10.8 PostgreSQL Segments Feedback = 4 Segments = 8 Segments = 16 1.5×

1.3× 1.21 1.26 1.14 1.1× 1 1.03 1 1 1.02 1 1 1 1 0.95 0.96 0.97 0.95 0.96 PostgreSQL Segments = 4 Segments = 8 Segments = 16 0.87 0.90 0.89 1.5× 0.9× 0.80 0.73 0.69 1.3× 1.21 1.26 0.6× 0.54 1.14 Speedup 1.03 1.02 Segment Sweep 0.42 0.42 1.1× 1 1 1 1 0.97 1 1 1 0.4× 0.39 0.95 0.96 0.95 0.96 PostgreSQ Segments Segments Segments = 0.31 0.87 0.90 0.89 L = 4 = 8 16 0.9× 0.80 Remote 0.31 0.87 1.00 0.69 0.2× 0.73 Sensing 0.69 LR 0.6× 0.54 WLAN 1.03 1.21 1.00 0.95 0.0× Speedup Segment Sweep 0.42 0.42 Remote 0.42 0.96 1.00 1.26 Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean 0.4× 0.39 Sensing PostgreSQ Segments Segments Segments = 0.31 SVM LR SVM L = 4 = 8 16 Netflix 1.14 1.02 1.00 0.90 Remote 0.31 0.87 1.00 0.69 0.2× Patient 0.42 0.97 1.00 0.73 Sensing Blog LR 0.39 0.80 1.00 0.95 Feedback WLAN 1.03 1.21 1.00 0.95 0.0× Geomean 0.54 0.96 1.00 0.89 Remote 0.42 0.96 1.00 1.26 Sensing Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean SVM LR SVM Netflix 1.14 1.02 1.00 0.90 Patient 0.42 0.97 1.00 0.73 Blog 0.39 0.80 1.00 0.95 Feedback Geomean 0.54 0.96 1.00 0.89

PostgreSQL Segments = 4 Segments = 8 Segments = 16 1.5×

1.3× 1.21 1.26 1.14 1.1× 1 1.03 1 1 1.02 1 1 1 1 0.95 0.96 0.97 0.95 0.96 0.87 0.90 0.89 0.9× 0.80 0.69 0.73 0.6× 0.54 Speedup Segment Sweep 0.42 0.42 0.4× 0.39 PostgreSQ Segments Segments Segments = 0.31 L = 4 = 8 16 Remote 0.31 0.87 1.00 0.69 0.2× Sensing LR WLAN 1.03 1.21 1.00 0.95 0.0× Remote 0.42 0.96 1.00 1.26 Sensing Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean SVM LR SVM Netflix 1.14 1.02 1.00 0.90 Patient 0.42 0.97 1.00 0.73 Blog 0.39 0.80 1.00 0.95 Feedback Geomean 0.54 0.96 1.00 0.89 MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 28.2 18.4 15.1 6.3 8.3 5×

4× 3.7 3.4 3.0 3.1 3× 2.7 Real Speedup over PostgreSQL Hot Cache Dataset PostgreSQL+ Greenplum+ DAnA+Postg 2.1 MADlib MADlib reSQL 2× 1.9 Remote 1 3.40 28.20 8.29 Speedup Sensing LR 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 WLAN 1 1.00 18.42 18.42 1× 0.9

Remote 1 2.70 15.10 5.59 End-to-End Runtime Sensing SVM Netflix 1 0.9 6.32 7.02 0× Patient 1 3.0 3.65 1.22 Blog Feedback 1 3.1 1.86 0.60 Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean Geomean 1.0 2.1 8.3 4.0 LR SVM

MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 14.6 8.6 6.0 5× 4.9 4.8

4× Real Speedup over PostgreSQL Cold Cache 3.2 Dataset PostgreSQL+ Greenplum+ DAnA+Postg MADlib MADlib reSQL 3× 2.6 Remote 1 3.20 4.89 1.53 2.4 2.4 Sensing LR 2.2 WLAN 1 1.00 14.58 14.58 2× 1.9 Speedup Remote 1 2.40 8.61 3.59 1.5 Sensing SVM 1.0 1.0 1.0 1.0 1.0 0.9 1.0 1.0 1.0 Netflix 1 0.9 6.01 6.68 1× Patient 1 2.4 2.23 0.85 End-to-End Runtime Blog Feedback 1 2.6 1.48 1.23 Geomean 1.0 1.9 4.8 2.9 0× Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean LR SVM

MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 20.2 8.7 8.0 41.8 13.2 5× 4.4 4.2 4×

Synthetic N Speedup over PostgreSQL Hot Cache Dataset PostgreSQL+ Greenplum+ DAnA+Postg 3× MADlib MADlib reSQL 2.6 S/N 1 1.10 20.16 18.33 Logistic 2×

S/N SVM 1 4.40 8.70 1.98 Speedup S/N 1 7.99 4.17 0.52 1.1 1.2 LRMF 1.0 1.0 1.0 1.0 1.0 S/N 1×

1 1.20 41.81 34.84 End-to-End Runtime Linear Geomean 1.0 2.6 13.2 5.1 0× S/N Logistic S/N SVM S/N LRMF S/N Linear Geomean

MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 10.0 5.5 6.5 7.8 28.7 9.5 5× 4.4 4×

Synthetic N Speedup over PostgreSQL Cold Cache 3× 2.7 Dataset PostgreSQL+ Greenplum+ DAnA+Postg MADlib MADlib reSQL S/N 1 1.10 10.05 9.13 2× Speedup Logistic 1.2 S/N SVM 1 5.50 6.47 1.18 1.0 1.1 1.0 1.0 1.0 1.0 S/N 1 7.78 4.36 0.56 1× LRMF End-to-End Runtime S/N 1 1.20 28.74 23.95 Linear 0× Geomean 1.0 2.7 9.5 3.5 S/N Logistic S/N SVM S/N LRMF S/N Linear Geomean

MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 7.8 278.2 19.0 12.9 5× 4.7

3× 2.1 2.2 Synthetic E Speedup over PostgreSQL Hot Cache-1 2×

Dataset PostgreSQL+ Greenplum+ DAnA+Postg Speedup MADlib MADlib reSQL 1.1 1.1 1.2 S/E 1 7.85 278.24 35.45 1.0 1.0 1.0 1.0 1.0 Logistic 1× S/E SVM 1 1.11 4.71 4.25 End-to-End Runtime S/E LRMF 1 2.08 1.12 0.54 S/E 1 1.23 19.01 15.44 0× Linear S/E Logistic S/E SVM S/E LRMF S/E Linear Geomean Geomean 1.0 2.2 12.9 5.9

MADlib+PostgreSQL MADlib+Greenplum DAnA+PostgreSQL 7.8 243.8 17.0 11.9 5× 4.4 4×

Synthetic E Speedup over PostgreSQL Cold Cache-1 2× 1.7 Speedup Dataset PostgreSQL+ Greenplum+ DAnA+Postg 1.2 MADlib MADlib reSQL 1.0 1.0 1.0 1.1 1.1 1.0 1.0 S/E 1 7.83 243.78 31.15 1× 0.8 Logistic End-to-End Runtime S/E SVM 1 0.77 4.35 5.62 S/E LRMF 1 1.13 1.12 0.99 0× S/E 1 1.23 17.02 13.83 S/E Logistic S/E SVM S/E LRMF S/E Linear Geomean Linear Geomean 1.0 1.7 11.9 7.0 Batch Size = 1 Batch Size = 16 Batch Size = 32 Batch Size = 64 Epochs Real 28 56 74132240 30 30 40 59 20 Batch Size Batch Size Batch Size = Batch Size 20 = 1 = 16 32 = 64 Remote 1 3 4 6 Sensing LR 16 WLAN 1 14 28 56 14 Remote 1 3 4 5 13 Sensing MADlib MADlib Multicore Liblinear DimmWitted DAnA SVM 12 External Software Libraries Compute Comparison Postgres 23.2 28.8 13.1 15.420.924.2 10.546.1 Netflix 1 2 2 3 10 Dataset PostgreSQL+ Greenplum+M LIblinear Dimmwitted9 DAnA 10× Patient 10 74 132 240 MADlib ADlib Logistic SVM Linear Blog Size = 1 30 30 40 59 8 Feedback 6 Geomean 2.59 9.08 12.96 20.37 Regression 7.7 Regression 5 1 3.87 2.903429412 0.56 23.20 8× 4 Remote Sensing4 4 3 3 3 External Software Libraries Comparison Compute SVM External Software Libraries Comparison Compute LR 2 2 3 1 1 1 1 5.7 Dataset PostgreSQL+ Greenplum+M LIblinear Dimmwitted DAnA Dataset PostgreSQL+ Greenplum+M LIblinear Dimmwitted DAnA 1 1 28.83942384 7.74 13.09 6× DAnA Epochs compared with Batch DAnA Epochs compared WLAN MADlib ADlib MADlib ADlib 0 4.9 Remote Sensing LR WLAN Remote Sensing SVM Netflix Patient Blog Feedback Geomean 3.9 3.8 3.9 3.8 S/N Logistic 1 1.11 15.43504842 20.90385938 24.189 4× 3.5 1

Speedup Remote Sensing 1 4.85 0.16 0.1 1.35 2.9 SVM Patient 3.75 0 3.9 1.87 Geomean 1.0 1.6 3.8 1.8 8.5 1.9 1.9 1.7 1 2× 1.4 1.2 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0 1.0 1.0 S/N SVM 1 5.67 0.1 0.1 1.67 Blog Feedback 3.79 0 1.9 3.49 0.6 0.2 0.1 0.1 0.1 N/A N/A N/A 0× S/N Linear 1 1.2 0 10.5 46.13 Batch Size = 1 Batch Size = 16 Batch Size = 32 Batch Size = 64 Remote Sensing LR WLAN S/N Logistic Remote Sensing S/N SVM Patient Blog Feedback S/N Linear Epochs Synthetic 3× Batch Size Batch Size Batch Size = Batch Size SVM = 1 = 16 32 = 64 Synthetic Benefits1.0 of 1.0Strider 1.0 1.0 Logistic SyntheticDataset 1.0DAnA w/o1.0 DAnA2.0 w 2.0 SVM Strider Strider 2.0 DAnA2.0 without Strider DAnA with Strider Synthetic 1.0 1.0 1.0 1.0 2× LinearRemote 4.00 28.20 28.2 20.2 41.8 278.2 GeomeanSensing 1.00 1.00 1.26 1.26 20× 19.0 19.0 External Software Libraries Comparison Total SVM External Software Libraries Comparison Total Linear LR 18.4 1.3 1.3 Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ Dataset PostgreSQL+ Greenplum+M LIblinear Dimmwitted DAnA WLAN 12.21 18.42 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1× MADlib ADlib greSQL ostgreSQL L MADlib ADlib Remote 15.10 16× 15.1 Sensing 1.928862027 External Software Libraries Comparison Dana SVM with Batch Size = 1 MADlib+PostgreSQL MADlib+Greenplum Liblinear+PostgreSQL DimmWitted+PostgreSQL DAnA+PostgreSQL 15.10 1 Remote Sensing Netflix 0.5797269308 6.32 12.2 Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ 1 2.7 0.14 0.1173196695 Patient 3.0 0 0.51 3.65 DAnA Epochs compared DAnA Epochs compared 28.2 18.4 20.2 15.1 41.8 SVM 12× MADlib ADlib greSQL ostgreSQL L 10.810× Patient 0.7629714796 3.65 0× Logistic SVM 8.7 Linear 1 Blog 8.7 1.86 Synthetic Logistic Synthetic SVM Synthetic1 Linear3.40 0.375Geomean 0.25 28.20 S/N SVM 1 4.4 0.1 0.1 8.70 Blog Feedback 3.1 0 0.52 1.86 Feedback 1.135196515 Remote Sensing Regression Regression 8× 6.6 8× 7.3 S/N Speedup LR 6.3 6.3 19 20.16 S/N Linear 1 1.20 0 5.5 41.81 Logistic 4.7 6.3 S/N SVM 2.2471383310 8.7009847700 4.0 WLAN 3.71 1.00 6.29 4.24.70 18.42 S/N 4× 2.9 6× 5.5 5.5

0.852334020704.173580788 End-to-End Runtime 2.2 20.16 2.3 LRMF 1.9 1.9 1.8 4.7 4.4 S/N S/N Logistic 0.8 1 1.1 1.1 5.528 7.3499033690.9 1.1 6.283642904 41.80711785 0.6 0.3 Linear 4× 3.4 3.7 Speedup

0× 3.1

Geomean 1.0 1.6 2.4 2.1 21.9 3.0

S/E 2.7

2.905628577 278.24182390

Logistic

1.9 S/E SVM 1.764241115 4.71871646

S/E 2× S/N

SVM 1.2 S/E S/N

S/E 1.1 S/N 1.0 1.0 1.0 SVM 1.0 1.0 1.0 1.0 1.0 1.0

S/E LRMF 0.291234120 1.12269605 LR S/N S/E Linear Linear Netflix WLAN SVM 0.5 0.5 LRMF LRMF 0.4 Patient S/E Blog 0.2 0.10.1 0.10.1 Logistic

Remote N/A N/A N/A Sensing

6.634635916 19.01548161 Remote Sensing Linear Logistic 0× Geomean Geomean 2.3 10.8 Feedback Remote Sensing LR WLAN S/N Logistic Remote Sensing S/N SVM Patient Blog Feedback S/N Linear SVM Page Size = 8 KB Page Size = 16 KB Page Size = 32 KB Page Buffer Sweep Postgres 1.10 Page Size = Page Size = Page Size = 8 KB 16 KB 32 KB Remote 1.02 0.96 1.00 Sensing LR 1.02 1.01 WLAN 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99 Remote 0.98 0.97 1.00 0.98 0.98 0.98 Sensing 0.96 0.97 0.97 SVM Netflix 0.92 0.97 1.00 Patient 0.99 1.01 1.00 0.92 Blog 0.99 1.00 1.00 Feedback 0.90 Geomean 0.98 0.98 1.00 Page Size = 32 KB PostgreSQL runtime compared with runtime compared PostgreSQL 0.80 External Software Libraries Comparison Remote Sensing LR WLAN Remote Sensing SVM Netflix PatientDataset Blog FeedbackTABLA GeomeanDAnA 10.4 12.3 Compute 10×

1 10.35 8.1 Remote Sensing 8× LR 1 0.79 5.9 WLAN 6× 5.4 1 12.32824434 5.0 RemoteRemote Sensing Sensing LR Remote Sensing SVM Netflix 100% 4.1 56% Patient Page Size = 8 KB4× Page Size = 16100% KBSVM 100%Page Size = 32 KB 5× 4× 1.0× 1.0× 3.8

1.5× Speedup 100% 1.40 8.13 100% 100% 1.34 Netflix 100%1 4× 100% 0.8× 0.8× Page Buffer Sweep-GreenPlum 84% 100% 3× 100% 100% 100% Page Size = Page Size = Page Size = 84% 2× 1.1 8 KB 16 KB 32 KB 1.12 1.09100% 3× 0.8 0.6× 1.0 0.6× Remote 1.03 1.01 1.02 100% 100% 100% 100% 0.99 0.81 1.00 0.99 1.00 1.01 1.00 1.00 2× 1.00 1.00 Patient 1.00 1.00 1 4.0512710421.00 1.00 100% 100% Sensing 1.0× 44%0.93 0.95 100% 100% 100% LR 2× 0× 0.4× 0.4× WLAN 1.01 1.00 1.00 0.81 12.5% 44% 100% Blog Feedback 1 5.430997254 12.5% 100% 100% 100% Remote 1.40 1.34 1.00 1× 22% Remote WLAN Remote Netflix 100%Patient Blog S/N S/N S/N S/N Geomean 100% Sensing 1× 22% 0.2× 100% 0.2× DaNA'sRuntime SVM DaNA'sRuntime S/N Logistic 1 1.01045225 DaNA'sRuntime Sensing LR SensingDaNA'sRuntime SVM 100% Feedback100% Logistic SVM LRMF Linear Netflix 1.03 0.93 1.00 0.5× Patient 1.12 1.01 1.00 Page Size = 32 KB 0× 0× 0.0× 0.0× Blog S/N SVM 1 1.128802894 1.02 0.95 1.00 1 4 16 64 256 1024 Feedback 1 4 16 64 256 1024 1 2 4 8 16 32 64 1 4 16 64 256 1024 Geomean 1.09 1.00 1.00 MergeS/N Coefficient LRMF 1 4.958326979 Merge Coefficient Greenplum runtime compared with runtime compared Greenplum Merge Coefficient Merge Coefficient

0.0× S/N Linear 1 5.896097628 Remote Sensing LR WLAN Remote Sensing SVM Netflix Patient Blog Feedback Geomean Figure 12:Geomean Runtime1.0 performance3.8 of DAnA with increasing # of threads compared with single-thread as baseline.

PostgreSQL Segments = 4 Segments = 8 Segments = 16 Multicore Liblinear DimmWitted 1.5× 0.25× Bandwidth 0.5× Bandwidth 2× Bandwidth 4× Bandwidth 2.0× Data Export Data Transform Analytics Data Export Data Transform Analytics 100% Bandwidth Sweep Real 1.26 0.25× 0.5× Baseline 2× 4× 1.3× 1.21 Bandwidth Bandwidth Bandwidth Bandwidth Bandwidth 1.5× 1.14 1.2 Remote 0.7 0.9 1.0 1.1 1.134691157 1.2 1.2 75% Sensing 1.11.1 1.1 1.11.1 1.1 LR 1.01.0 1.03 1.01.0 1.01.01.021.01.01.0 1.01.0 1.01.0 1.01.01.0 1.01.0 1.01.0 1.01.01.0 1.0 1.1× 1.01.0 1 1 0.91.0 1 1.01.0 1.0 1 Dataset 1 0.9Data1.0 Export1.01.01Data 1.0 Algorithm1 0.9 1.01.0 1.0 0.9 WLAN 1.0 1.0 1.0 1.0 1.0 1.0× 0.9 0.9 0.95 0.96 0.97 0.95 0.96 0.8 0.8 0.8 0.8 Transform 0.8 Remote 0.6 0.8 1.0 1.1 1.2 0.7 0.87 0.90 0.89 50% Sensing 0.6 0.6 SVM 0.9× 0.80 Netflix 0.8 0.9 1.0 1.1 1.1 0.5× 84.05% 4.83% 11.12% Bandwidth DAnA Remote 0.73 0.69 Liblinear Patient 0.9 1.0 1.0 1.0 1.0 Sensing LR Blog Speedup over Baseline 25% Logistic Linear 1.0 1.0 1.0 1.0 1.0 0.6× Feedback 0.54 SVM 0.0× Speedup S/N Remote 56.72% 3.26% 40.02% Regression Regression 1.0 1.0 1.0 1.0 1.0 Dimmwitted Logistic Segment Sweep Remote Sensing WLAN Remote Sensing Netflix Patient Blog S/N Logistic S/N SVM S/N LRMF S/N Linear S/E Logistic S/E SVM S/E LRMF S/E Linear Geomean S/N SVM 0.6 0.8 1.2 1.2 0.42 Sensing0.42 LR 0% LR SVM Feedback 0.39 S/N LRMF 0.9 PostgreSQ1.0 1.0 Segments1.0 Segments1.0 Segments = 0.4× S/N Linear 1.0 1.0 1.0 1.0 1.0 0.31 Liblinear WLAN 83.83% 3.74% 12.44% Remote WLAN S/N Logistic Remote S/N SVM Patient Blog S/N S/E L = 4 = 8 16 1.0 1.0 1.0 1.0 1.0 Logistic Remote 0.31 0.87 1.00 0.69 Sensing LR Sensing SVM Feedback Linear S/E SVM 0.8 0.9 1.0 1.0 0.2× Dimmwitted WLAN 62.64% 2.79% 34.56% S/E LRMF Sensing1.0 1.0 1.0 1.0 1.0 S/E Linear LR 1.0 1.0 1.0 1.0 1.0 Liblinear S/N Logistic 57.42% 1.96% 40.62% MADlib+PostgreSQL MADlib+Greenplum Liblinear+PostgreSQL DimmWitted+PostgreSQL DAnA+PostgreSQL Geomean WLAN0.82 0.92 1.031.00 1.051.21 1.08 1.00 0.95 External Software Libraries Compute Comparison Postgres (a) Runtime23.2 28.8 breakdown13.1 15.4 24.2 for Liblinear and DimmWitted 46.1 0.0× 10× Remote Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ 33.14% 0.42 0.96 1.00 1.26 DimmwittedMADlib S/N LogisticADlib greSQL64.65%ostgreSQL 2.21%L Logistic SVM Linear Sensing Remote Sensing WLAN Remote Sensing Netflix Patient Blog Feedback Geomean 10.5 MADlib+PostgreSQL MADlib+Greenplum20.9 Liblinear+PostgreSQL DimmWitted+PostgreSQL DAnA+PostgreSQL SVM Remote Sensing 69.24% 3.83% 26.93% 8× Regression 7.7 Regression LR SVM LiblinearExternal SoftwareSVM1 Libraries Compute3.87 2.903429412 Comparison Postgres 0.56 23.20 23.2 28.8 13.1 15.4 24.2 46.1 Remote Sensing Netflix 1.14 1.02 1.00 0.90 Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ 10× LR External Software Libraries Comparison Compute SVM External Software Libraries Comparison Compute MADlib ADlib greSQL 57.92%ostgreSQL 3.20%L 38.87% Logistic SVM Linear Patient 0.42 0.97 1.00 0.73 Remote Sensing 5.7 Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ 10.5 Figure 13: Greenplum performanceDimmwitted withSVM1 varying1 28.83942384 segments.7.74 13.09 6× 20.9 Blog WLAN Regression 7.7 4.9 Regression MADlib ADlib greSQL ostgreSQL L MADlib ADlib greSQL ostgreSQL L 0.39 0.80 1.00 0.95 1 3.87 2.903429412 0.56 23.20 8× Feedback Remote Sensing 2.5× 32.36% 3.9 3.8 3.9 3.8 External Software Libraries Comparison Compute SVM LR S/N Logistic 1 1.11 15.43504842 20.90385938 24.189 4× 3.5 External Software1 Libraries Comparison Compute

Geomean 0.54 0.96 1.00 0.89 S/N SVM Speedup Remote Sensing 0.25× Bandwidth 0.5× Bandwidth Baseline BandwidthLiblinear 65.54% 2.09% 2.1 1 4.85 0.16 0.1 1.35 2.1 6× 2.9 5.7 SVMDataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ PatientDataset PostgreSQL+ Greenplum+M3.75LIblinear+Post Dimmwitted+P0 3.9DAnA+PostgreSQ 1.87 2× Bandwidth 4× Bandwidth 1 1 28.83942384 7.74 13.09 MADlib ADlib greSQL ostgreSQL L MADlib ADlib greSQL ostgreSQL L 2.0× WLANGeomean 1.0 1.6 3.8 1.8 8.5 4.9 1.9 1.9 Bandwidth Sweep Real-1 1.8 1.7 1 1.7 32.30% 2× 1.4 1.2 0.25× 0.5× Baseline 2× 4× S/N SVM 1.6 1.03.9 1.0 1.0 1.0 1.1 1.0 1.0 1.03.8 3.9 1.03.8 1.0 S/N SVM 1 5.67 0.1 0.1 1.67 Blog Feedback 3.79 0 1.9 3.49 S/N LogisticDimmwitted 1 1.11 15.4350484265.61%20.90385938 2.1%24.189 4× 3.5 1 Bandwidth Bandwidth Bandwidth Bandwidth Bandwidth 1.4 Speedup 0.6 Remote Sensing 1.5× 2.9 0.2 0.1 0.1 0.1 N/A N/A N/A 1 4.85 0.16 0.1 1.35 Patient 3.75 0 3.9 1.87 Remote 0.7 0.9 1.0 1.1 1.134691157 SVM 1.2 0× S/N Linear 1 1.2 0 10.5 46.13 Sensing 1.1 1.1 Geomean 1.0 1.6 -% 3.8 -%1.8 -%8.5 1.1 1.9 1.9 1.0 1.0 1.0 1.0 1.0 1.7 1 LR 1.0 0.9 1.0 Liblinear0.9 Patient 1.0 2× Remote Sensing LR WLAN S/N Logistic Remote Sensing1.4 S/N SVM Patient Blog Feedback 1.2S/N Linear WLAN 1.0× 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0 1.0 1.0 S/N SVM 1 5.67 0.1 0.1 1.67 Blog Feedback 3.79 0 1.9 3.49 1.0 1.0 1.0 1.0 1.0 Speedup 0.8 0.7 0.8 0.6 SVM Remote 0.6 0.8 1.0 1.1 1.2 0.6 0.2 0.1 0.1 0.1 N/A N/A N/A Sensing 1 1.2 10.5 0.4 0.5 0.4 0.4 0× S/N Linear 0 46.13 SVM 0.5× Dimmwitted Patient0.3 0.3 Netflix 0.8 0.9 1.0 1.1 1.1 74.61% 2.83% 22.55% Remote Sensing LR WLAN S/N Logistic Remote Sensing S/N SVM Patient Blog Feedback S/N Linear Patient 0.9 1.0 1.0 1.0 1.0 SVM Blog 0.0× 1.0 1.0 1.0 1.0 1.0 -% -% -% Feedback Remote WLAN Remote Netflix Patient Blog S/N LiblinearS/N S/NBlog FeedbackS/N S/E S/E S/E S/E Geomean 15.4 S/N 0.4 0.7 1.0 1.4 1.7 External Software Libraries Comparison Total SVM External Software Libraries Comparison Total Linear Logistic Sensing LR Sensing SVM Feedback Logistic SVM LRMF Linear Logistic SVM LRMF Linear 24.2 (b) Compute Time Comparison Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ S/N SVM 0.5 0.7 1.0 1.2 1.4 MADlib ADlib greSQL ostgreSQL L MADlib ADlib greSQL ostgreSQL L S/N LRMF 0.9 1.0 1.0 1.0 1.0 Dimmwitted Blog Feedback S/N Linear 0.3 0.6 1.0 1.5 2.1 15.4 S/E 86.16% 2.61% 11.22% 0.4 0.7 1.0 1.4 1.8 External Software Libraries Comparison Dana MADlib+PostgreSQL MADlib+Greenplum Liblinear+PostgreSQL DimmWitted+PostgreSQL DAnA+PostgreSQL External Software Libraries Comparison Total SVM 15.10 External Software1 Libraries Comparison Total Linear Figure 14: Comparison of FPGA time with varying bandwidth. 24.2 Logistic Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ Remote Sensing S/E SVM 0.4 0.7 1.0 1.3 1.6 -% -% -% 28.2 18.4 20.2 15.1 41.8 Dataset PostgreSQL+ Greenplum+M1 2.7LIblinear+Post0.14Dimmwitted+P0.1173196695DAnA+PostgreSQ PatientDataset PostgreSQL+ Greenplum+M 3.0LIblinear+Post Dimmwitted+P0 0.51DAnA+PostgreSQ 3.65 MADlib ADlib greSQL ostgreSQL L SVM S/E LRMF 1.0 1.0 1.0 1.0 1.0 Liblinear S/N Linear 10× MADlib ADlib greSQL ostgreSQL L MADlib ADlib greSQL ostgreSQL L S/E Linear 0.3 0.6 1.0 1.6 2.1 Logistic SVM 8.7 Linear 1 Geomean 0.82 0.92 1.00 1.05 1.08 S/N SVM 1 4.4 0.1 0.1 8.70 Blog Feedback 3.1 0 0.52 1.86 Figure 12 shows performance sensitivityExternal 1Software Libraries3.40 with Comparison0.375 Dana increasing0.25 28.20 com- MADlib+PostgreSQL MADlib+Greenplum Liblinear+PostgreSQL DimmWitted+PostgreSQL DAnA+PostgreSQL 15.10 1 Remote Sensing Regression Regression Remote Sensing Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ 8× 1 2.7 0.14 0.1173196695 Patient 3.0 0 0.51 3.65 LR Dimmwitted S/N Linear 28.2 18.4 7.320.2 15.1 41.8 SVM MADlib ADlib greSQL 45.47%ostgreSQL 2.51%L 52.02% 10× S/N Linear 1 1.20 0 5.5 41.81 6.3 pute utilization of FPGA for differentWLAN workloads.1 1.00 6.29 Each4.70 plot18.42 shows Logistic SVM 8.7 Linear 1 6× 5.5 5.5 S/N SVM 1 4.4 0.1 0.1 8.70 Blog Feedback 3.1 0 0.52 1.86 Batch Size = 1 Batch Size = 16 Batch Size = 32 Batch Size = 64 1 3.40 0.375 0.25 28.20 Regression Regression Remote Sensing 20.16 8× 4.7 7.3 9.4 14.0 LR S/N Logistic8.7 1 1.1 5.528 7.349903369 4.4 8× S/N Linear 1 1.20 0 5.5 41.81 DAnA’s accelerator runtime (access engine + execution engine) in 6.3 3.7 1 1.00 6.29 4.70 18.42 4× 3.4 WLAN Speedup 3.1 Geomean 1.0 1.6 2.4 2.1 21.9 6× 5.5 2.7 3.0 5.5 6.0 20.16 4.7 1.9 S/N Logistic 1 1.1 5.528 7.349903369 2× 4.4 6× 5.5 comparison to a single-thread. The sensitivity towards compute 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0 1.0 1.0 1.2 5.3 4× 3.4 0.53.7 0.5 5.0 Speedup 0.4 3.1 4.8 4.8 Geomean 1.0 1.6 2.4 2.1 21.9 0.2 2.7 0.10.1 0.10.1 3.0 N/A N/A N/A 4.6 0× 4.3 1.9 resources is a function of algorithm type, model width, and # of 2× Remote Sensing LR WLAN S/N Logistic Remote Sensing S/N SVM Patient Blog Feedback S/N Linear 4× 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0 1.0 1.0 1.2 0.4 SVM 0.5 0.5 2.8 0.2 0.10.1 0.10.1 N/A N/A N/A Batch Sweep Synthetic epochs. Thus, each workload fares differently with varying com- 0× Batch Size Batch Size Batch Size = Batch Size = 1 = 16 32 = 64 DAnA Speedup 2× Remote Sensing LR WLAN(c) End-to-EndS/N Logistic Remote Sensing Runtime S/N SVM ComparisonPatient Blog Feedback S/N Linear Synthetic 1.0 4.8 5.5 9.4 Logistic 1.0 pute1.0 resources.1.0 Workloads1.0 such as Remote Sensing LR and Re- SVM Synthetic 1.0 4.8 2.8 5.0 SVM Synthetic 1.0 4.6 5.3 14.0 0× Figure 15: Comparison to external software libraries. Linear mote Sensing SVM have a narrow model size, thus, increasing the Geomean 1.00 6.00 4.32 8.70 Synthetic Logistic Synthetic SVM Synthetic Linear Geomean number of threads increases performance till they reach peak com- DimmWitted supports SVM, Logistic Regression, Linear Regres- pute utilization. On the other hand, LRMF algorithm workloads sion, Linear Programming, Quadratic Programming, Gibbs Sam- External Software Libraries Comparison Dataset TABLA DAnA 10.4 12.3 do not experience a higher performance withCompute increasing number10× pling, and Neural Networks. Logistic Regression, SVM, and Lin- External Software Libraries Comparison1 10.35 8.1 Remote Sensing Dataset TABLA DAnA 8× LR 10.4 12.3 of threads. This can be attributed to the copiousCompute amounts of10× par- ear Regression (only DimmWitted), overlap with our benchmarks, 1 0.79 WLAN 5.9 1 10.35 6× 8.1 5.4 Remote Sensing 1 12.32824434 8× thus, we compare multi-core5.0 versions (2, 4, 8, 16 threads) of these allelism available in a single instanceRemote Sensing of the update rule. Thus, LR SVM 4.1 3.8 1 0.79 4× Speedup 5.9 WLAN 8.13 6× 5.4 increasing the number of threadsNetflix reduces the1 ACs allocated to a libraries and use the minimum runtime. We maintain the same 1 12.32824434 5.0 Remote Sensing 2× SVM 4.1 1.1 4× 0.8 1.0 3.8 single thread, whereas, merging acrossPatient multiple1 4.0512710428.13 different threadsSpeedup hyper-parameters, such as tolerance, and choice of optimizer to Netflix 1 0× Blog Feedback 1 5.430997254 2× incurs an overhead. One of the challenges tackled by the compilerRemote 0.8WLAN compareRemote Netflix runtimePatient Blog of1.0S/N 1 epoch1.1S/N S/N across S/N allGeomean the systems. We separately PatientS/N Logistic 1 14.0512710421.01045225 Sensing LR Sensing SVM Feedback Logistic SVM LRMF Linear 0× is to allocate the on-chip resourcesBlogS/N Feedback SVM by striking1 15.4309972541.128802894 a balance betweenRemote WLAN Remotecompare Netflix Patient the computeBlog S/N timeS/N andS/N end-to-endS/N Geomean runtime (data extraction S/N S/NLogistic LRMF 1 1 1.010452254.958326979 Sensing LR Sensing SVM Feedback Logistic SVM LRMF Linear

the single-thread performance andS/N multi-threadS/NSVM Linear 1 11.1288028945.896097628 parallelism. from PostgreSQL + data transformation + compute), as well as pro- Geomean 1.0 3.8 S/N LRMF 1 4.958326979 vide a runtime breakdown. Figure 15a illustrates the breakdown of Figure 14 illustrates the impactS/N Linear of FPGA1 5.896097628 bandwidth (in com-

Geomean 1.0 3.8 Liblinear+PostgreSQL DimmWitted+PostgreSQL Data Export Data Transform Analytics Data Export Data Transform Analytics parison to baseline bandwidth) on the speedup of DAnA accelera- Liblinear100% and DimmWitted into the different phases that comprise Liblinear+PostgreSQL DimmWitted+PostgreSQL tors. The results show that as the size of the benchmark increases, the end-to-end75% Data Export runtime.Data Transform DataAnalytics exportingData Export andData Transform reformattingAnalytics for these Dataset Data Export Data Algorithm 100% except the ones that run LRMF algorithm, the workloads becomeTransform external50% specialized ML tools is an overhead specific to perform- 75% Remote 84.05% 4.83% 11.12% Liblinear Dataset Data Export Data Algorithm Sensing LR Transform 25% Logistic Linear DAnA bandwidth bound. The workloads S/N LRMF and S/E LRMF are ing50% analytics outside RDBMS.SVM Results suggest that is uni- Remote 56.72% 3.26% 40.02% Regression Dimmwitted Regression RemoteSensing LR 84.05% 4.83% 11.12% 0% Liblinear formly faster, as it (1) does not export the data from the database, compute heavy, thus, bandwidth increase does notSensing LR have83.83% a signifi-3.74% 12.44% 25% Logistic Remote WLAN S/N Logistic Remote S/N SVM LinearPatient Blog S/N Liblinear WLAN SVM Remote 56.72% 3.26% 40.02% Regression Sensing LR Sensing SVM Feedback Linear Dimmwitted 62.64% 2.79% 34.56% Regression cant impact on the accelerator runtime. Dimmwitted SensingWLAN LR (2) employs0% Striders in the FPGA to walk through the data, and (3) 40.62% LiblinearLiblinear WLANS/N Logistic 83.83%57.42% 3.74%1.96% 12.44% Remote WLAN S/N Logistic Remote S/N SVM Patient Blog S/N 33.14% Sensing LR Sensing SVM Feedback Linear DimmwittedDimmwitted WLANS/N Logistic 62.64%64.65% 2.79%2.21% 34.56% accelerates the ML computation with an FPGA. However, different Remote Sensing Liblinear 69.24% 3.83% 40.62%26.93% 7.3 Comparison to Custom DesignsLiblinear S/NSVM Logistic 57.42% 1.96% 57.92% 3.20% 38.87% Dimmwitted S/NRemote Logistic Sensing 64.65% 2.21% 33.14% software solutions exhibit different trends, as elaborated below. Dimmwitted SVM Liblinear Remote Sensing 69.24% 3.83% 26.93% SVM 32.36% Potential alternatives to DAnA are: (1)Liblinear customS/N SVM software65.54% 2.09% li- • Logistic regression: As Figure 15b shows, Liblinear Remote Sensing 57.92% 3.20% 38.87% Dimmwitted SVM 32.30% braries [39–41] that run multi-core advancedDimmwitted analyticsS/N SVM and65.61% (2) algo-2.1% 32.36% and DimmWitted provide 3.8× and 1.8× speedup over Liblinear S/N SVM 65.54% 2.09% -% -% -% Liblinear Patient 32.30% MADlib+PostgreSQL for logistic regression-based workloads in rithm specific FPGA implementations [42–44].Dimmwitted S/N For SVM these65.61% alterna-2.1%

Dimmwitted Patient -% 74.61%-% 2.83%-% 22.55% tives, if training data is stored in the database,Liblinear therePatient is an overhead terms of compute time. With respect to end-to-end runtime com- -% -% -% Liblinear Blog Feedback Dimmwitted Patient to extract, transform, and supply the data in accordance74.61% to each2.83% of 22.55% pared to MADlib+PostgreSQL (Figure 15c), the benefits from Dimmwitted Blog Feedback-% -% -% Liblinear Blog Feedback 86.16% 2.61% 11.22% their requirements. We compare the performance of these-% alterna--% -% Liblinear reduce to 2.4× and increase for DimmWitted to 2.1×. Liblinear S/N Linear Dimmwitted Blog Feedback 86.16% 2.61% 11.22% tives with MADlib+PostgreSQL and DAnA.Dimmwitted S/N Linear On the other hand, DAnA outperforms Liblinear by 2.2× and -% 45.47%-% 2.51%-% 52.02% Liblinear S/N Linear DimmWitted by 4.7× in terms of compute time. For overall run- Optimized software libraries. We compareDimmwitted C++-optimizedS/N Linear li- 45.47% 2.51% 52.02% braries DimmWitted and Liblinear-Multicore classification with time, DAnA is 9.1× faster than Liblinear and 10.4× faster than MADlib+PostgreSQL, Greenplum+MADlib, and DAnA acceler- DimmWitted. The compute time of Remote sensing LR bench- ators. Liblinear supports Logistic Regression and SVM, and mark receives the least benefit from LibLinear and DimmWitted MADlib MADlib Multicore Liblinear DimmWitted DAnA External Software Libraries Compute Comparison Postgres 23.2 28.8 13.1 15.420.924.2 10.546.1 Dataset PostgreSQL+ Greenplum+M LIblinear Dimmwitted DAnA 10× MADlib ADlib Logistic SVM Linear Regression 7.7 Regression 1 3.87 2.903429412 0.56 23.20 8× Remote Sensing LR External Software Libraries Comparison Compute SVM External Software Libraries Comparison Compute 5.7 Dataset PostgreSQL+ Greenplum+M LIblinear Dimmwitted DAnA Dataset PostgreSQL+ Greenplum+M LIblinear Dimmwitted DAnA 1 1 28.83942384 7.74 13.09 6× WLAN 4.9 MADlib ADlib MADlib ADlib 3.9 3.8 3.9 3.8 S/N Logistic 1 1.11 15.43504842 20.90385938 24.189 4× 3.5 1

Speedup Remote Sensing 1 4.85 0.16 0.1 1.35 2.9 SVM Patient 3.75 0 3.9 1.87 Geomean 1.0 1.6 3.8 1.8 8.5 1.9 1.9 1.7 1 2× 1.4 1.2 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0 1.0 1.0 S/N SVM 1 5.67 0.1 0.1 1.67 Blog Feedback 3.79 0 1.9 3.49 0.6 0.2 0.1 0.1 0.1 N/A N/A N/A 1 1.2 10.5 0× S/N Linear 0 46.13 Remote Sensing LR WLAN S/N Logistic Remote Sensing S/N SVM Patient Blog Feedback S/N Linear SVM

External Software Libraries Comparison Total SVM External Software Libraries Comparison Total Linear Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ Dataset PostgreSQL+ Greenplum+M LIblinear Dimmwitted DAnA MADlib ADlib greSQL ostgreSQL L MADlib ADlib

External Software Libraries Comparison Dana MADlib+PostgreSQL MADlib+Greenplum Liblinear+PostgreSQL DimmWitted+PostgreSQL DAnA+PostgreSQL 15.10 1 Remote Sensing Dataset PostgreSQL+ Greenplum+M LIblinear+Post Dimmwitted+P DAnA+PostgreSQ 1 2.7 0.14 0.1173196695 Patient 3.0 0 0.51 3.65 28.2 18.4 20.2 15.1 41.8 SVM MADlib ADlib greSQL ostgreSQL L 10× Logistic SVM 8.7 Linear 1 1 3.40 0.375 0.25 28.20 S/N SVM 1 4.4 0.1 0.1 8.70 Blog Feedback 3.1 0 0.52 1.86 Remote Sensing Regression Regression LR 8× 7.3 S/N Linear 1 1.20 0 5.5 41.81 6.3 WLAN 1 1.00 6.29 4.70 18.42 6× 5.5 5.5 20.16 4.7 S/N Logistic 1 1.1 5.528 7.349903369 4.4 4× 3.4 3.7 Geomean 1.0 1.6 2.4 2.1 21.9 Speedup 2.7 3.0 3.1 2× 1.9 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0 1.0 1.0 1.2 0.4 0.5 0.5 0.2 0.10.1 0.10.1 N/A N/A N/A 0× Remote Sensing LR WLAN S/N Logistic Remote Sensing S/N SVM Patient Blog Feedback S/N Linear SVM

External Software Libraries Comparison Dataset TABLA DAnA 10.4 12.3 Compute 10× tions aim to accelerate DBMS operations (some portion of the 1 10.35 8.1 Remote Sensing 8× LR query) [3, 4, 23, 24, 46], such as join and hash. LINQits [46] ac- 1 0.79 5.9 WLAN 6× 5.4 celerates database queries but does not focus on machine learning. 1 12.32824434 5.0 Remote Sensing 4.1 SVM 4× 3.8 Centaur [3] dynamically decides which particular operators in a 8.13 Speedup Netflix 1 2× MonetDB [47] query plan can be executed on FPGA and creates a 0.8 1.0 1.1 Patient 1 4.051271042 0× pipeline between FPGA and CPU. Another work [24] uses FPGAs Blog Feedback 1 5.430997254 Remote WLAN Remote Netflix Patient Blog S/N S/N S/N S/N Geomean to provide a robust hashing mechanism to accelerate data partition- S/N Logistic 1 1.01045225 Sensing LR Sensing SVM Feedback Logistic SVM LRMF Linear

S/N SVM 1 1.128802894 ing in database engines. In the GPU realm, HippogriffDB [22] aims Figure 16: Performance comparison of DAnA over TABLA. S/N LRMF 1 4.958326979 to balance the I/O and GPU bandwidth by compressing the data S/N Linear 1 5.896097628 and exhibits a slowdown in end-to-end runtime. This can be at- Geomean 1.0 3.8 that is transferred to GPU. Support for in-database advanced ana- tributed to the small model size, which, despite a large dataset, lytics for FPGAs in tandem with Striders set this work apart from does not provide enough parallelismMulticore Liblinear that can be exploited by DimmWitted Data Export Data Transform Analytics Data Export Data Transform Analytics 100% the aforementioned literature, which does not focus on providing these libraries. Specifically sparse datasets, such as WLAN, are components that integrate FPGAs within an RDBMS engine and handled more efficiently75% by these libraries. Dataset Data Export Data Algorithm machine learning. •TransformSVM: As shown in50% Figure 15b, that compares compute time of Remote 84.05% 4.83% 11.12% Liblinear Hardware acceleration for advanced analytics. Both re- Sensing LR SVM-based workloads,25% Logistic Liblinear and DimmWitted are 18.1× Linear SVM Remote 56.72% 3.26% 40.02% Regression Dimmwitted searchRegression and industry have recently focused on hardware acceler- Sensing LR and 22.3× slower than0% MADlib+PostgreSQL, respectively. For Liblinear WLAN 83.83% 3.74% 12.44% Remote WLAN S/N Logistic Remote S/N SVM ationPatient forBlog machine S/N learning [5, 21, 25, 48] especially deep neural end-to-end runtime (Figure Sensing 15c),LR the slowdown is Sensing reduced SVM to Feedback Linear Dimmwitted WLAN 62.64% 2.79% 34.56% networks [49–53] connecting two of the vertices in Figure 1 traid. Liblinear S/N Logistic 57.42% 14.61.96% × 40.62%for Liblinear and 15.9× for DimmWitted, due to the Dimmwitted S/N Logistic 64.65% complex2.21% 33.14% interplay between data accesses and UDF execution of These works either only focus on a fixed set of algorithms or do not Liblinear Remote Sensing 69.24% 3.83% 26.93% SVM DAnA offer the reconfigurability of the architecture. Among these, sev- Remote Sensing 57.92% MADlib+PostgreSQL.3.20% 38.87% In comparison, outperforms Lib- Dimmwitted SVM eral works [19, 29, 54] provide frameworks to automatically gener- linear by32.36%30.7× and DimmWitted by 37.7× in terms of compute Liblinear S/N SVM 65.54% 2.09% ate hardware accelerators for stochastic gradient descent. However, time. For32.30% overall runtime, DAnA is 127× and 138.3× faster Dimmwitted S/N SVM 65.61% than2.1% Liblinear and DimmWitted, respectively. none of these works provide hardware structures or software com- -% -% -% Liblinear Patient • Linear regression: For linear regression-based workloads, ponents that embed FPGAs within the RDBMS engine. DAnA’s Dimmwitted Patient Python DSL builds upon the mathematical language in the prior 74.61% DimmWitted2.83% 22.55% is 4.3× faster than MADlib+PostgreSQL in -% -% -% Liblinear Blog Feedback terms of compute time. For end-to-end runtime compared to work [5, 19]. However, the integration with both conventional

Dimmwitted Blog Feedback (Python) and data access (SQL) languages provides a significant 86.16% MADlib+PostgreSQL2.61% 11.22% (Figure 15c), the speedup of DimmWit- -% -% -% extension by enabling support for UDFs which include general it- Liblinear S/N Linear ted is reduced to 12%. DAnA outperforms DimmWitted by 1.6×

Dimmwitted S/N Linear erative update rules, merge functions, and convergence functions. 45.47% and2.51% 6.052.02%× in terms of compute and overall time, respectively. Specific FPGA implementations. We compare hand-optimized In-Database advanced analytics. Recent work at the intersec- FPGA designs created specifically for one algorithm with our re- tion of databases and machine learning are extensively trying to configurable architecture. DAnA’s execution engine performance facilitate efficient in-database analytics and have built frameworks is on par with Parallel SVM [42], is 44% slower than Heteroge- and systems to realize such an integration [7,14,15,17,55–61] (see neous SVM [43], and is 1.47× faster than Falcon Logistic Regres- [62] for a survey of various methods and systems). DAnA takes a sion [44]. In addition to the speedup, we compare Giga Ops Per step forward and exposes FPGA acceleration for in-Database ana- Second (GOPS), to measure the numerical compute performance lytics by providing a specialized component, Strider, that directly of these architectures. In terms of GOPS, DAnA performs, on av- interfaces with the database to alleviate some of the shortcomings erage, 16% less operations than these hand-coded designs. In addi- of the traditional Von-Neumann architecture in general purpose tion to providing comparable performance, DAnA relieves the data compute systems. Past work in Bismarck [7] provides a unified scientist of the arduous task of hardware design and testing whilst architecture for in-database analytics, facilitating UDFs as an in- integrating seamlessly within the database engine. Whereas, for terface for the analyst to describe their desired analytics models. these custom designs, designer requires hardware design expertise However, unlike DAnA, Bismarck lacks the hardware acceleration and long verification cycles to write ≈15000 lines of Verilog code. backend and support for general iterative optimization algorithms. Comparison with TABLA. We compare DAnA with TABLA [5], an open-source framework [45] that generates optimized FPGA im- 9. CONCLUSION plementations for a wide variety of analytics algorithms. We mod- This paper aims to bridge the power of well-established and ex- ify the templates for UltraScale+ and perform design space explo- tensively researched means of structuring, defining, protecting, and ration to present the best case results with TABLA. Figure 16 shows accessing data, i.e., RDBMS with FPGA accelerators for compute- that DAnA generated accelerators perform 4.7× faster than TABLA intensive advanced data analytics. DAnA provides the initial co- accelerators. DAnA’s benefits can be attributed to the: (1) inter- alescence between these paradigms and empowers data scientists leaving of Striders in the access engine with the execution engine with no knowledge of hardware design, to use accelerators within to mitigate the overheads of data transformation and (2) the multi- their current in-database analytics procedures. threading capability of the execution engines to exploit parallelism between different instances of the update rule. TABLA on the other 10. ACKNOWLEDGMENTS hand, offers only single threaded acceleration. We thank Sridhar Krishnamurthy and Xilinx for donating the FPGA boards. We thank Yannis Papakonstantinou, Jongse Park, 8. RELATED WORK Hardik Sharma, and Balaji Chandrasekaran for providing in- sightful feedback. This work was in part supported by NSF Hardware acceleration for data management. Accelerat- awards CNS#1703812, ECCS#1609823, Air Force Office of Scien- ing database operations is a popular research direction that con- tific Research (AFOSR) Young Investigator Program (YIP) award nects modern acceleration platforms and enterprise in-database #FA9550-17-1-0274, and gifts from Google, Microsoft, Xilinx, analytics as shown in Figure 1. These prior FPGA-based solu- Qualcomm, and Opera Solutions. 11. REFERENCES Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. The MADlib Analytics Library: Or MAD [1] Gartner Report on Analytics. Skills, the SQL. PVLDB, 5(12):1700–1711, 2012. gartner.com/it/page.jsp?id=1971516. [16] Microsoft SQL Server Data Mining. [2] SAS Report on Analytics. sas.com/reg/wp/corp/23876. https://docs.microsoft.com/en-us/sql/analysis- [3] M. Owaida, D. Sidler, K. Kara, and G. Alonso. Centaur: A services/data-mining/data-mining-ssas. framework for hybrid cpu-fpga databases. In 2017 IEEE 25th [17] Benjamin Recht, Christopher Re, Stephen Wright, and Feng International Symposium on Field-Programmable Custom Niu. Hogwild: A lock-free approach to parallelizing Computing Machines (FCCM), pages 211–218, April 2017. stochastic gradient descent. In J. Shawe-Taylor, R. S. Zemel, [4] David Sidler, Muhsen Owaida, Zsolt Istvan,´ Kaan Kara, and P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Gustavo Alonso. doppiodb: A hardware accelerated Advances in Neural Information Processing Systems 24. database. In 27th International Conference on Field Curran Associates, Inc., 2011. Programmable Logic and Applications, FPL 2017, Ghent, [18] Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel. Belgium, September 4-8, 2017, page 1, 2017. Learning generalized linear models over normalized data. In [5] Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Proceedings of the 2015 ACM SIGMOD International Sharma, Amir Yazdanbakhsh, Joon Kim, and Hadi Conference on Management of Data, SIGMOD ’15, pages Esmaeilzadeh. TABLA: A unified template-based framework 1969–1984. ACM, 2015. for accelerating statistical machine learning. In IEEE [19] Jongse Park, Hardik Sharma, Divya Mahajan, Joon Kyung International Symposium on High Performance Computer Kim, Preston Olds, and Hadi Esmaeilzadeh. Scale-out Architecture (HPCA), March 2016. acceleration for machine learning. In Proceedings of the 50th [6] Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Annual IEEE/ACM International Symposium on Chiou, Kypros Constantinides, John Demme, Hadi Microarchitecture, pages 367–381, New York, NY, USA, Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan 2017. ACM. Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir [20] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi et al. Dadiannao: A machine-learning supercomputer. In Xiao, and Doug Burger. A reconfigurable fabric for Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM accelerating large-scale datacenter services. In Proceeding of International Symposium on, pages 609–622. IEEE, 2014. the 41st Annual International Symposium on Computer [21] Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Architecuture, ISCA ’14, pages 13–24, 2014. Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai [7] Xixuan Feng, Arun Kumar, Benjamin Recht, and Zhou, and Yunji Chen. Pudiannao: A polyvalent machine Christopher Re.´ Towards a Unified Architecture for learning accelerator. In International Conference on in-RDBMS Analytics. In Proceedings of the 2012 ACM Architectural Support for Programming Languages and SIGMOD International Conference on Management of Data, Operating Systems (ASPLOS), 2015. SIGMOD ’12, pages 325–336. ACM, 2012. [22] Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis [8] Yu Cheng, Chengjie Qin, and Florin Rusu. GLADE: Big Papakonstantinou, and Steven Swanson. Hippogriffdb: Data Analytics Made Easy. In Proceedings of the 2012 ACM Balancing i/o and gpu bandwidth in big data analytics. SIGMOD International Conference on Management of Data, PVLDB, 9(14):1647–1658, 2016. SIGMOD ’12, pages 697–700. ACM, 2012. [23] Rene Mueller, Jens Teubner, and Gustavo Alonso. Data [9] Andrew R. Putnam, Dave Bennett, Eric Dellinger, Jeff processing on fpgas. PVLDB, 2(1):910–921, 2009. Mason, and Prasanna Sundararajan. CHiMPS: A high-level [24] Kaan Kara, Jana Giceva, and Gustavo Alonso. Fpga-based compilation flow for hybrid CPU-FPGA architectures. In data partitioning. In Proceedings of the 2017 ACM Field Programmable Gate Arrays (FPGA), 2008. International Conference on Management of Data, SIGMOD [10] Amazon web services . ’17, pages 433–445. ACM, 2017. https://aws.amazon.com/rds/postgresql/. [25] Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, and [11] Azure sql database. https://azure.microsoft.com/en- Ce Zhang. Fpga-accelerated dense linear machine learning: us/services/sql-database/. A precision-convergence trade-off. 2017 IEEE 25th FCCM, [12] Oracle Data Mining. pages 160–167, 2017. http://www.oracle.com/technetwork/database/options/ [26] Amazon EC2 F1 instances: Run custom FPGAs in the advanced-analytics/odm/overview/index.. amazon web services (aws) cloud. [13] Oracle R Enterprise. http: https://aws.amazon.com/ec2/instance-types/f1/, 2017. //www.oracle.com/technetwork/database/database- [27] Adrian M Caulfield, Eric S Chung, Andrew Putnam, Hari technologies/r/r-enterprise/overview/index.html. Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, [14] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Matt Humphrey, Puneet Kaur, Joo-Young Kim, et al. A Hellerstein, and Caleb Welton. MAD Skills: New Analysis cloud-scale acceleration architecture. In Microarchitecture Practices for Big Data. PVLDB, 2(2):1481–1492, 2009. (MICRO), 2016 49th Annual IEEE/ACM International [15] Joseph M. Hellerstein, Christoper Re,´ Florian Schoppmann, Symposium on, pages 1–13. IEEE, 2016. Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, [28] Norman P. Jouppi, Cliff Young, Nishant Patil, David Proceedings of the 11th European Conference on Computer Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Systems, pages 5:1–5:16, New York, NY, USA, 2016. ACM. Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, [40] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Wang, and Chih-Jen Lin. Liblinear: A library for large linear Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, classification. J. Mach. Learn. Res., 9:1871–1874, June 2008. Tara Vazir Ghaemmaghami, Rajendra Gottipati, William [41] Ce Zhang and Christopher Re.´ Dimmwitted: A study of Gulland, Robert Hagmann, Richard C. Ho, Doug Hogberg, main-memory statistical analytics. Computing Research John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Repository (CoRR), abs/1403.7550, 2014. Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, [42] S. Cadambi, I. Durdanovic, V. Jakkula, M. Sankaradass, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, E. Cosatto, S. Chakradhar, and H. P. Graf. A massively James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle parallel fpga-based coprocessor for support vector machines. Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, In 2009 17th IEEE Symposium on Field Programmable Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Custom Computing Machines, pages 115–122, April 2009. Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark [43] M. Papadonikolakis and C. S. Bouganis. A heterogeneous Omernick, Narayana Penukonda, Andy Phelps, Jonathan fpga architecture for support vector machine training. In Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory 2010 18th IEEE FCCM, pages 211–214, May 2010. Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy [44] Falcon computing. http://cadlab.cs.ucla.edu/∼cong/ Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia slides/HALO15 keynote.pdf. Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter [45] TABLA source code. Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter http://www.act-lab.org/artifacts/tabla/. performance analysis of a tensor processing unit. CoRR, [46] Eric S. Chung, John D. Davis, and Jaewon Lee. LINQits: Big abs/1704.04760, 2017. data on little clients. In ISCA, 2013. [29] Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik [47] Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi K. Sjoerd Mullender, and Martin L. Kersten. Monetdb: Two Esmaeilzadeh. Tabla: A unified template-based framework decades of research in column-oriented database for accelerating statistical machine learning. In HPCA, 2016. architectures. IEEE Technical Committee on Data [30] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Engineering, 35(1):40–45, 2012. Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and [48] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Hadi Esmaeilzadeh. From high-level deep neural models to Chengyong Wu, Yunji Chen, and Olivier Temam. DianNao: FPGAs. In ACM/IEEE International Symposium on A small-footprint high-throughput accelerator for ubiquitous Microarchitecture (MICRO), October 2016. machine-learning. In Proceedings of 19th International [31] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. Conference on Architectural Support for Programming On parallelizability of stochastic gradient descent for speech Languages and Operating Systems (ASPLOS), 2014. dnns. In ICASSP, 2014. [49] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, [32] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Smola. Parallelized stochastic gradient descent. In Neural Temam. Shidiannao: shifting vision processing closer to the Information Processing Systems, 2010. sensor. In 42nd International Symposium on Computer [33] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Architecture (ISCA), 2015. Xiao. Optimal distributed online prediction using [50] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan mini-batches. Journal of Machine Learning Research, Pedram, Mark A. Horowitz, and William J. Dally. Eie: 13(Jan):165–202, 2012. Efficient inference engine on compressed deep neural [34] J. Langford, A.J. Smola, and M. Zinkevich. Slow learners are network. In Proceedings of the 43rd International fast. In NIPS, 2009. Symposium on Computer Architecture, ISCA ’16, pages [35] Gideon Mann, Ryan McDonald, Mehryar Mohri, Nathan 243–254, Piscataway, NJ, USA, 2016. IEEE Press. Silberman, and Daniel D. Walker. Efficient large-scale [51] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A distributed training of conditional maximum entropy models. Spatial Architecture for Energy-Efficient Dataflow for In NIPS, 2009. Convolutional Neural Networks. In ISCA, 2016. [36] Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, [52] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Jerger, and A. Moshovos. Cnvlutin: Ineffectual-Neuron-Free Kalamkar, Bharat Kaul, and Pradeep Dubey. Distributed Deep Neural Network Computing. In ISCA, 2016. deep learning using synchronous stochastic gradient descent. [53] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh arXiv:1602.06709 [cs], 2016. Rama, Hyunkwang Lee, Sae Kyu Lee, Jose Miguel [37] Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Hernandez-Lobato, Gu-Yeon Wei, and David Brooks. Jozefowicz. Revisiting distributed synchronous SGD. In Minerva: Enabling low-power, highly-accurate deep neural International Conference on Learning Representations network accelerators. In ISCA, 2016. Workshop Track, 2016. [54] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, [38] A. Frank and A. Asuncion. University of california, irvine C. Shao, A. Mishra, and H. Esmaeilzadeh. From high-level (uci) machine learning repository, 2010. deep neural models to fpgas. In 2016 49th Annual [39] Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei IEEE/ACM International Symposium on Microarchitecture Dai, Garth A. Gibson, and Eric P. Xing. Strads: A distributed (MICRO), pages 1–12, Oct 2016. framework for scheduled model parallel machine learning. In [55] Boriana L. Milenova, Joseph S. Yarmus, and Marcos M. [59] M. Levent Koc and Christopher Re.´ Incrementally Campos. Svm in oracle database 10 g : Removing the maintaining classification using an rdbms. PVLDB, barriers to widespread adoption of support vector 4(5):302–313, 2011. machines. PVLDB, pages 1152–1163, 2005. [60] Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim [56] Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, Kraska, Carsten Binnig, Ugur Cetintemel, and Stan Zdonik. and Joseph M. Hellerstein. Querying probabilistic An architecture for compiling udf-centric workflows. information extraction. PVLDB, 3(1-2):1057–1067, 2010. PVLDB, 8(12):1466–1477, 2015. [57] Michael Wick, Andrew McCallum, and Gerome Miklau. [61] Shoumik Palkar, James J. Thomas, Anil Shanbhag, Deepak Scalable probabilistic databases with factor graphs and Narayanan, Holger Pirk, Malte Schwarzkopf, Saman mcmc. PVLDB, 3(1-2):794–804, 2010. Amarasinghe, and Matei Zaharia. Weld: A common runtime [58] Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. for high performance data analytics. January 2017. Franklin, Scott Shenker, and Ion Stoica. Shark: Sql and rich [62] Arun Kumar, Matthias Boehm, and Jun Yang. Data analytics at scale. In Proceedings of the 2013 ACM management in machine learning: Challenges, techniques, SIGMOD International Conference on Management of and systems. In Proceedings of the 2017 ACM International Data, SIGMOD ’13, pages 13–24, New York, NY, USA, Conference on Management of Data, SIGMOD ’17, pages 2013. ACM. 1717–1722, New York, NY, USA, 2017. ACM.