BLAS-On-Flash

BLAS-on-flash : An Alternative for Large Scale ML Training and Inference? Suhas Jayaram Subramanya Harsha Vardhan Simhadri Srajan Garg Microsoft Research India Microsoft Research India IIT Bombay [email protected] [email protected] [email protected] Anil Kag B. Venkatesh Microsoft Research India Microsoft Research India [email protected] [email protected] Abstract treme multi-label learning [45], are memory limited as opposed to being limited by compute. That is, on large Many large scale machine learning training and inference datasets, a training algorithm that requires a few hours tasks are memory-bound rather than compute-bound. of compute on a multi-core workstation would run out of That is, on large data sets, the working set of these al- DRAM for its working set. gorithms does not fit in memory for jobs that could run This forces users to move the algorithm to distributed overnight on a few multi-core processors. This often big-data platforms such as Apache Spark [61] or sys- forces an expensive redesign of the algorithm for dis- tems based on Parameter Servers [37, 19, 58], which in- tributed platforms such as parameter servers and Spark. curs three costs: (1) the cost of rewriting code in a dis- We propose an inexpensive and efficient alternative tributed framework, (2) the cost of a cluster of nodes or based on the observation that many ML tasks admit al- non-availability in production environments, and (3) in- gorithms that can be programmed with linear algebra efficiencies of the platform in using the hardware. Train- subroutines. A library that supports BLAS and sparse- ing on these platforms can require dozens of nodes for BLAS interface on large SSD-resident matrices can en- moderate speedups over single threaded code for non- able multi-threaded code to scale to industrial scale trivial algorithms [23, 38]. This could be due to plat- datasets on a single workstation. form overheads as well as mismatch between the struc- We demonstrate that not only can such a library pro- ture of the algorithm and the platform’s programming vide near in-memory performance for BLAS, but can model [56, 9, 18], resulting in low processor utilization. also be used to write implementations of complex algo- Several light-weight frameworks for single node rithms such as eigensolvers that outperform in-memory workstations demonstrate that this inefficiency is unnec- (ARPACK) and distributed (Spark) counterparts. essary for many classes of algorithms that admit multi- Existing multi-threaded in-memory code can link to threaded implementations that are orders of magnitude our library with minor changes and scale to hundreds of more efficient [35, 49, 50, 17]. It is also widely observed gigabytes of training or inference data at near in-memory that many machine learning problems admit algorithms processing speeds. We demonstrate this with two in- that are essentially compositions of linear algebra oper- dustrial scale use cases arising in ranking and relevance ations on sparse and dense matrices. High performance pipelines: training large scale topic models and inference code for these algorithms is typically written as a main for extreme multi-label learning. thread consisting of glue code that invokes linear alge- This suggests that our approach could be an efficient bra calls through standard APIs such as BLAS [10] and alternative to expensive distributed big-data systems for sparseBLAS [21]. High performance implementations scaling up structurally complex machine learning tasks. for these standard APIs are provided by hardware ven- dors [27, 28, 41, 42]. 1 Introduction Linear algebra kernels offer plenty of locality, so much so that the bandwidth required for running them on high- Data analysis pipelines in scientific computing as well end multiprocessors can be provided by a non-volatile as ranking and relevance often work on datasets that are memory over PCIe or SATA bus [54, 5, 13]. Non-volatile hundreds of gigabytes to a few terabytes in size. Many memory is already widely deployed in cloud and de- algorithms in these pipelines, such as topic modeling [6], velopments in hardware and software eco-system posi- matrix factorizations [34], spectral clustering [33], ex- tion non-volatile memory as an inexpensive alternative to 1 DRAM [3, 47, 20, 16]. Hardware technology and inter- Topic modeling [11] involves summarizing a corpus faces for non-volatile memories have increasingly lower of documents, where each document is a collection of end-to-end latency (few µs) [26] and higher bandwidth: words from a fixed vocabulary, as a set of topics that are from 4-8 GT/s in PCIe3.0 to 16GT/s in PCIe4.0 [43] probability distributions over the vocabulary. Although and 32GT/s in PCIe5.0. Hardware manufactures are also most large scale algorithms are based on approximating packaging non-volatile memory with processing units, and scaling an intractable probabilistic model on param- e.g. Radeon PRO SSG [2] to increase available memory. eter servers [59, 14, 60], recent research [6] has shown These observations point to a cost-effective solution that linear algebra based approaches can be just as good for scaling linear algebra based algorithms to large qualitatively. We take a highly optimized version of the datasets in many scenarios. Use inexpensive PCIe- algorithm in [6] that already outperforms prior art on sin- connected SSDs to store large matrices corresponding to gle node workstations, and link to the eigensolvers and the data and the model, and exploit the locality of linear clustering algorithms written using our framework. This algebra to develop a library of routines that can operate allows the algorithm to train a 2000 topic model on a 60 on these matrices with a limited amount of DRAM. By billion token (500GB on disk) corpus in under 4 hours. conforming to the standard APIs, such a library could be Extreme Multi-Label Learning (XML) is the problem a replacement for code that would have linked to BLAS of learning to automatically annotate a data point with libraries such as Intel MKL or OpenBLAS [57]. the most relevant subset of labels from an extremely We present empirical evidence that this approach can large label set (often many millions of labels). This is be practical, easy, and fast by developing a library which an important task that has many applications in tagging, provides near in-memory speeds on NVM-resident data ranking and recommendation [8]. Models in extreme for subroutines on dense matrices and sparse matrices. multi-label learning tasks are often ensembles of deep The performance of our BLAS-on-flash library is com- trees with smaller classifier(s) at each node. e.g. Pfas- parable to that of in-memory Intel MKL implementations treXML [45], Parabel [44]. In production, models that for level-3 BLAS and sparseBLAS kernels such as gemm exceed DRAM in size need to score (i.e. infer) several (dense-dense matrix multiplication) and csrmm (sparse- hundreds of millions sparse data points from a space with dense matrix multiplication) on multiprocessor machines million+ dimensions every week on a platform that pro- with SSDs. The key to this performance is using the vides machines with moderate sized DRAM. As datasets knowledge of data-access patterns arising in linear alge- grow in size, XML algorithms need to scale 10x along bra kernels to effectively pipeline IO with computation. multiple axes: model size, number of points scored and Using these kernels, we can implement algorithms such the dimensionality of data. as k-means clustering that run at near in-memory speeds. In this work, we start with PfastreXML and Parabel To illustrate that this approach is not limited to sim- models and a dataset that needed 440 and 900 compute ple kernels, we consider one of the most structurally hours repsectively on a VM with large RAM. We opti- complex numerical algorithms – eigensolvers. Using the mized this code and reduced in-memory compute time BLAS-on-flash library, we built a general purpose sym- by a factor of six. When the optimized code is linked metric eigen-solver which is critical to dimensionality to our library, it runs at about 90% of in-memory speed reduction (e.g. PCA) and spectral methods. Specifi- with a much smaller memory fooprint. cally, we wrote an implementation of the restarted block These results suggest that for complicated numeri- Krylov-Schur [63] algorithm that can compute hundreds cal algorithms, our approach is capable of running at or thousands of eigen-vectors on SSD-resident data faster near in-memory speeds on large datasets while provid- than standard in-memory solvers based on the IRAM al- ing significant benefits in hardware utilization as com- gorithm [52] (e.g., Spectra [46], ARPACK [36]). On pared to general-purpose big-data systems. Further, we large bag of words text data sets running into hundreds of envision our library being useful in the following sce- gigabytes, our implementation running on one multi-core narios: (1) Environments without multi-node support for workstation with under 50GB DRAM outperforms Spark MPI, Spark etc., (2) Laptops and workstations or VMs MLlib’s computeSVD [39] deployed on hundreds of in cloud with limited RAM but large non-volatile mem- executors, representing an order of magnitude efficiency ories, (3) Batch mode periodic retraining and inferenc- gain in hardware utilization. Further, our solver can com- ing of large scale models in production data analysis pute thousands of eigenvalues, while computeSVD is pipelines, (4) Extending the capabilities of legacy single- limited to 500 or fewer. node ML training code. We present two use cases of the library for algorithms Roadmap. Sections 2, 3 and 4 provide an overview of used in ranking and relevance pipelines that process hun- the interface, design and the architecture of the library.

BLAS-On-Flash

GPU Developments 2018

Macbook Pro (15-Inch, 2019) - Technical Specifications

Apple Products and Pricing List

Videocard Benchmarks

Key Features the WORLD's FIRST GPU to BREAK the TERABYTE

Ati Mobility Radeon Hd 4270 Driver Download Ati Mobility Radeon Hd 4270 Driver Download

Shared Pass-Through Graphics Guide

AMD Firepro Accelerators for HPE Proliant Servers Overview

Element Gpus

Ansys 2020 R1

AMD EPYC + Radeon Instinct Platform Configuration

AMD EPYC™ Solutions