On Linear Learning with Manycore Processors

On Linear Learning with Manycore Processors Eliza Wszola Celestine Mendler-Dunner¨ y Department of Computer Science Department of Electrical Engineering and Computer Sciences ETH Zurich UC Berkeley Zurich, Switzerland Berkeley, US [email protected] [email protected] Martin Jaggi Markus Puschel¨ School of Computer and Communication Sciences Department of Computer Science EPFL ETH Zurich Lausanne, Switzerland Zurich, Switzerland martin.jaggi@epfl.ch [email protected] Abstract—A new generation of manycore processors is on the limitations to perform it given straightforward C (or worse, rise that offers dozens and more cores on a chip and, in a sense, Java, Python, etc.) code, a problem that has been known fuses host processor and accelerator. In this paper we target the already for the earlier, simpler multicore and single core efficient training of generalized linear models on these machines. We propose a novel approach for achieving parallelism which systems [3]. Challenges include vector instruction sets, deep we call Heterogeneous Tasks on Homogeneous Cores (HTHC). cache hierarchies, non-uniform memory architectures, and It divides the problem into multiple fundamentally different efficient parallelization. tasks, which themselves are parallelized. For evaluation, we The challenge we address in this paper is how to map ma- design a detailed, architecture-cognizant implementation of our chine learning workloads to manycore processors. We focus on scheme on a recent 72-core Knights Landing processor that is adaptive to the cache, memory, and core structure. Our library recent standalone manycores and the important task of training efficiently supports dense and sparse datasets as well as 4- generalized linear models used for regression, classification, bit quantized data for further possible gains in performance. and feature selection. Our core contribution is to show that in We show benchmarks for Lasso and SVM with different data contrast to prior approaches, which assign the same kind of sets against straightforward parallel implementations and prior subtask to each core, we can often achieve significantly better software. In particular, for Lasso on dense data, we improve the state-of-the-art by an order of magnitude. overall performance and adaptivity to the system resources, by Index Terms—Manycore, performance, machine learning, co- distinguishing between two fundamentally different tasks. A ordinate descent, GLM, SVM, Lasso subset of the cores will be assigned a task A that only reads the model parameters, while the other subset of cores will perform I. INTRODUCTION a task B that updates them. So in the manycore setting, while The evolution of mainstream computing systems has moved the cores are homogeneous, we show that assigning them het- from the multicore to the manycore area. This means that erogeneous tasks results in improved performance and use of a few dozen to even hundreds of cores are provided on a compute, memory, and cache resources. The adaptivity of our single chip, packaged with up to hundreds of gigabytes of approach is particularly crucial: the number and assignment memory at high bandwidth. Examples include Intel Xeon Phi of threads can be adapted to the computing platform and the (up to 72 cores), ARM ThunderX2 (64 cores), Qualcomm problem at hand. arXiv:1905.00626v6 [cs.PF] 5 Feb 2020 Centriq 2400 (48 cores), and of course GPUs (100s of cores). We make the following contributions: One declared target of the recent generation of manycores 1) We describe a novel scheme, consisting of two hetero- is machine learning. While much work has been devoted geneous tasks, to train generalized linear models on ho- to efficient learning and inference of neural nets on GPUs, mogeneous manycore systems. We call it Heterogeneous e.g. [1], [2], other domains of machine learning and manycores Tasks on Homogeneous Cores (HTHC). have received less attention. 2) We provide a complete, performance-optimized imple- One exciting trend in manycore is the move from ac- mentation of HTHC on a 72-core Intel Xeon Phi proces- celerators (like GPUs) to standalone manycore processors. sor. Our library supports both sparse and dense data as These remove the burden of writing two types of code and well as data quantized to 4 bits for further possible gains enable easier integration with applications and legacy code. in performance. Our code is publicly available1. However, the efficient mapping of the required mathematics 3) We present a model for choosing the best distribution to manycores is a difficult task as compilers have inherent of threads for each task with respect to the machine’s yWork conducted while at IBM Research, Zurich. 1https://github.com/ElizaWszola/HTHC memory system. We demonstrate that with explicit con- Different such measures exist and they depend either on the trol over parallelism our approach provides an order dataset, the current model parameters, or both. They can be of magnitude speedup over a straightforward OpenMP used for the selection of important coordinates during the implementation. coordinate descent procedure, speeding up overall conver- 4) We show benchmarks for Lasso and SVM with different gence, either by deriving sampling probabilities (importance data sets against straightforward parallel implementations sampling, e.g. [6], [7]), or by simply picking the parameters and prior software. In particular, for Lasso on dense data, with the highest importance score (greedy approach, e.g. [8], we improve the state-of-the-art by an order of magnitude. [9]). II. PROBLEM STATEMENT &BACKGROUND C. Duality-gap based coordinate selection This section details the considered problem class, provides A particular measure of coordinate-wise importance that necessary background on asynchronous stochastic coordinate we will adopt in this paper, is the coordinate-wise duality descent and coordinate selection, and introduces our target gap certificate proposed in [10]. The authors have shown platform: the Intel Knights Landing (KNL) manycore proces- that choosing model parameters to update based on their sor. contribution to the duality gap provides faster convergence than random selection and classical importance sampling [11]. A. Problem specification To define the duality gap measure we denote the convex ∗ ∗ We focus on the training of generalized linear models conjugates of gi by gi , defined as gi (v) := maxv vu − gi(u). (GLMs). In mathematical terms this can be expressed by the Then, the duality gap (see [12]) of our objective (1) can be following optimization task: written as X X min F(α) := f(Dα) + gi(αi); (1) gap(α; w) = gapi(αi; w); with n α2R i2[n] i2[n] ∗ gapi(αi; w) := αihw; dii + gi(αi) + gi (−hw; dii); (2) where [n] = f1; : : : ; ng, f and gi are convex functions, and α 2 Rn is the model to be learned from the training data where the model vector α; w are related through the primal- d×n matrix D 2 R with columns d1;:::; dn. The function f is dual mapping w := rf(Dα). Importantly, knowing the assumed to be smooth. This general setup covers many widely parameters α and w, it is possible to calculate the duality applied machine learning models including logistic regression, gap values (2) for every i 2 [n] independently and thus in support vector machines (SVM), and sparse models such as parallel. In our implementation, we introduce the auxiliary Lasso and elastic-net. vector v = Dα from which w can be computed using a simple linear transformation for many problems of interest. B. Coordinate descent A popular method for solving machine learning problems D. The Knights Landing architecture of the form (1) is stochastic (a.k.a. random) coordinate de- Intel Knights Landing (KNL) is a manycore processor P scent, where the separability of the term g := i gi(αi) is architecture used in the second generation Intel Xeon Phi crucial for its efficiency. Coordinate descent methods [4] form devices, the first host processors, i.e., not external accelerators, a group of algorithms which minimize F coordinate-wise, offered in this line. It provides both high performance (with i.e., by performing the optimization across multiple iterations machine learning as one declared target) and x86 backwards i 2 f1;:::;T g, where each iteration updates a single model compatibility. A KNL processor consists of 64–72 cores with parameter αi using the corresponding data column di, while low base frequency (1.3–1.5 GHz). KNL offers AVX-512, a the other parameters remain fixed. In this way, an optimization vector instruction set for 512-bit data words, which allows over a complex model with many variables can be split into a parallel computation on 16 single or 8 double precision floats. sequence of one-dimensional optimization problems. Note that It also supports vector FMA (fused multiply-add) instructions this approach can also be extended to batches where a subset of (e.g., d = ab + c) for further fine-grained parallelism. Each coordinates is updated at a time. While the coordinate descent core can issue two such instructions per cycle, which yields a algorithm is by its nature sequential, it can be parallelized such theoretical single precision peak performance of 64 floating that the coordinates are updated asynchronously in parallel point operations (flops) per cycle. Additionally, AVX-512 by multiple threads. It can be shown that such a procedure introduces gather-scatter intrinsics facilitating computations on maintains convergence guarantees if the delay between reading sparse data formats. The KNL cores are joined in pairs called and updating the model vector is small enough [5]. This delay, tiles located on a 2D mesh. Each core has its own 32 KB L1 which is a proxy for the staleness of the model information cache and each tile has a 1 MB L2 cache. The latter supports used to compute each update, depends on the data density and two reads and one write every two cycles.

Load more