Tabular Operating System Architecture for Massively Parallel Heterogeneous Compute Engines

TabulaROSA: Tabular Operating System Architecture for Massively Parallel Heterogeneous Compute Engines Jeremy Kepner1−4, Ron Brightwell5, Alan Edelman2;3, Vijay Gadepally1;2;4, Hayden Jananthan1;4;6, Michael Jones1;4, Sam Madden2, Peter Michaleas1;4, Hamed Okhravi4, Kevin Pedretti5, Albert Reuther1;4, Thomas Sterling7, Mike Stonebraker2 1MIT Lincoln Laboratory Supercomputing Center, 2MIT Computer Science & AI Laboratory, 3MIT Mathematics Department, 4MIT Lincoln Laboratory Cyber Security Division, 5Sandia National Laboratories Center for Computational Research, 6Vanderbilt University Mathematics Department, 7Indiana University Center for Research in Extreme Scale Technologies Abstract—The rise in computing hardware choices is driving a trend toward developing more specialized processors for a reevaluation of operating systems. The traditional role of an different stages and types of deep neural networks (e.g., train- operating system controlling the execution of its own hardware ing vs inference, dense vs sparse networks). Such hardware is evolving toward a model whereby the controlling processor is distinct from the compute engines that are performing most is often massively parallel, distributed, heterogeneous, and of the computations. In this context, an operating system can non-deterministic, and must satisfy a wide range of security be viewed as software that brokers and tracks the resources requirements. Current mainstream operating systems (OS) can of the compute engines and is akin to a database management trace their lineages back 50 years to computers designed for system. To explore the idea of using a database in an operating basic office functions running on serial, local, homogeneous, system role, this work defines key operating system functions in terms of rigorous mathematical semantics (associative array deterministic hardware operating in benign environments (see algebra) that are directly translatable into database operations. Figure 1). Increasingly, these traditional operating systems These operations possess a number of mathematical properties are bystanders at best and impediments at worse to using that are ideal for parallel operating systems by guaranteeing purpose-built processors. This trend is illustrated by the current correctness over a wide range of parallel operations. The resulting GPU programming modelarchitecture whereby [14] was developed to provide thesignificantly higher user based on engages large message sizes. The randomized with destination a throughput than the conventional merge sorters. routing achieved approximately six times higher data rate and operating system equations provide a mathematical specification network utilization efficiency in the simulation using an The k-way merge sorter sorts long sequences of numbers identical network. conventional OS to acquireby using athe recursive “divide privilege and conquer” approach. It ofdivides accessing the GPU for a Tabular Operating System Architecture (TabulaROSA) the sequence into k sequences that have equal, or as equal as Fig. 6. Randomized destination vs. unique destination packet possible, lengths. The k shorter sequences are then sorted communication. independently and merged to produce the sorted result. The to then implement mostsorting OSof k shorter sequences functions can also be divided into k even (managing memory, that can be implemented on any platform. Simulations of shorter sequences and sorted recursively by using the same merge sort algorithm. This process is recursively repeated until forking in TabularROSA are performed using an associative processes, and IO) insidethe divided their sequence length own reaches 1. applicationThe sorting process code. takes order nlogkn memory cycles. The k-way merge sort is log2k times faster than the 2-way merge sort process when k is greater than 2. For example, when k = 32, the k-way merge array implementation and compared to Linux on a 32,000+ sorter has five times greater sorter throughput than the 2-way merge sorter. The main difficulty with implementing a k-way merge sorter in a conventional processor is that it takes many core supercomputer. Using over 262,000 forkers managing over Computingclock cycles Past to figure out theComputing smallest (or largest) Present value among k entries during each step of the merge sorting process. Ideally, the smallest value of k should be computed within one 68,000,000,000 processes, the simulations show that TabulaROSA processorSerial clock cycle for theMassively maximum sorter throughput.Parallel The 100% efficient systolic merge sorter [9] can achieve this performanceLocal requirement usingDistributed k linear systolic array cells and III. FPGA PROTOTYPE DEVELOPMENT AND PERFORMANCE has the potential to perform operating system functions on a it is particularly well suited for FPGA and integrated circuit MEASUREMENT (IC) implementation since it consists of repeated systolic cells Lincoln Laboratory has developed an FPGA prototype of TX-0 Homogeneouswith nearest-neighbor-only Heterogeneous communications. the graph processor usingMassive commercial Data FPGA Centers boards as shown massively parallel scale. The TabulaROSA simulations show 20x in Figure 7. Each board has one large FPGA and two 4-GByte C. 6D Toroidal Communication Network and DDR3 memory banks. Two graph processor nodes are DeterministicRandomized MessageNon-Deterministic Routing implemented in each board. A small 4-board chassis higher performance as compared to Linux while managing 2000x The new graph processor architecture is a parallel processor implements an 8-node graph processor tied together with 1D interconnected in a 6D toroidal configuration using high toroidal network. Since the commercial board offered limited bandwidth optical links. The 6D toroid provides much higher scalability due to limited number of communication ports for more processes in fully searchable tables. communication performance than lower-dimensional toroids networkManycore connection, the larger prototypes will be developed in because of the higher bisection bandwidth. the future using custom FPGA boards that can support 6D OS Managed User Managed toroidalProcessors network and up to 1 million nodes. The communication network is designed as a packet- PDP-11 routing network optimized to support small packet sizes that Fig. 7. FPGA prototype of the graph processor. Processesare as small as a single sparseProcesses matrix element. The network I. INTRODUCTION scheduling and protocol are designed such that successive communication packets from a node would have randomized arXiv:1807.05308v1 [cs.DC] 14 Jul 2018 destinationsMemory in order to minimizeMemory network congestion [15]. This design is a great contrast to typical conventional multiprocessor message-routing schemes that are based on Next generation computing hardware is increasingly pur- much largerFiles message sizesFiles and globally arbitrated routing that are used in order to minimize the message-routing overhead. However, large message-based communications are often Communications Communications pose built for simulation [1], data analysis [2], and machine difficult to route and can have a relatively high message contention rate caused by the long time periods during which Intel x86 theSecurity involved communication Security links are tied up. The small message sizes, along with randomized destination routing, Specialized Processors learning [3]. The rise in computing hardware choices: general minimize message contentions and improve the overall network communication throughput. Figure 6 shows the 512- node (8 × 8 × 8) 3D toroidal network (drawn as 3 × 3 × 3 network for illustration purposes) simulation based on purpose central processing units (CPUs), vector processors, randomized destination communication versus unique destination communication. Even though both routing methods Fig. 1. Current operating systemsare based on small can message trace sizes, the unique their destination lineage back to the first routing has a message contention rate that is closer to the graphics processing units (GPUs), tensor processing units computers and still have manycontention features rate of conventional that routing are algorithms designed that are for that era. Modern (TPUs), field programmable gate arrays (FPGAs), optical com- computers are very different and currently require the user to perform most puters, and quantum computers is driving a reevaluation of op- operating system functions. erating systems. Even within machine learning, there has been The role of an operating system controlling the execution This material is based upon work supported by the Assistant Secretary of of its own hardware is evolving toward a model whereby the Defense for Research and Engineering under Air Force Contract No. FA8721- controlling processor is distinct from the compute engines that 05-C-0002 and/or FA8702-15-D-0001. Any opinions, findings, conclusions or are performing most of the computations [4]–[6]. Traditional recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for operating systems like Linux are built to execute a shared Research and Engineering. kernel on homogeneous cores [7]. Popcorn Linux [8] and K2 [9] run multiple Linux instances on heterogeneous cores. primary interface is the C language, which is also the primary Barrelfish [10] and FOS (Factored Operating System) [11] aim language that programmers use to develop applications in to support

Load more