Presented By: Jonathan Beard To: C++Now 2016 RAFTLIB

RAFTLIB Presented by: Jonathan Beard To: C++Now 2016 RAFTLIB Alternate Titles: This thing I started on a plane RAFTLIB Alternate Titles: This thing I started on a plane What’s this RaftLib Thingy? RAFTLIB Alternate Titles: This thing I started on a plane What’s this RaftLib Thingy? OMG, Another Threading Library… RAFTLIB Alternate Titles: This thing I started on a plane What’s this RaftLib Thingy? OMG, Another Threading Library… Why I hate parallel programming RAFTLIB Alternate Titles: This thing I started on a plane What’s this RaftLib Thingy? OMG, Another Threading Library… Why I hate parallel programming A self help guide for pthread anxiety All thoughts, opinions are my own. RaftLib is not a product of ARM Inc. Please don’t ask about ARM products or strategy. I will scowl and not answer. Thank you � ABOUT ME my website http://www.jonathanbeard.io slides at http://goo.gl/cwT5UB project page raftlib.io WHERE I STARTED WHERE I STARTED �⇢ 10 WHERE I STARTED �⇢ ⇢ 11 WHERE I STARTED const uint8_t fsm_table[STATES][CHARS]= { /* 0 - @ */ {[9]=0x21}, /* 1 - NAME */ {[0 ... 8]=0x11 ,[11]=0x32}, /* 2 - SEQ */ {[1 ... 2]=0x42 ,[5]=0x42,[8]=0x42,[11]=0x13}, /* 3 - \n */ {[1 ... 2]=0x42 ,[5]=0x42,[8]=0x42,[10]=0x14}, /* 4 - NAME2 */ {[0 ... 8]=0x54 ,[11]=0x15}, /* 5 - SCORE */ {[0 ... 10]=0x65,[11]=0x16}, /* 6 - SEND/SO */ {[9] = 0x71,[11] = 0x16} �⇢ }; 12 NECESSITY DRIVES IDEAS �� -� � �� ++�� NECESSITY DRIVES IDEAS �� -� � �� ++�� _�� _�� _�� _�� _�� _�� _�� _�� _�� _�� _�� _�� AN (PERHAPS BAD) ANALOGY AN (PERHAPS BAD) ANALOGY AN (PERHAPS BAD) ANALOGY ANALOGY PART TWO TOPOLOGY THE JELLYFISH THE FIRST ORGANISM TO OVERLAP ACCESS AND EXECUTION DATA MOVEMENT DOMINATES Source: Shekhar Borkar, Journal of Lightwave Technology, 2013 I SHOULDN’T HAVE TO CARE Multiple Sequence Alignment ⚙ Gene Expression Data ⚙ 23 I SHOULDN’T HAVE TO CARE Multiple Sequence Alignment ⚙ Gene Expression Data ⚙ 24 HARDWARE 1984 2015 Cray X-MP/48 $19 million / GFLOP $.08 / GFLOP CORES PER DEVICE FINANCIAL INCENTIVE Sequential JS: $4-7/line Sequential Java: $5-10/line Embedded Code: $30-50/line HPC Code: $100/line WHY YOU SHOULD CARE PRODUCTIVITY / EFFICIENCY AN EQUALIZER ➤ Titan SC estimates for porting range from 5, to > 20 million USD ➤ Most code never really optimized for machine topology, wasting $$ (time/product) and energy ➤ Getting the most out of what you have is an equalizer �� -� � �� ++�� { �� HAVES AND HAVE NOTS Start-up / Gov’t / Big Business Small Government •Lots of $$ •Not a lot of $$ •Can hire the best people •Often can’t hire the best •Can acquire the rest people •Plenty of compute resources •Left to the mercy of cloud providers AN IDEA Let’s make computers super fast, and easy to program 31 WHERE TO START Brook for GPUs: Stream Computing on Graphics Hardware Ian Buck Tim Foley Daniel Horn Jeremy Sugerman Kayvon Fatahalian Mike Houston Pat Hanrahan Stanford University Abstract modern hardware. In addition, the user is forced to ex- press their algorithm in terms of graphics primitives, such In this paper, we present Brook for GPUs, a system as textures and triangles. As a result, general-purpose GPU for general-purpose computation on programmable graphics computing is limited to only the most advanced graphics WASHINGTON UNIVERSITY hardware. Brook extends C to include simple data-parallel developers. constructs, enabling the use of the GPU as a streaming co- This paper presents Brook, a programming environment processor. We present a compiler and runtime system that that provides developers with a view of the GPU as a stream- SCHOOL OF ENGINEERING AND APPLIED SCIENCE abstracts and virtualizes many aspects of graphics hardware. ing coprocessor. The main contributions of this paper are: In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to deter- The presentation of the Brook stream programmingDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING mine when the GPU can outperform the CPU for a particu- • model for general-purpose GPU computing. Through lar algorithm. We evaluate our system with five applications, the use of streams, kernels and reduction operators, the SAXPY and SGEMV BLAS operators, image segmen- Brook abstracts the GPU as a streaming processor. tation, FFT, and ray tracing. For these applications, we The demonstration of how various GPU hardware lim- demonstrate that our Brook implementations perform com- • X-SIM AND X-EVAL: parably to hand-written GPU code and up to seven times itations can be virtualized or extended using our com- faster than their CPU counterparts. piler and runtime system; specifically, the GPU memory system, the number of supported TOOLSshader outputs, FOR SIMULATION AND ANALYSIS OF HETEROGENEOUS PIPELINED CR Categories: I.3.1 [Computer Graphics]: Hard- and support for user-defined data structures. ware Architecture—Graphics processors D.3.2 [Program- ming Languages]: Language Classifications—Parallel Lan- The presentation of a cost model for comparing GPU ARCHITECTURES guages • vs. CPU performance tradeoffs to better understand under what circumstances the GPU outperforms the Keywords: Programmable Graphics Hardware, Data CPU. Parallel Computing, Stream Computing, GPU Computing, by Brook 2 Background Saurabh Gayen 1 Introduction 2.1 Evolution of Streaming Hardware Prepared under the direction of Professor Mark A. Franklin In recent years, commodity graphics hardware has rapidly Programmable graphics hardware dates back to the origi- evolved from being a fixed-function pipeline into having pro- nal programmable framebuffer architectures [England 1986]. grammable vertex and fragment processors. While this new One of the most influential programmable graphics systems programmability was introduced for real-time shading, it has was theSPADE:UNC PixelPlanes series The[Fuchs et Systemal. 1989] culmi- S Declarative been observed that these processors feature instruction sets nating in the PixelFlow machine [Molnar et al. 1992]. These general enough to perform computation beyond the domain systems embedded pixel processors, running as a SIMD pro-AthesispresentedtotheSchoolofEngineeringandAppliedScience of rendering. Applications such as linear algebra [Kruger¨ cessor, on the sameStreamchip as framebuffer Processingmemory. Peercy et Engine and Westermann 2003], physical simulation, [Harris et al. al. [2000] demonstrated how the OpenGL architecture [Woo Washington University in partial fulfillment of the 2003], and a complete ray tracer [Purcell et al. 2002; Carr et al. 1999] can be abstracted as a SIMD processor. Each et al. 2002] have been demonstrated to run on GPUs. rendering pass implements a SIMD instruction that per- requirements for the degree of Originally, GPUs could only be programmed using as- forms a basic arithmetic operation and updates the frame- sembly languages. Microsoft’s HLSL, NVIDIA’s Cg, and Bubugra˘ffer atomically Gedik. Using this abstraction,Henriquethey were able Andrade Kun-Lung Wu OpenGL’s GLslang allow shaders to be written in a high to compile RenderMan to OpenGL 1.2 with imaging exten- MASTER OF SCIENCE level, C-like programming language [Microsoft 2003; MarkIBM Thomassions. Thompson J. Watsonet al. [2002] exploredIBMthe use Thomasof GPUs as J. Watson IBM Thomas J. Watson et al. 2003; Kessenich et al. 2003]. However, theseResearchlan- a Center,general-purp Hawthorne,ose vector processorResearchby implementing Center,a soft- Hawthorne, Research Center, Hawthorne, guages do not assist the programmer in controlling other May 2008 NY,w 10532,are layer on USAtop of the graphics library NY,that p 10532,erformed USA NY, 10532, USA aspects of the graphics pipeline, such as allocating texture arithmetic computation on arrays of floating point numbers. � memory, loading shader programs, or constructing [email protected] and vector processing [email protected] a read, an [email protected] primitives. As a result, the implementation of applications execution of a single instruction, and a write to off-chip mem- Saint Louis, Missouri requires extensive knowledge of the latest graphics APIs as ory [Russell 1978; Kozyrakis 1999]. This results in signifi- well as an understanding of the features and limitations of cant memory bandwidthPhilipuse. T S.oda Yuy’s graphics hardware MyungCheol Doo Department of Computer College of Computing, Permission to make digital or hard copies of part or all of this work for personal or executes small programs where instructions load and store classroom use is granted without fee provided that copies are not made or distributed for data to loScience,cal temporary Universityregisters rather ofthan Illinois,to memory. Georgia Institute of profit or direct commercial advantage and that copies show this notice on the first page or This is a major difference between the vector and stream initial screen of a display along with the full citation. Copyrights for components of this Chicago, IL, 60607, USA Technology, GA, 30332, USA work owned by others than ACM must be honored. Abstracting with credit is permitted. To processor abstraction [Khailany et al.

Presented By: Jonathan Beard To: C++Now 2016 RAFTLIB

High-Level and Efficient Stream Parallelism on Multi-Core Systems

Queue Streaming Model Theory, Algorithms, and Implementation

Pash: Light-Touch Data-Parallel Shell Processing

A Personal Analytics Platform for the Internet of Things Implementing Kappa Architecture with Microservice-Based Stream Processing

Paving the Way Towards High-Level Parallel Pattern Interfaces for Data Stream Processing

A Complete Bibliography of Publications in the International

Packet-Level Analytics in Software Without Compromises

Balancing Performance and Reliability in Software Components

Raftlib: a C++ Template Library for High Performance Stream Parallel Processing

User-Defined Execution Relaxations for Enhanced Programmability in High-Performance Parallel Computing

Online Modeling and Tuning of Parallel Stream Processing Systems Jonathan Curtis Beard Washington University in St

Intrinsic Compilation Model to Enhance Performance of Real Time