RAFTLIB Presented by: Jonathan Beard To: C++Now 2016 RAFTLIB

Alternate Titles: This thing I started on a plane RAFTLIB

Alternate Titles: This thing I started on a plane What’s this RaftLib Thingy? RAFTLIB

Alternate Titles: This thing I started on a plane What’s this RaftLib Thingy? OMG, Another Threading Library… RAFTLIB

Alternate Titles: This thing I started on a plane What’s this RaftLib Thingy? OMG, Another Threading Library… Why I hate parallel programming RAFTLIB

Alternate Titles: This thing I started on a plane What’s this RaftLib Thingy? OMG, Another Threading Library… Why I hate parallel programming A self help guide for pthread anxiety All thoughts, opinions are my own. RaftLib is not a product of ARM Inc. Please don’t ask about ARM products or strategy. I will scowl and not answer. Thank you � ABOUT ME

my website http://www.jonathanbeard.io

slides at http://goo.gl/cwT5UB

project page raftlib.io WHERE I STARTED WHERE I STARTED �⇢

10 WHERE I STARTED �⇢ ⇢

11 WHERE I STARTED

const uint8_t fsm_table[STATES][CHARS]= { /* 0 - @ */ {[9]=0x21}, /* 1 - NAME */ {[0 ... 8]=0x11 ,[11]=0x32}, /* 2 - SEQ */ {[1 ... 2]=0x42 ,[5]=0x42,[8]=0x42,[11]=0x13}, /* 3 - \n */ {[1 ... 2]=0x42 ,[5]=0x42,[8]=0x42,[10]=0x14}, /* 4 - NAME2 */ {[0 ... 8]=0x54 ,[11]=0x15}, /* 5 - SCORE */ {[0 ... 10]=0x65,[11]=0x16}, /* 6 - SEND/SO */ {[9] = 0x71,[11] = 0x16} �⇢ };

12 NECESSITY DRIVES IDEAS

���� ���-� � ����� �� ���������������� �������������� �++�������������� ������������ ���� ��������������� ��� NECESSITY DRIVES IDEAS ���� ���-� � ����� �� ���������������� �������������� �++�������������� ������������ ���� ��������������� ��� �����_������������ �������_����������� �������_����������� �����_��������� �����_����������� �����_����������� ������ �������_������ ���_��������� ���_���� �����_������������ �������_����� ����_����� ��������� ������� �������� ����� AN (PERHAPS BAD) ANALOGY AN (PERHAPS BAD) ANALOGY AN (PERHAPS BAD) ANALOGY ANALOGY PART TWO TOPOLOGY THE JELLYFISH THE FIRST ORGANISM TO OVERLAP ACCESS AND EXECUTION DATA MOVEMENT DOMINATES

Source: Shekhar Borkar, Journal of Lightwave Technology, 2013 I SHOULDN’T HAVE TO CARE

Multiple Sequence Alignment ⚙ Gene Expression Data ⚙

23 I SHOULDN’T HAVE TO CARE

Multiple Sequence Alignment ⚙ Gene Expression Data ⚙

24 HARDWARE

1984 2015 Cray X-MP/48 $19 million / GFLOP $.08 / GFLOP CORES PER DEVICE FINANCIAL INCENTIVE

Sequential JS: $4-7/line

Sequential Java: $5-10/line

Embedded Code: $30-50/line

HPC Code: $100/line WHY YOU SHOULD CARE PRODUCTIVITY / EFFICIENCY AN EQUALIZER

➤ Titan SC estimates for porting range from 5, to > 20 million USD ➤ Most code never really optimized for machine topology, wasting $$ (time/product) and energy ➤ Getting the most out of what you have is an equalizer ���� ���-� � ����� �� ���������������� �������������� �++�������������� { ������������ ���� ��������������� ��� HAVES AND HAVE NOTS

Start-up / Gov’t / Big Business Small Government •Lots of $$ •Not a lot of $$ •Can hire the best people •Often can’t hire the best •Can acquire the rest people •Plenty of compute resources •Left to the mercy of cloud providers AN IDEA

Let’s make computers super fast, and easy to program

31 WHERE TO START

Brook for GPUs: Stream Computing on Graphics Hardware

Ian Buck Tim Foley Daniel Horn Jeremy Sugerman Kayvon Fatahalian Mike Houston Pat Hanrahan Stanford University

Abstract modern hardware. In addition, the user is forced to ex- press their algorithm in terms of graphics primitives, such In this paper, we present Brook for GPUs, a system as textures and triangles. As a result, general-purpose GPU for general-purpose computation on programmable graphics computing is limited to only the most advanced graphics WASHINGTON UNIVERSITY hardware. Brook extends C to include simple data-parallel developers. constructs, enabling the use of the GPU as a streaming co- This paper presents Brook, a programming environment processor. We present a compiler and runtime system that that provides developers with a view of the GPU as a stream- SCHOOL OF ENGINEERING AND APPLIED SCIENCE abstracts and virtualizes many aspects of graphics hardware. ing coprocessor. The main contributions of this paper are: In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to deter- The presentation of the Brook stream programmingDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING mine when the GPU can outperform the CPU for a particu- • model for general-purpose GPU computing. Through lar algorithm. We evaluate our system with five applications, the use of streams, kernels and reduction operators, the SAXPY and SGEMV BLAS operators, image segmen- Brook abstracts the GPU as a streaming processor. tation, FFT, and ray tracing. For these applications, we The demonstration of how various GPU hardware lim- demonstrate that our Brook implementations perform com- • X-SIM AND X-EVAL: parably to hand-written GPU code and up to seven times itations can be virtualized or extended using our com- faster than their CPU counterparts. piler and runtime system; specifically, the GPU mem- ory system, the number of supported TOOLSshader outputs, FOR SIMULATION AND ANALYSIS OF HETEROGENEOUS PIPELINED CR Categories: I.3.1 [Computer Graphics]: Hard- and support for user-defined data structures. ware Architecture—Graphics processors D.3.2 [Program- ming Languages]: Language Classifications—Parallel Lan- The presentation of a cost model for comparing GPU ARCHITECTURES guages • vs. CPU performance tradeoffs to better understand under what circumstances the GPU outperforms the Keywords: Programmable Graphics Hardware, Data CPU. , Stream Computing, GPU Computing, by Brook 2 Background Saurabh Gayen 1 Introduction 2.1 Evolution of Streaming Hardware Prepared under the direction of Professor Mark A. Franklin In recent years, commodity graphics hardware has rapidly Programmable graphics hardware dates back to the origi- evolved from being a fixed-function into having pro- nal programmable framebuffer architectures [England 1986]. grammable vertex and fragment processors. While this new One of the most influential programmable graphics systems programmability was introduced for real-time shading, it has was theSPADE:UNC PixelPlanes series The[Fuchs et Systemal. 1989] culmi- S Declarative been observed that these processors feature instruction sets nating in the PixelFlow machine [Molnar et al. 1992]. These general enough to perform computation beyond the domain systems embedded pixel processors, running as a SIMD pro-AthesispresentedtotheSchoolofEngineeringandAppliedScience of rendering. Applications such as linear algebra [Kruger¨ cessor, on the sameStreamchip as framebuffer Processingmemory. Peercy et Engine and Westermann 2003], physical simulation, [Harris et al. al. [2000] demonstrated how the OpenGL architecture [Woo Washington University in partial fulfillment of the 2003], and a complete ray tracer [Purcell et al. 2002; Carr et al. 1999] can be abstracted as a SIMD processor. Each et al. 2002] have been demonstrated to run on GPUs. rendering pass implements a SIMD instruction that per- requirements for the degree of Originally, GPUs could only be programmed using as- forms a basic arithmetic operation and updates the frame- sembly languages. Microsoft’s HLSL, NVIDIA’s Cg, and Bubugra˘ffer atomically Gedik. Using this abstraction,Henriquethey were able Andrade Kun-Lung Wu OpenGL’s GLslang allow shaders to be written in a high to compile RenderMan to OpenGL 1.2 with imaging exten- MASTER OF SCIENCE level, C-like programming language [Microsoft 2003; MarkIBM Thomassions. Thompson J. Watsonet al. [2002] exploredIBMthe use Thomasof GPUs as J. Watson IBM Thomas J. Watson et al. 2003; Kessenich et al. 2003]. However, theseResearchlan- a Center,general-purp Hawthorne,ose vector processorResearchby implementing Center,a soft- Hawthorne, Research Center, Hawthorne, guages do not assist the programmer in controlling other May 2008 NY,w 10532,are layer on USAtop of the graphics library NY,that p 10532,erformed USA NY, 10532, USA aspects of the graphics pipeline, such as allocating texture arithmetic computation on arrays of floating point numbers. � memory, loading shader programs, or constructing [email protected] and vector processing [email protected] a read, an [email protected] primitives. As a result, the implementation of applications execution of a single instruction, and a write to off-chip mem- Saint Louis, Missouri requires extensive knowledge of the latest graphics as ory [Russell 1978; Kozyrakis 1999]. This results in signifi- well as an understanding of the features and limitations of cant memory bandwidthPhilipuse. T S.oda Yuy’s graphics hardware MyungCheol Doo Department of Computer College of Computing, Permission to make digital or hard copies of part or all of this work for personal or executes small programs where instructions load and store classroom use is granted without fee provided that copies are not made or distributed for data to loScience,cal temporary Universityregisters rather ofthan Illinois,to memory. Georgia Institute of profit or direct commercial advantage and that copies show this notice on the first page or This is a major difference between the vector and stream initial screen of a display along with the full citation. Copyrights for components of this Chicago, IL, 60607, USA Technology, GA, 30332, USA work owned by others than ACM must be honored. Abstracting with credit is permitted. To processor abstraction [Khailany et al. 2001]. copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any The stream [email protected] captures [email protected] component of this work in other works requires prior specific permission and/or a fee. locality not present in the SIMD or vector models through Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or [email protected]. the use of streams and kernels. A stream is a collection © 2004 ACM 0730-0301/04/0800-0777 $5.00 of records requiring similar computation while kernels are ABSTRACT cations, ranging from monitoring financial market feeds to Spade − radio telescopes to semiconductor fabrication lines. In this paper, we777 present the System S declarative engine. System S is a large-scale, dis- � tributed data stream processing middleware under develop- Categories and Subject Descriptors ment at IBM T. J. Watson Research Center. As a front-end H.2.4 [Database Management]: Systems—Distributed Spade for rapid application development for System S, pro- databases; H.2.3 [Database Management]: Languages— vides (1) an intermediate language for flexible composition Data manipulation languages of parallel and distributed data-flow graphs, (2) a toolkit of type-generic, built-in stream processing operators, that sup- General Terms port scalar as well as vectorized processing and can seam- lessly inter-operate with user-defined operators, and (3) a Design � rich set of stream adapters to ingest/publish data from/to outside sources. More importantly, Spade automatically Keywords brings performance optimization and to System S applications. To that end, Spade employs a code genera- Distributed Data Stream Processing tion framework to create highly-optimized applications that run natively on the Stream Processing Core (SPC), the exe- 1. INTRODUCTION cution and communication substrate of System S, and take On-line information sources are increasingly taking the full advantage of other System S services. Spade allows de- form of data streams, that is time ordered series of events velopers to construct their applications with fine granular or readings. Example data streams include live stock and stream operators without worrying about the performance option trading feeds in financial services, physical link statis- implications that might exist, even in a distributed system. tics in networking and telecommunications, sensor readings Spade’s optimizing compiler automatically maps applica- in environmental monitoring and emergency response, and � tions into appropriately sized execution units in order to satellite and live experimental data in scientific computing. minimize communication overhead, while at the same time The proliferation of these sources has created a paradigm exploiting available parallelism. By virtue of the scalability shift in how we data, moving away from the tradi- of the System S runtime and Spade’s effective code genera- tional “store and then process” model of database manage- tion and optimization, we can scale applications to a large ment systems toward the “on-the-fly processing” model of number of nodes. Currently, we can run Spade jobs on emerging data stream processing systems (DSPSs). This ≈ 500 processors within more than 100 physical nodes in a paradigm shift has recently created a strong interest in tightly connected cluster environment. Spade has been in DSPS-related research, in academia [1, 4, 6, 7] and indus- use at IBM Research to create real-world streaming appli- try [8, 17, 21, 25] alike. In this paper we describe the design of Spade,whichis the declarative stream processing engine of the massively scalable and distributed System S − a large-scale stream processing middleware under development at IBM Research. Permission to make digital or hard copies of all or part of this work for Spade provides a rapid application development front-end personal or classroom use is granted without fee provided that copies are for System S. Concretely, Spade offers: not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copyotherwise,to 1. An intermediate language for flexible composition of republish, to post on servers or to redistribute to lists, requires prior specific parallel and distributed data-flow graphs. This lan- permission and/or a fee. SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada. guage sits in-between higher level programming tools Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00. and languages such as the System S IDE or stream

1123 32 WHERE TO START

Raft Language: Language � •Object oriented Java •Multiple inheritance C C++ •Simple types Library C# •…. lots more x Python

33 RAFTLIB

C++ Streaming Template Library Simplifies parallelization of code Abstracts the details Auto-manages allocation types, data movement.

software download: http://raftlib.io

34 ISSUES THAT NEED SOLUTIONS

➤ Usable Generic API ➤ Where to run the code ➤ Where to allocate memory ➤ How big of allocations to start off with ➤ How many user vs. kernel threads ➤ …Gilligan had a better chance of having a 3 hour tour ➤ …than we have of fitting all the issues on a single slide STREAM PROCESSING (STRING SEARCH EXAMPLE)

Match 1

Read File, Match Reduce Distribute i

Match n

36 STREAM PROCESSING (STRING SEARCH EXAMPLE)

Match 1

Read File, Match Reduce Distribute i

Match n

Input Data Output Data Stream Stream Get Data Send Data

37 STREAM PROCESSING (STRING SEARCH EXAMPLE)

Match 1

Read File, Match Reduce Distribute i

Match n

Don’t Run

No

data avail

Yes

Run Kernel

38 HOW IT WORKS (HIGH LEVEL)

Check Graph Connectivity & Define Streaming Types Topology

Map Kernels to Resources

Continually Continually Monitor Kernel Monitor Buffer Performance Behavior

39 Stream Processing (Search Ex)

40 Stream Processing (Search Ex)

41 Stream Processing (Search Ex)

42 Stream Processing (Search Ex)

43 Stream Processing (Search Ex)

44 Stream Processing (Search Ex)

45 CHOOSE YOUR ADVENTURE

Models Interface ⏏ Implementation

46 MODELING

•RaftLib optionally uses Mapping Scotch •Otherwise basic affinity

Scheduling Modeling

•OS Managed •Affinity Locked (/Unix) •Run-time yields when blocked Allocation Mostly my work, which we can talk about MODELING ISSUES - SIZE OF Q1

A Q1 B

48 WHY DO WE NEED TO SOLVE

➤ Buffer allocations take time and energy ➤ Programmers are horrible at deciding (too many parameters) ➤ Hardware specific locations matter (NUMA) ➤ Re-allocating with an accelerator takes even more time (bus latency, hand-shakes, etc.) ➤ Must be solved in conjunction with partitioning/scheduling/ placement

49 MODELING

A B C

“Stream” is modeled as a Queue

A Q1 B Q2 C

50 MONITOR

Kernel A Stream Kernel B

Kernel Kernel Thread

Monitor Thread

OS Scheduler

processor processor processor core core core

51 APPROXIMATE INSTRUMENTATION

A Q1 B

52 SOLVE FOR THROUGHPUT QUICKLY

Use network flow model μ =A GB/s to quickly estimate flow C C within a streaming graph

Bμ Cμ Decompose queueing C C network and solve each queueing station Dμ independently

53 HOW WOULD YOU GET THE RIGHT BUFFER SIZE?

Queueing Model to Solve for Buffer Size

M/D/1 M/M/1 None ?

A Q1 B 54 HOW WOULD YOU GET THE RIGHT BUFFER SIZE?

Queueing Model to Solve for Buffer Size

M/D/1 M/M/1 None ?

A Q1 B 55 ML BASED MODEL SELECTION

Classifier

M/M/1 Apply the Model

M/M/1

56 CHOOSE YOUR ADVENTURE

Models Interface ⏏ Implementation

57 COMPUTE KERNEL

Input Data Output Data Stream Stream Get Data Send Data

58 COMPUTE KERNEL

class akernel : public raft::kernel

{ Input Data Output Data Stream Stream public: Get Data Send Data akernel() : raft::kernel() { //add input ports input.addPort< /** type **/ >( "x0","x1", "x..." ); //add output ports output.addPort< /** type **/ >( "y0", "y1", "y..." ); }

virtual raft::kstatus run() { /** get data from input ports **/ auto &valFromX( input[ "x..." ].peek< /** type of "x..." **/ >() ); /** do something with data **/

const auto ret_val( do_something( valFromX ) );

output[ "y..." ].push( ret_val );

input[ "x..." ].unpeek(); input[ "x..." ].recycle(); return( raft::proceed /** or stop **/ ); } };

59 COMPUTE KERNEL class example : public raft::parallel_k

{ Input Data Output Data Stream Stream public: Get Data Send Data example() : parallel_k() { input.addPort< /** some type **/ >( "0" ); /** add a starter output port **/ addPort(); }

/** implement virtual function **/ virtual std::size_t addPort() { return( (this)->addPortTo< /** type **/ >( output /** direction **/ ) ); }

"..."

b

example b c

60 RECEIVING DATA

/**

* return reference to memory on Input Data Output Data Stream Stream * in-bound stream Get Data Send Data */ template< class T > T& peek( raft::signal *signal = nullptr )

template< class T > autorelease< T, peekrange > peek_range( const std::size_t n )

•Returns object with “special access to stream” •Operator [ ] overload returns auto pair •Direct reference as in peek() for each element

template < class T > struct autopair { T &ele; Buffer::Signal &sig; };

61 RECEIVING DATA

/**

Input Data Output Data * necessary to tell inbound stream we're done Stream Stream * looking at it Get Data Send Data */ Yes No: resize, Is in peek() virtual relocate void unpeek() No

Resize, Relocate /** * free up index in fifo, lazily deallocate large objects */ Wait to see if Is external Yes void object is pushed allocate recycle( const std::size_t range = 1 ) downstream No

Immediately free slot w/in stream

62 RECEIVING DATA

/**

Input Data Output Data * these pops produce a copy Stream Stream */ Get Data Send Data template< class T > void pop( T &item, raft::signal *signal = nullptr )

template< class T > using pop_range_t = std::vector< std::pair< T , raft::signal > >;

template< class T > void pop_range( pop_range_t< T > &items, const std::size_t n_items )

/** * no copy, slightly higher overhead, "smart object" * implements peek, unpeek, recycle */ template< class T > autorelease< T, poptype > pop_s()

63 SENDING DATA

/** in-place allocation **/

Input Data Output Data template < class T, Stream Stream class ... Args > Get Data Send Data T& allocate( Args&&... params )

sizeof( T ) <= yes Has yes in-place new Cache Line Constructor Constructor

no no

Bare Allocate from Uninitialized External Pool Memory

/** in-place alloc of range for fundamental types **/ template < class T > auto allocate_range( const std::size_t n ) -> std::vector< std::reference_wrapper< T > >

64 SENDING DATA

Input Data Output Data Stream Stream Get Data Send Data /** release data to stream **/ virtual void send( const raft::signal = raft::none )

/** release data to stream **/ virtual void send_range( const raft::signal = raft::none )

/** oops, don't need this memory **/ virtual void deallocate()

65 SENDING DATA

Input Data Output Data Stream Stream Get Data Send Data

/** multiple forms **/ template < class T > void push( const T &item, const raft::signal signal = raft::none )

/** insert from container within run() function to stream **/ template< class iterator_type > void insert( iterator_type begin, iterator_type end, const raft::signal signal = raft::none )

66 INCLUDED KERNELS

/** * thread safe print, specialization for '\n' vs. '\0' */ template< typename T, char delim = '\0' > class print

/** read from iterator to streams **/ static raft::readeach< T, Iterator > read_each( Iterator &&begin, Iterator &&end )

/** write from iterator to streams **/ template < class T, class BackInsert > static writeeach< T, BackInsert > write_each( BackInsert &&bi ) { return( writeeach< T, BackInsert >( bi ) ); } CONNECTING COMPUTE KERNELS

raft::map m; /** example only **/ raft::kernel a, b; m += a >> b;

a b CONNECTING COMPUTE KERNELS

raft::map m; /** example only **/ raft::kernel a, b; m += a[ "y0" ] >> b;

“y0” a b

c CONNECTING COMPUTE KERNELS

raft::map m; /** example only **/ raft::kernel a, b; m += a[ "y0" ] >> b[ "x0" ];

“y0” a b “x0”

c CONNECTING COMPUTE KERNELS raft::map m; /** example only **/ raft::kernel a, b, c; m += a <= b >= c;

Topology user specifies

a b c CONNECTING COMPUTE KERNELS raft::map m; /** example only **/ raft::kernel a, b, c; m += a <= b >= c;

RaftLib Turns Into

b

a b c

b CONNECTING COMPUTE KERNELS

raft::map m; /** example only **/ raft::kernel a, b, c, d; m += a <= b >> c >= d;

RaftLib Turns Into

b c

a b c d

b c CONNECTING COMPUTE KERNELS raft::map m; /** example only **/ raft::kernel a, b, c; m += a >> raft::order::out >> b >> raft::order::out >> c;

•Clone at SESE points •Use CLONE macro b

a b c

b AUTO-PARALLELIZATION

1) RaftLib figures out methodology 2) Runtime calls CLONE() macro of kernel “b”

#define CLONE()\ virtual raft::kernel* clone()\ { \ auto *ptr( \ new typename std::remove_reference< decltype( *this ) >::type( ( *(\ (typename std::decay< decltype( *this ) >::type * ) \ this ) ) ) );\ /** RL needs to dealloc this one **/\ ptr->internal_alloc = true;\ return( ptr );\ }

b

a b c

b AUTO-PARALLELIZATION

3) Lock output port container of “a” 4) Register new port 5) Decide where to run it 6) Allocate memory for stream 7) Unlock output container of kernel “a” //do same on output side of “b”

b

a b c

b CONNECTION TODO ITEMS

➤ Better SESE implementation ➤ Decide on syntax for set of kernels raft::map m; /** example only **/ raft::kernel a, b, c, d; m += src <= raft::kset( a, b, c, d ) >= dst; ➤ Address space stream modifier (new VM space) raft::map m; /** example only **/ raft::kernel a, b, c; m += a >> raft::vm::part >> b >> raft::vm::part >> c; ➤ Anything else? CHOOSE YOUR ADVENTURE

Models Interface ⏏ Implementation

78 IMPLEMENTATION DETAILS

Check Graph Connectivity & Define Streaming Types Topology

Map Kernels to Resources

Continually Continually Monitor Kernel Monitor Buffer Performance Behavior

79 TOPOLOGY CHECK

➤ Add kernels to map ➤ Check type of each link (potential for type optimization) ➤ Handle static split/joins (produce any new kernels) ➤ DFS to ensure no unconnected edges

Match 1

Read File, Match Reduce Distribute i

Match n PARTITION (LINUX / UNIX)

➤ Take RaftLib representation, convert to Scotch format ➤ Use fixed communications cost at each edge for initial partition ➤ 2 main dimensions: ➤ Flow between each edge in the application graph ➤ Bandwidth available between compute resource ➤ Set affinity to partitioned compute cores ➤ Repartition as needed @ run-time ➤ TODO: incorporate OpenMPI utilities hwloc and netloc to get cross-platform hardware topology information CHOOSE ALLOCATIONS

➤ Alignment ➤ SIMD ops often require memory alignment, RaftLib takes an alignment by default approach for in-stream allocations ➤ In-stream vs. External Pool Allocate (template allocators)

yes yes in-place new sizeof( T ) <= Cache Line Has Constructor Constructor

no no

Bare Allocate from Uninitialized External Pool Memory MONITOR BEHAVIOR - ALLOCATORS

➤ Two options for figuring out optimal buffer size while running ➤ model based (discussed on modeling adventure path) ➤ branch & bound search ➤ separate thread, exits when app done ➤ pseudocode:

while( not done ) { if( queue_utilization > .5 ) { queue->resize(); } sleep( ALLOC_INTERVAL ); } RUN-TIME LOCK FREE FIFO RESIZING enum access_key : key_t { allocate = 0, allocate_range = 1, push = 3, recycle = 4, pop = 5, peek = 6, size = 7, N }; } struct ThreadAccess { union { std::uint64_t whole = 0; /** just in case, default zero **/ dm::key_t flag[ 8 ]; }; std::uint8_t padding[ L1D_CACHE_LINE_SIZE - 8 /** padding **/ ]; } #if defined __APPLE__ || defined __linux __attribute__((aligned( L1D_CACHE_LINE_SIZE ))) #endif volatile thread_access[ 2 ];

/** std::memory_order_relaxed **/ std::atomic< std::uint64_t > checking_size = { 0 }; RUN-TIME LOCK FREE FIFO RESIZING

➤ Optimization…wait for the right conditions if( rpt < wpt ) { //perfect to copy w/std::memcpy }

➤ Factory allocators /** allocator factory map **/ std::map< Type::RingBufferType , instr_map_t* > const_map;

/** initialize some factories **/ const_map.insert( std::make_pair( Type::Heap , new instr_map_t() ) );

const_map[ Type::Heap ]->insert( std::make_pair( false /** no instrumentation **/, RingBuffer< T, Type::Heap, false >::make_new_fifo ) ); const_map[ Type::Heap ]->insert( std::make_pair( true /** yes instrumentation **/, RingBuffer< T, Type::Heap, true >::make_new_fifo ) ); ….many more MONITOR BEHAVIOR - PARALLELIZATION

➤ Mechanics covered in interface, simple model here ➤ Run in separate thread, term on exit

/** apply criteria **/ if( in_utilization > .5 && out_utilization < .5 ) { //tag kernel auto &tag( ag[ reinterpret_cast< std::uintptr_t >( kernel ) ] ); tag += 1; if( tag == 3 ) { dup_list.emplace_back( kernel ); } }

//after checking all kernels, handle duplication IMPLEMENTATION TODO ITEMS

➤ Find fast SVM library to integrate (research code used LibSVM) for buffer model selection ➤ Integrate more production-capable network flow model for run-time re-partitioning choices ➤ Performant TCP links…. ➤ RDMA on wish list ➤ QThreads Integration (see pool scheduler) ➤ hwloc and netloc integration (see partition_scotch ) ➤ Perf data caching (useful for initial partition) CHOOSE YOUR ADVENTURE

Models Interface ⏏ Implementation

88 PERFORMANCE

➤ Decent compared to pthread stock implementation of pbzip2 ➤ Parallel Bzip2 Example: https://goo.gl/xyQAhm PERFORMANCE

➤ Fixed string search compared to , GNU Parallel + GNU Grep RaftLib (Boyer-Moore)

Apache Spark RaftLib (Boyer-Moore) (Aho-Corasick)

GNU Parallel + GNU grep

90 91 ABOUT ME

my website http://www.jonathanbeard.io

slides at http://goo.gl/cwT5UB

project page raftlib.io video of talk given at #CppNow2016 http://goo.gl/mbxAwK

93 Stream Processing

for i←0 through N do a[i] ←(b[i] + c[i]) i++ end do

Traditional Control Flow Streaming a,b,c,i Read b,c i <=N exit out <- b + c a[i] ←(b[i] + c[i])

SBS Stream Based Supercomputing Lab Write a http://sbs.wustl.edu i++ 94