Computer Systems @ Colorado COMPUTER AND CYBERPHYSICAL SYSTEMS RESEARCH AT COLORADO http://systems.cs.colorado.edu

SKIR:SKIR: Just-in-TimeJust-in-Time CompilationCompilation forfor ParallelismParallelism withwith LLVMLLVM

JeffJeff FifieldFifield UniversityUniversity ofof ColoradoColorado

20112011 LLVMLLVM DevelopersDevelopers MeetingMeeting NovemberNovember 18,18, 20112011

The SKIR Project is:

 Stream and Kernel Intermediate Representation (SKIR)  SKIR = LLVM + Streams/Kernels  Stream language front-end  SKIR code optimizations for stream parallelism  Dynamic scheduler for shared memory  LLVM JIT back-end for CPU  LLVM to OpenCL back-end for GPU

WhatWhat isis StreamStream Parallelism?Parallelism?

stateful  Regular data-centric computation kernel

 Independent processing stages data

with explicit communication e

n parallel

i

l

e kernel

p

 i

Pipeline, Data, and Task p Parallelism split/scatter

 Examples:

fine  Digital Signal Processing

 Encoding/Decoding

y

t

i r  a Compression, Cryptography

l

u streams n  join/gather a Video, Audio

r

g  Network Processing

 Real-time “Big Data” services coarse task data

FormalFormal ModelsModels ofof StreamStream ParallelismParallelism

 Kahn Process Networks (KPN) [1]

 computation as a graph of independent processes  communication over unidirectional FIFO channels  block on read to empty channel, never block on write  deterministic kernels ⇒ deterministic network

 Synchronous Data Flow Networks (SDF) [2]

 restriction of KPN where kernel have fixed I/O rates  allows better analysis  allows static scheduling techniques

[1] G. Kahn, The semantics of a simple language for parallel programming, Information Processing (1974), 471–475. [2] E. A. Lee and D. G. Messerschmitt, Static scheduling of synchronous data flow programs for digital signal processing, IEEE Transactions on Computing 36 (1987), no. 1, 24–35. WhyWhy StreamStream Parallelism?Parallelism? ItIt cancan targettarget increasinglyincreasingly diversediverse parallelparallel hardwarehardware

Intel Knights Corner: 50 Cores AMD Bulldozer: Shared FPU AMD Llano: On Die GPU

Intel Sandy Bridge: Shared L3 between Gfx, CPU NVIDIA Tegra 3: Asymmetric Multicore

WhyWhy StreamStream Parallelism?Parallelism? ItIt SupportsSupports aa varietyvariety ofof datadata centriccentric applicationsapplications

 Audio processing  Physics Simulations  Image processing  Financial Applications  Compression  Network Processing  Encryption  Computational Bio  Data Mining  Game Physics  Software Radio  Twitter Parsing  2-D and 3-D Graphics  Marmot Detection  And many more...

WhyWhy SKIR?SKIR? EmbeddedEmbedded domaindomain specificspecific parallelismparallelism isis usefuluseful

 Embed domain specific knowledge Application  into the language, compiler, or Parallel  Programming model tailored to your problem Algorithm

 higher level of abstraction ⇒ higher productivity

 restricted prog. model ⇒ higher performance Standard  Your favorite language, with better parallelism Compiler

 Examples: Offline Runtime  PLINQ – optimizable embedded query language Domain  ArBB – vector computation on parallel arrays Specific Compiler  CUDA – memory and models for GPUs

 SKIR – language independent stream parallelism Parallel Hardware

SKIR:SKIR: OverviewOverview

 Application Organized as JIT Compiler Source Code  SKIR for performance portability Algorithm  for dynamic program graphs  for dynamic optimization Standard Compiler  SKIR intrinsics for LLVM Offline  Runtime stream communication SKIR Compiler &  static or dynamic stream Runtime graph manipulation

Parallel & Heterogeneous Hardware CPU GPU FPGA ???

now future work bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { SKIR ExampleSKIR Example int data skir.pop(0, &data) data += *state skir.push(0, &data) return false  } SKIR pseudo-code bool CONSUMER (int *state, void *ins[], void *outs[]) {  int data Construct and execute a if (*state == 0) return true skir.pop(0, &data) print(data) 4 stage pipeline *state = *state ­ 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = ­1

stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4

Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0

K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)

skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) *state = *state ­ 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = ­1

stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4

Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0

K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)

skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data PRODUCER skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) ADDER *state = *state ­ 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = ­1

stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4 ADDER

Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0

K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)

skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) CONSUMER skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data PRODUCER skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) ADDER *state = *state ­ 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = ­1

stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4 ADDER

Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0

K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)

skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) CONSUMER skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data PRODUCER skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) ADDER *state = *state ­ 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = ­1

stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4 ADDER

Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0

K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)

skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) CONSUMER skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data PRODUCER skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) ADDER *state = *state ­ 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = ­1

stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4 ADDER

Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0

K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)

skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) CONSUMER skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data PRODUCER skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) ADDER *state = *state ­ 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = ­1

stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4 ADDER

Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0

K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)

skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) CONSUMER skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } SKIRSKIR asas aa compilercompiler target:target: CC

int subtracter_work(void *state, skir_stream_ptr_t *ins, skir_stream_ptr_t *outs) { float f0; float f1; __SKIR_pop(0, &f0); __SKIR_pop(0, &f1); f0 = f1 ­ f0; __SKIR_push(0, &f0); return 0; } skir_stream_ptr_t build_band_pass_filter(skir_stream_ptr_t src, float rate, float low, float high, int taps) { skir_stream_ptr_t ins[2] = {0}; skir_stream_ptr_t outs[2] = {0}; src = build_bpf_core(src, rate, low, high, taps); skir_kernel_ptr_t sub = __SKIR_kernel((void*)subtracter_work, 0); ins[0] = src; outs[0] = __SKIR_stream(sizeof(float)); __SKIR_call(sub, ins, outs); return outs[0]; } One-to-one mapping of SKIR operations to C intrinsics

SKIRSKIR asas aa compilercompiler target:target: C++C++

#include class CalculateForces : public Kernel A high level C++ library maps object oriented { public: stream parallelism onto SKIR intrinsics float m_pos_rd[4*NBODIES]; float m_softeningSquared;

CalculateForces(float &softeningSquared) : m_softeningSquared(softeningSquared) stream parallelism is embedded deep inside the application { }

void interaction(float *accel, int pos0, int pos1) { update(..) { // compute acceleration ... display() ... (*m_pairs)(nul0, s0); { } (*m_calc)(s0, s1); … static int work(CalculateForces *me, StreamPtr ins[], (*m_integ)(s1, nul1); nbody->update(∆t); StreamPtr outs[]) renderer >setPositions(...); { m_integ->wait(); ... Stream in(ins,0); m_calc->wait(); Stream out(outs,0); } m_pairs->wait(); float force[3] = {0.0f,0.0f,0.0f}; … } int i = in.pop()*4; int N = in.pop()*4;

for (int j=0; jinteraction(force, j, i); me->interaction(force, j+4, i); me->interaction(force, j+8, i); me->interaction(force, j+12, i); }

float f = i/4; out.push(f); out.push(force[0]); out.push(force[1]); out.push(force[2]);

return 0; } }; Example: N-Body Simulation from CUDA SDK

SKIRSKIR asas aa compilercompiler target:target: StreamItStreamIt

 Stream Language from MIT Streamit  Independent Filters Source Code  FIFO Streams compiler  Synchronous Data Flow bitcode library  Fixed I/O Rates  Fixed stream graph structure Streamit to SKIR Front-end compiler

float->float pipeline BandPassFilter(float rate, float low, float high, int taps) { add BPFCore(rate, low, high, taps); add Subtracter(); } SKIR bitcode float->float splitjoin BPFCore (float rate, float low, float high, int taps) { offline split duplicate; add LowPassFilter(rate, low, taps, 0); runtime add LowPassFilter(rate, high, taps, 0); join roundrobin; } SKIR JIT & Scheduler float->float filter Subtracter { work pop 2 push 1 { push(peek(1) - peek(0)); pop(); pop(); } Parallel HW }

SKIRSKIR asas aa compilercompiler target:target: JavaScriptJavaScript

Sluice: SKIR based acceleration of JavaScript SKIR StreamIt style stream parallelism for the node.js/V8 JavaScript environment

Dynamic recompilation, Type inference, function Adder(arg) { K1 JS → SKIR C → LLVM this.a = arg; this.work = function() { var e = this.pop(); input stream shared e = e + this.a; memory e = e + this.a; this.push(e); shared return false; state memory state copy } } K2 K2 (offloaded) var a0 = new Adder(1); var a1 = new Adder(1); shared var a2 = new Adder(1); memory output stream var sj = Sluice.SplitRR(1,a0,a1,a2).JoinRR(1);

P

r var p = Sluice.Pipeline(new Count(10),

o

c

e sj,

s

K3 s

new Printer());

b

o

u

n

d > p.run()

a

r y 1 2 3 4 5 6 7 8 9 10

J. Fifield and D. Grunwald. A Methodology for Fine-Grained Parallelization of JavaScript Applications. LCPC 2011 CompilingCompiling SKIR:SKIR: OverviewOverview

 Kernel Analysis  Dynamic Batching Performance  Coroutine Elimination  Kernel Specialization  Stream Graph Transforms: Portability fission, fusion  Compile for GPU Hardware

CompilingCompiling SKIR:SKIR: KernelKernel AnalysisAnalysis

Attempt to extract SDF semantics from arbitrary kernels ● push, pop, peek rates ● data parallel vs. stateful

int high_pass_filter_work(high_pass_filter_t *state, ...) { /* FIR filtering operation as convolution sum. */ float sum = 0; for (int i=0; i<64; i++) { peek rate: 64 float f; read only state __SKIR_peek(0, &f, i); sum += state->h[i]*f; } __SKIR_push(0, &sum); push rate: 1, pop rate: 1 float e; __SKIR_pop(0, &e); Leverage existing LLVM analysis: return 0; } Dominator Trees Loop Info C version of a kernel from StreamIt channel vocoder benchmark Scalar Evolution Def/Use Information Stack/alloca Information

SchedulingScheduling SKIR:SKIR: DemandDemand andand datadata drivendriven executionexecution

… skir.push(...) Strategy: Schedule kernels by following } demand for data and buffer space 3) Switch to blocking kernel ...

2) Locate kernel causing blockage Advantages: work { - Doesn't require global task queues skir.pop(...) - Doesn't access global program structure - Attempts to preserve locality 1) Run kernel until data or … buffer space runs out … … Challenges: skir.push(...) - Avoiding unnecessary execution } - Making it fast

...

CoroutineCoroutine Scheduling:Scheduling: HowHow SKIRSKIR createscreates demanddemand drivendriven executionexecution

During code generation, we transform kernels to coroutines by specializing stream communication. In general, we cannot know when a kernel will block... … ... while (input.is_empty()) Runtime e = pop(); yield input.src Scheduler if (e > N) e = input.read() push(e); ... if (e > N) { while (output.is_full()) ...until a stream push/pop yield output.dst expression is executing. output.write(e) Especially true when } … kernels execute in Active parallel. Kernels

SchedulingScheduling SKIR:SKIR: ObtainingObtaining parallelparallel executionexecution withwith tasktask stealingstealing

(1) Run the blocking kernel. This … rule does not apply if task could not skir.push(...) be spawned or stolen. } 3) Steal or spawn (2) Pop a kernel from the bottom of blocking kernel its own deque. This rule does not ... apply if the deque is empty.

2) Locate kernel (3) Steal a kernel from the top of causing blockage another randomly chosen deque. If the chosen deque is empty, the work { thread tries this rule again until it skir.pop(...) succeeds. 1) Run kernel until data or … buffer space runs out … TBB+SKIR work stealing algorithm … skir.push(...) } 4) Push blocked kernel to the bottom of deque

...

CompilingCompiling SKIR:SKIR: DynamicDynamic BatchingBatching

call call work { work { while (1) { skir.pop(...) skir.pop(...) … … Scheduler … Scheduler … … … skir.push(...) skir.push(...) } } return / yield return / yield }

High overhead for small kernels Run as long as data/buffer available

CompilingCompiling SKIR:SKIR: CoroutineCoroutine EliminationElimination

The default coroutine code transformation is fine for coarse-grained kernels, but it has high overhead for fine-grained kernels. work(...) work(...) { { while(1) { while (1) { while (input.is_empty()) n = niters(input, output) yield input.src while(n­­) { e = input.read() e = input.read()

do_actual_work do_actual_work

while (output.is_full()) output.write(e) yield output.dst } output.write(e) } } } } We can be smarter for Not good if “do_actual_work” is small kernels with fixed I/O rates

entry f2a <+42>: mov 0x20(%r15),%rdi f2e <+46>: mov 0x28(%r15),%rsi ImpactImpact ofof CoroutineCoroutine f32 <+50>: callq *%r14 f35 <+53>: cmp %rbp,0x80(%r15) f3c <+60>: je f2a <+42> EliminationElimination f42 <+66>: mov 0x8(%rsp),%rdi f47 <+71>: mov %r13d,0xc0(%r15,%r12,1) f4f <+79>: mov %rbp,0x40(%r15) f53 <+83>: incq 0x50(%r15) f57 <+87>: incq 0x8(%rdi) %rax = calculate number of iterations f5b <+91>: mov (%rbx),%r15 f5e <+94>: mov 0x80(%r15),%r12 0ad <+93>: mov 0x80(%r12),%rcx f65 <+101>: jmpq f75 <+117> 0b5 <+101>: test %rax,%rax f6a <+106>: mov 0x20(%r15),%rsi 0b8 <+104>: mov 0x40(%r13),%rdx f6e <+110>: mov 0x28(%r15),%rdi 0bc <+108>: je 0x117 <+199> f72 <+114>: callq *%r14 0c2 <+114>: mov (%r14),%rsi f75 <+117>: cmp %r12,0x40(%r15) 0c5 <+117>: mov (%rbx),%rdi f79 <+121>: je f6a <+106> 0c8 <+120>: mov 0x88(%rsi),%rsi f7f <+127>: mov 0xc0(%r15,%r12,1),%r13d 0cf <+127>: mov 0x48(%rdi),%rdi f87 <+135>: add $0x4,%r12d f8b <+139>: mov %r12d,%eax 0d3 <+131>: lea 0x4(%rcx),%r8d f8e <+142>: and $0x7fff,%rax 0d7 <+135>: and $0x7fff,%r8 f95 <+149>: mov %rax,0x80(%r15) 0de <+142>: mov (%rsi,%r8,1),%r8d f9c <+156>: mov (%rbx),%r15 0e2 <+146>: add (%rsi,%rcx,1),%r8d f9f <+159>: mov 0x80(%r15),%r12 0e6 <+150>: lea 0x8(%rcx),%ecx fa6 <+166>: jmpq fb6 <+182> 0e9 <+153>: and $0x7fff,%rcx fab <+171>: mov 0x20(%r15),%rsi 0f0 <+160>: mov %r8d,(%rdi,%rdx,1) faf <+175>: mov 0x28(%r15),%rdi 0f4 <+164>: add $0x4,%edx fb3 <+179>: callq *%r14 0f7 <+167>: incq 0x8(%r15) fb6 <+182>: cmp %r12,0x40(%r15) 0fb <+171>: mov %rcx,0x80(%r12) fba <+186>: je fab <+171> 103 <+179>: and $0x7fff,%rdx fc0 <+192>: add 0xc0(%r15,%r12,1),%r13d 10a <+186>: mov %rdx,0x40(%r13) fc8 <+200>: add $0x4,%r12d 10e <+190>: dec %rax fcc <+204>: mov %r12d,%eax 111 <+193>: jne 0d3 <+131> fcf <+207>: and $0x7fff,%rax fd6 <+214>: mov %rax,0x80(%r15) with fdd <+221>: mov 0x10(%rsp),%rax fe2 <+226>: mov (%rax),%r15 fe5 <+229>: mov 0x40(%r15),%r12 int ­> int filter adder fe9 <+233>: lea 0x4(%r12),%ebp { fee <+238>: and $0x7fff,%rbp work { stream read ff5 <+245>: jmpq f35 <+53> push(pop()+pop()); read address calc } stream write write address calc without } kernel work (the add) profiling input scheduler

CompilingCompiling SKIR:SKIR: KernelKernel FusionFusion

work(...) {

work(...) { work(...) { pop(0, a) pop(0, a) pop(0, a) pop(1, b) pop(1, b) ... pop(1, b) ... … push(0, c) push(M, c) push(M, d) store c push(0, d) store d } … load c pop(N, e ) 0 … ... push(0, f ) work(...) { 0 push(0, f ) pop(0, e) 0 load d ... … push(0, f) push(0, f ) } pop(N, e ) } 1 1 ... push(0, f ) 1

}

Original Graph After Renumbering Final Fused Kernel & Inlining

 Developed a fusion algorithm for SKIR  Dynamic fusion shows performance benefits

CompilingCompiling SKIR:SKIR: KernelKernel FissionFission

splitter

K K K K K 0 1 2 3

joiner

Before Fission After Fission

 Kernel Fission is easy to implement for SKIR  Automatic fission by SKIR runtime  Manual fission by programmer or language  One of many methods to exploit data parallelism

CompilingCompiling SKIR:SKIR: OpenCLOpenCL BackendBackend

CPU GPU

input stream buffer ... K1

work(...) { work(...) { work(...) { work(...) { work(...) { work(...) { pop(0, a) pop(0, a) pop(0, a) pop(0, a) pop(0, a) pop(0, a) pop(0, b) pop(0, b) pop(0, b) pop(0, b) pop(0, b) pop(0, b) ...... push(0, c) push(0, c) push(0, c) push(0, c) push(0, c) push(0, c) input stream push(0, d) push(0, d) push(0, d) push(0, d) push(0, d) push(0, d) ... OpenCL Buffer push(0, e) push(0, e) push(0, e) push(0, e) push(0, e) push(0, e) } } } } } } iter 0 iter 1 iter 2 iter 3 iter 4 iter 5 r/o state state copy OpenCL K1K2 K2K2 API K2K2 ... OpenCL Control output stream buffer

OpenCL Buffer output stream

 Transparent execution on GPU via OpenCL

 K3 Modified version of LLVM C backend to emit OpenCL kernels  Any data parallel kernel with decidable state

SummarySummary

 Optimized stream parallelism using LLVM

 Dynamic scheduling  Performance

 Good!  Future work

 Use for ongoing network & signal processing research

 Better GPU support

 Vectorization

 Open source soon

Contact:Contact:

[email protected]@colorado.edu http://systems.cs.colorado.eduhttp://systems.cs.colorado.edu