Computer Systems @ Colorado COMPUTER AND CYBERPHYSICAL SYSTEMS RESEARCH AT COLORADO http://systems.cs.colorado.edu
SKIR:SKIR: Just-in-TimeJust-in-Time CompilationCompilation forfor ParallelismParallelism withwith LLVMLLVM
JeffJeff FifieldFifield UniversityUniversity ofof ColoradoColorado
20112011 LLVMLLVM DevelopersDevelopers MeetingMeeting NovemberNovember 18,18, 20112011
The SKIR Project is:
Stream and Kernel Intermediate Representation (SKIR) SKIR = LLVM + Streams/Kernels Stream language front-end compilers SKIR code optimizations for stream parallelism Dynamic scheduler for shared memory x86 LLVM JIT back-end for CPU LLVM to OpenCL back-end for GPU
WhatWhat isis StreamStream Parallelism?Parallelism?
stateful Regular data-centric computation kernel
Independent processing stages data
with explicit communication e
n parallel
i
l
e kernel
p
i
Pipeline, Data, and Task p Parallelism split/scatter
Examples:
fine Digital Signal Processing
Encoding/Decoding
y
t
i r a Compression, Cryptography
l
u streams n join/gather a Video, Audio
r
g Network Processing
Real-time “Big Data” services coarse task data
FormalFormal ModelsModels ofof StreamStream ParallelismParallelism
Kahn Process Networks (KPN) [1]
computation as a graph of independent processes communication over unidirectional FIFO channels block on read to empty channel, never block on write deterministic kernels ⇒ deterministic network
Synchronous Data Flow Networks (SDF) [2]
restriction of KPN where kernel have fixed I/O rates allows better compiler analysis allows static scheduling techniques
[1] G. Kahn, The semantics of a simple language for parallel programming, Information Processing (1974), 471–475. [2] E. A. Lee and D. G. Messerschmitt, Static scheduling of synchronous data flow programs for digital signal processing, IEEE Transactions on Computing 36 (1987), no. 1, 24–35. WhyWhy StreamStream Parallelism?Parallelism? ItIt cancan targettarget increasinglyincreasingly diversediverse parallelparallel hardwarehardware
Intel Knights Corner: 50 Cores AMD Bulldozer: Shared FPU AMD Llano: On Die GPU
Intel Sandy Bridge: Shared L3 between Gfx, CPU NVIDIA Tegra 3: Asymmetric Multicore
WhyWhy StreamStream Parallelism?Parallelism? ItIt SupportsSupports aa varietyvariety ofof datadata centriccentric applicationsapplications
Audio processing Physics Simulations Image processing Financial Applications Compression Network Processing Encryption Computational Bio Data Mining Game Physics Software Radio Twitter Parsing 2-D and 3-D Graphics Marmot Detection And many more...
WhyWhy SKIR?SKIR? EmbeddedEmbedded domaindomain specificspecific parallelismparallelism isis usefuluseful
Embed domain specific knowledge Application Source Code into the language, compiler, or runtime system Parallel Programming model tailored to your problem Algorithm
higher level of abstraction ⇒ higher productivity
restricted prog. model ⇒ higher performance Standard Your favorite language, with better parallelism Compiler
Examples: Offline Runtime PLINQ – optimizable embedded query language Domain ArBB – vector computation on parallel arrays Specific Compiler CUDA – memory and execution models for GPUs
SKIR – language independent stream parallelism Parallel Hardware
SKIR:SKIR: OverviewOverview
Application Organized as JIT Compiler Source Code SKIR for performance portability Algorithm for dynamic program graphs for dynamic optimization Standard Compiler SKIR intrinsics for LLVM Offline Runtime stream communication SKIR Compiler & static or dynamic stream Runtime graph manipulation
Parallel & Heterogeneous Hardware CPU GPU FPGA ???
now future work bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { SKIR ExampleSKIR Example int data skir.pop(0, &data) data += *state skir.push(0, &data) return false } SKIR pseudo-code bool CONSUMER (int *state, void *ins[], void *outs[]) { int data Construct and execute a if (*state == 0) return true skir.pop(0, &data) print(data) 4 stage pipeline *state = *state 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = 1
stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4
Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0
K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)
skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) *state = *state 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = 1
stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4
Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0
K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)
skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data PRODUCER skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) ADDER *state = *state 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = 1
stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4 ADDER
Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0
K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)
skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) CONSUMER skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data PRODUCER skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) ADDER *state = *state 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = 1
stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4 ADDER
Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0
K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)
skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) CONSUMER skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data PRODUCER skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) ADDER *state = *state 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = 1
stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4 ADDER
Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0
K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)
skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) CONSUMER skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data PRODUCER skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) ADDER *state = *state 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = 1
stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4 ADDER
Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0
K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)
skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) CONSUMER skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } bool PRODUCER (int *state, void *ins[], void *outs[]) { *state = *state + 1 skir.push(0, state) return false } bool ADDER (int *state, void *ins[], void *outs[]) { int data PRODUCER skir.pop(0, &data) data += *state skir.push(0, &data) return false } bool CONSUMER (int *state, void *ins[], void *outs[]) { int data if (*state == 0) return true skir.pop(0, &data) print(data) ADDER *state = *state 1 return false } main() { int counter = 0 int limit = 20 int one = 1 int neg = 1
stream Q1[2], Q2[2], Q3[2] kernel K1, K2, K3, K4 ADDER
Q1[0] = skir.stream(sizeof(int)); Q1[1] = 0 Q2[0] = skir.stream(sizeof(int)); Q2[1] = 0 Q3[0] = skir.stream(sizeof(int)); Q3[1] = 0
K1 = skir.kernel(PRODUCER, &counter) K2 = skir.kernel(ADDER, &one) K3 = skir.kernel(ADDER, &neg) K4 = skir.kernel(CONSUMER, &limit)
skir.call(K1, NULL, Q1) skir.call(K2, Q1, Q2) CONSUMER skir.call(K3, Q2, Q3) skir.call(K4, Q3, NULL) skir.wait(K4) } } SKIRSKIR asas aa compilercompiler target:target: CC
int subtracter_work(void *state, skir_stream_ptr_t *ins, skir_stream_ptr_t *outs) { float f0; float f1; __SKIR_pop(0, &f0); __SKIR_pop(0, &f1); f0 = f1 f0; __SKIR_push(0, &f0); return 0; } skir_stream_ptr_t build_band_pass_filter(skir_stream_ptr_t src, float rate, float low, float high, int taps) { skir_stream_ptr_t ins[2] = {0}; skir_stream_ptr_t outs[2] = {0}; src = build_bpf_core(src, rate, low, high, taps); skir_kernel_ptr_t sub = __SKIR_kernel((void*)subtracter_work, 0); ins[0] = src; outs[0] = __SKIR_stream(sizeof(float)); __SKIR_call(sub, ins, outs); return outs[0]; } One-to-one mapping of SKIR operations to C intrinsics
SKIRSKIR asas aa compilercompiler target:target: C++C++
#include
CalculateForces(float &softeningSquared) : m_softeningSquared(softeningSquared) stream parallelism is embedded deep inside the application { }
void interaction(float *accel, int pos0, int pos1) { update(..) { // compute acceleration ... display() ... (*m_pairs)(nul0, s0); { } (*m_calc)(s0, s1); … static int work(CalculateForces *me, StreamPtr ins[], (*m_integ)(s1, nul1); nbody->update(∆t); StreamPtr outs[]) renderer >setPositions(...); { m_integ->wait(); ... Stream
for (int j=0; j
float f = i/4; out.push(f); out.push(force[0]); out.push(force[1]); out.push(force[2]);
return 0; } }; Example: N-Body Simulation from CUDA SDK
SKIRSKIR asas aa compilercompiler target:target: StreamItStreamIt
Stream Language from MIT Streamit Independent Filters Source Code FIFO Streams compiler Synchronous Data Flow bitcode library Fixed I/O Rates Fixed stream graph structure Streamit to SKIR Front-end compiler
float->float pipeline BandPassFilter(float rate, float low, float high, int taps) { add BPFCore(rate, low, high, taps); add Subtracter(); } SKIR bitcode float->float splitjoin BPFCore (float rate, float low, float high, int taps) { offline split duplicate; add LowPassFilter(rate, low, taps, 0); runtime add LowPassFilter(rate, high, taps, 0); join roundrobin; } SKIR JIT & Scheduler float->float filter Subtracter { work pop 2 push 1 { push(peek(1) - peek(0)); pop(); pop(); } Parallel HW }
SKIRSKIR asas aa compilercompiler target:target: JavaScriptJavaScript
Sluice: SKIR based acceleration of JavaScript SKIR StreamIt style stream parallelism for the node.js/V8 JavaScript environment
Dynamic recompilation, Type inference, function Adder(arg) { K1 JS → SKIR C → LLVM this.a = arg; this.work = function() { var e = this.pop(); input stream shared e = e + this.a; memory e = e + this.a; this.push(e); shared return false; state memory state copy } } K2 K2 (offloaded) var a0 = new Adder(1); var a1 = new Adder(1); shared var a2 = new Adder(1); memory output stream var sj = Sluice.SplitRR(1,a0,a1,a2).JoinRR(1);
P
r var p = Sluice.Pipeline(new Count(10),
o
c
e sj,
s
K3 s
new Printer());
b
o
u
n
d > p.run()
a
r y 1 2 3 4 5 6 7 8 9 10
J. Fifield and D. Grunwald. A Methodology for Fine-Grained Parallelization of JavaScript Applications. LCPC 2011 CompilingCompiling SKIR:SKIR: OverviewOverview
Kernel Analysis Dynamic Batching Performance Coroutine Elimination Kernel Specialization Stream Graph Transforms: Portability fission, fusion Compile for GPU Hardware
CompilingCompiling SKIR:SKIR: KernelKernel AnalysisAnalysis
Attempt to extract SDF semantics from arbitrary kernels ● push, pop, peek rates ● data parallel vs. stateful
int high_pass_filter_work(high_pass_filter_t *state, ...) { /* FIR filtering operation as convolution sum. */ float sum = 0; for (int i=0; i<64; i++) { peek rate: 64 float f; read only state __SKIR_peek(0, &f, i); sum += state->h[i]*f; } __SKIR_push(0, &sum); push rate: 1, pop rate: 1 float e; __SKIR_pop(0, &e); Leverage existing LLVM analysis: return 0; } Dominator Trees Loop Info C version of a kernel from StreamIt channel vocoder benchmark Scalar Evolution Def/Use Information Stack/alloca Information
SchedulingScheduling SKIR:SKIR: DemandDemand andand datadata drivendriven executionexecution
… skir.push(...) Strategy: Schedule kernels by following } demand for data and buffer space 3) Switch to blocking kernel ...
2) Locate kernel causing blockage Advantages: work { - Doesn't require global task queues skir.pop(...) - Doesn't access global program structure - Attempts to preserve locality 1) Run kernel until data or … buffer space runs out … … Challenges: skir.push(...) - Avoiding unnecessary execution } - Making it fast
...
CoroutineCoroutine Scheduling:Scheduling: HowHow SKIRSKIR createscreates demanddemand drivendriven executionexecution
During code generation, we transform kernels to coroutines by specializing stream communication. In general, we cannot know when a kernel will block... … ... while (input.is_empty()) Runtime e = pop(); yield input.src Scheduler if (e > N) e = input.read() push(e); ... if (e > N) { while (output.is_full()) ...until a stream push/pop yield output.dst expression is executing. output.write(e) Especially true when } … kernels execute in Active parallel. Kernels
SchedulingScheduling SKIR:SKIR: ObtainingObtaining parallelparallel executionexecution withwith tasktask stealingstealing
(1) Run the blocking kernel. This … rule does not apply if task could not skir.push(...) be spawned or stolen. } 3) Steal or spawn (2) Pop a kernel from the bottom of blocking kernel its own deque. This rule does not ... apply if the deque is empty.
2) Locate kernel (3) Steal a kernel from the top of causing blockage another randomly chosen deque. If the chosen deque is empty, the work { thread tries this rule again until it skir.pop(...) succeeds. 1) Run kernel until data or … buffer space runs out … TBB+SKIR work stealing algorithm … skir.push(...) } 4) Push blocked kernel to the bottom of deque
...
CompilingCompiling SKIR:SKIR: DynamicDynamic BatchingBatching
call call work { work { while (1) { skir.pop(...) skir.pop(...) … … Scheduler … Scheduler … … … skir.push(...) skir.push(...) } } return / yield return / yield }
High overhead for small kernels Run as long as data/buffer available
CompilingCompiling SKIR:SKIR: CoroutineCoroutine EliminationElimination
The default coroutine code transformation is fine for coarse-grained kernels, but it has high overhead for fine-grained kernels. work(...) work(...) { { while(1) { while (1) { while (input.is_empty()) n = niters(input, output) yield input.src while(n) { e = input.read() e = input.read()
do_actual_work do_actual_work
while (output.is_full()) output.write(e) yield output.dst } output.write(e) } } } } We can be smarter for Not good if “do_actual_work” is small kernels with fixed I/O rates
entry f2a <+42>: mov 0x20(%r15),%rdi f2e <+46>: mov 0x28(%r15),%rsi ImpactImpact ofof CoroutineCoroutine f32 <+50>: callq *%r14 f35 <+53>: cmp %rbp,0x80(%r15) f3c <+60>: je f2a <+42> EliminationElimination f42 <+66>: mov 0x8(%rsp),%rdi f47 <+71>: mov %r13d,0xc0(%r15,%r12,1) f4f <+79>: mov %rbp,0x40(%r15) f53 <+83>: incq 0x50(%r15) f57 <+87>: incq 0x8(%rdi) %rax = calculate number of iterations f5b <+91>: mov (%rbx),%r15 f5e <+94>: mov 0x80(%r15),%r12 0ad <+93>: mov 0x80(%r12),%rcx f65 <+101>: jmpq f75 <+117> 0b5 <+101>: test %rax,%rax f6a <+106>: mov 0x20(%r15),%rsi 0b8 <+104>: mov 0x40(%r13),%rdx f6e <+110>: mov 0x28(%r15),%rdi 0bc <+108>: je 0x117 <+199> f72 <+114>: callq *%r14 0c2 <+114>: mov (%r14),%rsi f75 <+117>: cmp %r12,0x40(%r15) 0c5 <+117>: mov (%rbx),%rdi f79 <+121>: je f6a <+106> 0c8 <+120>: mov 0x88(%rsi),%rsi f7f <+127>: mov 0xc0(%r15,%r12,1),%r13d 0cf <+127>: mov 0x48(%rdi),%rdi f87 <+135>: add $0x4,%r12d f8b <+139>: mov %r12d,%eax 0d3 <+131>: lea 0x4(%rcx),%r8d f8e <+142>: and $0x7fff,%rax 0d7 <+135>: and $0x7fff,%r8 f95 <+149>: mov %rax,0x80(%r15) 0de <+142>: mov (%rsi,%r8,1),%r8d f9c <+156>: mov (%rbx),%r15 0e2 <+146>: add (%rsi,%rcx,1),%r8d f9f <+159>: mov 0x80(%r15),%r12 0e6 <+150>: lea 0x8(%rcx),%ecx fa6 <+166>: jmpq fb6 <+182> 0e9 <+153>: and $0x7fff,%rcx fab <+171>: mov 0x20(%r15),%rsi 0f0 <+160>: mov %r8d,(%rdi,%rdx,1) faf <+175>: mov 0x28(%r15),%rdi 0f4 <+164>: add $0x4,%edx fb3 <+179>: callq *%r14 0f7 <+167>: incq 0x8(%r15) fb6 <+182>: cmp %r12,0x40(%r15) 0fb <+171>: mov %rcx,0x80(%r12) fba <+186>: je fab <+171> 103 <+179>: and $0x7fff,%rdx fc0 <+192>: add 0xc0(%r15,%r12,1),%r13d 10a <+186>: mov %rdx,0x40(%r13) fc8 <+200>: add $0x4,%r12d 10e <+190>: dec %rax fcc <+204>: mov %r12d,%eax 111 <+193>: jne 0d3 <+131> fcf <+207>: and $0x7fff,%rax fd6 <+214>: mov %rax,0x80(%r15) with fdd <+221>: mov 0x10(%rsp),%rax fe2 <+226>: mov (%rax),%r15 fe5 <+229>: mov 0x40(%r15),%r12 int > int filter adder fe9 <+233>: lea 0x4(%r12),%ebp { fee <+238>: and $0x7fff,%rbp work { stream read ff5 <+245>: jmpq f35 <+53> push(pop()+pop()); read address calc } stream write write address calc without } kernel work (the add) profiling input scheduler
CompilingCompiling SKIR:SKIR: KernelKernel FusionFusion
work(...) {
work(...) { work(...) { pop(0, a) pop(0, a) pop(0, a) pop(1, b) pop(1, b) ... pop(1, b) ... … push(0, c) push(M, c) push(M, d) store c push(0, d) store d } … load c pop(N, e ) 0 … ... push(0, f ) work(...) { 0 push(0, f ) pop(0, e) 0 load d ... … push(0, f) push(0, f ) } pop(N, e ) } 1 1 ... push(0, f ) 1
}
Original Graph After Renumbering Final Fused Kernel & Inlining
Developed a fusion algorithm for SKIR Dynamic fusion shows performance benefits
CompilingCompiling SKIR:SKIR: KernelKernel FissionFission
splitter
K K K K K 0 1 2 3
joiner
Before Fission After Fission
Kernel Fission is easy to implement for SKIR Automatic fission by SKIR runtime Manual fission by programmer or language One of many methods to exploit data parallelism
CompilingCompiling SKIR:SKIR: OpenCLOpenCL BackendBackend
CPU GPU
input stream buffer ... K1
work(...) { work(...) { work(...) { work(...) { work(...) { work(...) { pop(0, a) pop(0, a) pop(0, a) pop(0, a) pop(0, a) pop(0, a) pop(0, b) pop(0, b) pop(0, b) pop(0, b) pop(0, b) pop(0, b) ...... push(0, c) push(0, c) push(0, c) push(0, c) push(0, c) push(0, c) input stream push(0, d) push(0, d) push(0, d) push(0, d) push(0, d) push(0, d) ... OpenCL Buffer push(0, e) push(0, e) push(0, e) push(0, e) push(0, e) push(0, e) } } } } } } iter 0 iter 1 iter 2 iter 3 iter 4 iter 5 r/o state state copy OpenCL K1K2 K2K2 API K2K2 ... OpenCL Control output stream buffer
OpenCL Buffer output stream
Transparent execution on GPU via OpenCL
K3 Modified version of LLVM C backend to emit OpenCL kernels Any data parallel kernel with decidable state
SummarySummary
Optimized stream parallelism using LLVM
Dynamic scheduling Performance
Good! Future work
Use for ongoing network & signal processing research
Better GPU support
Vectorization
Open source soon
Contact:Contact:
[email protected]@colorado.edu http://systems.cs.colorado.eduhttp://systems.cs.colorado.edu