Customizable Domain-Specific Computing

Jason Cong Center for Domain-Specific Computing UCLA Computer Science Department [email protected] http://cadlab.cs.ucla.edu/~cong

The Power Barrier …

Source : Shekhar Borkar, Intel 2 Focus: New Transformative Approach to Power/Energy Efficient Computing

Current Solution: Parallelization

Parallelization

Source: Shekhar Borkar, Intel 3

Cost and Energy are Still a Big Issue …

Cost of computing •HW acquisition •Energy bill •Heat removal •Space •…

4 Next Significant Opportunity -- Customization

Parallelization

Customization

Adapt the architecture to Application domain

Source: Shekhar Borkar, Intel 5

Motivation

A few facts We have sufficient computing power for most applications Each user/enterprise need high computing power for only selected tasks in its domain Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture

Our proposal A general, customizable platform for the given domain(s) • Can be customized to a wide-range of applications in the domain • Can be massively produced with cost efficiency • Can be programmed efficiently with novel compilation and runtime systems Goal: A “supercomputer-in-a-box” with 100X performance/power improvement via customization for the intended domain(s) Analogy: Advance of civilization via specialization/customization

6 Example Application Domain: Healthcare

Medical imaging has transformed healthcare An in vivo method for understanding disease development and patient condition Estimated to be $100 billion/year More powerful & efficient computation can help • Fewer exposures using compressive sensing • Better clinical assessment (e.g., for cancer) using improved registration and segmentation algorithms

Hemodynamic simulation Very useful for surgical procedures involving blood flow and vasculature Magnetic resonance (MR) angiograph of an aneurysm Both may take hours to days to construct Clinical requirement: 1-2 min Cloud computing won’t work – • Communication, real-time requirement, privacy A megawatt-datacenter for each hospital? Intracranial aneurysm reconstruction with hemodynamics

Medical Image Processing Pipeline

Medical images exhibit sparsity, and can be sampled at a rate << classical Shannon- Nyquist theory : compressive 2 min ∑ ARu - S + λ ∑ grad(u) reconstruction reconstruction u sensing sampled points ∀voxels

S 1 2 y −z S ∑ k k total variational ⎛ ⎞ − k=1 ⎜ 2 ⎟ 2 1 h ∀voxel: u(i) = w i,j f ( j) − 2σ , w i,j = e denoising denoising ⎜ ∑ ⎟ Z(i) ⎝voxel j∈volume ⎠ algorithm

∂u v = + v ⋅∇u fluid ∂t registration registration μΔv + (μ +η)∇()∇ ⋅v = −[]T(x −u) − R(x) ∇T(x −u) registration

∂ϕ ⎡ ⎛ ∇ϕ ⎞⎤ = ∇ϕ ⎢F(data,φ) + λdiv⎜ ⎟⎥ ⎜ ⎟ level set ∂t ⎣⎢ ⎝ ∇ϕ ⎠⎦⎥ surface(t) = {}voxels x : ϕ(x,t) = 0 methods segmentation segmentation

∂v + (v ⋅∇)v = −∇p +υΔv + f (x,t) ∂t Navier-Stokes analysis analysis ∂v 3 ∂v ∂p 3 ∂2v i + v i = − +υ v i + f (x,t) ∂t ∑ j ∂x ∂x ∑ j 2 i equations j =1 j i j =1 ∂x j 8 Application Domains: Medical Image Processing Pipeline

iterative, local or global communication compressive dense and sparse linear algebra, optimization methods reconstruction reconstruction sensing

non-iterative, highly parallel, local & globaltotal communication variational denoising denoising sparse linear algebra, structured grid, optimization methods • These algorithms have diverse algorithm computation & communication patterns • A single homogenous system parallel, global communication fluid can not perform very well on dense linear algebra, optimization methods

registration can not perform very well on registration registration all these algorithms

local communication level set dense linear algebra, spectral methods, MapReducemethods segmentation segmentation

local communication Navier-Stokes analysis analysis sparse linear algebra, n-body methods, graphical modelsequations 9

Need of Customization for Medical Image Processing Pipeline

iterative, local or global communication compressive • These algorithms have diverse denseBi-harmonic and sparse registration linear algebra, (Using theoptimization same algorithm methods on all reconstruction reconstruction sensing computation & communication platforms)

patterns CPU (Xenon 2.0 GHz) GPU (Tesla C1060) FPGA (xc4vlx100) • A single, homogeneous system Non-iterative,1x highly parallel, local93x & globaltotal communication variational11x ~100 W ~150 W ~5W denoising denoising cannot perform very well on all sparse linear algebra, structured grid, optimization methodsalgorithm of these algorithms 3D median filter: For each voxel, compute the median of • Need architecture • Need architecture the 3 x 3 x 3 neighboring voxels customization and hardware- parallel, global communication fluid denseCPU linear (Xenon algebra, 2.0 GHz) optimizationGPU (Tesla methods C1060) FPGA (xc4vlx100) registration registration software co-optimization registration Quick select Median of medians Bit-by-bit majority voting • Include many common 1x 70x 1200x computation kernels (“motifs”) ~100 W ~140 W ~3 W local communication level set • Applicable to other domains dense linear algebra, spectral methods, MapReducemethods segmentation segmentation

local communication Navier-Stokes analysis analysis sparse linear algebra, n-body methods, graphical modelsequations 10 11

Center for Domain-Specific Computing (CDSC) Organization

• A diversified & highly accomplished faculty team: 8 in CS&E; 1 in EE; 2 in medical school; 1 in applied math • 15-20 postdocs and graduate students in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara

Aberle Baraniuk Bui Chang Cheng Cong (Director) (UCLA) (Rice) (UCLA) (UCLA) (UCSB) (UCLA)

Palsberg Potkonjak Reinman Sadayappan Sarkar Vese (UCLA) (UCLA) (UCLA) (Ohio-State) (Associate Dir) (UCLA) (Rice)

12 Overview of the Proposed Research Customizable Heterogeneous Platform (CHP)

$$ $$ $$ $$ DRAMDRAM I/OI/O CHPCHP

FixedFixed FixedFixed FixedFixed FixedFixed DRAM CHP CHP CoreCore CoreCore CoreCore CoreCore DRAM CHP CHP

CustomCustom CustomCustom CustomCustom CustomCustom CoreCore CoreCore CoreCore CoreCore Domain-specific-modeling (healthcare applications)

ProgProg ProgProg ProgProg ProgProg Fabric Fabric Fabric Fabric Fabric Fabric Fabric Fabric n Ap io p at li iz ca er ti ct on Reconfigurable RF-I bus ra m ha o Reconfigurable optical bus c de in li Transceiver/receiver a ng Optical interface om D Architecture modeling CHP mapping CHP creation Source-to-source CHP mapper Customizable computing engines Reconfiguring & optimizing backend Customizable interconnects Adaptive runtime

Design once Invoke many times 13

CHP Creation – Design Space Exploration

CoreCore parametersparameters Customizable Heterogeneous Platform (CHP) FrequencyFrequency && voltagevoltage DatapathDatapath bit bit widthwidth InstructionInstruction windowwindow sizesize $$ $$ $$ $$ IssueIssue widthwidth CacheCache sizesize && configurationconfiguration RegisterRegister filefile organizationorganization NoCNoC parametersparameters Fixed Fixed Fixed Fixed ## ofof threadthread contextscontexts Fixed Fixed Fixed Fixed InterconnectInterconnect topologytopology CoreCore CoreCore CoreCore CoreCore …… ## ofof virtualvirtual channelschannels RoutingRouting policypolicy LinkLink bandwidthbandwidth CustomCustom CustomCustom CustomCustom CustomCustom RouterRouter pipelinepipeline depthdepth CoreCore CoreCore CoreCore CoreCore NumberNumber ofof RF-IRF-I enabledenabled Custom instructions & accelerators routersrouters Custom instructions & accelerators AmountAmount ofof programmableprogrammable fabricfabric RF-IRF-I channelchannel andand ProgProg ProgProg ProgProg ProgProg bandwidthbandwidth allocationallocation SharedShared vs.vs. privateprivate acceleratorsaccelerators FabricFabric FabricFabric FabricFabric FabricFabric …… CustomCustom instructioninstruction selectionselection ChoiceChoice ofof acceleratorsaccelerators …… Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface Key questions: Optimal trade-off between efficiency & customizability Which options to fix at CHP creation? Which to be set by CHP mapper? 14 Customization for Cores

Example of core customization space Instruction Register Number and type ROB queue file of FUs size size size

Branch predictor

BTB size LSQ Memory hierarchy and configuration BTB complexity size Cache sizes Cache associativity Memory latency

Existing Studies on Cores Customization (not domain-specific)

Reference Feature Impact [Folegnani & Gonzalez, Issue logic and Issue Queue (43/58) 16% total processor energy saving ISCA 2001] [Ponomarev, et.al., Instruction Queue (17/32) Reorder Buffer (57/128) Power saving of 59% for the three MICRO 2001] Load/Store Queue (18/32) components [Hughes, et.al., MICRO Issue Width (8,4,2) Issue Queue (128,64,32) Up to 78% total energy saving with 2001] Function Units (4,2) Dynamic Voltage Scaling combined DVS and architectural adaption [Yeh et al Reduced precision FP arithmetic (mini FPU mantissa 14, exponent 8) , FPU sharing Up to 50% power reduction and 55% MICRO 2007] (2:4:8 sharing cores), eliminating trivial FP operations, lookup table performance improvement

[Cong et al Trans on PDS Core spilling – spill from 1 core up to 8 cores Less than 50% worse than ideal 8x 2007] powerful core. Up to 40% improvement for changing workloads [Ipel et al ISCA 2007] Core fusion – 2-issue cores fused to simulate 4 and 6 issue cores Less than 30% and 20% worse for sequential and parallel benchmarks respectively [Mai et al ISCA 2007] Memory system: streaming register files or cache hierarchy Only 2x worse than domain optimized Communication: broadcast and routed Processor: SIMD or RISC superscalar system [Lee and Brooks Issue queue, issue width, Branch, LSQ, ROB, Registers Cache I-LI, D-L1, L2 cache size 1.6X performance gain and 0.8X ASPLOS 2008] and latency, Memory Latency, temporal sensitivity power reduction 5.1x efficiency improvement 16 Energy-Effective Issue Logic [Folegnani & Gonzalez, ISCA’01]

Inefficiency of conventional instruction issue logic & issue queue (IQ) A) Energy waste from empty entries and ready operand B) Effectively used IQ varies across different applications C) Effectively used IQ varies in different period of one application A B

Adaptation of Multiple Datapath Resources (cont’d)

Dynamic adapt through multi-partitioned resources Instruction queue (IQ) • avg: 17; max: 32 Reorder buffer (ROB) • avg: 57; max: 128 Load/Store queue (LSQ) • avg 18; max: 32 Three resources are independently adjusted at run time Downsize the resources based on sampling statistics of effective usage history Upsize the resources based on the resource miss record Total power saving for the three resized components: 59%

18 Architectural and Frequency Adaptations for Multimedia Applications [Hughes, et al, MICRO 2001]

Dynamic adapt Architecture • Issue Width & Issue Queue • # Function Units Dynamic Voltage Scaling (DVS) • Continuous DVS (CDVS) • Discrete DVS (DDVS) Adaptation method Initial profiling • Multimedia application has similar performance and power stats for the same frame type Dynamic adaptation • Choose optimal configuration based on history stats for the same frame type by table lookup Energy saving DDVS Alone: 73% Arch Alone: 22% CDVS Alone: 75% Arch + DDVS: 77% Arch + CDVS: 78%

Architectural and Frequency Adaptations for Multimedia Applications (cont’d)

Important conclusions DVS gives the most of energy reduction Architectural adaption further reduce energy when augmented on DVS Without DVS, less aggressive architectures are more energy-efficient With DVS, more aggressive architectures are often more energy-efficient • The higher IPC of the more aggressive architectures means it an be run at a lower frequency to save energy

20 Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]

Examine two main questions: Spatial adaptivity - which parameters to tune? Temporal adaptivity – how often to tune?

Study effects of tuning 15 parameters and at different time intervals of adaptation

Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]

Architectural parameters studied Instruction Register Number and type ROB queue file of FUs size size size

Branch predictor

BTB size LSQ Cache sizes BTB complexity size Cache associativity Memory latency

22 Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]

Key findings Up to 5.3x improvement in efficiency through adaptation Relatively frequent adaptation (80K instruction intervals) needed to achieve maximum efficiency

Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]

Key findings On average, adapting 3 parameters is sufficient to achieve 77% of efficiency gain • However, the 3 parameters depend on application and phase DVFS provides relatively less benefits (in terms of efficiency) with architecture adaptations

24 Existing Studies on Cores Customization (not domain-specific)

CHP Creation – Design Space Exploration

FPGA-based acceleration has shown a lot of promise Many applications in bio-informatics, financial engineering, image processing, scientific computing, … Many publications in FCCM, FPGA, FPL, FPT, …

Two significant barriers Communication between CPU and FPGA accelerator • Overhead of using peripheral bus is too high Automatic compilation • Real programmers do not use VHDL/Verilog

But … a lot of encouraging progress made recently

Customization of Programmable Fabrics

Recent enablers Communication between CPU and FPGA accelerator • High-speed connections – HyberTransport bus, FSB, QPI, … • On-chip integration Automatic compilation • Maturing of C/C++ to RTL synthesis tools

28 Acceleration of Lithographic Simulation [FPGA’08]

Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) −

ψκ(x−x2, y−y1) + ψκ(x−x2, y− 2 y2) − ψκ(x−x1, y−y2)] | Lithography simulation Simulate the optical imaging process Computational intensive; very slow for full-chip simulation Algorithm in C AutoPilotTM Synthesis Tool

15X+ Performance Improvement vs. AMD Opteron 2.2GHz Processor

Close to 100X improvement on energy efficiency 15W in FPGA comparing with 86W in Opteron

XtremeData X1000 development system (AMD Opteron + Altera StratixII EP2S180) 29

xPilot: Behavioral-to-RTL Synthesis Flow

Behavioral spec. Advanced transformtion/optimizations in C/C++/SystemC Loop unrolling/shifting/pipelining Strength reduction / Tree height reduction Platform Bitwidth analysis Frontend description Frontend Memory analysis … compilercompiler Core behvior synthesis optimizations Scheduling Resource binding, e.g., functional unit binding register/port binding SSDMSSDM

μArch-generation & RTL/constraints RTL + constraints generation Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, … FPGAs/ASICsFPGAs/ASICs 30 Some Recent Studies -- Efficient Identification of Approximate Patterns [Cong & Wei, FPGA’08]

Programers may contain many patterns

Prior work can only identify exact patterns + + + + + + We can efficiently identify “approximate” < - - patterns in large programs Structure Variation Based on the concept of editing distance 16 + 16 + 32 32 32 32 Use data-mining techniques + + + + Efficient subgraph enumeration and pruning 16 * 32 * 32 * Bitwidth Variation Highly scalable – can handle programs with 100,000+ lines of code + + + + Applications: + + * * Behavioral synthesis: * Ports Variation • 20+% area reduction due to sharing of approximate patterns ASIP synthesis: • Identify & extract customized instructions 31

Some Recent Studies -- Automatic Memory Partitioning

To appear in ICCAD 2009 for (int i =0; i < n; i++) Memory system is critical for high … = A[i]+A[i+1] performance and low power design Memory bottleneck limits maximum (a) C code parallelism

Memory system accounts for a significant A[i] A[i+1] portion of total power consumption (b) Scheduling Goal Given platform information (memory port, R1 R2 power, etc.), behavioral specification, and throughput constraints Decoder • Partition memories automatically

• Meet throughput constraints A[0, 2, 4,…] A[1, 3, 5…] • Minimize power consumption

32 Automatic Memory Partitioning (AMP) Memory Platform Techniques Loop Nest Information Capture array access confliction in conflict graph Array Subscripts Analysis for throughput optimization Throughput Optimization Model the loop kernel in parametric polytopes to Partition Candidate Generation obtain array frequency Try Partition Candidate Ci, Contributions Minimize Accesses on Each Bank Automatic approach for N design space exploration Meet Port Limitation? Cycle-accurate Y

Handle irregular array Power Optimization accesses

Light-weight profiling for Loop Pipelining and Scheduling power optimization

Pipeline Results 33

Automatic Memory Partitioning (AMP)

About 6x throughput improvement on average with 45% area overhead

In addition, power optimization can further reduced 30% of power after throughput optimization

Original Partition Original Partition Area Power II II SLICES SLICES Comparsion Reduction fir 3 1 241 510 2.12 26.82% idct 4 1 354 359 1.01 44.23% litho 16 1 1220 2066 1.69 31.58% matmul 4 1 211 406 1.92 77.64% motionEst 5 1 832 961 1.16 10.53% palindrome 2 1 84 65 0.77 0.00% avg 5.67x 1.45 31.80%

34 AutoPilot Compilation Tool (based UCLA xPilot system)

Design Specification C/C++/SystemC User Constraints Common Testbench Simulation, Verification, and Prototyping Platform-based C to FPGA synthesis Compilation & AutoPilotTM Elaboration ESL Synthesis Synthesize pure ANSI-C and C++, GCC-compatible Presynthesis Optimizations compilation flow Full support of IEEE-754 floating point data types & = Behavioral & Communication Platform operations Synthesis and Optimizations Characterization Library Efficiently handle bit-accurate fixed-point arithmetic

RTL HDLs & Timing/Power/Layout More than 10X design RTL SystemC Constraints productivity gain

High quality-of-results FPGA Co-Processor

Some Other Usage of AutoPilot (Microsoft)

On John Cooley’s DeepChip 6/30/09 http://www.deepchip.com/items/0482-06.html

“We purchased AutoESL's AutoPilot in 2008 to implement some of the time- consuming cores in our software into FPGA hardware for the runtime speed-up improvements… 1. RankBoost - a machine-learning algorithm used in the dynamic ranking of search engines… 2. Sorting Algorithm - also several thousand lines of OO C++ code with 138 lines that needed speeding up…

36 CHP Creation – Design Space Exploration

Current On-Chip Interconnect Technology

Optimized RC lines with repeaters Wiresizing, buffer insertion, buffer sizing … E.g. UCLA Tio and IPEM packages

Reconfigurable interconnects For FPGAs: • RC busses with pass-transistors or bi-directional buffers For CMPs (chip multi-processors) • Mesh-like network-on-chip (NoC) Pay a large penalty on performance

38 Used vs. Available Bandwidth in Modern CMOS

fT 10

@ 45nm CMOS Technology Data Rate: 4 Gbit/s

fT of 45nm CMOS can be as high as 240GHz Baseband signal bandwidth only about 4GHz 98.4% of available bandwidth is wasted Question: How to take advantage of full-bandwidth of modern CMOS?

39 39

UCLA 90nm CMOS VCO at 324GHz [ISSCC 2008]

-70 323.5GHz VCO

-80 Pout (dBm) CMOS VCO designed by Frank -90 Chang’s group at UCLA, fabricated in 90nm process

-100 323.038 323.238 323.438 323.638 323.838 324.0 Frequency (GHz)

CMOS Voltage Controlled Oscillator, measured with a subharmonic mixer and driven with a 80 GHz synthesizer local oscillator. The mixing

frequency is (fVCO -4*fLO)=fIF, or fVCO -4*(80 GHz)= 3.5 GHz, yielding

fVCO= 323.5 GHz!

On-Wafer VCO Test Setup at JPL *Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA 40 40 Multiband RF-Interconnect Signal Power Signal Power Signal Power Signal Power Signal Spectrum

• In TX, each mixer up-converts individual baseband streams into specific frequency band (or channel) • N different data streams (N=6 in exemplary figure above) may transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates • In RX, individual signals are down-converted by mixer, and recovered after low-pass filter

41 41

Tri-band On-Chip RF-I Test Results

Process IBM 90nm CMOS Digital Process Total 3 Channels 30GHz, 50GHz, Base Band RF Band: 4Gbps Data Rate in each channel Base Band: 2Gbps Total Data Rate 10Gbps Bit Error Rate Across all Bands <10E‐9 Latency 6 ps/mm Enegry Per Bit (RF) 0.09*pJ/bit/mm Enegry Per Bit (BB) 0.125pJ/bit/mm

*VCO power (5mW) can be shared by all (many tens) parallel RF-I links in NOC and does not burden individual link significantly.

30GHz Channel 30GHz Channel 50 GHz Channel

50GHz Channel Base Band Channel

Data Output waveform Output Spectrum of the RF- Bands, 30GHz and 50GHz 42 42 Comparison between Repeated Bus and Multi-band RF-I @ 32nm

Repeated Assumptions:

RF‐I Bus 1. 32nm node; 30x repeater, # of wire 13 448 FO4=8ps, Rwire = 306Ω/mm Data rate per carrier Cwire = 315fF/mm, wire (Gbit/s) 8 NA pitch=0.2um, Bus length = 2cm, # of carrier 7 NA f_bus = 1GHz, Bus Width 96Byte Data rate per carrier 2 (Gbit/s) 56 1 2. Repeaters Area = 0.022mm 3. Bus physical width = 160um Aggregate Data Rate 728 768 Bus Physical Width 160 160 4. In that width we can fit 13 transmission line, each with 7 Transceiver Area (mm2) 0.27 0.022 carriers with carrying 8Gbps Power (mW) 455 6144

Energy per bit (pJ/bit) 0.63 8 Interconnect length = 2cm 43 43

Architectural Impact Using RF-I

High bandwidth communication Data distribution across many-core topologies Vital in keeping many-core designs active Low latency communication Enables users to apply parallel computing to a broader applications through faster synchronization and communication Faster cache coherence protocols Reconfigurability Adapt NoC topology/bandwidth to the needs of the individual application Power efficient communication

44 44 Simple RF-I Topology

RF-I Transmission Four NoC Components Line Bundle C C > > > > > > > > Tunable Tx/Rx’s C C Tx/Rx Arbitrary topologies NoC Component Arbitrary bandwidths One physical topology can be configured to many virtual topologies

C C C C C C C C C C C C C C C C C C C Bus Multicast Fully Crossbar Pipeline/Ring Connected

45 45

Mesh Overlaid with RF-I [HPCA’08]

10x10 mesh of pipelined routers NoC runs at 2GHz XY routing 64 4GHz 3-wide processor cores Labeled aqua 8KB L1 Data Cache 8KB L1 Instruction Cache 32 L2 Cache Banks Labeled pink 256KB each Organized as shared NUCA cache 4 Main Memory Interfaces Labeled green RF-I transmission line bundle Black thick line spanning mesh

46 46 RF-I Logical Organization

• Logically: - RF-I behaves as set of N express channels - Each channel assigned to src, dest router pair (s,d)

• Reconfigured by: - remapping shortcuts to match needs of different applications LOGICAL B A

47 47

Power Savings [MICRO’08]

1648 bytes We can thin the baseline mesh links Requires high bw to bytes communicate w/ B From 16B… …to 8B A …to 4B

RF-I makes up the difference in performance while saving overall power! RF-I provides bandwidth where most necessary Baseline RC wires supply the rest

48 48 RF-I Enabled Multicast

Request Scenario S Get

Conventional NoC RF-I enabled NoC FILL 2 Tx Rx Tx Rx Tx Rx 1

1 Tx Rx Tx Rx Tx Rx Fill 1

Tx Rx Tx Rx Tx Rx 2 3 4 1 11

49 49

Impact of Using RF-Interconnects [MICRO’08]

• Adaptive RF-I enabled NoC - Cost Effective in terms of both power and performance

50 50 Overview of the Proposed Research Customizable Heterogeneous Platform (CHP)

$$ $$ $$ $$ DRAMDRAM I/OI/O CHPCHP

FixedFixed FixedFixed FixedFixed FixedFixed DRAM CHP CHP CoreCore CoreCore CoreCore CoreCore DRAM CHP CHP

CustomCustom CustomCustom CustomCustom CustomCustom CoreCore CoreCore CoreCore CoreCore Domain-specific-modeling (healthcare applications)

Design once Invoke many times 51

CHP Mapping – Compilation and Runtime Software Systems for Customization

Goals: Efficient mapping of domain-specific specification to customizable hardware – Adapt the CHP to a given application for drastic performance/power efficiency improvement

Domain-specificDomain-specific applicationsapplications

AbstractAbstract ProgrammerProgrammer executionexecution

Domain-specific programming model Application characteristics (Domain-specific coordination graph and domain-specific language extensions)

CHP architecture Source-toSource-to sourcesource CHPCHP MapperMapper models C/C++ code Analysis C/SystemC annotations behavioral spec

C/C++C/C++ front-endfront-end RTLRTL SynthesizerSynthesizer Performance ReconfiguringReconfiguring andand optimizingoptimizing back-endback-end (xPilot)(xPilot) feedback

Binary code for fixed & Customized RTL for customized cores target code programmable fabric

AdaptiveAdaptive runtimeruntime LightweightLightweight threadsthreads andand adaptiveadaptive configurationconfiguration

CHPCHP architecturalarchitectural prototypesprototypes (CHP(CHP hardwarehardware testbeds,testbeds, CHPCHP simulationsimulation testbed,testbed, fullfull CHP)CHP)

52 FCUDA: CUDA-to-FPGA (Best Paper Award at SASP 2009) Use CUDA in tandem with High-Level Synthesis (HLS) to: enable high-level abstraction for FPGA programming exploit massively parallel compute capabilities of FPGA facilitate single interface for GPU and FPGA kernel acceleration

CUDA: C-based parallel programming model for GPUs concise expression of coarse grained parallelism very popular (wide range of existing applications) Explicit partitioning and trasnfer of data between off-chip and on-chip memory

AutoPilot: Advanced HLS tool (from AutoESL) Platform-specific (i.e. FPGA/ASIC) C-to-RTL mapping Fine-grained and loop iteration parallelism extraction Annotated coarse-grained parallelism extraction • Requires explicit expression and annotation from programmer 53

CUDA-to-AutoPilot C Translation

Identify off-chip data transfers aggregate multi-thread off-chip accesses into DMA bursts

Split kernel into computation and data communication tasks

Use thread-block granularity for splitting kernel threads into parallel FPGA cores

Allocate data storage based on following memory space mapping: GPU FPGA • Global Off-chip DRAM • Shared On-chip BRAMs • Constant/Texture Registers • Registers / Local Memory thread-block kernel tasks

54 Results

Assume FPGA has high bandwidth bus to off-chip DDR Kernel Configuration Description

Common kernel in many Benchmark Core # DRAM Bandwidth Limiting Resource Matrix Multiply 1024x1024 imaging, simulation, and (matmul) scientific application matmul 32bit 128 3.5GB/s DSP Coulombic Computation of electric matmul 16bit 176 1.6GB/s BRAM 4000 atoms, potential potential in a volume 512x512 grid matmul 8bit 176 0.8GB/s BRAM (cp) containing charged atoms

cp 32bit 25 0.128GB/s DSP RSA Encryption Brute force encryption key 4 Billion Keys cp 16bit 96 0.19GB/sec DSP (rc5-72) generation and matching

cp 8bit 96 0.1GB/sec DSP

rc5-72 32bit 80 ≈ 0GB/sec LUT Benchmark GPU FPGA Virtex5 FPGA over GPU GeForce 8800 xc5vfx200t Benefit 2.5 GPU 2 matmul 10.622 Watt 9.41X 1.5 FPGA 32bit 1 speedup 0.5 matmul ≈ 100 Watt 10.559 Watt 9.47X 0 16bit 32bit 16bit 8bit 32bit 16bit 8bit 32bit

matmul cp rc5-72 matmul 8bit 9.954 Watt 10.05X

Speedup comparable to GPU in several configurations Much more power efficient than GPU!

Concluding Remarks

We believe that domain-specific customization is the next transformative approach to energy efficient computing Beyond parallelization?

Many research opportunities and challenges Domain-specific modeling/specification Novel architecture & microarchitecture for customization Compilation and runtime software to support intelligent customization New research in testing, verification, reliability in customizable computing

CDSC is taking a highly integrated effort – Coordinated cross-layer customization in modeling, HW, SW, & application development

56 Acknowledgements

•A highly collaborative effort • thanks to all my co-PIs in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara • Thanks the support from the National Science Foundation

Aberle Baraniuk Bui Chang Cheng Cong (Director) (UCLA) (Rice) (UCLA) (UCLA) (UCSB) (UCLA)

Palsberg Potkonjak Reinman Sadayappan Sarkar Vese (UCLA) (UCLA) (UCLA) (Ohio-State) (Associate Dir) (UCLA) (Rice)