Customizable Domain-Specific Computing
Jason Cong Center for Domain-Specific Computing UCLA Computer Science Department [email protected] http://cadlab.cs.ucla.edu/~cong
1
The Power Barrier …
Source : Shekhar Borkar, Intel 2 Focus: New Transformative Approach to Power/Energy Efficient Computing
Current Solution: Parallelization
Parallelization
Source: Shekhar Borkar, Intel 3
Cost and Energy are Still a Big Issue …
Cost of computing •HW acquisition •Energy bill •Heat removal •Space •…
4 Next Significant Opportunity -- Customization
Parallelization
Customization
Adapt the architecture to Application domain
Source: Shekhar Borkar, Intel 5
Motivation
A few facts We have sufficient computing power for most applications Each user/enterprise need high computing power for only selected tasks in its domain Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture
Our proposal A general, customizable platform for the given domain(s) • Can be customized to a wide-range of applications in the domain • Can be massively produced with cost efficiency • Can be programmed efficiently with novel compilation and runtime systems Goal: A “supercomputer-in-a-box” with 100X performance/power improvement via customization for the intended domain(s) Analogy: Advance of civilization via specialization/customization
6 Example Application Domain: Healthcare
Medical imaging has transformed healthcare An in vivo method for understanding disease development and patient condition Estimated to be $100 billion/year More powerful & efficient computation can help • Fewer exposures using compressive sensing • Better clinical assessment (e.g., for cancer) using improved registration and segmentation algorithms
Hemodynamic simulation Very useful for surgical procedures involving blood flow and vasculature Magnetic resonance (MR) angiograph of an aneurysm Both may take hours to days to construct Clinical requirement: 1-2 min Cloud computing won’t work – • Communication, real-time requirement, privacy A megawatt-datacenter for each hospital? Intracranial aneurysm reconstruction with hemodynamics
7
Medical Image Processing Pipeline
Medical images exhibit sparsity, and can be sampled at a rate << classical Shannon- Nyquist theory : compressive 2 min ∑ ARu - S + λ ∑ grad(u) reconstruction reconstruction u sensing sampled points ∀voxels
S 1 2 y −z S ∑ k k total variational ⎛ ⎞ − k=1 ⎜ 2 ⎟ 2 1 h ∀voxel: u(i) = w i,j f ( j) − 2σ , w i,j = e denoising denoising ⎜ ∑ ⎟ Z(i) ⎝voxel j∈volume ⎠ algorithm
∂u v = + v ⋅∇u fluid ∂t registration registration μΔv + (μ +η)∇()∇ ⋅v = −[]T(x −u) − R(x) ∇T(x −u) registration
∂ϕ ⎡ ⎛ ∇ϕ ⎞⎤ = ∇ϕ ⎢F(data,φ) + λdiv⎜ ⎟⎥ ⎜ ⎟ level set ∂t ⎣⎢ ⎝ ∇ϕ ⎠⎦⎥ surface(t) = {}voxels x : ϕ(x,t) = 0 methods segmentation segmentation
∂v + (v ⋅∇)v = −∇p +υΔv + f (x,t) ∂t Navier-Stokes analysis analysis ∂v 3 ∂v ∂p 3 ∂2v i + v i = − +υ v i + f (x,t) ∂t ∑ j ∂x ∂x ∑ j 2 i equations j =1 j i j =1 ∂x j 8 Application Domains: Medical Image Processing Pipeline
iterative, local or global communication compressive dense and sparse linear algebra, optimization methods reconstruction reconstruction sensing
non-iterative, highly parallel, local & globaltotal communication variational denoising denoising sparse linear algebra, structured grid, optimization methods • These algorithms have diverse algorithm computation & communication patterns • A single homogenous system parallel, global communication fluid can not perform very well on dense linear algebra, optimization methods
registration can not perform very well on registration registration all these algorithms
local communication level set dense linear algebra, spectral methods, MapReducemethods segmentation segmentation
local communication Navier-Stokes analysis analysis sparse linear algebra, n-body methods, graphical modelsequations 9
Need of Customization for Medical Image Processing Pipeline
iterative, local or global communication compressive • These algorithms have diverse denseBi-harmonic and sparse registration linear algebra, (Using theoptimization same algorithm methods on all reconstruction reconstruction sensing computation & communication platforms)
patterns CPU (Xenon 2.0 GHz) GPU (Tesla C1060) FPGA (xc4vlx100) • A single, homogeneous system Non-iterative,1x highly parallel, local93x & globaltotal communication variational11x ~100 W ~150 W ~5W denoising denoising cannot perform very well on all sparse linear algebra, structured grid, optimization methodsalgorithm of these algorithms 3D median filter: For each voxel, compute the median of • Need architecture • Need architecture the 3 x 3 x 3 neighboring voxels customization and hardware- parallel, global communication fluid denseCPU linear (Xenon algebra, 2.0 GHz) optimizationGPU (Tesla methods C1060) FPGA (xc4vlx100) registration registration software co-optimization registration Quick select Median of medians Bit-by-bit majority voting • Include many common 1x 70x 1200x computation kernels (“motifs”) ~100 W ~140 W ~3 W local communication level set • Applicable to other domains dense linear algebra, spectral methods, MapReducemethods segmentation segmentation
local communication Navier-Stokes analysis analysis sparse linear algebra, n-body methods, graphical modelsequations 10 11
Center for Domain-Specific Computing (CDSC) Organization
• A diversified & highly accomplished faculty team: 8 in CS&E; 1 in EE; 2 in medical school; 1 in applied math • 15-20 postdocs and graduate students in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara
Aberle Baraniuk Bui Chang Cheng Cong (Director) (UCLA) (Rice) (UCLA) (UCLA) (UCSB) (UCLA)
Palsberg Potkonjak Reinman Sadayappan Sarkar Vese (UCLA) (UCLA) (UCLA) (Ohio-State) (Associate Dir) (UCLA) (Rice)
12 Overview of the Proposed Research Customizable Heterogeneous Platform (CHP)
$$ $$ $$ $$ DRAMDRAM I/OI/O CHPCHP
FixedFixed FixedFixed FixedFixed FixedFixed DRAM CHP CHP CoreCore CoreCore CoreCore CoreCore DRAM CHP CHP
CustomCustom CustomCustom CustomCustom CustomCustom CoreCore CoreCore CoreCore CoreCore Domain-specific-modeling (healthcare applications)
ProgProg ProgProg ProgProg ProgProg Fabric Fabric Fabric Fabric Fabric Fabric Fabric Fabric n Ap io p at li iz ca er ti ct on Reconfigurable RF-I bus ra m ha o Reconfigurable optical bus c de in li Transceiver/receiver a ng Optical interface om D Architecture modeling CHP mapping CHP creation Source-to-source CHP mapper Customizable computing engines Reconfiguring & optimizing backend Customizable interconnects Adaptive runtime
Design once Invoke many times 13
CHP Creation – Design Space Exploration
CoreCore parametersparameters Customizable Heterogeneous Platform (CHP) FrequencyFrequency && voltagevoltage DatapathDatapath bit bit widthwidth InstructionInstruction windowwindow sizesize $$ $$ $$ $$ IssueIssue widthwidth CacheCache sizesize && configurationconfiguration RegisterRegister filefile organizationorganization NoCNoC parametersparameters Fixed Fixed Fixed Fixed ## ofof threadthread contextscontexts Fixed Fixed Fixed Fixed InterconnectInterconnect topologytopology CoreCore CoreCore CoreCore CoreCore …… ## ofof virtualvirtual channelschannels RoutingRouting policypolicy LinkLink bandwidthbandwidth CustomCustom CustomCustom CustomCustom CustomCustom RouterRouter pipelinepipeline depthdepth CoreCore CoreCore CoreCore CoreCore NumberNumber ofof RF-IRF-I enabledenabled Custom instructions & accelerators routersrouters Custom instructions & accelerators AmountAmount ofof programmableprogrammable fabricfabric RF-IRF-I channelchannel andand ProgProg ProgProg ProgProg ProgProg bandwidthbandwidth allocationallocation SharedShared vs.vs. privateprivate acceleratorsaccelerators FabricFabric FabricFabric FabricFabric FabricFabric …… CustomCustom instructioninstruction selectionselection ChoiceChoice ofof acceleratorsaccelerators …… Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface Key questions: Optimal trade-off between efficiency & customizability Which options to fix at CHP creation? Which to be set by CHP mapper? 14 Customization for Cores
Example of core customization space Instruction Register Number and type ROB queue file of FUs size size size
Branch predictor
BTB size LSQ Memory hierarchy and configuration BTB complexity size Cache sizes Cache associativity Memory latency
15
Existing Studies on Cores Customization (not domain-specific)
Reference Feature Impact [Folegnani & Gonzalez, Issue logic and Issue Queue (43/58) 16% total processor energy saving ISCA 2001] [Ponomarev, et.al., Instruction Queue (17/32) Reorder Buffer (57/128) Power saving of 59% for the three MICRO 2001] Load/Store Queue (18/32) components [Hughes, et.al., MICRO Issue Width (8,4,2) Issue Queue (128,64,32) Up to 78% total energy saving with 2001] Function Units (4,2) Dynamic Voltage Scaling combined DVS and architectural adaption [Yeh et al Reduced precision FP arithmetic (mini FPU mantissa 14, exponent 8) , FPU sharing Up to 50% power reduction and 55% MICRO 2007] (2:4:8 sharing cores), eliminating trivial FP operations, lookup table performance improvement
[Cong et al Trans on PDS Core spilling – spill from 1 core up to 8 cores Less than 50% worse than ideal 8x 2007] powerful core. Up to 40% improvement for changing workloads [Ipel et al ISCA 2007] Core fusion – 2-issue cores fused to simulate 4 and 6 issue cores Less than 30% and 20% worse for sequential and parallel benchmarks respectively [Mai et al ISCA 2007] Memory system: streaming register files or cache hierarchy Only 2x worse than domain optimized Communication: broadcast and routed Processor: SIMD or RISC superscalar system [Lee and Brooks Issue queue, issue width, Branch, LSQ, ROB, Registers Cache I-LI, D-L1, L2 cache size 1.6X performance gain and 0.8X ASPLOS 2008] and latency, Memory Latency, temporal sensitivity power reduction 5.1x efficiency improvement 16 Energy-Effective Issue Logic [Folegnani & Gonzalez, ISCA’01]
Inefficiency of conventional instruction issue logic & issue queue (IQ) A) Energy waste from empty entries and ready operand B) Effectively used IQ varies across different applications C) Effectively used IQ varies in different period of one application A B
C
17
Adaptation of Multiple Datapath Resources (cont’d)
Dynamic adapt through multi-partitioned resources Instruction queue (IQ) • avg: 17; max: 32 Reorder buffer (ROB) • avg: 57; max: 128 Load/Store queue (LSQ) • avg 18; max: 32 Three resources are independently adjusted at run time Downsize the resources based on sampling statistics of effective usage history Upsize the resources based on the resource miss record Total power saving for the three resized components: 59%
18 Architectural and Frequency Adaptations for Multimedia Applications [Hughes, et al, MICRO 2001]
Dynamic adapt Architecture • Issue Width & Issue Queue • # Function Units Dynamic Voltage Scaling (DVS) • Continuous DVS (CDVS) • Discrete DVS (DDVS) Adaptation method Initial profiling • Multimedia application has similar performance and power stats for the same frame type Dynamic adaptation • Choose optimal configuration based on history stats for the same frame type by table lookup Energy saving DDVS Alone: 73% Arch Alone: 22% CDVS Alone: 75% Arch + DDVS: 77% Arch + CDVS: 78%
19
Architectural and Frequency Adaptations for Multimedia Applications (cont’d)
Important conclusions DVS gives the most of energy reduction Architectural adaption further reduce energy when augmented on DVS Without DVS, less aggressive architectures are more energy-efficient With DVS, more aggressive architectures are often more energy-efficient • The higher IPC of the more aggressive architectures means it an be run at a lower frequency to save energy
20 Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]
Examine two main questions: Spatial adaptivity - which parameters to tune? Temporal adaptivity – how often to tune?
Study effects of tuning 15 parameters and at different time intervals of adaptation
21
Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]
Architectural parameters studied Instruction Register Number and type ROB queue file of FUs size size size
Branch predictor
BTB size LSQ Cache sizes BTB complexity size Cache associativity Memory latency
22 Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]
Key findings Up to 5.3x improvement in efficiency through adaptation Relatively frequent adaptation (80K instruction intervals) needed to achieve maximum efficiency
23
Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]
Key findings On average, adapting 3 parameters is sufficient to achieve 77% of efficiency gain • However, the 3 parameters depend on application and phase DVFS provides relatively less benefits (in terms of efficiency) with architecture adaptations
24 Existing Studies on Cores Customization (not domain-specific)
Reference Feature Impact [Folegnani & Gonzalez, Issue logic and Issue Queue (43/58) 16% total processor energy saving ISCA 2001] [Ponomarev, et.al., Instruction Queue (17/32) Reorder Buffer (57/128) Power saving of 59% for the three MICRO 2001] Load/Store Queue (18/32) components [Hughes, et.al., MICRO Issue Width (8,4,2) Issue Queue (128,64,32) Up to 78% total energy saving with 2001] Function Units (4,2) Dynamic Voltage Scaling combined DVS and architectural adaption [Yeh et al Reduced precision FP arithmetic (mini FPU mantissa 14, exponent 8) , FPU sharing Up to 50% power reduction and 55% MICRO 2007] (2:4:8 sharing cores), eliminating trivial FP operations, lookup table performance improvement
[Cong et al Trans on PDS Core spilling – spill from 1 core up to 8 cores Less than 50% worse than ideal 8x 2007] powerful core. Up to 40% improvement for changing workloads [Ipel et al ISCA 2007] Core fusion – 2-issue cores fused to simulate 4 and 6 issue cores Less than 30% and 20% worse for sequential and parallel benchmarks respectively [Mai et al ISCA 2007] Memory system: streaming register files or cache hierarchy Only 2x worse than domain optimized Communication: broadcast and routed Processor: SIMD or RISC superscalar system [Lee and Brooks Issue queue, issue width, Branch, LSQ, ROB, Registers Cache I-LI, D-L1, L2 cache size 1.6X performance gain and 0.8X ASPLOS 2008] and latency, Memory Latency, temporal sensitivity power reduction 5.1x efficiency improvement 25
CHP Creation – Design Space Exploration
CoreCore parametersparameters Customizable Heterogeneous Platform (CHP) FrequencyFrequency && voltagevoltage DatapathDatapath bit bit widthwidth InstructionInstruction windowwindow sizesize $$ $$ $$ $$ IssueIssue widthwidth CacheCache sizesize && configurationconfiguration RegisterRegister filefile organizationorganization NoCNoC parametersparameters Fixed Fixed Fixed Fixed ## ofof threadthread contextscontexts Fixed Fixed Fixed Fixed InterconnectInterconnect topologytopology CoreCore CoreCore CoreCore CoreCore …… ## ofof virtualvirtual channelschannels RoutingRouting policypolicy LinkLink bandwidthbandwidth CustomCustom CustomCustom CustomCustom CustomCustom RouterRouter pipelinepipeline depthdepth CoreCore CoreCore CoreCore CoreCore NumberNumber ofof RF-IRF-I enabledenabled Custom instructions & accelerators routersrouters Custom instructions & accelerators AmountAmount ofof programmableprogrammable fabricfabric RF-IRF-I channelchannel andand ProgProg ProgProg ProgProg ProgProg bandwidthbandwidth allocationallocation SharedShared vs.vs. privateprivate acceleratorsaccelerators FabricFabric FabricFabric FabricFabric FabricFabric …… CustomCustom instructioninstruction selectionselection ChoiceChoice ofof acceleratorsaccelerators …… Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface Key questions: Optimal trade-off between efficiency & customizability Which options to fix at CHP creation? Which to be set by CHP mapper? 26 Customization of Programmable Fabrics
FPGA-based acceleration has shown a lot of promise Many applications in bio-informatics, financial engineering, image processing, scientific computing, … Many publications in FCCM, FPGA, FPL, FPT, …
Two significant barriers Communication between CPU and FPGA accelerator • Overhead of using peripheral bus is too high Automatic compilation • Real programmers do not use VHDL/Verilog
But … a lot of encouraging progress made recently
27
Customization of Programmable Fabrics
Recent enablers Communication between CPU and FPGA accelerator • High-speed connections – HyberTransport bus, FSB, QPI, … • On-chip integration Automatic compilation • Maturing of C/C++ to RTL synthesis tools
28 Acceleration of Lithographic Simulation [FPGA’08]
Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) −
ψκ(x−x2, y−y1) + ψκ(x−x2, y− 2 y2) − ψκ(x−x1, y−y2)] | Lithography simulation Simulate the optical imaging process Computational intensive; very slow for full-chip simulation Algorithm in C AutoPilotTM Synthesis Tool
15X+ Performance Improvement vs. AMD Opteron 2.2GHz Processor
Close to 100X improvement on energy efficiency 15W in FPGA comparing with 86W in Opteron
XtremeData X1000 development system (AMD Opteron + Altera StratixII EP2S180) 29
xPilot: Behavioral-to-RTL Synthesis Flow
Behavioral spec. Advanced transformtion/optimizations in C/C++/SystemC Loop unrolling/shifting/pipelining Strength reduction / Tree height reduction Platform Bitwidth analysis Frontend description Frontend Memory analysis … compilercompiler Core behvior synthesis optimizations Scheduling Resource binding, e.g., functional unit binding register/port binding SSDMSSDM
μArch-generation & RTL/constraints RTL + constraints generation Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, … FPGAs/ASICsFPGAs/ASICs 30 Some Recent Studies -- Efficient Identification of Approximate Patterns [Cong & Wei, FPGA’08]
Programers may contain many patterns
Prior work can only identify exact patterns + + + + + + We can efficiently identify “approximate” < - - patterns in large programs Structure Variation Based on the concept of editing distance 16 + 16 + 32 32 32 32 Use data-mining techniques + + + + Efficient subgraph enumeration and pruning 16 * 32 * 32 * Bitwidth Variation Highly scalable – can handle programs with 100,000+ lines of code + + + + Applications: + + * * Behavioral synthesis: * Ports Variation • 20+% area reduction due to sharing of approximate patterns ASIP synthesis: • Identify & extract customized instructions 31
Some Recent Studies -- Automatic Memory Partitioning
To appear in ICCAD 2009 for (int i =0; i < n; i++) Memory system is critical for high … = A[i]+A[i+1] performance and low power design Memory bottleneck limits maximum (a) C code parallelism
Memory system accounts for a significant A[i] A[i+1] portion of total power consumption (b) Scheduling Goal Given platform information (memory port, R1 R2 power, etc.), behavioral specification, and throughput constraints Decoder • Partition memories automatically
• Meet throughput constraints A[0, 2, 4,…] A[1, 3, 5…] • Minimize power consumption
(c) Memory architecture after partitioning
32 Automatic Memory Partitioning (AMP) Memory Platform Techniques Loop Nest Information Capture array access confliction in conflict graph Array Subscripts Analysis for throughput optimization Throughput Optimization Model the loop kernel in parametric polytopes to Partition Candidate Generation obtain array frequency Try Partition Candidate Ci, Contributions Minimize Accesses on Each Bank Automatic approach for N design space exploration Meet Port Limitation? Cycle-accurate Y
Handle irregular array Power Optimization accesses
Light-weight profiling for Loop Pipelining and Scheduling power optimization
Pipeline Results 33
Automatic Memory Partitioning (AMP)
About 6x throughput improvement on average with 45% area overhead
In addition, power optimization can further reduced 30% of power after throughput optimization
Original Partition Original Partition Area Power II II SLICES SLICES Comparsion Reduction fir 3 1 241 510 2.12 26.82% idct 4 1 354 359 1.01 44.23% litho 16 1 1220 2066 1.69 31.58% matmul 4 1 211 406 1.92 77.64% motionEst 5 1 832 961 1.16 10.53% palindrome 2 1 84 65 0.77 0.00% avg 5.67x 1.45 31.80%
34 AutoPilot Compilation Tool (based UCLA xPilot system)
Design Specification C/C++/SystemC User Constraints Common Testbench Simulation, Verification, and Prototyping Platform-based C to FPGA synthesis Compilation & AutoPilotTM Elaboration ESL Synthesis Synthesize pure ANSI-C and C++, GCC-compatible Presynthesis Optimizations compilation flow Full support of IEEE-754 floating point data types & = Behavioral & Communication Platform operations Synthesis and Optimizations Characterization Library Efficiently handle bit-accurate fixed-point arithmetic
RTL HDLs & Timing/Power/Layout More than 10X design RTL SystemC Constraints productivity gain
High quality-of-results FPGA Co-Processor
35
Some Other Usage of AutoPilot (Microsoft)
On John Cooley’s DeepChip 6/30/09 http://www.deepchip.com/items/0482-06.html
“We purchased AutoESL's AutoPilot in 2008 to implement some of the time- consuming cores in our software into FPGA hardware for the runtime speed-up improvements… 1. RankBoost - a machine-learning algorithm used in the dynamic ranking of search engines… 2. Sorting Algorithm - also several thousand lines of OO C++ code with 138 lines that needed speeding up…
36 CHP Creation – Design Space Exploration
CoreCore parametersparameters Customizable Heterogeneous Platform (CHP) FrequencyFrequency && voltagevoltage DatapathDatapath bit bit widthwidth InstructionInstruction windowwindow sizesize $$ $$ $$ $$ IssueIssue widthwidth CacheCache sizesize && configurationconfiguration RegisterRegister filefile organizationorganization NoCNoC parametersparameters Fixed Fixed Fixed Fixed ## ofof threadthread contextscontexts Fixed Fixed Fixed Fixed InterconnectInterconnect topologytopology CoreCore CoreCore CoreCore CoreCore …… ## ofof virtualvirtual channelschannels RoutingRouting policypolicy LinkLink bandwidthbandwidth CustomCustom CustomCustom CustomCustom CustomCustom RouterRouter pipelinepipeline depthdepth CoreCore CoreCore CoreCore CoreCore NumberNumber ofof RF-IRF-I enabledenabled Custom instructions & accelerators routersrouters Custom instructions & accelerators AmountAmount ofof programmableprogrammable fabricfabric RF-IRF-I channelchannel andand ProgProg ProgProg ProgProg ProgProg bandwidthbandwidth allocationallocation SharedShared vs.vs. privateprivate acceleratorsaccelerators FabricFabric FabricFabric FabricFabric FabricFabric …… CustomCustom instructioninstruction selectionselection ChoiceChoice ofof acceleratorsaccelerators …… Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface Key questions: Optimal trade-off between efficiency & customizability Which options to fix at CHP creation? Which to be set by CHP mapper? 37
Current On-Chip Interconnect Technology
Optimized RC lines with repeaters Wiresizing, buffer insertion, buffer sizing … E.g. UCLA Tio and IPEM packages
Reconfigurable interconnects For FPGAs: • RC busses with pass-transistors or bi-directional buffers For CMPs (chip multi-processors) • Mesh-like network-on-chip (NoC) Pay a large penalty on performance
38 Used vs. Available Bandwidth in Modern CMOS
fT 10
@ 45nm CMOS Technology Data Rate: 4 Gbit/s
fT of 45nm CMOS can be as high as 240GHz Baseband signal bandwidth only about 4GHz 98.4% of available bandwidth is wasted Question: How to take advantage of full-bandwidth of modern CMOS?
39 39
UCLA 90nm CMOS VCO at 324GHz [ISSCC 2008]
-70 323.5GHz VCO
-80 Pout (dBm) CMOS VCO designed by Frank -90 Chang’s group at UCLA, fabricated in 90nm process
-100 323.038 323.238 323.438 323.638 323.838 324.0 Frequency (GHz)
CMOS Voltage Controlled Oscillator, measured with a subharmonic mixer and driven with a 80 GHz synthesizer local oscillator. The mixing
frequency is (fVCO -4*fLO)=fIF, or fVCO -4*(80 GHz)= 3.5 GHz, yielding
fVCO= 323.5 GHz!
On-Wafer VCO Test Setup at JPL *Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA 40 40 Multiband RF-Interconnect Signal Power Signal Power Signal Power Signal Power Signal Spectrum
• In TX, each mixer up-converts individual baseband streams into specific frequency band (or channel) • N different data streams (N=6 in exemplary figure above) may transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates • In RX, individual signals are down-converted by mixer, and recovered after low-pass filter
41 41
Tri-band On-Chip RF-I Test Results
Process IBM 90nm CMOS Digital Process Total 3 Channels 30GHz, 50GHz, Base Band RF Band: 4Gbps Data Rate in each channel Base Band: 2Gbps Total Data Rate 10Gbps Bit Error Rate Across all Bands <10E‐9 Latency 6 ps/mm Enegry Per Bit (RF) 0.09*pJ/bit/mm Enegry Per Bit (BB) 0.125pJ/bit/mm
*VCO power (5mW) can be shared by all (many tens) parallel RF-I links in NOC and does not burden individual link significantly.
30GHz Channel 30GHz Channel 50 GHz Channel
50GHz Channel Base Band Channel
Data Output waveform Output Spectrum of the RF- Bands, 30GHz and 50GHz 42 42 Comparison between Repeated Bus and Multi-band RF-I @ 32nm
Repeated Assumptions:
RF‐I Bus 1. 32nm node; 30x repeater, # of wire 13 448 FO4=8ps, Rwire = 306Ω/mm Data rate per carrier Cwire = 315fF/mm, wire (Gbit/s) 8 NA pitch=0.2um, Bus length = 2cm, # of carrier 7 NA f_bus = 1GHz, Bus Width 96Byte Data rate per carrier 2 (Gbit/s) 56 1 2. Repeaters Area = 0.022mm 3. Bus physical width = 160um Aggregate Data Rate 728 768 Bus Physical Width 160 160 4. In that width we can fit 13 transmission line, each with 7 Transceiver Area (mm2) 0.27 0.022 carriers with carrying 8Gbps Power (mW) 455 6144
Energy per bit (pJ/bit) 0.63 8 Interconnect length = 2cm 43 43
Architectural Impact Using RF-I
High bandwidth communication Data distribution across many-core topologies Vital in keeping many-core designs active Low latency communication Enables users to apply parallel computing to a broader applications through faster synchronization and communication Faster cache coherence protocols Reconfigurability Adapt NoC topology/bandwidth to the needs of the individual application Power efficient communication
44 44 Simple RF-I Topology
RF-I Transmission Four NoC Components Line Bundle C C > > > > > > > > Tunable Tx/Rx’s C C Tx/Rx Arbitrary topologies NoC Component Arbitrary bandwidths One physical topology can be configured to many virtual topologies
C
C C C C C C C C C C C C C C C C C C C Bus Multicast Fully Crossbar Pipeline/Ring Connected
45 45
Mesh Overlaid with RF-I [HPCA’08]
10x10 mesh of pipelined routers NoC runs at 2GHz XY routing 64 4GHz 3-wide processor cores Labeled aqua 8KB L1 Data Cache 8KB L1 Instruction Cache 32 L2 Cache Banks Labeled pink 256KB each Organized as shared NUCA cache 4 Main Memory Interfaces Labeled green RF-I transmission line bundle Black thick line spanning mesh
46 46 RF-I Logical Organization
• Logically: - RF-I behaves as set of N express channels - Each channel assigned to src, dest router pair (s,d)
• Reconfigured by: - remapping shortcuts to match needs of different applications LOGICAL B A
47 47
Power Savings [MICRO’08]
1648 bytes We can thin the baseline mesh links Requires high bw to bytes communicate w/ B From 16B… …to 8B A …to 4B
RF-I makes up the difference in performance while saving overall power! RF-I provides bandwidth where most necessary Baseline RC wires supply the rest
B
48 48 RF-I Enabled Multicast
Request Scenario S Get
Conventional NoC RF-I enabled NoC FILL 2 Tx Rx Tx Rx Tx Rx 1
1 Tx Rx Tx Rx Tx Rx Fill 1
Tx Rx Tx Rx Tx Rx 2 3 4 1 11
49 49
Impact of Using RF-Interconnects [MICRO’08]
• Adaptive RF-I enabled NoC - Cost Effective in terms of both power and performance
50 50 Overview of the Proposed Research Customizable Heterogeneous Platform (CHP)
$$ $$ $$ $$ DRAMDRAM I/OI/O CHPCHP
FixedFixed FixedFixed FixedFixed FixedFixed DRAM CHP CHP CoreCore CoreCore CoreCore CoreCore DRAM CHP CHP
CustomCustom CustomCustom CustomCustom CustomCustom CoreCore CoreCore CoreCore CoreCore Domain-specific-modeling (healthcare applications)
ProgProg ProgProg ProgProg ProgProg Fabric Fabric Fabric Fabric Fabric Fabric Fabric Fabric n Ap io p at li iz ca er ti ct on Reconfigurable RF-I bus ra m ha o Reconfigurable optical bus c de in li Transceiver/receiver a ng Optical interface om D Architecture modeling CHP mapping CHP creation Source-to-source CHP mapper Customizable computing engines Reconfiguring & optimizing backend Customizable interconnects Adaptive runtime
Design once Invoke many times 51
CHP Mapping – Compilation and Runtime Software Systems for Customization
Goals: Efficient mapping of domain-specific specification to customizable hardware – Adapt the CHP to a given application for drastic performance/power efficiency improvement
Domain-specificDomain-specific applicationsapplications
AbstractAbstract ProgrammerProgrammer executionexecution
Domain-specific programming model Application characteristics (Domain-specific coordination graph and domain-specific language extensions)
CHP architecture Source-toSource-to sourcesource CHPCHP MapperMapper models C/C++ code Analysis C/SystemC annotations behavioral spec
C/C++C/C++ front-endfront-end RTLRTL SynthesizerSynthesizer Performance ReconfiguringReconfiguring andand optimizingoptimizing back-endback-end (xPilot)(xPilot) feedback
Binary code for fixed & Customized RTL for customized cores target code programmable fabric
AdaptiveAdaptive runtimeruntime LightweightLightweight threadsthreads andand adaptiveadaptive configurationconfiguration
CHPCHP architecturalarchitectural prototypesprototypes (CHP(CHP hardwarehardware testbeds,testbeds, CHPCHP simulationsimulation testbed,testbed, fullfull CHP)CHP)
52 FCUDA: CUDA-to-FPGA (Best Paper Award at SASP 2009) Use CUDA in tandem with High-Level Synthesis (HLS) to: enable high-level abstraction for FPGA programming exploit massively parallel compute capabilities of FPGA facilitate single interface for GPU and FPGA kernel acceleration
CUDA: C-based parallel programming model for GPUs concise expression of coarse grained parallelism very popular (wide range of existing applications) Explicit partitioning and trasnfer of data between off-chip and on-chip memory
AutoPilot: Advanced HLS tool (from AutoESL) Platform-specific (i.e. FPGA/ASIC) C-to-RTL mapping Fine-grained and loop iteration parallelism extraction Annotated coarse-grained parallelism extraction • Requires explicit expression and annotation from programmer 53
CUDA-to-AutoPilot C Translation
Identify off-chip data transfers aggregate multi-thread off-chip accesses into DMA bursts
Split kernel into computation and data communication tasks
Use thread-block granularity for splitting kernel threads into parallel FPGA cores
Allocate data storage based on following memory space mapping: GPU FPGA • Global Off-chip DRAM • Shared On-chip BRAMs • Constant/Texture Registers • Registers / Local Memory thread-block kernel tasks
54 Results
Assume FPGA has high bandwidth bus to off-chip DDR Kernel Configuration Description
Common kernel in many Benchmark Core # DRAM Bandwidth Limiting Resource Matrix Multiply 1024x1024 imaging, simulation, and (matmul) scientific application matmul 32bit 128 3.5GB/s DSP Coulombic Computation of electric matmul 16bit 176 1.6GB/s BRAM 4000 atoms, potential potential in a volume 512x512 grid matmul 8bit 176 0.8GB/s BRAM (cp) containing charged atoms
cp 32bit 25 0.128GB/s DSP RSA Encryption Brute force encryption key 4 Billion Keys cp 16bit 96 0.19GB/sec DSP (rc5-72) generation and matching
cp 8bit 96 0.1GB/sec DSP
rc5-72 32bit 80 ≈ 0GB/sec LUT Benchmark GPU FPGA Virtex5 FPGA over GPU GeForce 8800 xc5vfx200t Benefit 2.5 GPU 2 matmul 10.622 Watt 9.41X 1.5 FPGA 32bit 1 speedup 0.5 matmul ≈ 100 Watt 10.559 Watt 9.47X 0 16bit 32bit 16bit 8bit 32bit 16bit 8bit 32bit
matmul cp rc5-72 matmul 8bit 9.954 Watt 10.05X
Speedup comparable to GPU in several configurations Much more power efficient than GPU!
55
Concluding Remarks
We believe that domain-specific customization is the next transformative approach to energy efficient computing Beyond parallelization?
Many research opportunities and challenges Domain-specific modeling/specification Novel architecture & microarchitecture for customization Compilation and runtime software to support intelligent customization New research in testing, verification, reliability in customizable computing
CDSC is taking a highly integrated effort – Coordinated cross-layer customization in modeling, HW, SW, & application development
56 Acknowledgements
•A highly collaborative effort • thanks to all my co-PIs in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara • Thanks the support from the National Science Foundation
Aberle Baraniuk Bui Chang Cheng Cong (Director) (UCLA) (Rice) (UCLA) (UCLA) (UCSB) (UCLA)
Palsberg Potkonjak Reinman Sadayappan Sarkar Vese (UCLA) (UCLA) (UCLA) (Ohio-State) (Associate Dir) (UCLA) (Rice)
57