Design of Application Specific Processor Architectures
Rainer Leupers RWTH Aachen University Software for Systems on Silicon [email protected]
Institute for Integrated Signal Processing Systems Overview: Geography
Berlin
Aachen
2005 © R. Leupers 2 About ISS
¾ RWTH Aachen is a top-ranked technical university in Germany ¾ ISS institute 3 professors (Meyr, Ascheid, Leupers) 18 Ph.D. students 5 staff ¾ Research on wireless communication systems tight industry cooperations origin of several EDA spin-off companies (e.g. Cadis, Axys, LISATek) SSS group focuses on embedded processor design tools
2005 © R. Leupers 3 Overview
1. Introduction 2. ASIP design methodologies 3. Software tools 4. ASIP architecture design 5. Case study 6. Advanced research topics
2005 © R. Leupers 4 1. Introduction
2005 © R. Leupers 5 Embedded system design automation
¾ Embedded systems Special-purpose electronic devices Very different from desktop computers ¾ Strength of European IT market Telecom, consumer, automotive, medical, ... Siemens, Nokia, Bosch, Infineon, ... ¾ New design requirements Low NRE cost, high efficiency requirements Real-time operation, dependability Keep pace with Moore´s Law
2005 © R. Leupers 6 What to do with chip area ?
2005 © R. Leupers 7 Example: wireless multimedia terminals
¾ Multistandard radio UMTS GSM/GPRS/EDGE WLAN Bluetooth UWB … ¾ Multimedia standards Key issues: MPEG-4 Key issues: MP3 ••TimeTime toto marketmarket((≤≤1212 months)months) AAC • Flexibility (ongoing standard GPS • Flexibility (ongoing standard updates) DVB-H updates) … ••EfficiencyEfficiency(battery(batteryoperation)operation)
2005 © R. Leupers 8 Application specific processors (ASIPs)
„As the performance of conventional microprocessors improves, they first meet and then exceed the requirements of most computing applications. Initially, performance is key. But eventually, other factors, like customization, become more important to the customer...“
[M.J. Bass, C.M. Christensen: The Future of the Microprocessor Business, IEEE Spectrum 2002]
design budget = (semiconductor revenue) × (% for R&D) growth ≈ 15% ≈ 10%
# IC designs = (design budget) / (design cost per IC) growth ≈ 15% growth ≈ 50-100% [Keutzer05] → Customizable application specific processors as reusable, programmable platforms
2005 © R. Leupers 9 Efficiency and flexibility
General Purpose Digital Signal Processors Application Processors Specific Instruction Set Processors Field
SW Programmable 6 Devices WhyDesignuse ASIPs? . . . 10
• Higher efficiency for given range Application 5 D I S S I P A T I O N of applications Specific 10 ICs • IP protection • Cost reduction (no HWroyalties) Design Physically
Log F L E X I B I L I T Y Optimized • Product differentiation ICs Log P O W E R
Log P E R F O R M A N C E
103 . . . 104 Source: T.Noll, RWTH Aachen
2005 © R. Leupers 10 Standard-CPU vs. ASIP
2005 © R. Leupers 11 2. ASIP design methodologies
2005 © R. Leupers 12 ASIP architecture exploration
Application initial processor architecture Profiler Compiler
Simulator Assembler
optimized Linker processor architecture
2005 © R. Leupers 13 Expression (UC Irvine)
2005 © R. Leupers 14 Tensilica Xtensa/XPRES
Source: Tensilica Inc.
2005 © R. Leupers 15 MIPS CorXtend/CoWare CorXpert
Synthesize HW and profile with Replace MIPSsim Profile critical code and and 2 with special extensions H identify instruction User Defined Instruction o custom 3 t instructions + s p CorExtend Module o 1 t User Defined Instruction
2005 © R. Leupers 16 CoWare LISATek ASIP architecture exploration
¾ Integrated embedded processor development environment ¾ Unified processor model in LISA 2.0 architecture description language (ADL) ¾ Automatic generation of: SW tools HW models
2005 © R. Leupers 17 LISA operation hierarchy
main Reflects hierarchical organization of ISAs decode
addr cond opcode opnds
imm linear cycl control arithm move short long
add sub mul and or
2005 © R. Leupers 18 LISA operations structure
LISA operation References to other operations DECLARE Binary coding CODING Assembly syntax SYNTAX Resource access, e.g. registers
EXPRESSION Initiate “downstream” operations in pipe ACTIVATION Computation and processor state update BEHAVIOR C compiler generation SEMANTICS
2005 © R. Leupers 19 LISA operation example
OPERATION ADD { DECLARE { GROUP src1, src2, dest = { Register } } CODING { 0b1011 src1 src2 dest } SYNTAX { “ADD” dest “,” src1 “,” src2 }
BEHAVIOR { dest = src1 + src2; } C/C++ Code } OPERATION Register { DECLARE { LABEL index; ADD } CODING { index } src1 src2 dest SYNTAX { “R” index } EXPRESSION{ R[index] } } Register Register Register
2005 © R. Leupers 20 Exploration/debugger GUI
••ApplicationApplicationssimulationimulation ••DebuggingDebugging ••ProfilingProfiling ••ResourceResourceutilizationutilizationaanalysisnalysis ••PipelinePipeline analysisanalysis ••ProcessorProcessormodelmodelddebuggingebugging ••MemoryMemoryhierarchyhierarchyeexplorationxploration ••CodeCode coveragecoverageanalysisanalysis ••......
2005 © R. Leupers 21 Some available LISA 2.0 models
¾ DSP: •VLIW: Texas Instruments – Texas Instruments TMS320C54x TMS320C6x Analog Devices – STMicroelectronics ADSP21xx ST220 Motorola 56000 •µC: ¾ RISC: –MHS80C51 MIPS32 4K • ASIP: ESA LEON SPARC 8 – Infineon PP32 NPU ARM7100 – Infineon ICore ARM926 – MorphICs DSP
2005 © R. Leupers 22 3. Software tools
2005 © R. Leupers 23 Tools generated from processor ADL model
Application
Profiler Compiler
Simulator Assembler
Linker
2005 © R. Leupers 24 Instruction set simulation
Interpretive:
•flexible Decode Execute Application Instruction • slow (~ 100 KIPS) Memory
RunRun--TimTimee Compiled: • fast (> 10 MIPS) Instruction Behavior Simulation • inflexible Application Compiler Execute Compiled Program • high memory Simulation Memory
consumption CCoompilmpilee--TimeTime RunRun--TimTimee JIT-CCS™: • „just-in-time“ Instruction Instruction Behavior Application Decode Compiled Execute compiled Program Simulation Memory Cache • SW simulation cache RunRun--TimTimee • fast and flexible
2005 © R. Leupers 25 JIT-CC simulation performance
• Dependent on simulation cache size • 95% of compiled simulation performance @ 4096 cache blocks (10% memory consumption of compiled sim.) • Example: ST200 VLIW DSP 9 100 8 90 C ]
80 a
S 7 c h IP 70 6 eM [M 60 i
5 s ce sR
n 50
a 4 a m 40 r t i o
3 o 30 rf [ 2 % Pe 20 ] 1 10 0 0
d e 8 6 6 2 4 8 6 2 4 8 e iv 1 32 64 9 9 il t 128 25 51 02 38 76 p e 1 204 40 81 m rpr 16 32 Co te In Cache size [records]
2005 © R. Leupers 26 Why care about C compilers?
¾ Embedded SW design becoming predominant manpower factor in system design ¾ Cannot develop/maintain millions of code lines in assembly language ¾ Move to high-level programming languages
2005 © R. Leupers 27 Why care about compilers?
¾ Trend towards heterogeneous multiprocessor systems-on- chip (MPSoC) ¾ Customized application specific instruction set processors (ASIPs) are key MPSoC components ¾ How to achieve efficient compiler support for ASIPs?
ASICASIC CPUCPU ASIPASIP
ASICASIC CPUCPU
ASIPASIP CPUCPU ASIPASIP
MemMem MemoryMemory MemoryMemory MemoryMemory
2005 © R. Leupers 28 C compiler in the exploration loop
„Compiler/Architecture Co-Design“ ¾ Efficient C-compilers cannot be designed for ARBITRARY architectures!
Application Compiler Processor Results Software
¾ Compiler and processor form a UNIT that needs to be optimized! ¾ “Compiler-friendliness“ needs to be taken into account during the architecture exploration!
2005 © R. Leupers 29 Retargetable compilers
Classical compiler Retargetable compiler source source processor code code model
Compiler processor Compiler model Compiler
asm asm code code
2005 © R. Leupers 30 GNU C compiler (gcc)
• Probably the most widespread retargetable compiler • Mostly used as a native Unix/Linux compiler, but may operate as a cross-compiler, too • Support for C/C++, Java, and other languages • Comes with comprehensive support software, e.g. runtime and standard libraries, debug support • Portable to new architectures by means of machine description file and C support routines
“The“The main main goal goal of of GCC GCC was was to to make make a a good, good, fast fast compiler compiler for for machinesmachines in in the the clas classs that that the the GN GNUU system system aims aims to to run run on: on: 32-bit 32-bit macmachhinineses that that address address 8-bit 8-bit bytes bytes and and have have several several general general registers. registers. Elegance,Elegance, theoretical theoretical power power and and simplicity simplicity are are only only secondary.” secondary.”
2005 © R. Leupers 31 LANCE (Univ. of Dortmund/RWTH Aachen)
¾ Modular retargetable C compiler system ¾ www.lancecompiler.com
LANCE library LANCE tools C frontend header file
lance2.h IR optimization 1
IR-C used by C++ library IR optimization n liblance2.a backend interface
2005 © R. Leupers 32 Executable C based IR in the LANCE compiler
C source code int A[10]; char *t1,*t3; symbol void f() int i,t2,t5,t6,t7,t8; table { int *t4; int i,A[10]; t3 = (char *)A; // cast base to char* i = A[2]++ t2 = 2 * 4; // compute offset > 1 ? t1 = t3 + t2; // compute eff addr t4 = (int *)t1; // cast back to int* 2 : 3; t5 = *t4; // load value from memory } array
t6 = t5 + 1; // increment access *t4 = t6; // store back into A[2] increment
t7 = t5 > 1; // compare if (t7) goto L1; // jump if > t8 = 3; // load 3 if <= conditional goto L2; // goto join point L1: t8 = 2; // load 2 if > expression L2: i = t8; // move result into i
2005 © R. Leupers 33 CoSy compiler system (ACE)
• Universal retargetable C/C++ compiler • Extensible intermediate representation (IR) • Modular compiler organization • Generator (BEG) for code selector, register allocator, © ACE - Associated scheduler Compiler Experts
2005 © R. Leupers 34 ACE CoSy system structure
.c .edl Control Flow through Compiler Parser Engine
IR-Description .sdl IR Optimizations
Code Generator Description datagen .cgd lower Architecture BEG match Specific Architecture Parameters sched Backend .tdf gra Engines emit
Backend components .asm
2005 © R. Leupers 35 LISATek C compiler generation
LISA processor model GUI
SYNTAXSYNTAX {{ ““AADDDD““ddsstt,, ssrrc1c1,, src2src2 } } Autom. analyses CODINCODINGG {{ 00bb00100100 dsdsttssrc1rc1 src2src2 }} BEHABEHAVIVIOROR {{ AALLU_U_readread(src1(src1,, src2)src2);; ALUALU__addadd();(); UpdateUpdate_fl_flaaggss();(); wwrriittebebaacckk(d(dst);st); Manual refinement }} SEMANTICSSEMANTICS {{ src1src1 ++ src2src2 ÆÆdsdstt;; }} …… CoSyCoSysystemsystem
CC CompilerCompiler
2005 © R. Leupers 36 LISATek compiler generation
C-Code ASM-Code
int a,b,c; LD R1, [R2] a = b+1; Code- Register- ADD R1, #1 Frontend Opt SchedulerBackend c = a<<3; Selector Allocator SHL R1, #3 … … FE DE EX Instruction- ADDALU … WB Fetch …#1 Write- Jump SUB MUL ADDSUBJMP Back DecodJMPer 2 1
PipelineADD Control 2 3Mem
Decoder
Prog Data Registers R[0..31]…R[i] … RAM RAM
2005 © R. Leupers 37 Compiled code quality: MIPS example
¾¾LISATekLISATekgeneratedgeneratedC-CompilerC-Compiler gccgccCC-Compiler-Compiler ¾gcc with MIPS32 4kc backend ¾¾Out-of-the-boxOut-of-the-boxC-CompilerC-Compiler ¾gcc with MIPS32 4kc backend ¾Used by most MIPS users ¾¾NoNo manual manualoptimizoptimizationsations ¾Used by most MIPS users ¾Large group of developers, ¾¾DevelopmentDevelopmenttimetime of of model model ¾Large group of developers, severalseveralmman-yearsan-yearsoof foptimization optimization approxapprox. .2 2 weeks weeks
Cycles Size
140.000.000 80.000
120.000.000 70.000
60.000 100.000.000
50.000 80.000.OverheadOverhead000 ofof 10%10% inin cyclecycleccountountaandnd 17%17% inin codecodedensitydensity Cycles 40.000 Size 60.000.000 30.000
40.000.000 20.000
20.000.000 10.000
0 0 gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2 gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2
2005 © R. Leupers 38 Demands on code quality
¾ Compilers for embedded processors have to generate extremely efficient code
Code size: » system-on-chip » on-chip RAM/ROM
Performance: » real-time constraints
Power/energy consumption: » heat dissipation » battery lifetime
2005 © R. Leupers 39 Compiler flexibility/code quality trade-off
variety of embedded unification processors
retargetable compilation
dedicated optimization specialization techniques
DSP NPU VLIW
2005 © R. Leupers 40 Adding processor-specific code optimizations
¾ High-level (compiler IR) Enabled by CoSy´s engine concept ¾ Low-level (ASM):
LISA C .C LISA C Unscheduled .C Compiler Unscheduled Compiler .asm.asm
Assembly API
Scheduled & Scheduled & OptimizationOptimization11 Optimized OptimizationOptimization22 Optimized OptimizationOptimization33 .asm.asm
Binary Code Generation
AssemblerAssembler LinkerLinker .out
2005 © R. Leupers 41 Embedded processors are not compiler-friendly
¾ Designed for efficiency ¾ E.g. fixed-point DSP data paths:
P D
TR MEM AX AY MX MY AF MF mult
PR ALU +,- * +,- AR ACCU MR
TI C25 ADSP210x ¾ Special purpose registers, constrained parallelism, ... ¾ Challenge for compilers that usually prefer orthogonal „compiler-friendly“ architectures
2005 © R. Leupers 42 DSP address code optimization
• Based on address generation unit (AGU) support • Address arithmetic in parallel to central data path instruction field
1 Support for auto-increment and auto-modify addressaddress registersregisters Examples: • TI C2x/C5x • Motorola 56000 +/-+/- modifymodify registersregisters • ADSP-210x • AMS Gepard core
MEMMEM
2005 © R. Leupers 43 DSP address code optimization
variable set: { a, b, c, d } access sequence: b, d, a, c, d, a, c, b, a, d, a, c, d
access graph: alphabetic optimized AR = 1 AR = 3 1 layout layout a b AR += 2 AR - - 4 0 a AR -= 3 0 c AR - - 3 1 AR += 2 AR - - 1 b 1 a 1 AR ++ AR += 2 c d 2 c AR -= 3 2 d AR - - 2 3 d AR += 2 3 b AR - - max. AR - - AR += 3 Hamiltonian path: AR - - AR -= 2 AR += 3 AR ++ a b AR -= 3 AR - - 4 AR += 2 AR - - 3 1 cost: 9 AR ++ cost: 5 AR += 2 c d www.address-code-optimization.org
2005 © R. Leupers 44 Source-level code optimization
Algorithm ––betterbetterperformanceperformance design ––llowerowercodecodesizesize ––hhighlyighlyrreusableeusable
C code generation
PoorPoorcodecode „Software „Software washing washing“ quality!quality! machine“ Compilation to assembly
2005 © R. Leupers 45 CoWare SPW example
Source (*buffer_45) = inst__9_coef * (*buffer_56); (*buffer_44) = inst__4_coef * (*buffer_56); (*buffer_43) = inst__8_coef * (*buffer_56); K1 K2(*buffer_58) = (*buffer_51)K3 + (*buffer_43); (*buffer_46) = inst__0_coef * (*buffer_58); (*buffer_59) = (*buffer_45) + (*buffer_46); (*buffer_47) = inst__10_coef * (*buffer_58);
-1 {/* RO: spb/dly.symbol-1 / 52 */ ∑ z ∑ (*buffer_52)z = inst__11__past_value;∑ Sink } {/* RI: spb/dly.symbol / 52 */ inst__11__past_value = (*buffer_59); K5 K4} (*buffer_60) = (*buffer_44) + (*buffer_52)+ (*buffer_47); inst__2__past_value = (*buffer_60);
¾ Very conservative, block-oriented code generation scheme ¾ Even simple code transformations have great impact, e.g. Variable localization If-statement merging Optimized data type selection
2005 © R. Leupers 46 4. ASIP architecture design
2005 © R. Leupers 47 ASIP implementation after exploration
2005 © R. Leupers 48 Unified Description Layer
L I S A
HDL Generation
Register-Transfer-Level Gate–Level Synthesis (e.g. SYNOPSYS design compiler) G a t e – L e v e l
2005 © R. Leupers 49 Challenges in Automated ASIP Implementation
ADL: HDL: Instructions
Arithmetic Control Multiplier (MUL)
Mul JMP 1:1 Mapping Mac BRC Multiplier (MAC)
Independent description of Independent mapping to instruction behavior: hardware blocks:
+ Efficient Design Space Exploration - Insufficient architectural efficiency by 1:1 mapping
2005 © R. Leupers 50 Unified Description Layer
L I S A
Structure & Mapping (incl. JTAG/DEBUG)
Optimizations Unified Description Layer
Backend (VHDL, Verilog, SystemC) Register-Transfer-Level Gate–Level Synthesis (e.g. SYNOPSYS design compiler) G a t e – L e v e l
2005 © R. Leupers 51 Optimization strategies
LISA: separate descriptions Instruction A Instruction B for separate instructions Goal: share hardware for separate instructions LISA LISA Operation A Operation B Mutual Exclusiveness ab cd
+ + Possible Optimizations • ALU Sharing x y
ac bd
+
x,y
2005 © R. Leupers 52 Optimization strategies
LISA: separate descriptions Instruction A Instruction B for separate instructions Goal: same hardware for separate instructions LISA LISA Operation A Operation B Mutual Exclusiveness
AddressA
…… Path PA DataA Possible Optimizations … • ALU Sharing Resource DataB Register Array Path P • Path Sharing Sharing B AddressB • ... …
DataA, DataB
AddressA Register Array AddressB 2005 © R. Leupers 53 5. Case study
2005 © R. Leupers 54 Motorola 6811
Project Goals:
• Performance (MIPS) must be increased
• Compatibility on the assembly level for reuse of legacy code (Integration into existing tool flow)
• Royalty free design
Î compatible architecture developed with LISA using RTL processor synthesis
2005 © R. Leupers 55 Motorola 6811
legacy code
compiler
assembly
assembler
0100101010011010 e 1110010110101111 reaasse InInccre e!!!!!! aanncce 0000110110110100 errfoformrm PPe PSS)) (M(MIIP
6812?6811
2005 © R. Leupers 56 Motorola 6811
Bluetooth app.
6811 compiler
assembly
assembly level assembler compatible
0100101010011010 1110010110101111 0000110110110100 LISA
Synthesized Architecture
2005 © R. Leupers 57 Architecture Development
original 6811 Processor LISA 6811 Processor
8 bit instructions 16 bit instructions 16 bit instructions 32 bit instructions 24 bit instructions 32 bit instructions 40 bit instructions
InstructionInstruction isis fetchedfetched byby 88 bitbit blocks:blocks: 1616 bitbit areare fetchedfetched simultaneously:simultaneously:
ÎÎupup toto 55 cyclescyclesforfor fetching!fetching! ÎÎmaxmax 22 cyclescyclesforfor fetching!fetching!
++ pipelinedpipelinedarchitecturearchitecture ++ possibilitypossibility forfor specialspecial instructionsinstructions
2005 © R. Leupers 58 Tools Flow and RTL Processor Synthesis
C-Application
6811 compiler LISA tools
Assembly LISA model LISA assembler 6811 compatible architecture generated completely in VHDL 1) VLSI Implementation: Area: <17kGates Executable Clock Speed: ~154 MHz 2) Mapped onto XILINX FPGA
2005 © R. Leupers 59 Retinex ASIP (digital image enhancement)
IN OUT
ASICASIC (0.18 (0.18 µm): µm): ••102102 kGates, kGates, 93 93 MHz MHz Virtex II FPGA board ASIPASIP (0.13 (0.13 µm): µm): ••124124 kGates, kGates, 154 154 MHz MHz ••SWSW programmable! programmable!
2005 © R. Leupers 60 6. Advanced research topics
2005 © R. Leupers 61 Generalized ASIP architecture design flow
Algorithm design (Matlab, SPW, ...)
C code generation or implementation Profiling
3% 14% 26%
0% 27%
1% 1% 20% 4% 3% 0% 1% initial architecture
architecture optimization
2005 © R. Leupers 62 A closer look at profilers: ASM level
Algorithm design (Matlab, SPW, ...)
C code generation or implementation
l initial ve le architecture M S A – requires full architecture description – slow (~1000x native C) architecture optimization
2005 © R. Leupers 63 A closer look at profilers: source level
Algorithm design sample app: image corner detection (Matlab, SPW, ...) gprof
C code generation or implementation hot spot
initial architecture Which instructions implement this code efficiently?
architecture optimization
2005 © R. Leupers 64 Do not neglect compiler optimizations
Algorithm design (Matlab, SPW, ...)
C code generation exec gcov or implementation count
p = in + (corner_list[n].y-1)*x_size + corner_list[n].x - 1; initial architecture • 5x ADD • 5x ADD • 2x SUB • 2x SUB • 3x MUL • 1x MUL architecture •2x LOAD •2x LOAD optimization Wrong• ... ISA decisions may• ... result! real code (optimized)
2005 © R. Leupers 65 µ-profiling approach
instrument code generate low-level C code count ADD count MUL
count LOAD
compile & execute
perform compiler optimizations
2005 © R. Leupers 66 Profiler features summary
C source assembly Micro- level (e.g. level (e.g. profiler gprof) LISATek)
primary Source code ISA and ISA and application optimization architecture architecture optimization optimization
needs No Yes No architectural details
speed High Low Medium
Profiling coarse fine fine granularity
2005 © R. Leupers 67 Micro-profiler in the design flow
¾Precise profiling of: Operator execution count Data type use Variable word lengths Memory access patterns ¾Guide designer in basic architecture decisions: Inclusion of dedicated function units (floating point, address gen.) Leave out unused instructions Optimal data path word lengths Design of memory hierarchy (cache, scratchpad) ¾Export profiling data for automatic ISA synthesis tool Effects of compiler optimizations predicted early
2005 © R. Leupers 68 Custom instruction set synthesis: recent approaches
¾ J. Yang, C. Kyung et al.: MetaCore: An Application Specific DSP Development System, DAC 1998 ¾ M. Gschwind: Instruction Set Selection for ASIP Design, CODES 1999 ¾ K. Kücükcakar: An ASIP Design Methodology for Embedded Systems, CODES 1999 ¾ H. Choi, C. Kyung et al.: Synthesis of Application Specific Instructions for Embedded DSP Software, IEEE Trans. CAD 1999 ¾ F. Sun, S. Ravi et al.: Synthesis of Custom Processors based on Extensible Platforms, ICCAD 2002 ¾ D. Goodwin, D. Petkov: Automatic Generation of Application Specific Processors, CASES 2003 ¾ K. Atasu, L. Pozzi, P. Ienne: Automatic Application-Specific Instruction-Set Extensions under Microarchitectural Constraints, DAC 2003 ¾ N. Clark, H. Zhong, S. Mahlke: Processor Acceleration through Automated Instruction Set Customization, MICRO 2003 ¾ ... and many others
2005 © R. Leupers 69 Atasu/Pozzi/Ienne (EPFL)
• Branch-and-Bound like algorithm for custom pattern identification under I/O constraints • Capable of identifying optimal complex and disconnected patterns • Requires fast and accurate cost estimation of speedup due to candidate patterns
2005 © R. Leupers 70 Custom ISA synthesis: ISS approach
PerformPerformaancence estimestimaationtion
optimal custom instr
PerformPerformaancence evalevaluatiuationon
ApplicationApplicationccodeode ProcessorProcessor profiling HotHot spot(s) spot(s) Semi-automSemi-automatiaticc generation profiling graph model ISA synthesis (GUI) generation (gprof,(gprof, µprof) µprof) graph model ISA synthesis (GUI) (e.g.(e.g. CorXpert) CorXpert)
Architectural UserUser Architectural constraints preferencespreferences constraints
2005 © R. Leupers 71 Abstract modeling: optimal data flow graph covering
+ *
instr 1 instr 3
>> - - * << ••PrecisePrecisemmathematicalathematicalproblemproblemfformulationormulation • Maximize application speedup under given architecture • Maximize application speedup under given arch+itecture instr 2andand area areaconstraintsconstraints • Solve optimization problem with combination of integer • Solve* optimization problem with combination of integer linearlinear pr programmingogrammingandand heur heuristicsistics - ••UsingUsingCCoWareoWareCCorXpertorXpertaass backend backend +
HOT SPOT /
2005 © R. Leupers 72 Outlook: reconfigurable ASIPs
¾ Currently: (one-time) configurable ASIPs ¾ Combination of ASIP and FPGA: “field-configurable ASIPs” Fast adaptations w/o HW modifications
Application specific Embedded FPGA components
Base processor architecture reconfigurable
2005 © R. Leupers 73 Outlook: compilation for heterogeneous MPSoC´s
¾ Urgently needed, but virtually not present TaskTask a a today TaskTask b b TaskTask d d ¾ No tools even for Task e simplest platforms Task e Task c (e.g. TI OMAP) Task c TaskTask f f
¾ Need for automated TaskTask g g spatial and temporal task-to-processor mapping
ASIASICC CPUCPU ASIASIPP
a.ca.c b.cb.c d.cd.c f.cf.c ASIASIPP CPCPUU ASIASIPP
g.cg.c c.cc.c e.ce.c MemMemoorryy MemMemoorryy MemMemoorryy
2005 © R. Leupers 74 References
¾ R. Leupers: Code Optimization Techniques for Embedded Processors - Methods, Algorithms, and Tools, Kluwer, 2000 ¾ R. Leupers, P. Marwedel: Retargetable Compiler Technology for Embedded Systems - Tools and Applications, Kluwer, 2001 ¾ A. Hoffmann, H. Meyr, R. Leupers: Architecture Exploration for Embedded Processors with LISA, Kluwer, 2002 ¾ C. Rowen, S. Leibson: Engineering the Complex SoC: Fast, Flexible Design with Configurable Processors, Prentice Hall, 2004 ¾ M. Gries, K. Keutzer, et al.: Building ASIPs: The Mescal Methodology, Springer, 2005 ¾ P. Ienne, R. Leupers (eds.): Customizable and Configurable Embedded Processor Cores, Morgan Kaufmann, to appear 2006
2005 © R. Leupers 75 Thank you !
Institute for Integrated Signal Processing Systems