Design of Application Specific

Rainer Leupers RWTH Aachen University Software for Systems on Silicon [email protected]

Institute for Integrated Signal Processing Systems Overview: Geography

Berlin

Aachen

2005 © R. Leupers 2 About ISS

¾ RWTH Aachen is a top-ranked technical university in Germany ¾ ISS institute ƒ 3 professors (Meyr, Ascheid, Leupers) ƒ 18 Ph.D. students ƒ 5 staff ¾ Research on wireless communication systems ƒ tight industry cooperations ƒ origin of several EDA spin-off companies (e.g. Cadis, Axys, LISATek) ƒ SSS group focuses on embedded processor tools

2005 © R. Leupers 3 Overview

1. Introduction 2. ASIP design methodologies 3. Software tools 4. ASIP design 5. Case study 6. Advanced research topics

2005 © R. Leupers 4 1. Introduction

2005 © R. Leupers 5 design automation

¾ Embedded systems ƒ Special-purpose electronic devices ƒ Very different from desktop ¾ Strength of European IT market ƒ Telecom, consumer, automotive, medical, ... ƒ Siemens, Nokia, Bosch, Infineon, ... ¾ New design requirements ƒ Low NRE cost, high efficiency requirements ƒ Real-time operation, dependability ƒ Keep pace with Moore´s Law

2005 © R. Leupers 6 What to do with chip area ?

2005 © R. Leupers 7 Example: wireless multimedia terminals

¾ Multistandard radio ƒ UMTS ƒ GSM/GPRS/EDGE ƒ WLAN ƒ Bluetooth ƒ UWB ƒ … ¾ Multimedia standards Key issues: ƒ MPEG-4 Key issues: ƒ MP3 ••TimeTime toto marketmarket((≤≤1212 months)months) ƒ AAC • Flexibility (ongoing standard ƒ GPS • Flexibility (ongoing standard updates) ƒ DVB-H updates) ƒ … ••EfficiencyEfficiency(battery(batteryoperation)operation)

2005 © R. Leupers 8 Application specific processors (ASIPs)

„As the performance of conventional improves, they first meet and then exceed the requirements of most computing applications. Initially, performance is key. But eventually, other factors, like customization, become more important to the customer...“

[M.J. Bass, C.M. Christensen: The Future of the Business, IEEE Spectrum 2002]

design budget = (semiconductor revenue) × (% for R&D) growth ≈ 15% ≈ 10%

# IC = (design budget) / (design cost per IC) growth ≈ 15% growth ≈ 50-100% [Keutzer05] → Customizable application specific processors as reusable, programmable platforms

2005 © R. Leupers 9 Efficiency and flexibility

General Purpose Digital Signal Processors Application Processors Specific Instruction Set Processors Field

SW Programmable 6 Devices WhyDesignuse ASIPs? . . . 10

• Higher efficiency for given range Application 5 D I S S I P A T I O N of applications Specific 10 ICs • IP protection • Cost reduction (no HWroyalties) Design Physically

Log F L E X I B I L I T Y Optimized • Product differentiation ICs Log P O W E R

Log P E R F O R M A N C E

103 . . . 104 Source: T.Noll, RWTH Aachen

2005 © R. Leupers 10 Standard-CPU vs. ASIP

2005 © R. Leupers 11 2. ASIP design methodologies

2005 © R. Leupers 12 ASIP architecture exploration

Application initial processor architecture Profiler Compiler

Simulator Assembler

optimized Linker processor architecture

2005 © R. Leupers 13 Expression (UC Irvine)

2005 © R. Leupers 14 Tensilica Xtensa/XPRES

Source: Tensilica Inc.

2005 © R. Leupers 15 MIPS CorXtend/CoWare CorXpert

Synthesize HW and profile with Replace MIPSsim Profile critical code and and 2 with special extensions H identify instruction User Defined Instruction o custom 3 t instructions + s p CorExtend Module o 1 t User Defined Instruction

2005 © R. Leupers 16 CoWare LISATek ASIP architecture exploration

¾ Integrated embedded processor development environment ¾ Unified processor model in LISA 2.0 architecture description language (ADL) ¾ Automatic generation of: ƒ SW tools ƒ HW models

2005 © R. Leupers 17 LISA operation hierarchy

main Reflects hierarchical organization of ISAs decode

addr cond opcode opnds

imm linear cycl control arithm move short long

add sub mul and or

2005 © R. Leupers 18 LISA operations structure

LISA operation References to other operations DECLARE Binary coding CODING Assembly syntax SYNTAX Resource access, e.g. registers

EXPRESSION Initiate “downstream” operations in pipe ACTIVATION Computation and processor state update BEHAVIOR C compiler generation SEMANTICS

2005 © R. Leupers 19 LISA operation example

OPERATION ADD { DECLARE { GROUP src1, src2, dest = { Register } } CODING { 0b1011 src1 src2 dest } SYNTAX { “ADD” dest “,” src1 “,” src2 }

BEHAVIOR { dest = src1 + src2; } C/C++ Code } OPERATION Register { DECLARE { LABEL index; ADD } CODING { index } src1 src2 dest SYNTAX { “R” index } EXPRESSION{ R[index] } } Register Register Register

2005 © R. Leupers 20 Exploration/ GUI

••ApplicationApplicationssimulationimulation ••DebuggingDebugging ••ProfilingProfiling ••ResourceResourceutilizationutilizationaanalysisnalysis ••PipelinePipeline analysisanalysis ••ProcessorProcessormodelmodelddebuggingebugging ••MemoryMemoryhierarchyhierarchyeexplorationxploration ••CodeCode coveragecoverageanalysisanalysis ••......

2005 © R. Leupers 21 Some available LISA 2.0 models

¾ DSP: •VLIW: ƒ Texas Instruments – Texas Instruments TMS320C54x TMS320C6x ƒ Analog Devices – STMicroelectronics ADSP21xx ST220 ƒ Motorola 56000 •µC: ¾ RISC: –MHS80C51 ƒ MIPS32 4K • ASIP: ƒ ESA LEON SPARC 8 – Infineon PP32 NPU ƒ ARM7100 – Infineon ICore ƒ ARM926 – MorphICs DSP

2005 © R. Leupers 22 3. Software tools

2005 © R. Leupers 23 Tools generated from processor ADL model

Application

Profiler Compiler

Simulator Assembler

Linker

2005 © R. Leupers 24 Instruction set simulation

Interpretive:

•flexible Decode Execute Application Instruction • slow (~ 100 KIPS) Memory

RunRun--TimTimee Compiled: • fast (> 10 MIPS) Instruction Behavior Simulation • inflexible Application Compiler Execute Compiled Program • high memory Simulation Memory

consumption CCoompilmpilee--TimeTime RunRun--TimTimee JIT-CCS™: • „just-in-time“ Instruction Instruction Behavior Application Decode Compiled Execute compiled Program Simulation Memory • SW simulation cache RunRun--TimTimee • fast and flexible

2005 © R. Leupers 25 JIT-CC simulation performance

• Dependent on simulation cache size • 95% of compiled simulation performance @ 4096 cache blocks (10% memory consumption of compiled sim.) • Example: ST200 VLIW DSP 9 100 8 90 C ]

80 a

S 7 c h IP 70 6 eM [M 60 i

5 s ce sR

n 50

a 4 a m 40 r t i o

3 o 30 rf [ 2 % Pe 20 ] 1 10 0 0

d e 8 6 6 2 4 8 6 2 4 8 e iv 1 32 64 9 9 il t 128 25 51 02 38 76 p e 1 204 40 81 m rpr 16 32 Co te In Cache size [records]

2005 © R. Leupers 26 Why care about C compilers?

¾ Embedded SW design becoming predominant manpower factor in system design ¾ Cannot develop/maintain millions of code lines in ¾ Move to high- programming languages

2005 © R. Leupers 27 Why care about compilers?

¾ Trend towards heterogeneous multiprocessor systems-on- chip (MPSoC) ¾ Customized application specific instruction set processors (ASIPs) are key MPSoC components ¾ How to achieve efficient compiler support for ASIPs?

ASICASIC CPUCPU ASIPASIP

ASICASIC CPUCPU

ASIPASIP CPUCPU ASIPASIP

MemMem MemoryMemory MemoryMemory MemoryMemory

2005 © R. Leupers 28 C compiler in the exploration loop

„Compiler/Architecture Co-Design“ ¾ Efficient C-compilers cannot be designed for ARBITRARY architectures!

Application Compiler Processor Results Software

¾ Compiler and processor form a UNIT that needs to be optimized! ¾ “Compiler-friendliness“ needs to be taken into account during the architecture exploration!

2005 © R. Leupers 29 Retargetable compilers

Classical compiler Retargetable compiler source source processor code code model

Compiler processor Compiler model Compiler

asm asm code code

2005 © R. Leupers 30 GNU C compiler (gcc)

• Probably the most widespread retargetable compiler • Mostly used as a native Unix/Linux compiler, but may operate as a cross-compiler, too • Support for C/C++, Java, and other languages • Comes with comprehensive support software, e.g. runtime and standard libraries, debug support • Portable to new architectures by means of machine description file and C support routines

“The“The main main goal goal of of GCC GCC was was to to make make a a good, good, fast fast compiler compiler for for machinesmachines in in the the clas classs that that the the GN GNUU system system aims aims to to run run on: on: 32- 32-bit macmachhinineses that that address address 8-bit 8-bit bytes bytes and and have have several several general general registers. registers. Elegance,Elegance, theoretical theoretical power power and and simplicity simplicity are are only only secondary.” secondary.”

2005 © R. Leupers 31 LANCE (Univ. of Dortmund/RWTH Aachen)

¾ Modular retargetable C compiler system ¾ www.lancecompiler.com

LANCE library LANCE tools C frontend header file

lance2.h IR optimization 1

IR-C used by C++ library IR optimization n liblance2.a backend interface

2005 © R. Leupers 32 Executable C based IR in the LANCE compiler

C source code int A[10]; char *t1,*t3; symbol void f() int i,t2,t5,t6,t7,t8; table { int *t4; int i,A[10]; t3 = (char *)A; // cast base to char* i = A[2]++ t2 = 2 * 4; // compute offset > 1 ? t1 = t3 + t2; // compute eff addr t4 = (int *)t1; // cast back to int* 2 : 3; t5 = *t4; // load value from memory } array

t6 = t5 + 1; // increment access *t4 = t6; // store back into A[2] increment

t7 = t5 > 1; // compare if (t7) goto L1; // jump if > t8 = 3; // load 3 if <= conditional goto L2; // goto join point L1: t8 = 2; // load 2 if > expression L2: i = t8; // move result into i

2005 © R. Leupers 33 CoSy compiler system (ACE)

• Universal retargetable C/C++ compiler • Extensible intermediate representation (IR) • Modular compiler organization • Generator (BEG) for code selector, register allocator, © ACE - Associated scheduler Compiler Experts

2005 © R. Leupers 34 ACE CoSy system structure

.c .edl Control Flow through Compiler Parser Engine

IR-Description .sdl IR Optimizations

Code Generator Description datagen .cgd lower Architecture BEG match Specific Architecture Parameters sched Backend .tdf gra Engines emit

Backend components .asm

2005 © R. Leupers 35 LISATek C compiler generation

LISA processor model GUI

SYNTAXSYNTAX {{ ““AADDDD““ddsstt,, ssrrc1c1,, src2src2 } } Autom. analyses CODINCODINGG {{ 00bb00100100 dsdsttssrc1rc1 src2src2 }} BEHABEHAVIVIOROR {{ AALLU_U_readread(src1(src1,, src2)src2);; ALUALU__addadd();(); UpdateUpdate_fl_flaaggss();(); wwrriittebebaacckk(d(dst);st); Manual refinement }} SEMANTICSSEMANTICS {{ src1src1 ++ src2src2 ÆÆdsdstt;; }} …… CoSyCoSysystemsystem

CC CompilerCompiler

2005 © R. Leupers 36 LISATek compiler generation

C-Code ASM-Code

int a,b,c; LD R1, [R2] a = b+1; Code- Register- ADD R1, #1 Frontend Opt SchedulerBackend c = a<<3; Selector Allocator SHL R1, #3 … … FE DE EX Instruction- ADDALU … WB Fetch …#1 Write- Jump SUB MUL ADDSUBJMP Back DecodJMPer 2 1

PipelineADD Control 2 3Mem

Decoder

Prog Data Registers R[0..31]…R[i] … RAM RAM

2005 © R. Leupers 37 Compiled code quality: MIPS example

¾¾LISATekLISATekgeneratedgeneratedC-CompilerC-Compiler gccgccCC-Compiler-Compiler ¾gcc with MIPS32 4kc backend ¾¾Out-of-the-boxOut-of-the-boxC-CompilerC-Compiler ¾gcc with MIPS32 4kc backend ¾Used by most MIPS users ¾¾NoNo manual manualoptimizoptimizationsations ¾Used by most MIPS users ¾Large group of developers, ¾¾DevelopmentDevelopmenttimetime of of model model ¾Large group of developers, severalseveralmman-yearsan-yearsoof foptimization optimization approxapprox. .2 2 weeks weeks

Cycles Size

140.000.000 80.000

120.000.000 70.000

60.000 100.000.000

50.000 80.000.OverheadOverhead000 ofof 10%10% inin cyclecycleccountountaandnd 17%17% inin codecodedensitydensity Cycles 40.000 Size 60.000.000 30.000

40.000.000 20.000

20.000.000 10.000

0 0 gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2 gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2

2005 © R. Leupers 38 Demands on code quality

¾ Compilers for embedded processors have to generate extremely efficient code

ƒ Code size: » system-on-chip » on-chip RAM/ROM

ƒ Performance: » real-time constraints

ƒ Power/energy consumption: » heat dissipation » battery lifetime

2005 © R. Leupers 39 Compiler flexibility/code quality trade-off

variety of embedded unification processors

retargetable compilation

dedicated optimization specialization techniques

DSP NPU VLIW

2005 © R. Leupers 40 Adding processor-specific code optimizations

¾ High-level (compiler IR) ƒ Enabled by CoSy´s engine concept ¾ Low-level (ASM):

LISA C .C LISA C Unscheduled .C Compiler Unscheduled Compiler .asm.asm

Assembly API

Scheduled & Scheduled & OptimizationOptimization11 Optimized OptimizationOptimization22 Optimized OptimizationOptimization33 .asm.asm

Binary Code Generation

AssemblerAssembler LinkerLinker .out

2005 © R. Leupers 41 Embedded processors are not compiler-friendly

¾ Designed for efficiency ¾ E.g. fixed-point DSP data paths:

P D

TR MEM AX AY MX MY AF MF mult

PR ALU +,- * +,- AR ACCU MR

TI C25 ADSP210x ¾ Special purpose registers, constrained parallelism, ... ¾ Challenge for compilers that usually prefer orthogonal „compiler-friendly“ architectures

2005 © R. Leupers 42 DSP address code optimization

• Based on (AGU) support • Address arithmetic in parallel to central data path instruction field

1 Support for auto-increment and auto-modify addressaddress registersregisters Examples: • TI C2x/C5x • Motorola 56000 +/-+/- modifymodify registersregisters • ADSP-210x • AMS Gepard core

MEMMEM

2005 © R. Leupers 43 DSP address code optimization

variable set: { a, b, c, d } access sequence: b, d, a, c, d, a, c, b, a, d, a, c, d

access graph: alphabetic optimized AR = 1 AR = 3 1 layout layout a b AR += 2 AR - - 4 0 a AR -= 3 0 c AR - - 3 1 AR += 2 AR - - 1 b 1 a 1 AR ++ AR += 2 c d 2 c AR -= 3 2 d AR - - 2 3 d AR += 2 3 b AR - - max. AR - - AR += 3 Hamiltonian path: AR - - AR -= 2 AR += 3 AR ++ a b AR -= 3 AR - - 4 AR += 2 AR - - 3 1 cost: 9 AR ++ cost: 5 AR += 2 c d www.address-code-optimization.org

2005 © R. Leupers 44 Source-level code optimization

Algorithm ––betterbetterperformanceperformance design ––llowerowercodecodesizesize ––hhighlyighlyrreusableeusable

C code generation

PoorPoorcodecode „Software „Software washing washing“ quality!quality! machine“ Compilation to assembly

2005 © R. Leupers 45 CoWare SPW example

Source (*buffer_45) = inst__9_coef * (*buffer_56); (*buffer_44) = inst__4_coef * (*buffer_56); (*buffer_43) = inst__8_coef * (*buffer_56); K1 K2(*buffer_58) = (*buffer_51)K3 + (*buffer_43); (*buffer_46) = inst__0_coef * (*buffer_58); (*buffer_59) = (*buffer_45) + (*buffer_46); (*buffer_47) = inst__10_coef * (*buffer_58);

-1 {/* RO: spb/dly.symbol-1 / 52 */ ∑ z ∑ (*buffer_52)z = inst__11__past_value;∑ Sink } {/* RI: spb/dly.symbol / 52 */ inst__11__past_value = (*buffer_59); K5 K4} (*buffer_60) = (*buffer_44) + (*buffer_52)+ (*buffer_47); inst__2__past_value = (*buffer_60);

¾ Very conservative, block-oriented code generation scheme ¾ Even simple code transformations have great impact, e.g. ƒ Variable localization ƒ If-statement merging ƒ Optimized data type selection

2005 © R. Leupers 46 4. ASIP architecture design

2005 © R. Leupers 47 ASIP implementation after exploration

2005 © R. Leupers 48 Unified Description Layer

L I S A

HDL Generation

Register-Transfer-Level Gate–Level Synthesis (e.g. SYNOPSYS design compiler) G a t e – L e v e l

2005 © R. Leupers 49 Challenges in Automated ASIP Implementation

ADL: HDL: Instructions

Arithmetic Control Multiplier (MUL)

Mul JMP 1:1 Mapping Mac BRC Multiplier (MAC)

Independent description of Independent mapping to instruction behavior: hardware blocks:

+ Efficient Design Exploration - Insufficient architectural efficiency by 1:1 mapping

2005 © R. Leupers 50 Unified Description Layer

L I S A

Structure & Mapping (incl. JTAG/DEBUG)

Optimizations Unified Description Layer

Backend (VHDL, , SystemC) Register-Transfer-Level Gate–Level Synthesis (e.g. SYNOPSYS design compiler) G a t e – L e v e l

2005 © R. Leupers 51 Optimization strategies

LISA: separate descriptions Instruction A Instruction B for separate instructions Goal: share hardware for separate instructions LISA LISA Operation A Operation B Mutual Exclusiveness ab cd

+ + Possible Optimizations • ALU Sharing x y

ac bd

+

x,y

2005 © R. Leupers 52 Optimization strategies

LISA: separate descriptions Instruction A Instruction B for separate instructions Goal: same hardware for separate instructions LISA LISA Operation A Operation B Mutual Exclusiveness

AddressA

…… Path PA DataA Possible Optimizations … • ALU Sharing Resource DataB Register Array Path P • Path Sharing Sharing B AddressB • ... …

DataA, DataB

AddressA Register Array AddressB 2005 © R. Leupers 53 5. Case study

2005 © R. Leupers 54 Motorola 6811

Project Goals:

• Performance (MIPS) must be increased

• Compatibility on the assembly level for reuse of legacy code (Integration into existing tool flow)

• Royalty free design

Î compatible architecture developed with LISA using RTL processor synthesis

2005 © R. Leupers 55 Motorola 6811

legacy code

compiler

assembly

assembler

0100101010011010 e 1110010110101111 reaasse InInccre e!!!!!! aanncce 0000110110110100 errfoformrm PPe PSS)) (M(MIIP

6812?6811

2005 © R. Leupers 56 Motorola 6811

Bluetooth app.

6811 compiler

assembly

assembly level assembler compatible

0100101010011010 1110010110101111 0000110110110100 LISA

Synthesized Architecture

2005 © R. Leupers 57 Architecture Development

original 6811 Processor LISA 6811 Processor

8 bit instructions 16 bit instructions 16 bit instructions 32 bit instructions 24 bit instructions 32 bit instructions 40 bit instructions

InstructionInstruction isis fetchedfetched byby 88 bitbit blocks:blocks: 1616 bitbit areare fetchedfetched simultaneously:simultaneously:

ÎÎupup toto 55 cyclescyclesforfor fetching!fetching! ÎÎmaxmax 22 cyclescyclesforfor fetching!fetching!

++ pipelinedpipelinedarchitecturearchitecture ++ possibilitypossibility forfor specialspecial instructionsinstructions

2005 © R. Leupers 58 Tools Flow and RTL Processor Synthesis

C-Application

6811 compiler LISA tools

Assembly LISA model LISA assembler 6811 compatible architecture generated completely in VHDL 1) VLSI Implementation: Area: <17kGates Executable Clock Speed: ~154 MHz 2) Mapped onto XILINX FPGA

2005 © R. Leupers 59 Retinex ASIP (digital image enhancement)

IN OUT

ASICASIC (0.18 (0.18 µm): µm): ••102102 kGates, kGates, 93 93 MHz MHz Virtex II FPGA board ASIPASIP (0.13 (0.13 µm): µm): ••124124 kGates, kGates, 154 154 MHz MHz ••SWSW programmable! programmable!

2005 © R. Leupers 60 6. Advanced research topics

2005 © R. Leupers 61 Generalized ASIP architecture

Algorithm design (Matlab, SPW, ...)

C code generation or implementation Profiling

3% 14% 26%

0% 27%

1% 1% 20% 4% 3% 0% 1% initial architecture

architecture optimization

2005 © R. Leupers 62 A closer look at profilers: ASM level

Algorithm design (Matlab, SPW, ...)

C code generation or implementation

l initial ve le architecture M S A – requires full architecture description – slow (~1000x native C) architecture optimization

2005 © R. Leupers 63 A closer look at profilers: source level

Algorithm design sample app: image corner detection (Matlab, SPW, ...) gprof

C code generation or implementation hot spot

initial architecture Which instructions implement this code efficiently?

architecture optimization

2005 © R. Leupers 64 Do not neglect compiler optimizations

Algorithm design (Matlab, SPW, ...)

C code generation exec gcov or implementation count

p = in + (corner_list[n].y-1)*x_size + corner_list[n].x - 1; initial architecture • 5x ADD • 5x ADD • 2x SUB • 2x SUB • 3x MUL • 1x MUL architecture •2x LOAD •2x LOAD optimization Wrong• ... ISA decisions may• ... result! real code (optimized)

2005 © R. Leupers 65 µ-profiling approach

instrument code generate low-level C code count ADD count MUL

count LOAD

compile & execute

perform compiler optimizations

2005 © R. Leupers 66 Profiler features summary

C source assembly Micro- level (e.g. level (e.g. profiler gprof) LISATek)

primary Source code ISA and ISA and application optimization architecture architecture optimization optimization

needs No Yes No architectural details

speed High Low Medium

Profiling coarse fine fine granularity

2005 © R. Leupers 67 Micro-profiler in the design flow

¾Precise profiling of: ƒ Operator execution count ƒ Data type use ƒ Variable word lengths ƒ Memory access patterns ¾Guide in basic architecture decisions: ƒ Inclusion of dedicated function units (floating point, address gen.) ƒ Leave out unused instructions ƒ Optimal data path word lengths ƒ Design of (cache, scratchpad) ¾Export profiling data for automatic ISA synthesis tool ƒ Effects of compiler optimizations predicted early

2005 © R. Leupers 68 Custom instruction set synthesis: recent approaches

¾ J. Yang, C. Kyung et al.: MetaCore: An Application Specific DSP Development System, DAC 1998 ¾ M. Gschwind: Instruction Set Selection for ASIP Design, CODES 1999 ¾ K. Kücükcakar: An ASIP Design Methodology for Embedded Systems, CODES 1999 ¾ H. Choi, C. Kyung et al.: Synthesis of Application Specific Instructions for Embedded DSP Software, IEEE Trans. CAD 1999 ¾ F. Sun, S. Ravi et al.: Synthesis of Custom Processors based on Extensible Platforms, ICCAD 2002 ¾ D. Goodwin, D. Petkov: Automatic Generation of Application Specific Processors, CASES 2003 ¾ K. Atasu, L. Pozzi, P. Ienne: Automatic Application-Specific Instruction-Set Extensions under Microarchitectural Constraints, DAC 2003 ¾ N. Clark, H. Zhong, S. Mahlke: Processor Acceleration through Automated Instruction Set Customization, MICRO 2003 ¾ ... and many others

2005 © R. Leupers 69 Atasu/Pozzi/Ienne (EPFL)

• Branch-and-Bound like algorithm for custom pattern identification under I/O constraints • Capable of identifying optimal complex and disconnected patterns • Requires fast and accurate cost estimation of speedup due to candidate patterns

2005 © R. Leupers 70 Custom ISA synthesis: ISS approach

PerformPerformaancence estimestimaationtion

optimal custom instr

PerformPerformaancence evalevaluatiuationon

ApplicationApplicationccodeode ProcessorProcessor profiling HotHot spot(s) spot(s) Semi-automSemi-automatiaticc generation profiling graph model ISA synthesis (GUI) generation (gprof,(gprof, µprof) µprof) graph model ISA synthesis (GUI) (e.g.(e.g. CorXpert) CorXpert)

Architectural UserUser Architectural constraints preferencespreferences constraints

2005 © R. Leupers 71 Abstract modeling: optimal data flow graph covering

+ *

instr 1 instr 3

>> - - * << ••PrecisePrecisemmathematicalathematicalproblemproblemfformulationormulation • Maximize application speedup under given architecture • Maximize application speedup under given arch+itecture instr 2andand area areaconstraintsconstraints • Solve optimization problem with combination of integer • Solve* optimization problem with combination of integer linearlinear pr programmingogrammingandand heur heuristicsistics - ••UsingUsingCCoWareoWareCCorXpertorXpertaass backend backend +

HOT SPOT /

2005 © R. Leupers 72 Outlook: reconfigurable ASIPs

¾ Currently: (one-time) configurable ASIPs ¾ Combination of ASIP and FPGA: ƒ “field-configurable ASIPs” ƒ Fast adaptations w/o HW modifications

Application specific Embedded FPGA components

Base processor architecture reconfigurable

2005 © R. Leupers 73 Outlook: compilation for heterogeneous MPSoC´s

¾ Urgently needed, but virtually not present TaskTask a a today TaskTask b b TaskTask d d ¾ No tools even for Task e simplest platforms Task e Task c (e.g. TI OMAP) Task c TaskTask f f

¾ Need for automated TaskTask g g spatial and temporal task-to-processor mapping

ASIASICC CPUCPU ASIASIPP

a.ca.c b.cb.c d.cd.c f.cf.c ASIASIPP CPCPUU ASIASIPP

g.cg.c c.cc.c e.ce.c MemMemoorryy MemMemoorryy MemMemoorryy

2005 © R. Leupers 74 References

¾ R. Leupers: Code Optimization Techniques for Embedded Processors - Methods, , and Tools, Kluwer, 2000 ¾ R. Leupers, P. Marwedel: Retargetable Compiler Technology for Embedded Systems - Tools and Applications, Kluwer, 2001 ¾ A. Hoffmann, H. Meyr, R. Leupers: Architecture Exploration for Embedded Processors with LISA, Kluwer, 2002 ¾ C. Rowen, S. Leibson: the Complex SoC: Fast, Flexible Design with Configurable Processors, Prentice Hall, 2004 ¾ M. Gries, K. Keutzer, et al.: Building ASIPs: The Mescal Methodology, Springer, 2005 ¾ P. Ienne, R. Leupers (eds.): Customizable and Configurable Embedded Processor Cores, Morgan Kaufmann, to appear 2006

2005 © R. Leupers 75 Thank you !

Institute for Integrated Signal Processing Systems