<<

MPSoC with multi Configurable Processors and Design Environment

Jack, Kazutoshi Wakabayashi EDA R&D center, Central Res. Labs. NEC corp. and chief architect, CWB business promotion, NEC System Technology, Ltd. System IP Core Res. Labs, NEC Corp, Visiting Professor, JAIST

URL:www.cyberworkbench.com Email: [email protected] June 28 (Th) 2008 Agenda One feature of NEC’s ASSP (DAV,Mobile) is wide usage of C-based synthesis and verification. This impacts ASSP architecture and these five years. (We applied Behavior synthesis even for CPUs) Keywords: 1. Multiple Configurable Processors (custom CPU + custom DSP) 2. Dynamic Reconfigurable Processors (flexible Hardware and controller) 3. C-based Synthesis and Verification 4. Reverse Trend : SW to HW HW solution for RT operation, power efficiency

K. Wakabayashi © NEC Corporation 2007 2 Global trend: Roll of SW(CPU) is increasing

ROMROM IFIF SDRAMSDRAM IFIF MPEG2MPEG2 A/VA/V GraphicsGraphics DecoderDecoder TimerTimer UARTUART TSTS DEMUXDEMUX

SmartCardSmartCard VideoVideo EncoderEncoder CPUCPU DescramblerDescrambler HighSpeedPortHighSpeedPort VideoDACVideoDAC For Low Power Multi Custom Processors

SW: just controlling HW blocks uPD61030 (1998) CPU SW: DSP MPEG2MPEG2 StreamStream MPEG2MPEG2 MIPSMIPS DisplayDisplay VideoVideo function Enc.Enc. Proc.Proc. Dec.Dec. DSPDSP Proc.Proc. EncoderEncoder

DSP DSP

Custom CPU,DSP

1998 2000 2002 2004 2007

K. Wakabayashi © NEC Corporation 2007 3 Global trend: MPSoC: Software-oriented system

V850E General CPU: for embedded SW : a few MPSoC Custom CPU,DSP configurable Proc. : ~20

MIPS: MIPS: 4KE 4KE V850E

EMMA2RL 12

EMMA2R Custom V_Bus 14 9 DSP 12 EMMA2 . 10 8 7 6 SIF EMMA1 4 Custom 2 Custom 0 3 CPU DSP 1998年11999 2000 1999年 2001 2001 2000年 2003 2001年 2004 2005 2002年 2003年 2004年 2005年 Custom CPU EMMArchitecture Roadmap

K. Wakabayashi © NEC Corporation 2007 4 Custom CPUs exist usually inside HW module # of “custom CPU” is 20 or more, but they are not Main CPUs. Main CPU is just a few… (not so-called MPSoC) (though, custom CPU communicates with Main CPU)

Main General CPU CPU OS (RTOS)

HI/F OUTPUT HI/F OUTPUT HI/F OUTPUT HI/F OUTPUT

INPUT INPUT INPUT INPUT Local ローカルCustom CPU Local Local Mem メモリ Mem. Mem CPU CPU CPU CPU

アクセラ Accele- DMAC accelerator DMAC HW DMAC Accelerator DMAC レータ rator module

K. Wakabayashi © NEC Corporation 2007 5 Our ASSP architecture A few General CPUs + Multi Custom Processors + HW modules (+ DRP ) General General CPU CPU

MIPS custom General CPU DRP: DSP Dynamic (every 1ns) reconfigurable (multi context) custom DRP FPGA or Programable DSP . array MEM Custom Custom CPU CPU Custom Custom CPU CPU

Custom CPU, DRP, HW modules are based on C-synthesis. So, they can be replace into another. e.g. C source for DRP can be synthesized into pure HW!! CPU tasks are implemented in DRP (really happens often)!!

K. Wakabayashi © NEC Corporation 2007 6 “All-in-C” : All Modules with C-based Synthesis

FLASH DMAC HOSTHOST CPUCPU DMAC ROM ARM,V850ARM,V850等等 (( HOST)HOST) CPU

HIFHIF GPIOGPIO DSPDSP SRAM SPX,TISPX,TI等等 DSP BUS

BRIDGEBRIDGE (HOST- (HOST- SPRAM DSP)DSP) DMACDMAC (( APPLIAPPLI ))

APPLI BUS SDRAM BRIDGEBRIDGE ROTATORROTATOR RESIZERRESIZER LCDCLCDC BRIDGEBRIDGE (( DSP-DSP- (( HOST-HOST- APPLIAPPLI )) APPLIAPPLI ))

Control Dominated circuit

K. Wakabayashi © NEC Corporation 2007 7 1. Multi Custom Processors Types of CPUs on our ASSP

RTOS, C developing environment ARM (, Performance analysis, etc) MIPS General CPU , V850 C Open for “SW engineer” or ASSP user

General CPUs Instr. Set Semi- CPUs for + Special Inst. Set Custom -Tel com, CPU (Environment: same as above) -Digital AV Assembler C-synthesis (video, audio, No OS, Assembler lang. graphic, etc ) Custom CPU Full Custom High Performance -Encryption CPU / DSP Less Memory size etc. “HW engineer” or ASSP supplier “screw&nail”

K. Wakabayashi © NEC Corporation 2007 8 Merits of Custom CPU/DSP 1) HW control (sequencer)->SW control more flexible, debuggable after chip fab. 2) Special Instruction Set : higher Performance Low Power and low clock frequencies, memory save Requirements for our favorite custom CPU/DSP - Real time function : No , avoid bus collision - HW-like Performance for special procedures - Smaller size of Memory (Instruction & Data) - Bit handing, Branch with bit - read/write of I/O reg. - Easy to design/ Modify ( few weeks for design)

K. Wakabayashi © NEC Corporation 2007 9 Types of Additional Instruction Set zALU extension :Easy, but very limited Additional IS CPU ●Co-Processor Co-processor core Pros: Easy even with RTL based Design Cons: no resource sharing among co-processors and core-CPU slow data transfer CPU-core and Co-processor(e.g. memory mapped IO) ●Complex Instruction into CPU core Pro: sharing registers, FUs with core CPU and Extended Instructions Direct read of multi public register -> higher performance CPU Cons: CPU core has to be changed core

(difficult for RTL design, timing closure ) Additonal (CPU core should be described in C, C-synthesis!) IS “C-base behavior synthesis” supports all types

K. Wakabayashi © NEC Corporation 2007 10 Tool Flow for Custom CPUs

C behavior of Custom CPU

C (BDL)

Behevioral Synthesizer Cyber

Cycle Accurate Models For other HW modules RTL C++ + General CPU models , VHDL Cycle Accurate++ C++ Sim.Model HW-SW co-sim FPGA emu. (ISS)

K. Wakabayashi © NEC Corporation 2007 11 Configurable Processors Generation RTL-based vs Behavioral-synthesis-based

Logic Synthesis based Behavioral Synthesis based

Adding Instructions : RTL Co-processors Adding Instructions: Melted into Core Behavioral C Processor A B C D A B core Processor core Smaller Beh.Syn. Lower power C D

Special IS: Limited Special IS: less limited Base Processor is not flexible Base Processor is flexible

K. Wakabayashi © NEC Corporation 2007 12 EffectsEffects ofof ResourceResource SharingSharing

Base I.S. : 81 Adding I.S. : 24 Base Processor : 33.9K Gate For After Adding IS: 42.6K Gate e.g.CRC, check-sum, only 8.7K(25%) increase start-code check...

Kind Before After Necessary Bit width Diff. Effects of FU sharing among adding adding FUs in new IS >> 32 1 3 2 8 >> 33 1 1 0 0 Base IS’s and Adding IS’s + 3 1 1 0 0 + 12 2 2 0 0 + 14 0 1 1 4 + 16 4 5 1 3 + 32 3 3 0 6 <= 14 4 4 0 0 <= 32 1 1 0 0 Adding IS’s include - 3 1 1 0 0 - 5 0 1 1 2 Six Add Operations(32bit), - 32 2 2 0 0 << 32 1 1 0 0 But No FU increases << 33 1 1 0 0 * 8 1 1 0 0 > 16 0 2 2 2 > 32 1 1 0 0 Adding extra Instractions : >= 32 1 1 0 0 < 16 0 2 2 2 Not big area impact like co-processors < 32 1 1 0 0

K. Wakabayashi © NEC Corporation 2007 13 Summary for our custom CPU/DSP design

We are “C-maniac” all modules are C-synthesis: this changes ASSP platform 1) CPU is described in behavioral C, and RTL is synthesized with our behavioral synthesizer (Cyber) 2) Additional Instruction set is also described in behavioral C 1)CPU basic Instractions and 2)additional Instructions share resources. 3) ISS of the CPU is also generated with Cyber, used for entire SoC including HW modules is constructed 4) C or C-like structured assembler ( semi-custom CPU case, RTOS and development env. can be used. Additional IS is handled as macro instruction.)

K. Wakabayashi © NEC Corporation 2007 14 What size of additional Instruction is appropriate?

Fine grain Instruction : Flexible, but low performance Coarse grain Instruction: HW like performance, but low flexibility

1) Find bottleneck contains many general CPU instructions.

2) Hierarchical Instruction (for flexibility) ex. Level 3 : function level instruction (DES encryption) Level 2: SIMD, Protocol, combination (2bit shift&mask) Level 1: Reg access, single ALU, testing, status checks

Bug, Spec. Change : Insted of ECO, HW bugs in level 3, can be realized with level 2 or 1. ( use current chip until next ES)

K. Wakabayashi © NEC Corporation 2007 15 Example of Additional Instruction basic example:several codes -> one instruction e.g. Huffman Decoding Instruction Only with Basic Instractions Additional Instruction which Original CPU core has tmp32A = FAST_LEN_TABLE; tmp32A = memr32(tmp16a + tmp32A); tmp16b = tmp32A >> 16; tmp16a = tmp32A; Len = VLD_CalcLenDist (tmp32A, TableData1, 1); if (tmp16b != 0) { tmp16b = VLD_GetBitReverse (tmp16b); Huffman } CPU Len = tmp16a + tmp16b; Instruction

7 instructions → 1 instructions 7 cycles 2 cycles ☆Large procedure containing hundreds instructions, might be synthesized as a HW accelerator. (e.g. highly parallel=many FUs, low power, routability)

K. Wakabayashi © NEC Corporation 2007 16 Additional IS containing control flow: special branch, loop instruction branch with bit calculation: common for DAV, mobile

Function: If 25th bit of R3 is 0, then goto L1

Base Instructions New Instruction For custom CPU mov r2,0xfeffffff

and r3,r2 If(bit(r3,25)==0) goto L1 jz L1 1 instruction 3 instruction (1 cycle ) (3 cycle)

K. Wakabayashi © NEC Corporation 2007 17 Result: “ Screw & snail” CPU

C(BDL) 2,267 Line Basic:81, RTL(verilog) 12,985 Line Special:24 # of IS 105 clock 108 MHz Gate size 42.6 K Gate Less than 1% area -> dozens is fine, but for 1000, 42KG is too large. Man Power (design) 3.0 MM First Trial (C++ simulation) 3.5 MM Next version took Much less Effects of C-based design Just inserting Specials 1. Flexible enough to follow spec changes 2.HW/SW co-design SW and HW guys discussed on additional ISs 3 Smaller circuits with resource sharing

K. Wakabayashi © NEC Corporation 2007 18 Comparison of Custom Processors with C-Synthesis vs. Manual RTL Configurable Behavior C RTL Rate Processor Code Size 1.3KL 9.2KL 7.6X

Simulation 61Kc/s 0.3K 203X (*1) Gate Size 19KG 18KG ~5%+

“Screw&nail” processor should be with C-synthesis, since custom CPU design effort should be small enough!

*1 Pentium3@1GHz, Testbench, debug included

K. Wakabayashi © NEC Corporation 2007 19 Dynamic Program Loading for custom CPU and DRP

Off chip memory Custom Instructions Loading with CPU DMA Data A

Instruction/Data Memory Process B

Configurable DRP Memory

Process C Instruction/DATA memory is dynamically reloaded upon request. Memory size can be reduced. With C++ cycle accurate simulation, good size of memory should be Determined.

K. Wakabayashi © NEC Corporation 2007 20 2. DRP Architecture (NECEL) DRP Tile (variable array size) MemMem MemMem MemMem MemMem ™™ ArrayArray ofof byte-orientedbyte-oriented MemMem PE PE PE PE PE PE PE PE MemMem processingprocessing elementselements MemMem PE PE PE PE PE PE PE PE MemMem ™™ FullyFully programmableprogrammable Mem PE PE PE PE PE PE PE PE Mem Mem Mem inter-PEinter-PE wiringwiring resourcesresources MemMem PE PE PE PE PE PE PE PE MemMem State Transition Controller MemMem PE PE PE PE PE PE PE PE MemMem ™™ STC:STC: aa simplesimple sequencersequencer

MemMem PE PE PE PE PE PE PE PE MemMem

MemMem PE PE PE PE PE PE PE PE MemMem ™ Array of configurable MemMem PE PE PE PE PE PE PE PE MemMem ™ Array of configurable

MemMem MemMem MemMem MemMem datadata memoriesmemories

™ Fine-grained processor-based programmable array architecture ™ Architected for stream data processing, such as NW packet, motion/still picture, and wireless data streams, etc.

K. Wakabayashi © NEC Corporation 2007 21 Mechanism of Dynamic Reconfiguration

Multi-context data-paths #4 #3 #2 ReconfigurationReconfiguration donedone inin #1 lessless thanthan 11 nsns

NoNo additionaladditional cyclecycle neededneeded + + - forfor reconfigurationreconfiguration

+ *

Add. Ctrl

p(i+1 , j-1) p(i , j-1) j-1 line + pixel data + Event sig. p(i-1 , j) pn(i , j) Context ID =3 p(i , j) j line + pixel data ー p(i+1 , j)

#1 状態遷移コントローラ#2 #3 #4 * p(i+1 , j+1) p(i , j+1) p(i-1 , j+1) 5 State Transition Controller: STC

K. Wakabayashi © NEC Corporation 2007 22 DRP’s roll in ASSP 1) Original Roll : programmable HW in ASSP (faster than DSP) for DSP accelerator, Encryption, etc. several rolls of different sized procedure (Dynamic context , Dynamic program re-loading)

2) Proxy of Main CPU controlling chip, software procedure instead of main CPUs

Main CPU is not powerful enough to perform several jobs in terms of performance and size -> several tasks for main CPU are transferred to DRP after chip fab. (dynamic program re-loading: very fast for DRP)

DRP gives flexibility to ASSP (CPU is not flexible enough in terms of performance)

K. Wakabayashi © NEC Corporation 2007 23 DRP: Tool Driven Architecture!

CC descriptiondescription Architecture is derived from C- ConstraintConstraint • synthesis “Cyber” DRPDRP CompilerCompiler –“FSMD” directly mapped onto “STC and PE array”

FSMFSM Data-pathData-path • C-source code Debugging Data-pathData-pathData-path Data-pathData-path – Same as Pentium, Arm, MIPS!

Watch Reg/Mem PE array IN value look-up 1 with C variable Download 2 - + configuration code 3 < + to each context 4 + + Test Data OUT Input/Output ALU State Control State Transition Controller Stop/Pause/ On-chip debugger Step/Run

K. Wakabayashi © NEC Corporation 2007 24 DRP : Fully Integrated Design Environment

Variety of graphical feedbacks provided for designers Behavioral synthesizer in order to support C source code optimization IDE C source code State transition diagram

Synthesis Reports: status PE, Delay, Throughput, Power Estimation

Mem/Reg/Port Access Info showing Easy, effective, On-chip debugger and intuitive

C-sourceC-source levellevel debuggingdebugging forfor HWHW isis finallyfinally supportedsupported justjust likelike IDEIDE forfor CPU.CPU. DRP-1 board

K. Wakabayashi © NEC Corporation 2007 25 3. “All-in-C” : all tools work for original C source code

CPU Bus I/F generator BehavioralBehavioral CPU Bus I/F generator Software IP library IP library C Behavioral description RTL IP SystemC/SpecC ANSI-C (BDL) Veirlog/VHDL SystemC

Formal Verifier High-speed simulator C-RTLC-RTL Equivalence Equivalence Behavioral Bit-accurateBit-accurate Prover Behavioral Cycle Behavioral Simulator Prover Synthesizer Behavioral Simulator Synthesizer Accurate Cycle-accurate HW/SW Property/AssertionProperty/Assertion Cycle-accurate HW/SW Control-intensive Co-simulator CheckerChecker C++ Co-simulator Data-dominant FPGAFPGA Accelerator Accelerator LibraryLibrary Control-flow-intensive SystemC Characterizer QoR Characterizer QoR Analyzer SystemC simulator Analyzer

RT Power RT Power RT Testbench Estimator Verilog/VHDL RT Testbench Estimator FloorPlannerFloorPlanner GeneratorGenerator

FPGA Base ISSP DRP & Back-end implementation

K. Wakabayashi © NEC Corporation 2007 26 How to connect modules and CPUs? CPU Bus and BUS I/F generator: ex. AMBA AHB,AXI

AMBA AHB(ARM) M0 Address & control S0 M0 M1 M2 MUX HADDR[31:0] bus1 M1 Arbiter S1 MUX

HWDATA[31:0] S0 S1 S2 S3 Generate Write data M2 S2 Read data

•Kind=AMBA_AHB MUX HRDATA[31:0] •Bus Master {M0, M1, M2} S3 •Bus Slave { S1,S2,S3,S4} Address •Arbiter : {RoundRobin | Decoder FixedPriority} Sequential Combinational behavior Config. Bus beh. C RTL Bus and I/F 30L Gen. 4KL Syn. 12KL

K. Wakabayashi © NEC Corporation 2007 27 Communication via BUS

BUS CWB generates bus I/Fs, a bus, and an arbiter from our bus access APIs and a bus definition. filter_core

bus definition (for BUS synthesis) API for BUS I/F (for BUS I/F synthesis) defbus AMBA_AHB { /* on-chip bus protocol(AMBA AHB) */ void read_stream(uint24 *p, ureg32 *idx) width address = 32; /* address bus bit width */ { width data = 32; /* data bus bit width */ module master ={pci2master, filter_core}; /* master UDL list for this bus */ *p = CBM_single_read(*idx) ; module slave = {slave2sdram,filter_ctrl}; /* slave UDL list for this bus */ *idx += 4 ; mode arbiter_rule =RoundRobin; /* arbitration rule */ return ; } bus1; /* bus instance name */ } module AMBA_AHB_MASTER { /* master spec. declaration */ void write_stream(uint24 v, ureg32 *idx) mode burst = Enable; /* burst transfer capability */ { mode data_transfer =Direct; /* UDL-I/F communication type */ mode clock = Enable; /* source declaration */ CBM_single_write(*idx, v) ; mode reset = Enable; /* reset signal source declaration */ return ; } pci2master,filter_core; /* master instance name */ } module AMBA_AHB_SLAVE { filter_core() mode burst = Enable; /* burst transfer capability */ { map address = 0x00000000-0x00ffffff & 0xffffff00; /* address map & decoder mask */ .... } filter_ctrl; /* slave instance name */ /* Line 0 (odd=0) */ for (j = 0; j < w_size; j++) { read_stream(&val,&read_index); line[0][j] = val ; } ... } K. Wakabayashi © NEC Corporation 2007 28 4. Reverse(?) trend with C-synthesis Change SW control into Hardware control

Global trend: more SW, but still we have minor trend of SW to HW solution

1) After several series of versions - For Low Power - For low cost, area, performance Very complex control can be implemented as a HW

2) Hard Real Time Operation in HW rather than SW For easiness of design and reliability of RT operation

K. Wakabayashi © NEC Corporation 2007 29 moremore HardwareHardware forfor PowerPower eefficiency,fficiency, EasyEasy toto designdesign

SW control (more CPUs) More HW

Smaller S-DSP SRAM Area HW 2KWSRAM H/W H/W S-DSP S-DSP 2KWSRAM S-DSP 2KWSRAM 2KW Viterbi size: M gates Decoder

Viterbi NEC MPU Decoder DSP MPU C-synthesis is “must” 1)Power(Area) Reduction 2) SW control into HW control -Less custom DSP, SRAM - Waiting Procedure: S/W into H/W -> Pure Hardware => Power Reduction - Real time constrained S/W into H/W (less active power & leakages) => Easy to design , Reliable

K. Wakabayashi © NEC Corporation 2007 30 Summary 1) Introduced our ASSP approach, a few main CPUs + multi custom CPUs + DRPs 2) C-based synthesis affects our architecture (without C-based synthesis, 1) does not work well) -tasks in main CPU, custom CPU, HW modules, DRP could be exchangeable (that actually happens ) 3) Our Global trend is “HW to SW” , but “SW to HW” trend is also realized with C-synthesis

☆Followings are all C-synthesis HW modules. Designer can select realization with similar MonPower - custom CPU : very flexible & programmable HW - DRP : flexible and medium performance - HW modules : hardwired high performance

K. Wakabayashi © NEC Corporation 2007 31 Email: [email protected]..co.jp URL:http://www.cyberworkbench.com/

K. Wakabayashi © NEC Corporation 2007 32