® ®

Simulink®-based Programming Environment for Heterogeneous MPSoC

Katalin Popovici [email protected]

Software Engineer, The MathWorks © 2009 TheMathWorks, Inc. DATE 2009, Nice, France

® ® Summary . MPSoC (Multi-Processor System On Chip) integrates different components (hardware and software) on a single

. Context: . Heterogeneous MPSoC are required by current embedded applications . TI OMAP, ST Nomadik, NXP Nexperia, Atmel Diopsis . DSP + µC + sophisticated communication infrastructure . Multiple software (SW) stacks

. Problem: . Classic programming environments do not fit: . High level programming environments are not efficient to handle specific architecture capabilities (e.g., C/C++, Simulink) . HW (Virtual) prototypes are too detailed and time consuming for SW debug

. Challenge: . Efficient and fast programming environment for heterogeneous MPSoC 2

1 ® ®

Proposal . Simulink based SW development and validation environment . Combined architecture and application model in Simulink . Architecture markers to ggyenerate executable SystemC models . Performance analysis with different levels of details . Computation and communication mapping exploration

DPCD Variable IDCT Zigzag Length IQ Scan JPEG Decoding Images RLD Quantification Huffman Tables Tables Regenerate

SystemC C\C++ Virtual Architecture

Transaction Model

Virtual Prototype 3

® ®

Outline

. Introduction

. Software Design and Validation . System Architecture . Virtual Architecture . Transaction Accurate Architecture

. Conclusions

4

2 ® ®

MPSoC Architecture

. Heterogeneous MPSoC HW-SS SW-SS . SW subsystems for flexibility . HW subsyypstems for performance Interconnect Component . Complex communication network . Bus based architectures . Network on Chip (NoC) architectures Application . SW subsystem Software Hardware dependent Software (HdS) . Specific CPU subsystem: . CPUs: GPP, DSP, ASIP .CPU . I/O + memoryypp architecture + other peripherals CPU Memory . Layered SW architecture: Hardware . Application code (tasks) Network Other Interface Periph. . Hardware dependent Software (HdS) . Specific to architecture / application to achieve efficiency . Programming heterogeneous MPSoC . Generate SW efficiently by using HW resources for communication & synchronization 5

® ® Example of Heterogeneous MPSoC: Reduced Atmel Diopsis RDT MEM SS ARM9 SS . ARM9 SS SRAM DXM . DSP SS Bridge Bridge ARM9 ROM . MEM SS AMBA AHB . POT SS (Periph. On Tile) . I/O peripherals Bridge POT SS Bridge DMA DSP SS

. System peripherals Timer mailbox PIC mailbox PMEM . Interconnect: AMBA bus AIC SPI REG DMEM DSP

. Local & global memories accessible by both processing units . Different communication schemes between CPUs

. Require multiple software stacks (ARM + DSP) 6

3 ® ® Software Stack Organization SW Subsystem Task 1 Task 2 Task n

Memory CPU Application HdS HDS API Comm OS InterfInterf.. PeriphPeriph.. HAL API HAL . SW stack is organized into layers SW Stack . Application code . SW code of tasks mapped on the CPU . HdS (Hardware dependent Software) made of different components . OS ((pOperatin g S ystem ) . Comm (Communication Primitives) . HAL (Hardware Abstraction Layer) . API (Application Programming Interface) . Different SW components need to be validated incrementally . Different abstraction levels corresponding to the different SW components . SW development platforms (HW abstraction models) to allow specific SW components debug and communication refinement 7

® ® Software Development Platform

User SW code Development Platform . User SW Code . C/C++, Simulink functions, binary,… Executable . Development platform to abstract Model architectures Generation . Runtime library, simulator (ISS, Simulink) HW Executable Model Abstraction . Executable model generation . Compile, Link Debug & . Debug Performance Hardware Platform . Iterative process Validation . Different SW components need different detail levels . Requirements for MPSoC executable models: . Speed . Easily experiment with several mapping schemes . Multiple SW stacks . Accuracy . Evaluate the effect on performance by using specific HW resources . Debug low level SW code 8

4 ® ®

Software Design Flow ARM7 Mem. XTENSA Mem. Platform Independent

Local Bus Local Bus F1 F2 Architecture Application F0 F5 F Interface Periph. Interface Periph. 6 F3 F4

Global Bus HdHardware AhittArchitecture ARM-SS XTENSA-SS Simulink Template T1 T2 T3 F F 1 Partitioning2 & Mapping F 0 F5 F6

F3 F4

Simulink System Architecture Combined Architecture / Application Model

Development Platform Software Stack Generation Generation

HW Library SW Library

SystemC Virtual Prototype & SW Binary

9

® ® Software Abstraction Levels SW-SS1 SW-SS2 SW Development T1 T2 T3 SW Architecture Platform

Simulink Sy stem Architect u re Abstract CPU-SS1 Abstract CPU-SS2 T1 MEM T2 T3 SystemC TLM T2

HDS API

Abstract Interconnect Component Virtual Architecture

Abstract CPU-SS1 Abstract CPU-SS2 T2 T3 MEM-SS CPU1 Memory CPU2 Memory SystemC TLM HDS API Interface Periph Interface Periph Comm OS HAL API Transaction Accurate Interconnect Component (Bus/NoC) Architecture

CPU1 CPU-SS1 CPU2 CPU-SS2 T2 T3 MEM-SS SystemC TLM ISS Memory ISS Memory HDS API Periph Interface Periph Interface Comm OS Virtual Prototype HAL API Interconnect Component (Bus/NoC) HAL 10

5 ® ® Software Abstraction Levels SW-SS1 SW-SS2 SW Development T1 T2 T3 SW Architecture Platform

Simulink Sy stem Architect u re Abstract CPU-SS1 Abstract CPU-SS2 T1 MEM T2 T3 SystemC TLM T2

HDS API

Abstract Interconnect Component Virtual Architecture Models to debug the SW GAPSW Design Abstract CPU-SS1 Abstract CPU-SS2 T2 T3 MEM-SS CPU1 Memory CPU2 Memory SystemC TLM HDS API Interface Periph Interface Periph Comm OS HAL API Transaction Accurate Interconnect Component (Bus/NoC) Architecture

CPU1 CPU-SS1 CPU2 CPU-SS2 T2 T3 MEM-SS SystemC TLM ISS Memory ISS Memory HDS API Periph Interface Periph Interface Comm OS Virtual Prototype HAL API Interconnect Component (Bus/NoC) HAL 11

® ®

Outline

. Introduction

. Software Design and Validation . System Architecture . Virtual Architecture . Transaction Accurate Architecture

. Conclusions

12

6 ® ® System Architecture Model SW-SS1 SW-SS2 SW Development T1 T2 T3 SW Architecture Platform

Simulink Sy stem Architect u re Abstract CPU-SS1 Abstract CPU-SS2 T1 MEM T2 T3 SystemC TLM T2

HDS API

Abstract Interconnect Component Virtual Architecture

Abstract CPU-SS1 Abstract CPU-SS2 T2 T3 MEM-SS CPU1 Memory CPU2 Memory SystemC TLM HDS API Interface Periph Interface Periph Comm OS HAL API Transaction Accurate Interconnect Component (Bus/NoC) Architecture

CPU1 CPU-SS1 CPU2 CPU-SS2 T2 T3 MEM-SS SystemC TLM ISS Memory ISS Memory HDS API Periph Interface Periph Interface Comm OS Virtual Prototype HAL API Interconnect Component (Bus/NoC) HAL 13

® ®

System Architecture Design

. Captures application and mapping to Subsystems Tasks architecture Functions SW-SS1 SW-SS2 . Task T1 T2 T3 . Set of functions:

. Simulink blocks, S-functions F1 F2 F3 F4

. Subsystem COMM3 . Set of tasks COMM1 . Abstract communication types . Communication Intra-SS COMM2 . Communication Inter-SS Param1 = Value1 Param2 = Value2 . Communication protocol Comm. . Generic Simulink I/Os Inter-SS Comm. . Explicit annotation for implementation Intra-SS

. Algorithm validation through simulation 14

7 ® ® Application Example: M-JPEG Decoder Mapped on Diopsis RDT T1 DC Coefficients T2 T3 T4 . Application JPEG Variable DPCD Images IDCT specification Length Zigzag S can IQ Decoding (~ 1700 LOC) 01101… RLD Huffman Tables AC Coefficients Quantification Tables Bitmap Images

MEM SS ARM9 SS SRAM . Target architecture: DXM Diopis RDT with AMBA Bridge Bridge ARM9 ROM

AMBA AHB

Bridge POT SS Bridge DMA DSP SS . Partitioning & mapping Timer mailbox PIC mailbox PMEM

REG DMEM DSP AIC SPI 15

® ® System Architecture Level Software Development Platform for Diopsis RDT ARM9-SS DSP-SS POT-SS (I/O) T1 T2 T3 T4

ch1_T1T2 … … ch5_T1T2

ch1_T2T3 ch_T3T4 ResourceType = DXM ch2_T2T3 InterconnectType = AMBA ModuleName = DXM0 Communication units: Simulink Signals Architecture markers annotating the model: . 5 Intra SS communication units . ResourceType . depends on application . ARM9, DSP, POT, Task . Communication: swfifo, dmem, sram, reg, dxm . 3 Inter SS communication units . NetworkType: AMBA_AHB, NoC . depends on application . AccessType: DMA, direct . Generic channels to be mapped on resources . MemName . Execution model in Simulink  Validation of application functionality 16

8 ® ®

MJPEG System Architecture in Simulink . 7 S-Functions . Algorithm validation,10 frames QVGA YUV 444 . Simulation time: 15s on PC 1.73GHz, 1GBytes RAM

ch1_T2T3

ch_T3T4

ARM9 ch2_T2T3 DSP POT

Task1 Task2

Decoded Microblock

Simulation running

17

® ®

Capture of Low Level Architecture Features in Simulink for MJPEG . 8 communication units: MEM SS . 3 Inter-SS + 5 Intra-SS SRAM ARM9 SS DXM . Communication schemes: . Chang ing mar ks o f Simu lin k mo de l Bridge Bridge ARM9 ROM

AMBA AHB . Nb AMBA cycles REG to transfer “N” words

Bridge POT SS Bridge DMA DSP SS . ARM wr: N/4+8+(N-1) Timer mailbox PIC mailbox PMEM . DSP rd: N/4 ResourceType = REG AIC SPI REG DMEM DSP

For example, REG

ch1_T2T3

ch_T3T4

ch2_T2T3 REG

18

9 ® ®

Capture of Low Level Architecture Features in Simulink for MJPEG

MEM SS . Nb AMBA cycles DXM to transfer “N” words SRAM ARM9 SS DXM . ARM wr: 2*N . DSP rd: 14+(N-1) Bridge Bridge ARM9 ROM

AMBA AHB ResourceType = DXM

Bridge POT SS Bridge DMA DSP SS AccessType = DMA Timer mailbox PIC mailbox PMEM

AIC SPI REG DMEM DSP

For example, DXM

ch1_T2T3

ch_T3T4

ch2_T2T3 DXM

19

® ®

Capture of Low Level Architecture Features in Simulink for MJPEG MEM SS . Nb AMBA cycles SRAM to transfer “N” words SRAM ARM9 SS DXM . ARM wr: N/2

Bridge Bridge ARM9 ROM . DSP r d: 5+ (N-1)

AMBA AHB ResourceType = SRAM Bridge POT SS Bridge DMA DSP SS AccessType = DMA Timer mailbox PIC mailbox PMEM

AIC SPI REG DMEM DSP

For example, SRAM

ch1_T2T3

ch_T3T4

ch2_T2T3

SRAM

 Easy to experiment with different communication schemes 20

10 ® ® Virtual Architecture Model SW-SS1 SW-SS2 SW Development T1 T2 T3 SW Architecture Platform

Simulink Sy stem Architect u re Abstract CPU-SS1 Abstract CPU-SS2 T1 MEM T2 T3 SystemC TLM T2

HDS API

Abstract Interconnect Component Virtual Architecture

Abstract CPU-SS1 Abstract CPU-SS2 T2 T3 MEM-SS CPU1 Memory CPU2 Memory SystemC TLM HDS API Interface Periph Interface Periph Comm OS HAL API Transaction Accurate Interconnect Component (Bus/NoC) Architecture

CPU1 CPU-SS1 CPU2 CPU-SS2 T2 T3 MEM-SS SystemC TLM ISS Memory ISS Memory HDS API Periph Interface Periph Interface Comm OS Virtual Prototype HAL API Interconnect Component (Bus/NoC) HAL 21

® ® Virtual Architecture Design . Hardware Architecture Abstract CPU-SS1 Abstract CPU-SS2 T2 . SystemC TLM, message accurate model T1 MEM T2 T3 . Tasks encapsulated in SC_THREADS . . . . HDS API . Inter-SS communication units partially mapped on the resources Abstract Interconnect . Abstract interconnect component . Intra-SS communication units become software communication channels

. Simulation . Software Architecture . Task scheduled by the SystemC scheduler . Task C code based on HdS APIs . Task code validation and partitioning Hardware platform Task code of T2 CPU-SS2 (SC_MODULE) Communication channel C code of Task_T2 void recv (fifo_ch* ch, void* dst, int size) { SC_MODULE (CPU_SS2) { int B[10], C[20], D[10]; dst = ch->read (size); Task_T2 * T2; // tasks } void task_T2( ) { in_port *in; // ports while(1) { out_port *out; class fifo_ch : public sc_prim_channel { recv(CH1, B, 10); // Comm. API fifo_ch*ch1; // channels word *buffer; F1(B,C); // Computation Task_T2 (SC_THREAD) public: F2(C,D); word * read (int size) { SC_MODULE (Task_T2){… send(CH2, D, 10); // Comm. API for (i=0; i

11 ® ® Application Example: M-JPEG Decoder mapped on Diopsis RDT Abstract ARM9-SS MEM-SS . Abstract SS . Inter-SS communication ppyartially T1 T2 mapped on explicit resources (dxm, SRAM ch1_T1T2 sram, reg, dmem) … … ch5_T1T2 . Intra-SS communication becomes (swfifo) swfifo channels . Abstract AMBA bus: virtual arbiter + address decoder

Abstract Bus

Virtual Arbiter

Decoder

T4 DMEM REG .T3 T3

Abstract POT-SS

Abstract DSP-SS 23

® ® Application Example: M-JPEG Decoder mapped on Diopsis RDT Abstract ARM9-SS Software MEM-SS T1 T2

T1 T2 SRAM ch1_T1T2 HdS API … HdS API … ch5_T1T2 (swfifo) Tasks code on ARM9

. Tasks C code based on HdS API Abstract Bus . Tasks scheduled by SystemC scheduler

T3

T4 DMEM REG .T3 T3 HdS API Task code on DSP Abstract POT-SS

Abstract DSP-SS 24

12 ® ® Results for the M-JPEG Decoder mapped on Diopsis RDT at the Virtual Architecture Level

. 3 inter-SS communication mapping schemes: total messages through AMBA, execution time

Comm. ch1_ T2T3 ch2_ T2T3 ch_ T3T4 ch1_ T1T2- Total messages Execution Time [ns] Unit (256 bytes) (4 bytes) (64 bytes) ch5_T1T2 AMBA (1 clock cycle 20ns)

DXM DXM DXM SWFIFO 216000 4464060 MJPEG DXM REG DMEM SWFIFO 144000 3720060 SRAM SRAM DMEM SWFIFO 108000 2232020

. Simulation time 14s (DXM+REG+DMEM), 10 frames QVGA YUV 444 format Simulation Screenshot

 Validation of task code and partitioning SystemC

25

® ® Transaction Accurate Architecture Model SW-SS1 SW-SS2 SW Development T1 T2 T3 SW Architecture Platform

Simulink Sy stem Architect u re Abstract CPU-SS1 Abstract CPU-SS2 T1 MEM T2 T3 SystemC TLM T2

HDS API

Abstract Interconnect Component Virtual Architecture

Abstract CPU-SS1 Abstract CPU-SS2 T2 T3 MEM-SS CPU1 Memory CPU2 Memory SystemC TLM HDS API Interface Periph Interface Periph Comm OS HAL API Transaction Accurate Interconnect Component (Bus/NoC) Architecture

CPU1 CPU-SS1 CPU2 CPU-SS2 T2 T3 MEM-SS SystemC TLM ISS Memory ISS Memory HDS API Periph Interface Periph Interface Comm OS Virtual Prototype HAL API Interconnect Component (Bus/NoC) HAL 26

13 ® ® Transaction Accurate Architecture Design CPU-SS1 CPU-SS2 . Hardware Architecture Abstract Abstract T2 T3 MEM-SS . SystemC TLM model CPU1 Memory CPU2 Memory . Detailed CPU-SS local architecture HDS API Interface Periph Interface Periph Comm OS . Abstract CPU cores HAL API . Explicit communication protocol Interconnect Component (Bus/NoC) . Explicit interconnect component (bus, NoC)

. Simulation . Software Architecture . Task scheduled by the OS scheduler . Task code + OS + Comm . Validation of OS & Comm integration . Based on HAL APIs TA platform SW Stack code on CPU–SS2 main Communication SW TOP (SC_MODULE) SC_MODULE (TOP){ extidtern void tkT2()task_T2 (); void recv(chdth, dst, s ize){) { CPU2_SS *cpu2_ss; // Subsystems void __start (void) {… switch (ch.protocol){ HW_SS *hw_ss; create_task (task_T2); case FIFO: BUS *bus; // Interconnection if (ch.state==EMPTY) _schedule(); CPU2-SS (SC_MODULE) C code of Task_T2 OS SC_MODULE (CPU_SS2) { void task_T2( ) { Memory *mem; // local components void __schedule(void) { Periph *periph; while (1) { int old_tid = cur_tid; CPU2 * cpu2; … recv (CH1, B, 10); cur_tid = new_tid; … __cxt_switch( old_tid.cxt, cur_tid.cxt ); send (CH2, C, 20); Bus_port *busport; // Ports declation …} 27

® ® Application Example: M-JPEG Decoder mapped on Diopsis RDT MEM SS ARM9 SS Software SRAM T1 T2 DXM HDS API Abstract Comm OS Bridge Bridge ARM9 Masters HAL API Control & Write Data SW Stack on ARM9 MUX AMBA AHB Addr Arbiter MUX Read Decoder Bridge POT SS Bridge DMA Data MUX T3 DSP SS Slaves Timer mailbox PIC mailbox HDS API Comm OS HAL API AIC SPI REG DMEM Abstract DSP SW Stack on DSP . Software Stack (Tasks code + OS + Comm) . Local SS architectures detailed based on HAL API . Abstract CPU execution models . Tasks scheduled by OS scheduler . AMBA bus protocol fully modeled SystemC ARM9 . Inter-SS communication fully mapped on explicit resources . Intra-SS communication managed by OS DSP . Simulation time 5m10s (DXM+REG+SRAM), 10 frames QVGA POT  Validation of OS and Comm integration Simulation Screenshot 28

14 ® ® Results for the M-JPEG Decoder at the Transaction Accurate Architecture Level MEM SS ARM9 SS SRAM DXM . Different communication mapping schemes

Bridge Bridge ARM9 SRAM DXM AMBA AHB REG Bridge POT SS Bridge DMA

Timer mailbox PIC mailbox DSP SS

AIC SPI REG DMEM DSP

Communication mapping exploration Communication scheme Transactions to memories [Bytes] Total AMBA ___ DXM SRAM REG DMEM cycles DXM+DXM+DXM 5256k 0 0 0 8856k 100% DXM+REG+DMEM 4608k 0 72k 576k 7884k 89% SRAM+SRAM+DMEM 0 4680k 0 576k 3960k 45%

 Improvement up to 55% in communication performance by using HW resources 29

® ®

Outline

. Introduction

. Software Design and Validation . System Architecture . Virtual Architecture . Transaction Accurate Architecture

. Conclusions

30

15 ® ®

Conclusions

. Definition of the different abstraction levels and the HW & SW models . System Architecture (SA) in Simulink . Virtual Architecture (VA) in SystemC . Transaction Accurate Architecture (TA) in SystemC . Structuring the SW stack into layers allows: . Flexibility in terms of SW components reuse (OS, Communication) . Portability to other platforms (HAL) . Incremental generation and validation of the different SW components by using SW development platforms (HW abstraction models) . HW abstraction models: . Using markers in the Simulink model to allow automatic generation of mapping schemes . Allow early performance estimation at different levels of details . Allow the efficient use of architecture resources

. Easily experiment with several computation and communication mapping schemes 31

® ®

Thank you!

Katalin Popovici {[email protected]} The MathWorks

32

16 ® ®

Acknowledgement

. Dr. Ahmed Jerraya, CEA-LETI, France

. Prof. Frederic Petrot, TIMA Labs, France

. Prof. Frederic Rousseau, TIMA Labs, France

. Pieter Mosterman, The MathWorks, USA

33

17