MEH420 Intro. To Embedded Systems ARM Processors ARM Processors

• ARM was developed at Acron Computer • Based upon RISC Architecture with Limited of Cambridge, England between enhancements to meet requirements of 1983 & 1985 embedded applications. • RISC concept introduced in 1980 at Stanford • A large uniform register file and Berkley • Load-store architecture, where data processing operations operate on register contents only • ARM Limited founded in 1990 • Uniform and fixed length instruction • ARM Cores • 32-bit processor • Licensed to partners to develop and fabricate new • Instructions are 32-bit long • Good speed / power consumption ratio • Soft core • High code density

-1- -2- -3-

ARM Processors ARM Processors ARM Processors

• Version 1 (1983-1985) (obsolete) • Version 5T • Enhancement to Basic RISC Features: • 26-bit addressing, no multiply or coprocessor • Superset of 4T adding new instruction • Version 5TE • Control over ALU and barrel shifter for every data • Version 2 (obsolete) processing operation to maximize their usage • Includes 32-bit result multiply co-processor • Add signal processing extension • Auto-increment and auto-decrement addressing • Version 3 • Examples: • ARM9E-S: v5TE (Sony Ericsson K-W series, TI modes to optimize program loops • 32-bit addressing • Load and Store multiple instructions to maximize OMAPs) data throughput • Version 4 • XScale: v4 (Samsung Omnia, Blackberry) • Conditional execution of instructions to maximize • Add signed, unsigned half-word and signed byte • Version 6 execution throughput load and store instructions • ARM11: ARMv6 (iPhone, E90, N95 etc) • Version 4T: Thumb compressed form of • Cortex-M0-M1: ARMv6 (STM32, NXP LPC, FPGA instruction introduced. Softcore) -4- -5- -6-

ARM Processors ARM Processors: Common Features (till v5) ARM Processors: Basic ARM Organization

• ARM v7: (M,E-M,R,A): Cortex-M3-M4, Cortex-R4-R5- R7, Cortex-A5-A7-A8-A9,A12, A15. • Data items are place in register file • A: Applications processors are intended for use with open OS and • No data processing instructions directly manipulate feature a memory management unit (MMU) providing for virtual data in memory addressing • R: Real-time processors will focus more deeply embedded • Instructions typically use two source registers applications. They will feature a memory protection unit (MPU) which protects regions of memory but does not provide for virtual addressing. and single results or destination register • M: Microcontrollers will generally not have memory protection, and • A Barrel shifter on the data path can process focus on providing very low latency responses to interrupts and including features such as flash memory controllers and interrupt data before it enters ALU (no clock- controllers • NXP, Atmel, Cypress, ST, TI, OMAP Samsung, (20MHz- combinational circuit) 2.5GHz) • Increments/decrement logic can update • ARM v8: 32/64 bit version (A-R), Cortex-A53-A57 register content for sequential access • New instruction set, Advances SIMD (NEON), Crypto instuctions, Linux kernel version 3.7 and iOS 7 support. independent of ALU. (auto increment-dec. modes) -7- -8- -9- ARM Processors: Registers ARM Processors: Registers ARM Processors: Registers (r15)

• General purpose registers hold either • Depending upon context, registers r13 • When the processor is executing in data or address and r14 can also be used as GPR ARM state • All registers are of 32 bits • Any instruction which use r0 can as well • In user mode 16 data registers and 2 be used with any other GPR (r1-r13) status registers are visible • All instructions are 32-bit wide • In addition, there are two status • Data registers: r0 to r15 • All instructions are word aligned registers: • r13: stack pointer • PC value is stored in bits [31:2] with • r14: link register (where return address is put • CPSR: current program status bits [1:0] undefined whenever a subroutine is called) register • r15: program counter • SPSR: saved program status register

-10- -11- -12-

ARM Processors: Registers ARM Processors: Processor Modes ARM Processors: Processor Modes

• Processor modes determine • Which registers are active • Access rights to CPSR register itself • Each processor mode is either • Privileged: full read-write access to • The processor enters abort mode when there is • N: Negative Z: Zero, C: Carry, V: Overflow the CPSR a failed attempt to access memory. • I: 1 disable IRQ , F: 1 disable FIQ • Non-privileged: only read access to • Fast interrupt request (FIQ) and interrupt request modes correspond to the two interrupt • T: 0 ARM state, 1: Thumb state the control field of CPSR but read- levels available on the ARM processor. • Q : Overflow , saturation arithmetic (v5TE) write access to the condition flags. -13- -14- -15-

ARM Processors: Processor Modes ARM Processors: Banked Registers ARM Processors: Banked Registers

• Supervisor mode is the mode that the processor • ARM has 37 registers in the register file. • ARM has 37 registers in the register file. is in after reset and is generally the mode that Of those, 20 registers are hidden from a Of those, 20 registers are hidden from a an operating system kernel operates in. program at different times. These registers program at different times. These registers • System mode is a special version of user mode are called banked registers and are are called banked registers and are that allows full read-write access to the CPSR. identified by the shading in the diagram. identified by the shading in the diagram. • Undefined mode is used when the processor • They are available only when the • They are available only when the encounters an instruction that is undefined or processor is in a particular mode; for processor is in a particular mode; for not supported by the implementation. example, abort mode has banked example, abort mode has banked • User mode is used for programs and applications. registers r13_abt, r14_abt and spsr_abt. registers r13_abt, r14_abt and spsr_abt.

-16- -17- -18- ARM Processors: Banked Registers ARM Processors: Mode Changing ARM Processors: Pipeline

• When we enter FIQ mode we have a • Mode changes by written directly to • A pipeline is the mechanism a RISC fresh copy of r8-r14. CPSR or by hardware when the processor uses to execute instructions. • You normally should store (typically to stack) the status register before entering processor responds to exception or Using a pipeline speeds up execution to an interrupt. • CPSR is copied to SPSR. When going interrupt. by fetching the next instruction while to user mode SPSR is copied to CPSR. • To return user mode a special return other instructions are being decoded instruction is used that instructs the and executed. core to restore the original CPSR and banked registers.

-19- -20- -21-

ARM Processors: Pipeline ARM Processors: Pipeline ARM Processors: Pipeline and Memory Organization

• As the pipeline length increases, the amount of work done at each stage is reduced, which allows Processor # of pipeline Memory Clock MIPS/MHz the processor to attain a higher operating family stages organization Rate frequency. This in turn increases the performance. ARM6 3 Von Neumann 25 MHz • The system latency also increases because it ARM7 3 Von Neumann 66 MHz 0.9 takes more cycles to fill the pipeline before the ARM8 5 Von Neumann 72 MHz 1.2 core can execute an instruction. ARM9 5 Harvard 200 MHz 1.1 ARM10 6 Harvard 400 MHz 1.25 • The increased pipeline length also means there The ARM9 adds a memory and writeback stage, which StrongARM 5 Harvard 233 MHz 1.15 allows the ARM9 to process on average 1.1 Dhrystone can be data dependency between certain stages. ARM11 8 Von 550 MHz 1.2 MIPS per MHz—an increase in instruction throughput You can write code to reduce this dependency by Neumann/ by around 13% compared with an ARM7. The maximum using instruction scheduling. Harvard core frequency attainable using an ARM9 is also higher.

-22- -23- -24-

ARM Processors: Instructions ARM Processors: Instructions ARM Processors: Data Types

• Instructions process data held in registers and • Word is 32-bit long access memory with load and store • 3-address data processing instructions, • Word can be divided into four 8-bit instructions. 2 for inputs and 1 for output • Conditional execution of each instruction bytes • Classes of instructions: • Load and store multiple registers • ARM address can be 32-bit long • Data processing • Shift, ALU operation in a single • Address refer to byte • Branch instructions instruction (barrel shifter) • Address 4 start at byte 4 • Load-store instructions • Open instruction set extension through • Software interrupt instructions the coprocessors instruction • Can be configured at power-up as • Programs status register instructions either little- or big-endian mode.

-25- -26- -27- ARM Processors: Data Processing ARM Processors: Data Processing ARM Processors: Move Instruction

• Manipulate data within registers • Operands are 32-bit wide: come from • MOV Rd, N • MOVE instructions registers or specified as literal in the • Rd: destination register • Arithmetic instructions (multiply) instruction itself • N: can be an immediate value or a source register • Logical instructions • Second operand sent to ALU via barrel • Example: mov r7,r5 • Comparison instructions shifter • MVN Rd, N • Suffix S on data processing instructions • 32-bit result placed in register; long multiply instruction produce 64-bit • Move into Rd “not” of the 32-bit value from updates flags in CPSR source results

-28- -29- -30-

ARM Processors: Barrel Shifter ARM Processors: Arithmetic Instructions ARM Processors: Arithmetic Instructions

• Enables shifting 32-bit operand in one of the • Implements 32-bit addition and • Use of barrel shifter with arithmetic and source registers left or right by a specific number subtraction logical instructions increases the set of of positions within the cycle time of instruction. • 3-operand form possible available operations. • Available Operations: shift left and right, rotate right. •Examples: • SUB r0, r1, r2 • Facilitates fast multiply, division and increases •Examples: • Subtract value stored in r2 from that of r1 and • ADD r0, r1, r1 LSL #1 code density. store in r0 • Register r1 is shifted to the left by 1 then it is • Example mov r7, r5, LSL #2 • SUBS r1, r1, #1 added with r1 and result (3 times r1) is stored • Multiples content of r5 by 4 and • Subtract 1 from r1 and store result in r1 and in r0 puts result in r7 update Z and C flags

-31- -32- -33-

ARM Processors: Arithmetic Instructions ARM Processors: Arithmetic Instructions ARM Processors: Logical Instructions

• Multiple contents of a pair of registers • Multiply and Accumulate is especially used in • Bit-wise logical operations on the two source • Long multiply generates 64-bit result DSP operations registers • Examples • MUL r0, r1, r2 (32-bit multiplication) • Result of multiplication can be accumulated with • AND, OR, Ex-Or, bit clear • Content of r1 and r2 multiplied and put in r0 content of another register • UMULL r0, r1, r2, r3 (Long multiplication) • MLA Rd, Rm, Rs, Rn • Example; BIC r0, r1, r2 • Unsigned multiply with result stored in r0 and r1 • Rd = (Rm*Rs)+Rn • R2 contains a binary pattern where every binary • Number of cycles taken for execution of these • UMLAL Rdlo, Rdhi, Rm, Rs (Prodcut is 64-bit) 1 in r2 clears a corresponding bit location in instructions depends upon processor • [Rdhi,Rdlo]=[Rdhi,Rdlo]+ (Rm*Rs) register r1 implementation. • Useful in manipulating status flags and interrupt • Example operation: convolution ! masks.

-34- -35- -36- ARM Processors: Compare Instructions ARM Processors: Load-Store Instructions ARM Processors: Load-Store Instructions

• Enables comparison of 32-bit values • Transfers data between memory and processor • Single Transfer Instructions registers. RISCs use Load-Store instructions. • Load & Store data on a boundary alignment • Updates CPSR flags but do not affect other • Single register transfer • LDR, LDRh, LDRB (load word, half-word, byte) registers • Data types supported are signed and unsigned words • STR, STRH, STRB (store word, half-word, byte) (32-bits), half-words, bytes • Examples • Supports different addressing modes • CMP r0, r9 • Multiple-register transfer • Register indirect: LDR r0, [r1] • Flags set as a result of r0 – r9 • Transfer multiple registers between memory and the • Immediate: LDR r0, [r1,#4] processor in a single instruction • TEQ r0, r9 • 12-bit offset added to the base register • Flags set as a result of r0 ex-or r9 •Swap • Register operation: LDR, r0, [r1,-r2] • TST r0, r9 • Swaps content of a memory location with the content • Address calculated using base register and another of a register • Flags set as a result of r0 & r9 register.

-37- -38- -39-

ARM Processors: Load-Store Instructions ARM Processors: Load-Store Instructions ARM Processors: Load-Store Instructions

• Example: Pre-indexing with write back • More Addressing Modes: • Multiple Transfer Instructions: • LDR r0, [r1,#4]! • Scaled Mode: • Load -store multiple instructions transfer multiple • Before instruction execution • Address is calculated using the base register and register content between memory and processor a barrel shift operation. • r0=0x00000000 , r1=0x00009000 in a single instruction • Mem32[0x00009000]=0x0101010101 • Pre & Post Indexing Mode: • More efficient – for moving blocks of memory • Mem32[0x00009004]=0x0202020202 • Pre-index with write back: LDR r0, [r1,#4]! and saving and restoring context and stack • After instruction execution • Updates the address base register with new address • These instructions can increase interrupt latency • r0 = 0x0202020202 • Post-index: LDR r0, [r1], #4 • Usually instruction executions are not interrupted by • r1 = 0x00009004 • Updates the address register after address is used ARM

-40- -41- -42-

ARM Processors: Load-Store Instructions ARM Processors: Load-Store Instructions ARM Processors: Load-Store Instructions

• Any subset of current bank of registers can be transferred to memory or fetched from memory •LDM •SDM • The base register Rn determines source or destination address • LDMIA|IB|DA|DB , STMIA|IB|DA|DB are available instructions for this purpose

-43- -44- -45- ARM Processors: Stack Processing ARM Processors: Stack Operation Modes ARM Processors: Stack Operation Modes

• A stack is implemented as a linear data • Instructions used in Full Ascending structure which grows up (ascending) or • ARM multiple register transfer instructions • LDMFA: translates to LDMDA (POP) support down (descending) • STMFA: translates to STMIB (PUSH) • Full ascending: grows up, SP points to the highest • SP points to last item in stack • Stack pointer holds the address of the address containing a valid item • Instructions used in Empty Descending current top of the stack • Empty ascending: grows up, SP points to the first empty location above stack • LDMED: translates to LDMIB (POP) • Because of the orthogonal structure of the • Full descending: grows down, SP points to the • STMED: translates to STMIA (PUSH) instructions load and store instructions can highest address containing a valid item • SP points to first unused location be used to effectively utilize the stack • Empty descending: grows down, SP points to the first • LDMFA, STMFA, LDMED, STMED are instead of separate pop and push empty location above stack translated by assembly compiler instructions.

-46- -47- -48-

ARM Processors: SWAP Instruction ARM Processors: Control Flow Instructions ARM Processors: Control Flow Instructions

• It is a special case of load and store instructions • Branch Instructions • Branch Instruction: B label • Conditional Branches • Example: B forward • Swap instructions available: • Conditional Execution • Address label is stored in the instruction a • SWP: swap a word between memory and register • Branch and Link Instructions signed pc-relative offset • SWPB: swap a byte between memory and register • Subroutine Return Instructions • Conditional Branch: B label • Example: BNE loop • These instructions useful for implementing • Branch has a condition associated with it and synchronization primitives like semaphore executed if the condition codes have the correct value

-49- -50- -51-

ARM Processors: Control Flow Instructions ARM Processors: Control Flow Instructions ARM Processors: Control Flow Instructions

• Branch & Link Instruction: • Conditional Execution • Advantages of Conditional Execution: • Perform a branch, save the address following the • An unusual feature of ARM instruction set is that • Reduces the number of branches branch in the link register, r14 conditional execution applies not only to • Reduces the number of pipeline flushes • Example: BL subroutine • For nested subroutine, push r14 and some work branches but to all ARM instructions • Improves performance of the code registers required to be saved onto a stack in • Increases code density memory • Example: ADDEQ r0,r1,r2 • Whenever the conditional sequence is 3 •Example: • This instruction will only be executed when the instructions of fewer (smaller and faster) to BL sub1 ; return address is stored in r14 zero flag is set to 1 exploit conditional execution than to use a sub1: …… • EQ is the conditional code branch STMFD r13!, (r0-r2,r14); multiple byte store, r13: SP BL sub2 ; nested

-52- -53- -54- ARM Processors: Control Flow Instructions ARM Processors: SWI ARM Processors: SWI

• Subroutine return instructions: • Software Interrupt Instructions: • SWI is typically executed in user mode • No specific instructions • An SWI causes a software interrupt exception, • Instruction forces processor mode supervisor • Example (1): which provides a mechanism for applications to OS (SVC) and this allows an OS routine to be routines sub1: …… executed in privileged mode …… • Instruction: SWI {} SWI_number • Each SWI has an associated SWI number MOV r15,r14 ; r15: PC • When the processor executes an SWI instruction, it which is used to represent a particular • Example (2): when return address has been pushed sets the program counter PC to the offset 0x8 in the to stack vector table, function call or feature sub1: …… • In case of subroutine it is a part of user program, in • Parameter passing through registers; Return …… the case of SWI it is a part of OS. value is also passed using registers LDMFD r13!, (r0-r12,PC) • Instruction also forces the processor mode to SVC (supervisor), which allows an OS routine to execute -55- -56- -57-

ARM Processors: SWI ARM Processors: Program Status Register Instructions ARM Processors: Program Status Register Instructions

• Two instructions to control PSR directly • MRS: transfers contents of either CPSR or SPSR into a register * Enabling IRQ interrupt • MSR: transfers contents of register to CPSR or SPSR

(lr=r14_SVC)

-58- -59- -60-

ARM Processors: Coprocessor Instructions ARM Processors: Thumb ARM Processors: Thumb

• Used to extend the instruction set • Thumb encodes a subset of the 32-bit • Used by cores with a coprocessor instruction set into a 16-bit subspace • Coprocessor specific operations • Thumb has higher performance than ARM on • Syntax: coprocessor data processing a processor with a 16-bit data bus. Because • CDP {} Cp, opcode1, Cd, Cn, Cm {,opcode2} the memory transfer becomes much more • Cp represents coprocessor number between p0 to p15 efficient. • Opcode field describes coprocessor operation • Cd, Cn, Cm coprocessor registers • Thumb has higher code density which is quite important for memory constrained • Also coprocessor register transfer and embedded systems. memory transfer instructions are available

-61- -62- -63- ARM Processors: Thumb ARM Processors: Thumb ARM Processors: Thumb

• Only low registers r0 to r7 fully accessible in • ARM-Thumb Interworking thumb instructions. • To call a thumb routine from an ARM routine • Higher registers accessible with MOV, ADD, the core has to change state CMP instructions • Changing T bit in CPSR • Only branch instructions can be conditionally • BX and BLX instructions can be used for the executed switch • Example: BX r0 (like branch); BLX r0 (like branch and link) • Barrel shift operations are separate • Enters Thumb state if the bit 0 of the address in Rn is set to instructions binary 1; otherwise it enters ARM state

-64- -65- -66-

ARM Processors: Thumb ARM Processors: ARMv5E Extensions ARM Processors: ARMv5E Extensions

• Extensions to facilitate signal processing • Saturation Arithmetic: operations. It supports: • Normal ARM arithmetic instructions wrap • Signed multiply accumulate instructions • Saturation arithmetic around when there is an overflow of an • Grater flexibility and efficiency when manipulating 16-bit values integer value for applications such as 16-bit digital audio processing. • Using ARMv5E instructions you can saturate the result • Once the highest number is exceeded the result remains at the maximum value • Minimum value does not change on underflow • Example instructions: QADD, QSUB

-67- -68- -69- MEH420 Intro. To Embedded Systems ARM Processors: Exceptions ARM Processors: Exceptions

• Exceptions causes mode change and • Saves the CPSR to the SPSR of the exception mode • Save the PC to the LR of the exception mode • Sets the CPSR to the exception mode The memory map address 0x00000000 is reserved for the vector table, a set of 32-bit words. On some processors the vector table • Sets PC to the address of the exception can be optionally located at a higher address in memory (starting handler at the offset 0xffff0000). Operating systems such as Linux and Microsoft’s embedded products can take advantage of this feature

-1- -2- -3-

ARM Processors: Vector Table ARM Processors: Exception Priorities ARM Processors: Exception Handlers

• Reset handler • Vector table – a table of address that the • Initializes the system, setting up stack pointers, ARM core branches memory, external interrupt source before enabling • Fixed offset for each type of exception IRQ or FIQ • These addresses contain instructions of one • Code should be designed to avoid further triggering of the following forms: of exceptions within the reset handler Occur because of • B

: branching relative to PC external devices • Data Abort • LDR PC, [PC, #offset] : loads handler address • Occurs when memory controller indicates that an from memory to PC (increase interrupt latency) invalid memory address has been accessed. • MOV PC, #immediate : loads immediate value • An FIQ exception can be raised within data abort handler into PC 1: disable

-4- -5- -6-

ARM Processors: Exception Handlers ARM Processors: Exception Handlers ARM Processors: Exception Handlers

•FIQ • Prefetch Abort • Returning from Exception Handler: • Occurs when an external peripherals generates the FIQ input signal • Occurs when an attempt to fetch an instruction results in memory fault • Core disables both FIQ and IRQ interrupts • Exception handler must not corrupt LR • FIQ exception can be served •IRQ • Undefined instruction • Occurs when an external device generates the IRQ • After servicing is complete, return to normal input signal • Occurs when an instruction is not in the ARM or execution occurs Thumb instruction • IRQ handler will be entered if neither an FIQ • By moving the correct value of LR into PC exception or Data abort exception occurs • SWI and undefined instruction have the same • By restoring CPRS from SPSR • On entry IRQ exception is disabled and should level of priority because they can not occur remain disabled for the handler if not enabled by together the handler

-7- -8- -9- ARM Processors: Interrupt Assignment ARM Processors: Interrupt Latency ARM Processors: Stack Organization

• An interrupt controller connects multiple • Hardware (FIQ) and software latency • For each processor mode stack has to be set external interrupts to either FIQ or IRQ up • To be done every time processor is reset • IRQ are normally assigned to general • Software methods to reduce latency • Change to each mode by storing CPRS bit pattern and purpose interrupts • Nested handler which allow further interrupts to initialize SP • Example: periodic timer interrupt to force a occur even when servicing an existing interrupt by re-enabling the interrupts inside the service routine • Design decisions context switch (concurrent processes) • Program interrupt controller to ignore interrupts of • Location and mode (descending stack is • FIQ is reserved for an interrupt source which same or lower priority common) requires fast response time (short latency) • Higher priority interrupts will have lower average •Size latency • Nested interrupt handler requires larger stack (register save)

-10- -11- -12-

ARM Processors: I/O System ARM Processors ARM Processors: ARM7 Core

• Handle all I/O devices using memory mapped • Low-end ARM core for applications like intro I/O level mobile phones •TDMI • Interrupt support •T: Thumb • Fast interrupt ARM CPU CORE • D: On debug support enabling processor to halt in response to debug request • Normal interrupt Processor Core + Cache + MMU • M: Enhanced multiplier, yield a full 64-bit result (Will be discussed later) • I: Embedded ICE Hardware • DMA Support • Von Neumann architecture • Large bandwidth data transfer • 3 stage pipeline, CPI (Cycle Per Inst.) ~ 1.9

-13- -14- -15-

ARM Processors: ARM7TDMI Organization ARM Processors: ARM7 Pipeline ARM Processors: ARM7 Pipeline

• At any time slice, 3 different • Not always cycle per instruction completion instructions may occupy each • Example LDMIA r0, [r2,r3] (multiple load) stage of these stages. • 2 register to load, instruction in execution • When the processor is executing data processing for two cycles instructions, the latency = 3 cycles and the throughput = • Execution of pre-fetched instruction 1 instruction/cycle delayed • Because of prefetching PC will be 8+current address, • Branch, subroutine call and exceptions affect this will be a problem in the pipeline efficiency. case of interrupt! LR = PC+8 • So, LR should be correctted before returining interrupt

-16- -17- -18- ARM Processors: ARM7TDMI Core Interface Signal ARM Processors: Interface Signals ARM Processors: Interface Signals

• Clock control • Memory interface (cont.) • All state change within the processor are controlled by memory • LOCK indicates that the processor should keep the bus to clock (MCLK) ensure the atomicity of the read and write phase of a SWAP • ECLK: clock output reflects the clock used by the core instruction • Memory interface • MAS [1:0] encode memory access size: byte, half-word, word • 32-bit address A[31:0] bidirectional data bus D[31:0], separate • MMU interface Dout[31:0] and Din[31:0] • TRANS: (translation control) 0: user mode 1: privileged mode • MREQ indicates a processor cycle which requires a memory • MODE [4:0] bottom 5 bits of the CPSR (inverted) access • ABORT: disallow access • SEQ indicates that the memory address will be sequential to that used in the previous cycle • State and Configuration • TBIT, whether the processor is currently executing ARM or Thumb instruction • Bigend: big-endian or little-endian

-19- -20- -21-

ARM Processors: Interface Signals ARM Processors: ARM9TDMI ARM Processors: ARM9TDMI

• Harvard Architecture • Increases available memory bandwidth • 5 stage pipeline • Interrupt • Instruction memory interface •Fetch • FIQ: fast interrupt request, higher priority • Data memory interface • Decode • IRQ: normal interrupt request • Simultaneous access to instruction and data • Execute • ISYNC: allow the interrupt synchronizer to be passed memory • Buffer Data: • Initialization • 5 stage pipeline • Access data memory • RESET: start the processors from a know state, executing from or buffer address 0x00000000 • Changes implemented to • Increase CPI ~ 1.5 • Write back: • to register file • Improve maximum clock frequency

-22- -23- -24-

ARM Processors: ARM9TDMI ARM Processors: DSP Enhancements in ARM9E ARM Processors: DSP Enhancements in ARM9E

• New instruction additions give architecture • Count leading zeros instruction v5TE • CLZ for faster normalization and division (for • New 32x16 and 16x16 multiply and MAC floating point numbers). It counts the number of zeroes between the MSB and the first bit set to 1 instructions R1=0x00000010 • SMLAxy, SMLAWy, SMLALxy, SMULxy, SMUWy CLZ R0,R1 • Allows independent access to 16-bit halves of R0 = 6 registers: allows efficient use of 32-bit bandwidth for packed 16-bit operands • Single cycle 32x16 multiplier array • Zero overhead fractional saturating arithmetic • Speeds-up all ARM9E multiply instructions • QADD, QSUC, QDADD, QDSUB

-25- -26- -27- ARM Processors: DSP Enhancements in ARM9E ARM Processors: ARM920T ARM Processors: AMBA

The Advanced Bus Architecture (AMBA) was introduced in 1996 and has been widely adopted as the on-chip bus architecture used for ARM processors. The first AMBA buses introduced were the ARM System Bus (ASB) and the ARM Peripheral Bus (APB). Later ARM introduced another bus design, CPU core around ARM9TDMI called the ARM High Performance Bus (AHB). Using AMBA, peripheral designers can reuse the same design on multiple projects. -28- -29- -30-

ARM Processors: AMBA ARM Processors: Simple ARM based System ARM Processors: Simple ARM based System

• On-chip there will be an ARM core together • As far as memory is concerned there is likely with a number of system dependant to be some (cheap) narrow off-chip ROM (or peripherals flash) used to boot the system from • Also required will be some form of interrupt • There is likely to be some 16-bit wide RAM controller which receives interrupts from the used to store most of the run-time data and peripherals and raised the IRQ or FIQ input perhaps some code copied out of the flash to the ARM as appropriate • Then on-chip there may well be some 32-bit • This interrupt controller may also provide memory used to store the interrupt handlers hardware assistance for prioritizing interrupts and perhaps stack.

-31- -32- -33-

ARM Processors: Simple ARM based System ARM Processors: ARM v5TEJ ARM Processors: ARM v6 Architecture

• SIMD (Single Instruction Multiple Data) for exploiting data parallelism • J: supports implementation of Java virtual • High code density and low power • By slicing up the existing 32-bit datapath into for 8-bit and two machine 16-bit slices • Offering hardware and software acceleration • Example: QADD8 Rd, Rn, Rm • Signed saturating 8-bit SIMD add for optimized “byte code” execution • Example: UASAD8 Rd, Rn, Rm • Sum of absolute difference between corresponding 8-bit values • Dual 16x16 multiply • Cryptographic multiplication • A new 64 + 32x32 multiply accumulate operation

-34- -35- -36- ARM Processors ARM Processors: Intel’s ARM Derivative

• ARM based products to market form manufactures: Atmel, Cirrus Logic, Intel, Samsung • At past, most products based upon ARM7TMDI and ARM920T core, now Cortex based cores are Digital Signal Processors avaliable. • ARM is mostly used as a processor core in SoC and ASICs • There are a number of ASSP (application specific standard product) available for example communication applications, Philips VWS22100: ARM7 based GSM base band chip -37- -38- -39-

DSP DSP DSP

• Example sensor: Light sensitive silicon solid- • Charges shifted out using CCD shift register • An embedded systems is situated in an state device composed of many cells: CCDs • Serial access of pixels external environment • When exposed to light, each cell becomes • Analogue current converted to digital form • Sensors provide input about external electrically charged using ADC operating at pixel rate environment • Current at each site integrated over a period of • Input signal processes by the time (e.g. 15ms) to get reasonable SNR • Sensors can be designed for virtually every • Exposure control physical and chemical quantity • Weight, velocity, acceleration, electrical current, •CMOS : voltage, temperature, etc.

-40- -41- -42-

DSP DSP Features of DSPs

Example: Fingerprint sensor (© Siemens, VDE): • Most DSP task require • Processing of digitally represented signals • Repetitive numeric computations Matrix of 256 x 256 elem. • Signals represented digitally as sequences of • Attention to numeric fidelity Voltage ~ samples • High memory bandwidth most via array accesses distance. • Digital signals obtained from physical signals via • Real-time processing Resistance also transducers (e.g. Microphones) and analog-to-digital • DSPs must perform these tasks efficiently computed. converters (ADC) while minimizing • Digital Signal Processor (DSP) •Cost • Electronic system that processes digital signals • Power • Memory use • Development time

-43- -44- -45- DSP Applications Common DSP Algorithms Common DSP Algorithms: FIR Filtering

• Audio • Coding, decoding, surround-sound • Filters • Communication • FIR, IIR, Adaptive 4-tap FIR • Scrambling, cellular phones, software radios • Transformation filter • Control • Time domain to frequency domain, or space domain • Robotics, disk drive control, motor control to frequency domain: FFT, DCT, etc • Each tap (M+1 taps total) requires • Medical • Correlation • Two data fetches • Diagnostic equipments, hearing aids • signal classification for example • Multiply • Defense • Accumulate • Memory write-back to update • Radar & sonar processing, missile guidance

-46- -47- -48-

Simple DSP (1982): TI’s TMS32010 TMS32010 FIR Filter Code Common Features of DSPs

• Let X4, H4, … are direct (absolute) memory • 16-bit fixed-point addresses • Datapath configured for DSP • Harvard architecture • LT X4 ; Load T with x(n-4) • Specialized instruction set •MPT H4 ; P = H4 * X4 • Accumulator • Multiple memory bank and buses •LTD X3 ; Load T with next sample x(n-3) • Specialized instruction ; Acc = Acc + P • Specialized addressing modes set •MTP H3 ; P = H3 * X3 • Load and • Specialized execution control •LTD X2 Accumulate •MTP H2 • Specialized peripherals for DSP • 390ns MAC time • Two instruction per tap

-49- -50- -51-

Features of DSPs: Datapath Guard Bits and Rounding Error Guard Bits and Rounding Error

Take 2 numbers: 2.56*10^0 and 2.34*10^2 • Specialized hardware to perform key arithmetic operations in one cycle We bring the first number to the same power of 10 as the second one: 0.0256*10^2 • Hardware support for managing • Shifters: adjustment of mantissas of floating point The addition of the 2 numbers is: numbers 0.0256*10^2 + 2.3400*10^2 = 2.3656*10^2 • Guard bits: additional bits to increase accuracy of (after padding the second number with 0, notice that the bit results (32-bit data but 40-bit accumulator) after 4 is the guard digit and the bit after is the round digit) • Saturation: preventing wrap around on overflow The result after rounding is 2.37 as opposed to 2.36 or underflow without the extra bits (guard and round bits), i.e. by considering only 0.02+2.34 = 2.36. The error therefore is 0.01.

-52- -53- -54- Features of DSPs: Precision Features of DSPs: Saturation Features of DSPs: Multiplier

• Word size affect precision of fixed point (Qm.n format) numbers • Specialized hardware performs all key • Precision is defined as the smallest step between two • Saturation: consecutive numbers that can be obtained for a given • Set to most positive (2N-1-1) or arithmetic operations in 1 cycle number of bits • Most negative values (-2N-1) • More than 50% of instruction can involve • DSPs have 16-bit, 20-bit, or 24-bit data • No wrap around multiplier (single cycle latency multiplier) words • Special arithmetic instructions • Need to perform multiply-accumulate • Floating point DSPs cost 2X – 4X versus (MAC) fixed point, slower than fixed point, higher • Useful for signal processing operations power consumption • n-bit multiplier => 2b-bit product • Floating point support simplify development

-55- -56- -57-

Features of DSPs: Rounding Features of DSPs: Accumulator in Datapath DSP Memory

• Even with guard bits, need to round • Option 1: accumulator wider than product: • FIR tap implies multiple memory access guard bits • There are three DSP standard options • DSPs want multiple data ports • Chopping: Remove guard bits with no changes in the • Example: Motorola DSP • Some DSPs have ad hoc techniques to reduce retained bits (biased approximation, biases results up) • 24b x 24b =>48b product, 56b ACC bandwidth demand • Von Neumann Rounding: if bits to be removed are all 0, • Instruction repeat buffer: do 1 instruction 256 times then no change in the retained bit, if any of bits removed • Option 2: shift right and round product before • Often disables interrupts, thereby increasing interrupt latency are 1, them LSB of the retained bits is set to 1 (unbiased adder approximation) • Some recent DSPs have instruction caches • Rounding: round to the nearest number or even number • May allow programmer to “lock in” instructions into cache in case of tie (higher complexity, better accuarcy) • Option to turn cache into fast program memory • A “1” is added to the LSB position of the bits to be retained if • No DSPs have data caches there are “1” in the MSB position and/or subsequent bits being removed; in case of a “1” only in MSB, round to make • May have multiple data memories LSB of the retained bits a zero -58- -59- -60-

DSP Addressing DSP Addressing: Buffers DSP Addressing: Buffers

• DSPs deal with continuous I/O • Have standard addressing modes: immediate, • DSPs deal with continuous I/O • Often interact with an I/O buffer (delay lines) displacement, register indirect • Often interact with an I/O buffer (delay lines) • To save memory, buffer often organized as • Want to keep MAC datapath busy • To save memory, buffer often organized as circular buffer • Any extra instructions imply clock cycles of circular buffer overhead in inner loop • What can do to avoid overhead of address • What can do to avoid overhead of address checking instructions for circular buffer? • Complex addressing is good checking instructions for circular buffer? • Option-1: Keep start register and end register per address • Not use datapath to calculate address • Option-1: Keep start register and end register per address register for use with auto-increment addressing, reset to • Auto-increment/decrement register indirect register for use with auto-increment addressing, reset to start when end of buffer • lw r1, 0 (r2)+ => r1 <- M[r2] and r2<- r2+1 start when end of buffer • Option-2: Keep a buffer length register, assuming buffers • Option-2: Keep a buffer length register, assuming buffers • Option to do it before addressing, positive or negative start on aligned address, reset to start when reach end start on aligned address, reset to start when reach end • Every DSP has “modulo” or “circular” addressing

-61- -62- -63- Execution Control Specialized Peripherals DSPs

• Hardware support for fast looping • Zero overhead looping • Synchronous serial ports • All loop operations in hardware • Parallel ports • TI 5x family • Fast interrupts for I/O handling • Bit I/O ports • TI 6x family • Real time constrains •Timers • Analog Devices SHARC • Stream processing: operations completed within • On-chip A/D, D/A converters sampling period • Block processing • On-chip DMA controller • Debugging support

-64- -65- -66-

DSPs: TI C5x Family DSPs: TI C5x Family DSPs: TI C54x Architectural Features

• Fixed-point DSP • 40-bit ALU + barrel shifter • Modified Harvard architecture • Multiple internal buses, 1 instruction, 3 data, • 1 program memory bus 4 address • 3 data memory buses • 17x17 multiplier •40-bit ALU • Single-cycle exponent encoder • Multiple implementations: • Two address generators with dedicated registers • 1, 2 instructions/cycle • Compare/select/store unit (CSSU)

-67- -68- -69-

DSPs: TI C54x ALU DSPs: TI C5x ISA Features DSPs: TI C54x Pipeline

• 40-bit ALU performs 2’s complement arithmetic and • Repeat and block repeat instructions • Program pre-fetch: Send PC address on logical functions program address bus • ALU supports two 16-bit operations in one cycle • Instructions that read 2, 3 operands • Fetch: Load instruction from program bus • Support saturation and sign extension simultaneously to IR • 2 accumulators partitioned into (i) Lower 16-bits (ii) • Conditional store Upper 16-bits (iii) 8 guard bits • Fast return from interrupt • Decode • 17-bit x 17-bit hardware multiplier coupled with a 40- • Some 54x registers: circular buffer size • Access: Put operand addresses on buses bit adder is linked to accumulator to form MAC unit (there is not in ARM) • Output of adder is passed through a unit that register, block-repeat registers, interrupt detects a zero or an overflow registers, processor mode status register. • Read: Get operands from buses • Performs saturation or rounding according to the mode • Execute

-70- -71- -72- DSPs: TI C54x Power Down Modes DSPs: TI C54x Buses DSPs: TI C54x Buses and Accesses

• PB: Program read bus • Three IDLE instructions: • CB, DB: Data read buses • IDLE1: shuts down CPU • EB: data write bus • IDLE2: shuts down CPU and on-chip • PAB, CAB, DAB, EAB: address buses peripherals • Can generate two data memory address • IDLE3: shuts down chip completely per cycle (including PLL) • Stored in auxiliary register address units ARAU0, ARAU1

-73- -74- -75- MEH420 Intro. To Embedded Systems TI TMS320C55X TMS320C55X - Architecture

• Power efficient DSP processor targeted for • Provide dual MAC units each capable of 17x17 feature rich personal and portable multiplication and 40-bit additional and applications (e.g. cellular phone) subtraction with optional saturation in single cycle. • High performance through increased parallelism • A 40-bit ALU with additional 16-bit ALU for optimizing parallel operations • More built-in instructions and used • Compare, Select and Store Unit programmable parallel functions • Four general purpose 40-bit accumulator • Automatic power management for all • Supports 12 internal buses to performs up to peripherals, memory arrays and CPU. three data reads and two data writes in a single •CISC cycle.

-1- -2- -3-

TMS320C55X – Architecture Overview TMS320C55X – Architecture Overview TMS320C55X – Architecture Overview

• Instruction buffer performs 32-bit program • CPU uses software stack that support 16-bit and fetches and stores instructions for program flow 32-bit push and pop unit (PU) • Supports variable length instructions for • PU decodes and directs address-data flow unit improving code density (AU) and data-computation unit (DA) • Instruction length can be 8, 16, 24, 32, 40, or 48-bit • Program fetches are performed using 24-bit • CPU consist of four functional units: program read address but (PAB) and the 32-bit • IU (Instruction buffer unit) program read data bus to deliver codes to IU • PU (Program flow unit) • Functional units read data from memory or I/O • AU (address-data-flow unit) space via three 16-bit data read buses (A,B,C) • DU (data-computation unit) with associated 24-bit data read address buses

-4- -5- -6-

TMS320C55X – Architecture Overview TMS320C55X – CPU TMS320C55X – Program Flow Unit

• Instruction Buffer Unit • PU receives instructions from IU, generates DC,B bus D busses 3 data read busses 16 • CPU fetches 32-bit packets from memory all program space addresses and also 3 data read address busses 24 • Instruction buffer queue of size 64 bytes controls sequence of instructions program address bus 24 • Decodes 1 to 6 bytes of code using the • Interpreting conditions for conditional program instructions Program instruction decoder read bus Instruction Address Data 32 flow • Determining go to addresses unit unit unit • Passes data to the PU, AU, DU for the InstructionSingleWritesDualDual-multiplyData operandread operand unit execution of instructions • PU initiates interrupt servicing, manages fetchreadcoefficientfrom memory • Instruction buffer queue supports a local blok- single repeat and block repeat operations 2 data write busses 16 repeat instruction that executes a block of and managing execution of parallel 2 data write address busses 24 code within the queue. instructions

-7- -8- -9- TMS320C55X – Address-Data Flow Unit TMS320C55X – Data Computation Unit TMS320C55X – Pipeline

• AU contains DAGEN and all registers to generate • 40-bit barrel shifter, 40-bit ALU, two MAC units, • It has two main segments addresses for reads and writes to address space four 40-bit accumulators • (multi-stage pipeline) • There are 8 auxiliary registers to be used as • 40-bit ALU performs addition, subtraction, address pointer, a coefficient data point register comparison, rounding, saturation, Boolean logic • DAGEN supports both linear and circular operations and absolute-value calculations addressing • It can perform two arithmetic operations • Uses 16-bit ALU for addition, subtraction, simultaneously when a dual 16-bit instruction is comparison, arithmetic/logical shifts, etc. executed • Includes four general purpose temporary registers • Accumulators partitioned into a low-word, a • Simpler operation in parallel with DU high-word and 8 guard bits

-10- -11- -12-

TMS320C55X – Pipeline TMS320C55X – Pipeline TMS320C55X – Memory Space

• Uses unified program/data space and separated • If there is not a pipeline flush instructions are I/O space executed in a very high speed. • All 16Mbyes memory addressable as • Protected pipeline; introduces inactive cycles program or data space (modified Harvard) between instructions that would cause conflicts • CPU reads instructions from memory that uses like reading and writing from the same location program space addresses out of intended order. • Data space consists of general purpose memory • Two ready to execute instructions in the pipeline and MMRs (memory mapped registers, similar might have data dependencies. to PICs) • CPU reads instructions using 24-bit addresses that are assigned to individual bytes

-13- -14- -15-

TMS320C55X – Data Access TMS320C55X – Instruction Sets TMS320C55X – Addressing Modes for Parallelism

• Indirect addressing mode • When processor access data space, it uses • Six type of instructions • Dual AR indirect-addressing mode for accessing 23-bit word addresses so that LSB on the • Arithmetic two data memory words or for executing two address bus is forced to zero and access 16- • Bit manipulation instruction in parallel bit words at even addresses • Logical • Coefficient Indirect Addressing Mode • Data space is divided into 128 main data •Move • e.g. MPY Xmem, Cmem, ACx :: pages and each data page has 64k words • Program Control MPY Ymem, Cmem, ACy • On data page 0, the first 96-words are • Parallel Instructions reserved for MMRs • This command allows two multiplications to be • Execution of two instructions in a single performed in parallel; operand (Cmem) common execution cycle: built-in, user-defined, combined to both the multiplications.

-16- -17- -18- TMS320C55X – Parallel Instructions TMS320C55X – On Chip Memory & Peripherals Block Diagram of TMS320C5510

• Built-in: • Two timers (16-bit), three serial ports, 8 general • MPY *AR0, *CDP, AC0 purpose digital I/O pins (5510) :: MPY *AR1, *CDP, AC1 (exploiting dual MAC) • 16-bit Enhanced Port Interface (EHPI) • Used defined: • On chip Memory, 32k ROM, 256k single access memory, 64k dual access memory • MPYM *AR+, *CDP, AC1 • EMIF for interfacing to 8-bit, 16-bit, 32-bit accesses for || XOR AR2, T1 (16-bit) different types of memory • First instruction is executed by DU and second • DMA Controller instruction in AU in parallel • McBSP’s are high speed full duplex serial ports with • Executed in parallel if length of parallel instruction does not exceed 6 bytes (because of instruction capabilities of double buffered data registers that allow buffer) direct interfacing with other devices in the system • Combination • Configurable Instruction Cache -19- -20- -21-

TMS320C55X – EMIF TMS320C55X – EHPI TMS320C55X – McBSP

• 16-bit wide parallel port through which a host • EMIF is configured by the CPU • Designed to interface to devices such as processor can directly access the DSP’s other DSP processors and CODECs • EMIF services data transfer requests from memory space • T1/E1 frames, MVUP frames, etc three internal resources • DSP can be used a co-processor by help of • Program fetches form the CPU EHPI • Full duplex, double-buffered transmission, triple-buffered reception • Data accesses from the CPU • Communicates with memory via a dedicated • Data accesses from the on-chip DMA auxiliary DMA channel and internal DMA • Work with interrupts or DMA controller buses that provide connectivity to the entire • Highly programmable DSP’s internal memory and part of DSP’s external memory

-22- -23- -24-

TMS320C55X – McBSP TMS320C55X – McBSP Block Diagram TMS320C55X – Power Management

ping_RX 16-bit peripheral bus interrupt DMA • Six software programmable power domains:

 In order to make the application less CPU, DMA, peripheral clock generator, real-time critical, the input is double instruction cache, external memory interface Registers Registers for buffered • Each domain can be put to low-power idle  These buffers are called ping-pong for sync multichannel buffers control and control and mode when its capabilities are not required DMA monitoring monitoring  The configuration is that of a two compand • User has the capability to control power pong_RX frame circular buffer consumption dynamically  First fill one buffer, then fill the other, Clock/frame sync then switch back to the first • User-configurable IDLE domains allow programmer control of which hardware unit is shut down Data rcv, xmit clock

-25- -26- -27- SIMD Instructions SIMD Instructions

• Multimedia data types are narrow : e.g. 8 bit • HP Precision Architecture (old) per color, 16-bit per audio sample per • Pentium MMX channel • 64-bit vectors representing 8bytes encoded, 4 • 2-8 values can be stored per 32-bit or 64-bit word encoded or 2 double word encoded Architectural Variants registers and operated in parallel numbers • Wrap around/saturation options • Ultra Sparc (by Sun Microsystems) • Visual Instruction Set (VIS) •ARM v6

-28- -29- -30-

Very Long Instruction Word (VLIW) Processors Simple VLIW Architecture Clustered VLIW Architecture

• VLIW: parallel operations (instructions) • Large register file feeds multiple function units • Register file, function units divided into clusters encoded in one long word (instruction packet), • Increased compiler complexity (e.g. 2 clusters) each instruction controlling one functional unit.

-31- -32- -33-

Partitioned Register Files VLIW: TI C62/C67 VLIW: TI C62/C67

• Many memory ports are required to supply enough • Up to 8 instruction/cycle Program RAM/cache Data RAM operands per cycle • 32 32-bit registers 512K bits 512K bits • Memories with many port are expensive • Functional units: JTAG bus • Registers are partitioned into –typically- 2 sets: • Two multipliers e.g. for TI C60x •Six ALUs timers Execute • Data operations: DMA • 8/16/32-bit arithmetic Serial • 40-bit operations Data path 1/ Data path 2/ Reg file 1 Reg file 2 • Bit manipulation operations PLL • All instructions execute conditionally

-34- -35- -36- C6x Datapath C6x Function Units C6x System

• General purpose register files (A and B, 16 •.L words each) • 32/40-bit arithmetic • Leftmost 1 counting • Eight function units • Logical ops • .L1, .L2, .S1, .S2, .M1, .M2, .D1, .D2 •.S • 32-bit arithmetic • Two load units (LD1, LD2) • 32/40-bit shift and 32-bit field • Two store units (ST1, ST2) • Branches • Constants • Two register file cross paths (1X and 2X) •.M • Two data address path (DA1 and DA2) • 16x16 multiple •.D • 32-bit add, subtract, circular address • Load, store with 5/15-bit constant offset -37- -38- -39-

SHARC (Analog Devices DSP) SHARC Features SHARC Features

• Harvard Architecture • More RISC-like • SHARC can support operations in parallel. It • Internal memory split between program • Load–store architecture is a VLIW processor. memory (PM) and data memory (DM) • Operands in register before operation • Memory as well as operation parallelism • Memory organized as 32-bit word • Two data address generators • Instruction set allows CPU’s function units to • Large internal memory • Program memory, data memory be used in various combinations, as in • Floating point arithmetic • Special loop instruction: zero overhead • Fixed-point multiply-accumulate and add, subtract or average • 40-bit data registers • Loop counter stack for nested loops • Floating point multiplication and ALU operation • Functional units: ALU, multiplier, shifter • SHARC has a PC stack – 30 deep for storing • Multiplication and dual add-subtract subroutine return as well as loop addresses

-40- -41- -42-

Philips TM-1 Philips Trimedia TM-1 Philips Trimedia TM-1 VLIW CPU

memory interface Characteristics register file • Special purpose DSP video in video out •VLIW audio in audio out • Floating point support read/write crossbar 2 • Sub-word parallelism I C serial FU1 ... FU27 support timers VLD co-p • Additional custom operations image co-p VLIW CPU slot 1 slot 2 slot 3 slot 4 slot 5 PCI

-43- -44- -45- SoC SoC

• SoC consist of at least two or more complex • Technologies implementing embedded micro-electronic macro components systems evolved from microcontrollers and previously integrated into different single discrete components to fully integrated SoC SoC (System on Chip) dies • Reason: advances in silicon process • Complex functionalities that previously technology enabling a complete system to required heterogeneous components to be be designed into one or few integrated connected on a PCB, are integrated within devices one single silicon chip • Space and power reductions • Increased performance

-46- -47- -48-

Features of SoC SoC Design IP Based Design

• Typically SoC incorporates • Time and design effort required to integrate • Intellectual Property Cores • A programmable processor different types of components on a chip: a • Parameterized components with standard • On-chip memory bottleneck for SoC evolution interfaces facilitating high level synthesis (e.g. ARM) • Accelerated function units (e.g. DES block, MPEG- • Design reuse to reduce time to market • Cores available in three forms 2 decoder) • Use of parts form previous designs • Hard: Black box in optimized layout form and • Peripheral devices encrypted simulation model. Example: • Making use of parts designed by third parties • Often mixed technology designs integrating • Hardware and software component model • Analog, RF components • Firm: Synthesized netlist which can be simulated • All for PROVEN and tested solutions, avoiding re- and changed if needed •MEMS design and re-verification of real-time hardware • Soft: Register transfer level (RTL) HDLs, user is • Optical input/output and real-time software responsible for synthesis and layout

-49- -50- -51-

Platforms Platform based SoC Classes of Platforms

• Full Application Platform • Embedded applications built using • Platform based SoC’s are systems that • Common architectural blocks and contain • Platforms that let derivative product designers • Customized application specific components create complete application on top of hardware- • IP blocks like embedded CPU, embedded software architectures • Common architectures memory • A set of hardware modules: e.g. Complex • Processor, memory, peripherals, bus structures • Real word interfaces (e.g. PCI, USB) dual processor architecture with hierarchical • Common architectures and supporting • Mixed signal blocks and bus system tailored to a specific product’s technologies (IP libraries and tools) are • Software components: device drivers, RTOS requirements called platforms and platform based designs and application code • A layer of firmware and driver software • Examples: Philips Nexperia, TI’s OMAP

-52- -53- -54- Classes of Platforms Classes of Platforms Multi-processor SoC (MPSoC)

• Processor Centric Platforms • Typically centered on specific processors • Configurable (Programmable) platform • Full application platform • Key software services like RTOS kernel made • Programmable logic added to the platform allows • Multiple processors consumers to customize using both hardware and available through libraries • CPUs, DPSs, etc. software • Examples: ARM Micropack, ST’s ST100 • Hardwired blocks • Field programmable gate array (FPGA) added to • Communication Centric Platforms hard-coded processor centric platforms • Mixed-signal • Communication fabric optimized for specific • Examples: Altera Excalibur platform with ARM • Custom memory systems applications cores, Virtex II Pro (PowerPC) • Fabrics often bundled with specific processors • Lots of software • Examples: ARM AMBA, IBM CoreConnect bus architecture

-55- -56- -57-

Philips Nexperia TI OMAP ST Nomadik

• Targets mobile • Targets OMAP 5910: multimedia. • Multimedia communications, ARM9 application: multimedia. C55x DSP • A multiprocessor-of- STB, etc MPU multiprocessors. • Multiprocessor with bridge system

• 2 CPU, 3 interface Memory

DSP, RISC. MMU I/O bridges busses, • It is aimed at 2.5G/3G several Memory ctrl System I/O mobile phones, Audio Video accelerators, DMA personal digital accelerator accelerator I/O devices control assistants and other ARM9 portable wireless heterogeneous products with multiprocessors multimedia capability

-58- -59- -60-

TI OMAP Design Chain for OMAP OMAP Hardware Architecture

• OMAP: Open Multimedia Applications Platform • OMAP application processor has dual core architecture ARM9 + TMS320C55 • OMAP design chain includes • Software IP: OMAP supports several RTOS’s to suit different applications • Application and Middleware: Ported applications and middleware like MPEG-4 decoding and audio playback.

-61- -62- -63- OMAP Hardware Architecture Example Application How the Architecture Works

• Both processors utilize an instruction cache to • ARM RISC core is well suited for control • Video conferencing minimize external accesses code (OS, user interface, OS applications) • C55x DSP can process in real time full • Both core uses MMU for virtual to physical memory translation and task-to-task memory protection • DSP best suited for signal processing video conferencing applications (audio • Uses two external memory interfaces and one applications like video, speech processing, and video at 15 fps) using only 40% of internal memory port audio the available computational capability • External interfaces supports to synchronous (DRAMs) or a • Power efficient because signal processing • ARM processor can handle OS synchronous (SRAM,FLASH) • Internal memory port for on-chip memory access for task on DSP consumes much less power operations and other OS applications critical OS routines or LCD frame buffer than on ARM • Less power consumption on the whole • Allow concurrent access from either processor or DMA unit

-64- -65- -66-

Peripherals Software Architecture OMAP2

• Includes multiple engines executing multiple tasks • Includes numerous interfaces to connect • Defines an interface scheme that allows • An ARM11 based runs the OS and peripherals or external devices from either the GPP to be the system master performs supervisory control DSP of GPP • Called the DSP/BIOS Bridge • DSP core focuses on audio codecs, echo cancellation • Some interfaces and noise suppression • Camera and Display: serial unidirectional compact • DSP/BIOS Bridge provides communications • 3D graphics engine enables sophisticated graphics camera port, 8-bit parallel interface, 8/16-bit bi- between GPP tasks and DSP tasks rendering directional display interface, OMAP internal LCD • High level application developers use a set • Video/imaging accelerator handle streaming MPEG4 controller of DLL’s and drivers video and mega pixel-resolution camera • Several serial interfaces: SPI, McBSP, I2C, USB, UART • Digital baseband processor implements communications as a cellular modem handling voice and data -67- -68- -69-

OMAP2 OMAP2

• All blocks operate simultaneously • No degradation in quality of any service • Devices remain highly responsive • To conserve power each of these subsystems can be shut down when not used • SoC suited for implementation of Smart Phones

-70- -71- -72- Digital Media Processor DM 310 Media Processor DM 310 Architecture

• Functionalities expected in a portable media • Four subsystems: imaging/video, DSP, coprocessor, system ARM core • Imaging/Video system: CCD controller, preview • Live preview: captures, process, display engine, OSD, video encoder • Live image/video/audio capture: compresses • DSP: TMS320C54x operating at 72MHz performs • Image/video/audio decode/playback bulk of audio/video processing operations. • Photo printing • Co-processors: SIMD engine (8/16-bit), quantization, • Several of these modes operate VLC (e.g. Huffman) working concurrently concurrently. • ARM Core: manages system level tasks, controls components of on-chip except DSP and its co- processors

-73- -74- -75-

Application: Still Camera Engine Configurable SoC

• Consisting of • Processor • Memory • On-chip reconfigurable hardware parts for Reconfigurable Platforms customization to application • Fine-grained and coarse-grained reconfigurability • FPGA vs network of processors • Towards application specific programmable products

-76- -77- -78-

Reconfigurable Computing (RC) Advantages of RC FPGA-based RC

• Program • Programmable fabric that can be dynamically • Compute by building Example: Z[i] = a.X[i] + b.Y[i] • No instruction fetch a circuit rather than reconfigured //program • No I-cache, etc X Y executing instructions • In the last 10 years the growth of FPGA speed Load rx, X X Y • Bit width and constrains 8 8 • Efficient for long Mpy r1, rx, ra and density has exceeded that of CPUs * a * b running computations Load ry, Y • Assume X and Y are 8-bits • Mapping to FPGA Mpy r2, ry, rb * a * b • Assume a=0.25 and b=0.5 6 7 • Video and image + • Only the time consuming computations are Add r3, r1, r2 8 processing + • Much smaller circuit mapped Store r3, Z •DSP • Delay Z • Computation expressed in HDL (VHDL/Verilog) • Network processing Z • From two shift operation and one addition, all on • Structure 32-bit • FPGA + Memory on a peripheral board • To one 8-bit addition (shifts are free in hardware)

-79- -80- -81- FPGA-based RC Triscent A7 SoC Xilinx Virtex II Pro

PowerPC based Up to 16 serial transceivers • 420 Dhrystone MIPS • 622 Mbps to 3.125 Gbps at 300 MHz • Several products CSL: PowerPCs incorporate performs • 1 to 4 PowerPCs microprocessor basic • 4 to 16 gigabit combinational transceivers and FPGA on one and • 12 to 216 multipliers chip sequential • 3,000 to 50,000 logic logic Config. logic functions cells • 200k to 4M bits RAM • 204 to 852 I/O Courtesy of Xilinx $100-$500 (>25,000 units) -82- -83- -84-

Course Grained RC: Multiple ALUs Connected XPP: eXtreme Processing Platform Configurable Processors

• Operand routing with a hierarchical • Configurability: connection network • Processor parameters (cache size, • Registers are distributed registers, etc.) • Configure once and then run (no I-cache) • Instructions • Potentially an instruction level parallelism of • Result: 100 and more • HDL model for processor • No branch instruction • Software development environment • Adaptive reconfigurable data processing architecture • Processing array elements organized as processing arrays

-85- -86- -87-

Application-Specific Instruction Processors (ASIP)

• An ASIP is stored-memory CPU whose architecture is tailored for a particular set of applications • Programmability allows changes to implementation, use in several different products, high datapath utilization • Application-specific architecture provides smaller silicon area, higher speed • Retargetable compilation

-88- STM32 – 32-bit Cortex™-M MCUs What does a developer want in an MCU?

Rich Leading choice of tools edge core

Software Advanced libraries peripherals

Scalable Cost sensitive device portfolio Ultra-low-power STM32 product family key features STM32 – a comprehensive platform

Flash size (bytes)

Select your fit product

1 M inside a wide, compatible portfolio

Cortex™-M3/M4/M0 Flash – High performance

Over 300 pin-to-pin 16 K compatible devices

32 pins 176 pins STM32 product lines – 6 product series STM32 F4 series block diagram The full range of Cortex-M processors

. Bypass traditional 8/16/32-bit classifications and get . Seamless architecture across all applications . Every product optimized for ultra-low power and ease of use

Cortex-M0 Cortex-M3 Cortex-M4

8/16-bit applications 16/32-bit applications 32-bit/DSC applications

Binary and tool compatible Cortex-M processors binary compatible STM32 applications

. Industrial . Medical . PLC . Glucose meters . Inverters . Portable medical care . Printers, scanners . VPAP, CPAP . Industrial networking . Patient monitoring . Solar inverters

. Buildings and security . Appliances . Alarm systems . 3-phase motor drive . Access control . Application control . HVAC . User interfaces . Power meters . Induction cooking

. Consumer . Home audio . Gaming . PC peripherals . Digital cameras, GPS STM32 tools

Starter and promotion kits STM32 promotion kits Numerous boards

Motor control kit Evaluation boards: STM32 W evaluation kit STM320518-EVAL, STM32W-SK STM32-ComStick 4 Discovery kits STM3240G-EVAL EvoPrimer and STM32L152D-EVAL More than 15 different More than 25 different RTOS development IDE solutions and stack solution providers Free software solutions from ST

Standard Self-test routines for peripheral library USB device library Motor control library EN/IEC 60335-1 Class B

Encryption library DSP library SPEEX codec STM32 audio engine STM32 today: a comprehensive platform

High-performance MCUs With DSP and FPU Cortex-M4 168 MHz/210 DMIPS

High-performance MCUs Cortex-M3 120 MHz/150 DMIPS

Mainstream MCUs Cortex-M3 72 MHz/61 DMIPS

Entry-level MCUs Cortex-M0 48 MHz/38 DMIPS

Ultra-low-power MCUs Cortex-M3 32 MHz/33 DMIPS

Wireless MCUs Cortex-M3 24 MHz/30 DMIPS Thank you

www.st.com/stm32