The Anatomy of the ARM Cortex-M0+ Processor

The Anatomy of the ARM Cortex-M0+ Processor Joseph Yiu Embedded Technology Specialist 1 What is the Cortex-M0+ Processor? . 2009 – ARM® Cortex™-M0 processor released . Low gate count . High performance . Easy to use . Debug features . 2012 – Cortex-M0+ processor released . Same instruction set . Supports all existing features of Cortex-M0 . New features . Higher energy efficiency . Ready for future applications 2 What’s new? . Even better power efficiency . Clean sheet design – 2 stage pipeline . Better performance at the same frequency . Unprivileged execution level . 8 region Memory Protection Unit (MPU) . Faster I/O accesses . Vector table relocation . Low cost trace solution available . Various silicon integration features (e.g.16-bit flash support) 3 Why a New Design? Energy is the Key . Embedded products need even longer battery life . Need to have lower active power . But not compromise on performance . Low power control applications . Need to have faster I/O capability . But not higher operating frequency . Smarter designs . Need more sophisticated features . But not bigger silicon 4 Overview of the Cortex-M0+ Processor . Processor . ARMv6-M architecture . Easy to use, C friendly . Cortex-M series compatibility . Nested Vectored Interrupt Controller (NVIC) . Flexible interrupt handling . WIC support . Memory Protection Unit (MPU) . Debug from just 2 pins 5 Compact Instruction Set . Only 56 Instructions . 100% compatible with existing Cortex-M0 processor . Mostly 16-bit instructions . All instructions operate on the 32-bit registers . Option for single cycle 32x32 Maximum reuse of multiply Upwardexisting compatibility tools to andthe ARM Cortexecosystem-M3/Cortex -M4 6 Interrupt Handling . Nested Vectored Interrupt Cortex-M0+ Controller (NVIC) NMI NVIC Core . Interrupt prioritization . Interrupt masking Up to 32 IRQs System . Nested interrupt handling exceptions SysTick . Ease of use ARMv6-M 7 6 0 . Interrupt handlers in C Priority level register . Processor hardware Higher priority handles stacking -2 NMI . No hidden software -1 HardFault 0x00 overhead 0x40 . CMSIS-Core functions for 0x80 IRQs NVIC control 0xC0 7 Low Power Processor Design . Minimizing power at every opportunity . Small silicon area (from 12K gates) . Various low power techniques (clock gating, power gating, SRPG, etc) . 2-stage pipeline processor for maximum energy efficiency . Reduce ratio between flip-flops and combinatorial logic . Lower average CPI (Cycles Per Instruction) Pre-decode Main decode Clock Instruction Fetch Decode Execute #N Instruction #N+1 Fetch Decode Execute 8 Sleep Modes . Architecture defined sleep modes . Normal sleep . Deep sleep . Deep sleep with SRPG support (using WIC) – nW power profile with instant wakeup (processor power down with state retention) . Can be extended with MCU specific power control registers State Leakage Power Leakage Retention + some Active only Off only dynamic Leakage + Deep Sleep Sleep dynamic (WIC) Deep Sleep Power Power consumption (WIC+SRPG) Power Off Not To scale 9 Low Power Features . Wakeup Interrupt Controller (WIC) . Detect interrupt while the processor is powered down . Enables SRPG deep sleep operation with instant wakeup . Sleep-on-exit . Enables the processor to sleep automatically when all interrupt services are ISR exit complete Active ISR . Ideal for interrupt driven Mode Sleep Mode Sleep Mode application SLEEPONEXIT bit set WFI or WFE 10 Minimizing Flash Accesses . Program memory access (e.g. Flash) generally on alternate cycles . Typical execution of contiguous code shown here . A 32-bit fetch gives two 16-bit instructions 11 Smaller Branch Shadow . In pipelined processors, subsequent instructions are fetched while executing current instructions (“prefetching”) Maximum branch shadow Branch Taken is 2 instructions (1 word) and minimum is 0 instruction label BGE label ADD CMP (branch) Fetches as 16-bit Program flow Branch transfer if branch target shadow is unaligned . Branch shadow means energy wasted if branch is taken . Length of branch shadow depends on the alignment of branch instructions . By moving to a two stage pipeline, the branch shadow is reduced 12 Cortex-M0+ : The Ultimate in Low Power . The most energy efficient 32-bit processor ever designed . Bringing down processor consumption as low as 9µA/MHz* Up to 30% lower power than Cortex-M0 *TSMC90LP, 1.2V, . min. configuration 180ULL 90LP 40G (7-track, typical 1.8v, 25C) (7-track, typical 1.2v, 25C) (9-track, typical 0.9v, 25C) Area Power Area Power Area Power (Dhrystone loop) (Dhrystone loop) (Dhrystone loop) Minimum Config* 0.13mm2 52µW/MHz 0.04mm2 11µW/MHz 0.01mm2 3µW/MHz . Enabling our partners to develop smaller, smarter and energy friendly solutions for : . Pervasive embedded intelligence . The upcoming “Internet of Things” . Ultimately more electronics and a reduced energy footprint 13 Small and Powerful . Benchmarks . 1.77 CoreMark/MHz . 0.93 DMIPS/MHz . Lightweight DSP capable Function Block size 50MHz Runtime FIR Q15 32 (32 taps) 0.59 ms FIR Fast Q15 32 (32 taps) 0.45 ms Biquad Cascade Q15 32 (4 stages) 0.30 ms Biquad Cascade Fast Q15 32 (4 stages) 0.20 ms CFFT Radix4 Q15 64 0.24 ms . Good fit for entry-level real-time control 14 High-Code Density . Smaller flash size needed, leads to: CoreMark Code in kB Optimized cost 12 . 11.022 . Lower power consumption 10 Smaller chip packages in low . 8 pin count devices (sensors, 6.446 6 5.418 mobile equipments, medical 4.896 applications, etc) 4 2 0 Cortex-M0+ PIC24 PIC18 RL78 CSP16 (2x2mm) QFN20 (3x3mm) 15 Single Cycle I/O Interface Address . 32-bit simple bus protocol decoder . Supports 32-/16-/8-bit GPIO transfers Address belongs to I/ O I /O Interface . Best suited for accessing interface Peripherals . GPIO (Fast Accesses) . Peripheral registers Address outside I/ O . Memory mapped – interface Peripherals programmed just like normal AHB ROM peripherals . Address range(s) defined by RAM silicon designers . Optional 16 Faster I/O Accesses . Advantages for applications: . Faster GPIO operations, save precious cycles . Better energy efficiency in I/O intensive applications . Faster response in FSM replacement applications . Example : Controlling an LCD module with GPIO Character output loop – 1 letter/iteration Data Clock cycles per loop Cortex-M0+ Strobe . MSP430 - 12 cycles . 78K - 10 cycles Data “A” “B” “C” . 8051* - 15 cycles Strobe . Cortex-M0+ - 9 cycles * Optimized assembly code 17 Memory Protection Unit (MPU) . Prevents application task MPU Memory from corrupting OS and other tasks’ data I/O #2 I/O #1 . Improves system reliability I/O #0 Data for Up to 8 configurable Task B OS kernel . (unprivileged) regions Data for . Address Task C . Size Data for Task B . Memory attributes Data for Access permissions . OS kernel Task A (Privileged) MPU configuration 18 Vector Table Relocation . Vector Table Offset Register (VTOR) Memory 0xFFFFFFFF . Easier system level design (no need to use System Control Space (e.g. NVIC) memory remapping) . Flexible vector table configuration . Relocate vector table to Peripherals other locations in flash / SRAM SRAM . Boot loader . Exception vector reconfiguration at runtime Flash 0x00000000 Vector table 19 Easier Silicon Integration . Highly configurable . Verilog parameters to include features you want e.g. Number of IRQ . Debug features . I/O interface ... New features . Power . Start-up delay control Management Application Subsystem (CPUWAIT) Processor CPUWAIT 16-bit flash support (e.g. Cortex-A9) Cortex -M0+ . Cortex-M System Design Kit SRAM Interconnect (CMSDK) support available Control System logic soon Memory control Power control 20 Cortex-M System Design Kit (CMSDK) . Fast track your design process with CMSDK . Easy to use design kit with example system designs . Designer can simply plug-in their processor and go! . Essential AMBA® interconnects and peripherals . Software support - Keil™ examples and CMSIS drivers Optimised for low area, low power, low latency 21 Summary – Cortex-M0+ Maximizing energy efficiency Low-cost, powerful debug and trace Debug Microcontroller connector Program execution info Processor AHB interface MTB controller RAM interface Application SRAM Data + Trace Data Faster I/O accesses Maintaining 100% compatibility to enable an immediate ecosystem Cortex-M0+ Strobe Data Data “A” “B” “C” Strobe 22 Thank you 23 .

The Anatomy of the ARM Cortex-M0+ Processor

08-02-2021 Agenda Packet.Pdf

45-Year CPU Evolution: One Law and Two Equations

Cuda C Best Practices Guide

8-Bit All Flash 78K0 Microcontrollers 78K0S

Multi-Cycle Datapathoperation

CS2504: Computer Organization

Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: a Practical Approach∗

A Performance Analysis Tool for Intel SGX Enclaves

Familias De Microcontroladores

ESC-470: ARM 9 Instruction Set Architecture with Performance

A Characterization of Processor Performance in the VAX-1 L/780

Autotuning GPU Kernels Via Static and Predictive Analysis