The Cortex-M Chapter Series: Hardware 2 and Software

Introduction In this chapter the real-time DSP platform of primary focus for the course, the Cortex M4, will be introduced and explained. in terms of hardware, software, and development environments. Beginning topics include: • ARM Architectures and Processors – What is ARM Architecture – ARM Processor Families – ARM Cortex-M Series – Cortex-M4 Processor – ARM Processor vs. ARM Architectures • ARM Cortex-M4 Processor – Cortex-M4 Processor Overview – Cortex-M4 Block Diagram – Cortex-M4 Registers

ECE 5655/4655 Real-Time DSP 2–1 Chapter 2 • The Cortex-M Series: Hardware and Software What is ARM Architecture • ARM architecture is a family of RISC-based processor archi- tectures – Well-known for its power efficiency; – Hence widely used in mobile devices, such as smart phones and tablets – Designed and licensed to a wide eco-system by ARM • ARM Holdings – The company designs ARM-based processors; – Does not manufacture, but licenses designs to semiconduc- tor partners who add their own Intellectual Property (IP) on top of ARM’s IP, fabricate and sell to customers; – Also offer other IP apart from processors, such as physical IPs, interconnect IPs, graphics cores, and development tools

2–2 ECE 5655/4655 Real-Time DSP ARM Processor Families ARM Processor Families

• Cortex-A series (Application) Cortex-A57 Cortex-A53 – High performance processors Cortex-A15 Cortex-A9 Cortex-A Cortex-A8 capable of full Operating Sys- Cortex-A7 Cortex-A5

tem (OS) support; Cortex-R7 Cortex-R5 Cortex-R – Applications include smart- Cortex-R4 Cortex-M4 New!: Cortex-M7, Cortex-M33 phones, digital TV, smart Cortex-M3 Cortex-M1 Cortex-M Cortex-M0+ books, home gateways etc. Cortex-M0 New!: Cortex-M23 (no DSP)

SC000 • Cortex-R series (Real-time) SC100 SecurCore SC300 – High performance for real- ARM11 ARM9 Classic time applications; ARM7

– High reliability AsAs of of Jan Dec 2017 2013 – Applications include automotive braking system, power- trains etc. • Cortex-M series () – Cost-sensitive solutions for deterministic microcontroller applications; – Applications include , mixed signal devices, smart sensors, automotive body electronics and airbags; more recently IoT • SecurCore series – High security applications. • Previous classic processors: Include ARM7, ARM9, ARM11 families

ECE 5655/4655 Real-Time DSP 2–3 Chapter 2 • The Cortex-M Series: Hardware and Software

ARM Families and Architecture Over Time1

1. J. Yiu, The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, 3rd edi- tion, Newnes 2014.

2–4 ECE 5655/4655 Real-Time DSP Relative Performance

Relative Performance1

Design an ARM-based SoC • Select a set of IP cores from ARM and/or other third-party IP vendors • Integrate IP cores into a single chip design • Give design to semiconductor foundries for chip fabrication

IP libraries SoC

Cortex-A9 Cortex-R5 Cortex-M4 ARM ROM RAM processor ARM7 ARM9 ARM11 System bus ARM-based DRAM ctrl FLASH ctrl SRAM ctrl SoC Peripherals AXI bus AHB bus APB bus

External Interface GPIO I/O blocks Timer

Licensable IPs SoC Design

1. https://www.arm.com/products/processors/cortex-m

ECE 5655/4655 Real-Time DSP 2–5 Chapter 2 • The Cortex-M Series: Hardware and Software ARM Cortex-M Series • Cortex-M series: Cortex-M0, M0+, M1, M3, M4, M7, M33. • Energy-efficiency – Lower energy cost, longer battery life • Smaller code – Lower silicon costs • Ease of use – Faster software development and reuse • Embedded applications – Smart metering, human interface devices, automotive and industrial control systems, white goods, consumer prod- ucts and medical instrumentation, IoT

ARM Processors vs. ARM Architectures • ARM architecture – Describes the details of instruction set, programmer’s model, exception model, and memory map – Documented in the Architecture Reference Manual • ARM processor – Developed using one of the ARM architectures – More implementation details, such as timing information

2–6 ECE 5655/4655 Real-Time DSP Companies Making ARM Chips

– Documented in processor’s Technical Reference Manual

ARMv4/v4T ARMv5/ v4E ARMv6 ARMv7 ARMv8 Architecture Architecture Architecture Architecture ARMv7-A Architecture ARMv8-A e.g. Cortex-A9 e.g. Cortex-A53 Cortex-A57 ARMv7-R e.g. Cortex-R4 ARMv8-R

ARM v6-M ARMv7-M e.g. Cortex-M0, M1 e.g. Cortex-M4

e.g. ARM7TDMI e.g. ARM9926EJ-S e.g. ARM1136 As of Dec 2013

Companies Making ARM Chips • Apple, AppliedMicro, Microchip (), Broadcom, , , NXP, Samsung Electronics, ST Microelectronics, and (http://en.wiki- pedia.org/wiki/ARM_architecture) • More — (Zynq), ...

ECE 5655/4655 Real-Time DSP 2–7 Chapter 2 • The Cortex-M Series: Hardware and Software ARM Cortex-M Series Family Point Floating Optional DSP Extensions Math Saturated NoNo No SoftwareNo No No Software No No Software No Divide Hardware Multiply Hardware -2 ® Thumb ® Entire EntireEntire 1 cycle Entire Yes 1 cycle Yes Yes Software Yes No Hardware Thumb Core Architecture ARM rchitecture ARMv6-M Neumann Von Most Subsetcycle or 32 1 ARMv6-M Neumann Von ARMv6-M Most Neumann Von SubsetARMv7-M Mostcycle or 32 1 Harvard Subsetcycle or 33 3 ARMv7E-M Harvard A

2–8 ECE 5655/4655 Real-Time DSP Cortex-M4 Processor Overview Cortex-M4 Processor Overview • Cortex-M4 Processor – Introduced in 2010 – Designed with a large variety of highly efficient signal pro- cessing features – Features extended single-cycle multiply accumulate instructions, optimized SIMD arithmetic, saturating arith- metic and an optional Floating Point Unit. • High Performance Efficiency – 1.25 DMIPS/MHz (Dhrystone Million / MHz) at the order of µWatts / MHz • Low Power Consumption – Longer battery life – especially critical in mobile products • Enhanced Determinism – The critical tasks and interrupt routines can be served quickly in a known number of cycles

ECE 5655/4655 Real-Time DSP 2–9 Chapter 2 • The Cortex-M Series: Hardware and Software Cortex-M4 Processor Features • 32-bit Reduced Instruction Set Computing (RISC) processor • Harvard architecture – Separated data bus and instruction bus • Instruction set – Include the entire Thumb®-1 (16-bit) and Thumb®-2 (16/ 32-bit) instruction sets1

• 3-stage + branch speculation pipeline • Performance efficiency – 1.25 – 1.95 DMIPS/MHz (Dhrystone Million Instructions Per Second / MHz)

1. J. Yiu, The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, 3rd edition, Newnes, 2014.

2–10 ECE 5655/4655 Real-Time DSP Cortex-M4 Processor Features

• Supported Interrupts – Non-maskable Interrupt (NMI) + 1 to 240 physical inter- rupts – 8 to 256 interrupt priority levels • Supports Sleep Modes – Up to 240 Wake-up Interrupts – Integrated WFI (Wait For Interrupt) and WFE (Wait For Event) Instructions and Sleep On Exit capability (to be covered in more detail later) – Sleep & Deep Sleep Signals – Optional Retention Mode with ARM Power Management Kit • Enhanced Instructions – Hardware Divide (2-12 Cycles) – Single-Cycle 16, 32-bit MAC, Single-cycle dual 16-bit MAC – 8, 16-bit SIMD arithmetic • Debug – Optional JTAG & Serial-Wire Debug (SWD) Ports – Up to 8 Breakpoints and 4 Watchpoints • Memory Protection Unit (MPU) – Optional 8 region MPU with sub regions and background region

ECE 5655/4655 Real-Time DSP 2–11 Chapter 2 • The Cortex-M Series: Hardware and Software

• Cortex-M4 processor is designed to meet the challenges of low dynamic power constraints while retaining light foot- prints – 180 nm ultra low power process –157 µW/MHz – 90 nm low power process – 33 µW/MHz – 40 nm G process – 8 µW/MHz

ARM Cortex-M4 Implementation Data

180ULL 90LP 40G Process (7-track, typical 1.8v, 25C) (7-track, typical 1.2v, 25C) 9-track, typical 0.9v, 25C)

Dynamic Power 157 µW/MHz 33 µW/MHz 8 µW/MHz

Floorplanned Area 0.56 mm2 0.17 mm2 0.04 mm2

2–12 ECE 5655/4655 Real-Time DSP Cortex-M4 Block Diagram Cortex-M4 Block Diagram Optional Embedded Wire Viewer Wire Optional Serial Optional Trace Macrocell SRAM and SRAM peripheral interface peripheral Data Optional Optional watchpoints Bus matrix Bus Optional FPU Optional protection unit protection Processor core Processor Optional Memory Memory Optional Flash patch Optional Code interface Code Debug (NVIC) Interrupt Optional Controller Access Port Access Nested Vector Nested WIC Optional ARM Cortex-M4 ARM Cortex-M4

ECE 5655/4655 Real-Time DSP 2–13 Chapter 2 • The Cortex-M Series: Hardware and Software Cortex-M4 Block Diagram (cont.) • Processor core – Contains internal registers, the ALU, data path, and some control logic – Registers include sixteen 32-bit registers for both general and special usage • Processor pipeline stages – Three-stage pipeline: fetch, decode, and execution – Some instructions may take multiple cycles to execute, in which case the pipeline will be stalled – The pipeline will be flushed if a branch instruction is exe- cuted – Up to two instructions can be fetched in one transfer (16- bit instructions)

Instruction 1 Fetch Decode Execute

Instruction 2 Fetch Decode Execute

Instruction 3 Fetch Decode Execute Instruction 4 Fetch Decode Execute

Time

• Nested Vectored Interrupt Controller (NVIC) – Up to 240 interrupt request signals and a non-maskable interrupt (NMI)

2–14 ECE 5655/4655 Real-Time DSP Cortex-M4 Block Diagram (cont.)

– Automatically handles nested interrupts, such as compar- ing priorities between interrupt requests and the current priority level • Wakeup Interrupt Controller (WIC) – For low-power applications, the microcontroller can enter sleep mode by shutting down most of the components. – When an interrupt request is detected, the WIC can inform the power management unit to power up the system. • Memory Protection Unit (optional) – Used to protect memory content, e.g. make some memory regions read-only or preventing user applications from accessing privileged application data • Bus interconnect – Allows data transfer to take place on different buses simul- taneously – Provides data transfer management, e.g. a write buffer, bit- oriented operations (bit-band) – May include bus bridges (e.g. AHB-to-APB bus bridge) to connect different buses into a network using a single global memory space – Includes the internal bus system, the data path in the pro- cessor core, and the AHB LITE interface unit

ECE 5655/4655 Real-Time DSP 2–15 Chapter 2 • The Cortex-M Series: Hardware and Software

• Debug subsystem – Handles debug control, program breakpoints, and data watchpoints – When a debug event occurs, it can put the processor core in a halted state, where developers can analyse the status of the processor at that point, such as register values and flags

Cortex-M4 Registers • Processor registers – The internal registers are used to store and process tempo- rary data within the processor core – All registers are inside the processor core, hence they can be accessed quickly – Load-store architecture – To process memory data, they have to be first loaded from memory to registers, processed inside the processor core using register data only, and then written back to memory if needed • Cortex-M4 registers – Register bank – Sixteen 32-bit registers (thirteen are used for general-pur- pose); – Special registers

2–16 ECE 5655/4655 Real-Time DSP Cortex-M4 Registers (cont.) Cortex-M4 Registers (cont.)

Register bank R0 R1 R2 R3 Low R4 Registers R5 General purpose R6 register R7 R8 R9 R10 High Registers R11 R12 MSP Stack Pointer (SP) R13(banked) Main Stack Pointer Link Register (LR) R14 PSP Program Counter (PC) R15 Process Stack Pointer

Special registers Program Status Registers (PSR) x PSR APSR EPSR IPSR

PRIMASK Application Execution Interrupt PSR PSR PSR Interrupt mask register FAULTMASK BASEPRI Stack definition CONTROL

• R0 – R12: general purpose registers – Low registers (R0 – R7) can be accessed by any instruction – High registers (R8 – R12) sometimes cannot be accessed e.g. by some Thumb (16-bit) instructions • R13: Stack Pointer (SP) – Records the current address of the stack – Used for saving the context of a program while switching between tasks

ECE 5655/4655 Real-Time DSP 2–17 Chapter 2 • The Cortex-M Series: Hardware and Software

– Cortex-M4 has two SPs: Main SP, used in applications that require privileged access e.g. OS kernel, and exception handlers, and Process SP, used in base-level application code (when not running an exception handler) • Program Counter (PC) – Records the address of the current instruction code – Automatically incremented by 4 at each operation (for 32- bit instruction code), except branching operations – A branching operation, such as function calls, will change the PC to a specific address, meanwhile it saves the current PC to the Link Register (LR)

Data Data

PUSH POP

Low

Stack Address

SP High PC Heap

Code

2–18 ECE 5655/4655 Real-Time DSP Cortex-M4 Registers (cont.)

• R14: Link Register (LR) – The LR is used to store the return address of a subroutine or a function call – The program counter (PC) will load the value from LR after a function is finished

Current PC Current LR PC LR

1. Save current Main Main PC to LR C Program o Program C code d code o LR e Load PC with the d r e e address in LR to g er oi return to the main g 2. Load PC with n program oi the starting n address of the subroutine subroutine subroutine Current PC PC

Call a subroutine Return from a subroutine to the main program • xPSR, combined Program Status Register – Provides information about program execution and ALU flags – Application PSR (APSR) – Interrupt PSR (IPSR) – Execution PSR (EPSR)

APSR N Z C V Q Reserved

IPSR Reserved ISR number

EPSR ICI/IT T Reserved ICI/IT

xPSR N Z C V Q ICI/IT T Reserved ICI/IT ISR number

bit31 bit24 bit16 bit8 bit0

ECE 5655/4655 Real-Time DSP 2–19 Chapter 2 • The Cortex-M Series: Hardware and Software

• APSR – N: negative flag – set to one if the result from ALU is neg- ative – Z: zero flag – set to one if the result from ALU is zero – C: carry flag – set to one if an unsigned overflow occurs – V: overflow flag – set to one if a signed overflow occurs – Q: sticky saturation flag – set to one if saturation has occurred in saturating arithmetic instructions, or overflow has occurred in certain multiply instructions •IPSR – ISR number – current executing interrupt service routine number •EPSR – T: Thumb state – always one since Cortex-M4 only sup- ports the Thumb state (more on processor states in the next module) – IC/IT: Interrupt-Continuable Instruction (ICI) bit, IF- THEN instruction status bit • Interrupt mask registers – 1-bit PRIMASK – Set to one will block all the interrupts apart from nonmas- kable interrupt (NMI) and the hard fault exception – 1-bit FAULTMASK

2–20 ECE 5655/4655 Real-Time DSP Cortex-M4 Registers (cont.)

– Set to one will block all the interrupts apart from NMI – 1-bit BASEPRI – Set to one will block all interrupts of the same or lower level (only allow for interrupts with higher priorities) • CONTROL: special register – 1-bit stack definition – Set to one: use the process stack pointer (PSP) – Clear to zero: use the main stack pointer (MSP)

PRIMASK Reserved

FAULTMASK Reserved

BASEPRI Reserved

CONTROL Reserved

bit31 bit24 bit16 bit8 Stack definition

ECE 5655/4655 Real-Time DSP 2–21 Chapter 2 • The Cortex-M Series: Hardware and Software Software Development Overview • The software development flow1:

• At the file level the build process for Keil MDK is:

1. J. Yiu, The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, 3rd edition, Newnes, 2014.

2–22 ECE 5655/4655 Real-Time DSP Software Development Overview

• When using the GNU tool chain compilation and linking are merged1

• A debugging process that we will follow with the limited capability of the STM32F4 on-board emulator, is the use of a UART

– The level shifter is not required with the USB to serial rx/ tx wire device found in the lab

1. J. Yiu, The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, 3rd edition, Newnes, 2014.

ECE 5655/4655 Real-Time DSP 2–23 Chapter 2 • The Cortex-M Series: Hardware and Software

• The preferred IDE for this first offering of the Cortex-M4 version of the course, we will focus on the use of the ARM Keil integrated development environment

Keil Specific Debugging Examples • In this section a few simple debugging examples are provided to help you get started with writing and debugging code • The absolute starting point for working with hardware soft- ware IDE combination is Lab0 on the course Web Site

Setting Breakpoints • First build the code in your project and launch the debugger

Toggle for debugging

First rebuild the target

• The red d button is toggle for starting and stopping the debugger

2–24 ECE 5655/4655 Real-Time DSP Keil Specific Debugging Examples

Run!

Following the debugger launch, the code is loaded and you are brought to main

Starting debug from main

Clicking in the gray regions allows you to set breakpoints

• Set break points at code lines 54 and 55 • Now click the run button and verify that the debugger stops at line 54 • The State (cycle) counter and Sec (total session run time stopwatch) will advance as you move from the main entry point down to line 54

Before After

ECE 5655/4655 Real-Time DSP 2–25 Chapter 2 • The Cortex-M Series: Hardware and Software

Profiling or Code Timing • You just saw the code run from main to the first breakpoint • You also saw that CPU cycle count and clock stopwatch each increment • Being able to count cycle and/or observe run time in seconds is very valuable in characterizing the performance of real- time code • If you now clock the run button again, the debugger will advance to line 55 and the cycle count and CPU time will advance by some amount; don’t do it just yet • The middle field of the lower right status bar of the Keil app contains two resettable stop watches:

Context menu seen when you right-click

Right-click to set/reset the time display Hover to just see the three displays • Reset the t1 stop watch and now click run • Stop watch now displays the elapsed time in running the dot_p function • The States in the registers view has also incremented

2–26 ECE 5655/4655 Real-Time DSP Keil Specific Debugging Examples

• The debugger has knowledge of the CPU clock frequency, so the time increment is a true estimate of the elapsed process- ing time to enter-compute-return from dot_p; NICE!

1034000 to 1034127 so 127 cycles elapsed

Why a difference? 0.27 us at fclk = 168 MHz => ~45 clks

Using the Command Line • By hovering over variables or looking at the Command Stack + Locals view (lower right above the stop watch display), you can look at variables that are within scope of where the debugger has stopped • You can expand array variables for example

ECE 5655/4655 Real-Time DSP 2–27 Chapter 2 • The Cortex-M Series: Hardware and Software

• The command line also allows you to type commands and obtain debug information

command line response area

command line entries

Setting the Compiler Optimization Level • With the debugger stopped you can go to the Options for Tar- get button

click here to open Options for Target Then choose C/C++

2–28 ECE 5655/4655 Real-Time DSP Keil Specific Debugging Examples

Writing to GPIO for Real-Time Timing • Writing to a GPIO for scope/logic analyzer timing measure- ments

ECE 5655/4655 Real-Time DSP 2–29 Chapter 2 • The Cortex-M Series: Hardware and Software Useful Resources • Architecture Reference Manual: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403c/index.html • Cortex-M4 Technical Reference Manual: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0439d/DDI0439D_cortex_m4_proces- sor_r0p1_trm.pdf • Cortex-M4 Devices Generic User Guide: http://infocenter.arm.com/help/topic/com.arm.doc.dui0553a/DUI0553A_cortex_m4_dgug.pdf • The Designer's Guide to the Cortex-M Processor Family: A Tutorial Approach (Martin) http://www.amazon.com/Designers-Guide-Cortex-M-Processor-Family/dp/0080982964/ ref=sr_1_4?ie=UTF8&qid=1423511335&sr=8-4&keywords=Cortex-M • Embedded Systems: Real-Time Operating Systems for Arm Cortex M Microcontrollers (Valvano) http://www.amazon.com/Embedded-Systems-Real-Time-Operating-Microcontrollers/dp/ 1466468866/ref=sr_1_1?s=books&ie=UTF8&qid=1423511893&sr=1-1&keywords=Real- Time+Operating+Systems+for+Cortex-M

2–30 ECE 5655/4655 Real-Time DSP