CEC 320 and 322 Microprocessor Systems Class and Lab

Lecture 13 - MCU Platforms Exam #2 Results and Solutions Ave=68.2, High=94 – Exam-1 (Ave=75) and Exam-2 (Ave=65) = 30% (15% each) – Final Exam = 20% (40% Part-1, 40% Part-2, 20% Ch. 4) – Ex #1…5 = 30% (Ex #6 canceled - more review & lab-10 time) – 6 Quizzes = 20% (2 left - Ch. 4 and Final Review) – Canvas weights and policies, curve applied after final (help only)

Solutions Posted on Canvas – Solutions Walk-through – Q&A

Remaining Grade Events (32.67%) – 2 Quizzes (6.7%) - Complete before 11/26, 12/9 (on Canvas) – Ex #5 (6%) - Due 12/2 – Final (20%) - 12:00-02:30pm, Tues, 12/10 (Schedule 2019) Lab #9 Demo - Video

Re-do Lab #4 Test ISR Board Configuration & OLED Display - Verify ASM and C Version of ISR 3+ Clock Rates Re-compile, Reset between tests

PD7 Analog Input Goals: 1) Hand optimize ASM 2) Compare C and ASM 3) ARM Example Menu Procedure Call and Commands Standard ARM ISA and Platform Documents

ARM Architecture (like x86, MIPS, PowerPC, etc.) – ARM Infocenter – ARM Developer – ARM ABI – Azeria Labs - ARM Platform Security – ARM University - Overview of Resources – ARM Ltd.

Platform Documentation is Vendor Specific – E.g. Broadcom - bcm58712 ( BCM2837, BCM2711) – TI Sitara (A-Series) and Tiva TM4C (M-Series used in CEC 320) – Series – Marvell (XSC) – Cypress MCUs – ST Micro MCUs – Altera FPGA SoC – Silicon Labs MCUs

© Sam Siewert Continuation of MCU Related Studies Purpose-built MCUs SE program – CEC 470 Comp. Arch. – CEC 450 RT Systems Processor Scale-Up, Scale-Down RTOS OS + RT extensions – CEC 460 Telecomm Network processors CE Program

CE program – CEC 460 Telecomm – CEC 450 RT Systems – CEC 470 Comp. Arch. SE Program

Network Processors

© Sam Siewert Life-long Study of Embedded Systems SoC platforms and/or CPU core design - ALU with an FPGA or Sim

System on a Chip and embedded MCU platforms – Altera DE SoC (DE2-115) - Nios II Soft Core – Digilent - MicroBlaze Soft Core – Nano, Xavier NX IMSAI “Workstation” – - Launch Pads (TM4C123GXL LP, TM4C1294XL Connected LP) 8080 One 2 Mhz core 64KB von Neumann Arch. Useful for real-time systems (CEC 450) $931 in 1979 assembled – Concepts such as WCET (pipeline performance) for RMA – Resource view of Platforms for HAL or OS (CPU, I/O, Memory, Power) – RTOS introduction (e.g. FreeRTOS, Zephyr, TI RTOS, VxWorks, ARM Univ., Micrium, etc.) 40 – RT Services years 1. FPGA VHDL or SoC (CEC 330, CEC399 Special Topics), 2. Bare Metal CE / Main+ISR (CEC 320/322), 3. RTOS or IoT (URI, CEC399 Special Topics), 4. OS+RT extensions (CEC 450)

Self-Study Continuation after Micro, before CEC450 (Real-Time) – MIPS with Simulation of the ALU that is Cycle Accurate or Approximate – Hennessy and Patterson - MIPS Comp Org Book, 5th Ed., Cortex-A8, NVIDIA, ARM v7, v8, x86 QtSpim MARS – ARM MCU or SoC with ETM / KEIL CoreSight (IAR Tools, p. 262, Code Composer) QEMU Jetson Xavier NX Quartus-II and ModelSim Six 1.4 GHz ARM Cortex A cores – Intel x86, x64 PMU with VTune (to see -level events in Windows or Linux) 8GB, 384 Co-processors $399 in 2019 URI to learn and work on Comp Org for ICARUS (or CEC330 DB, CEC330 PC) – Between CEC 320 and CEC 450 with embedded FPGA, SoC, IoT, GP-GPU and RTOS/OS experience – Participate in research as an option before/after industry internships (e.g. summer after 2nd year)

© Sam Siewert Recall ARM M & A Series

ARM M Series - MCU ARM cortex-m4 – TIVA TM4C123G (M4), NXP, Cypress, Silicon Labs – The Cortex-M4 processor is developed to address digital signal control markets that demand an efficient, easy-to-use blend of control and signal processing capabilities.

ARM A Series - Adv. Mobile ARM cortex-a15 – Smart Phone – Qualcomm, Broadcomm, NVIDIA – Harvard Split L1, Unified L2, L3, Multi-core – The processor cluster has one to four cores. Each core has its own L1 instruction and data caches, together with a single shared L2 unified cache. © Sam Siewert Recall ARM R Series

ARM R Series - Real-Time – Redundancy (no SPOFs) - ARM cortex-r52 Lock-step MISD – Predictable / Deterministic response (TCM) – Resilience - recovery and fail- safe – ECC memory – Flash memory with data protection – Software sanity monitoring – RT critical services – Best-effort services – The Cortex-R52 processor meets the rising performance needs of advanced real-time embedded systems.

© Sam Siewert Assignment #5 Final Assignment - Ex1 … Ex5

Explore ARM MCU Platforms (Do, Observe, Explain) – Jetson TK1 - King 112 lab – Raspberry Pi 3b+ (Broadcomm) - borrow, remote login

Compare ARM MCU SoC Platforms (on paper) – Jetson Nano - remote login – DE2-115

Provides concrete examples to motivate CAC Ch. 4

Bridge to CEC 450, CS 415, Capstone

© Sam Siewert From MCUs to Platforms 1980’s - early 1990’s - Multi-chip, TTL logic, complex PCBs – von Neumann (no split L1 cache), no pipeline, zero or low wait-state memory – predictable - ASM clocks per instruction in x86 86/88, 186/188 User’s Manual - HW Ref – Introduction of 32-bit MCUs – 8-bit, 16-bit MCUs common (still widely used for deeply embedded today) – E.g. 8051, 68HC11 used in robotics, automotive, etc. (IEEE 485, RS232, Token Ring, etc.) – Today - Microchip/Atmel 8-bit, 16-bit AVR, TI for Scale-down (subsumption - CAN, I2C, SPI, BLE)

1990’s - MIPS, ARM, PowerPC, Alpha (3.3v) – Introduction of Pipelines and L1 cache (split cache Harvard architecture) – 32-bit MCUs common, 64-bit for Workstations (e.g. DEC) – Vector processing (SIMD) introduced - Altivec (PPC), MMX (Intel), (ARM NEON - 2009)

Early 2000 - Super-pipelines (XSC 7/8 stage, ARM-11), Superscalar (AMD Opteron, Intel P6/Xeon x64), Dual-core (ARM, XSC)

Current Decade (2010’s) - Many Core, MICA, FPGA & GP-GPU SoC – ARM Cortex M-Series (embedded), A-Series (mobile), R-Series (real-time) – Many new ARM SoCs

Next - IoT (Scale down), Visual (Scale-up), Neuromorphic (Purpose built) – TPU (Machine learning) – NVIDIA GP-GPU (Visual processing and ML) – Intel Neural Compute Stick (ML) – ARM NXP, TI, ST-Micro, Cypress, Silicon Labs, etc. (IoT) © Sam Siewert Scaling MCUs - 8, 16, to 32-bit Motorola 68K Early MCUs were not pipelined and had zero wait-state memory access (or single-wait state worst case) – Today, this is Tightly Coupled Memory

– TCM can be emulated with https://en.wikipedia.org/wiki/Motorola_68000 pipelined modern MCU with cache load and lock in L1 – 32-bit Examples: Motorola 68000 (Mac), Intel 8088 (IBM PC) – L1 split cache (Harvard) and unified L2/L3 minimizes wait-state slow down today

Cady, Frederick M. and Microcomputers principles of software and hardware engineering. Oxford University Press, Inc., 2009. © Sam Siewert Simplify, Speed-Up - RISC Pipelined MCUs

MIPS - R2000, R3000 (Late 1980’s - Early 1990’s) – Harris Radiation Hardened RH3000 (NASA New Horizons) – Mongoose-V (1993) – AAS 2017 Presentation on

Modern RH MCUs (Siewert) https://en.wikipedia.org/wiki/R3000 – Competition in 1993 was 64-bit DEC Alpha, 32-bit PowerPC Board level solutions (Mac), 32-bit 80486/Pentium P5 became System on Chip and MCU (Wintel PC), 32-bit ARM7 (von Solutions on Chip Neumann arch.) Lower part count, fewer issues with signal integrity, Other RISC MCUs - ARM, simpler, but sometimes PowerPC more than you need

© Sam Siewert Current Scale-down, Scale-up MCUs Scale-up (e.g. Cavium MIPS) - ARM A/R Series – Many new 64-bit MCUs MIPS 64 (Cavium Octeon, etc.) ARM 64 A-Series – Multi-core and Many-core MCUs – Co-processor SoCs - FPGA and GP-GPU

Scale-down (e.g. Microchip/Atmel AVR) - ARM M Series – Simple 32-bit IoT (BLE, 802.11, 5G) for predictive maintenance and consumer IoT (e.g. smart home) – Continuation of 8-bit and 16-bit MCUs (Sensor networks, robotics) – Subsumption architecture, Sensor networks

© Sam Siewert Cortex M-Series is Scale Down TIVA TM4C123G Dev Board

TM4C123G Dev board uses the TM4C123GH6PGE MCU Includes a number of demonstration devices – I2C devices (e.g. MPU9150 Motion Tracker) – GPIO LED, Switches, Pins (Multi-function) – Analog inputs (Temp sensor) – CAN interface – 96x64 color OLED (Synch. Serial Interface) – MicroSD (Synch. Serial)

TM4C123G has lower part count with MCU

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Computing Platforms

Platform organization. – MCU processor, peripherals (on-chip), peripherals (off-chip), on- chip memory, off-chip memory, on-chip/off-chip Nand flash, etc. Busses. – Local bus (AMBA) and I/O bus (e.g. PCIe) Memory devices. – Don’t confuse a (MCU) with a Unit (MCU) – Overloaded acronym – MMU - Memory Management Unit used for memory mapping and access control

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Computing platform architecture

DMA Request queue

DMA Completion queue

Request • Src starting address • Dst starting address • Length • Interrupt on done • Return request tag DMA provides . Timers used by OS, devices. Completion Multiple busses connect CPU, memory to devices. • Request tag • Status For TIVA TM4C123G we used Programmed MMIO – Read, Write FIFO or MMIO Registers (e.g. 16x8 UART FIFO) – ADC Channel Reads – GPIO Reads and Writes – I2C Bus Writes (Function Generator) – Exception is Motion Tracker - Data Filled in and Completion indicated by Call-back Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Platform software Platform software provides core functions, utilities.

Low-level functions depend on architecture--- TI interrupt vectors, etc. PDL

CE Main+ISR - e.g. Texas Instruments PDL RTOS - e.g. Wind River VxWorks Wind kernel, Zephyr micro-kernel, FreeRTOS OS + Extensions - e.g. Embedded Linux with POSIX RT

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Example 4Gb System Memory Map

0xFFFF_FFFF 1 Mbyte Boot ROM device Boot ROM (Flash) (reset vector address @ high address) 0xFFF0_0000 0xFFEF_FFFF 4015 Mbytes unused 0x0500_0000 0x04FF_FFFF 16 Mbytes Memory Mapped IO MMIO (PCI BARs for Device 0x0400_0000 Function Registers) 0x03FF_FFFF 32 Mbytes unused (space left for memory upgrades) 0x0200_0000 0x01FF_FFFF Main Working Memory for OS/Apps Working Memory (e.g. 32 Mbytes SRAM, SDRAM, DDR)

0x0000_0000

 Sam Siewert 18 RTOS and App Use of 32 Mb Memory

0x01FF_FFFF App Code Heap Loadable App modules (.text, .data, .bss, .rom)

ISR Stack ISR_STACK_SIZE WDB Pool WDB_POOL_SIZE _end+1 _end Loadable VxWorks image System Code (.text, .data, .bss, .rom) 0x0010_8000 System Stack 32K for kernel – grows down 0x0010_0000

0x0009_FFFF Bootrom Image 608 Kbyte Boot_rom image 0x0000_8000 Bootrom Stack 12K stack for boot – grows down 0x0000_5000 256 bootDev, unitNum, Boot Parameters 0x0000_1200 procNum, flags 0x0000_07FF Interrupt Vector Table 2 Kbytes for IRQ 0-15 Handlers 0x0000_0000 (128 Bytes, 32 Dwords of Code)  Sam Siewert 19 CPU buses Which ones are faster? Bus allows CPU, memory, devices to communicate. – Shared communication medium.

A bus is: – A set of wires. – A communications protocol. – Address, Data (Multiplexed or Dedicated), Control

CPCI

Parallel buses were dominant until 2002 - e.g. ISA, VME, PCI lane serial buses emerged in 2002 - e.g. PCI Express, Infiniband

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus protocols Bus protocol determines how devices communicate.

Devices on the bus go through sequences of states. – Protocols are specified by state machines, one state machine per actor in the protocol.

May contain asynchronous logic behavior.

DSP has allowed PCIe and Infiniband serial buses to surpass parallel

Many x N byte serial lanes

SERDES (serialize, de- serialize)

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Microprocessor busses

Clock provides synchronization.

R/W is true when reading (R/W’ is false when reading).

Address is a-bit bundle of address lines.

Data is n-bit bundle of data lines. Posted writes - writes to slow devices posted to FIFO, output later when device is ready Data ready signals when n-bit data is ready. Split-transaction reads - reads from slow devices requested with an I/O-tag, when device final responds, read data is matched to tag

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Timing diagrams - Analog Discovery

On-Off Keying or Amplitude

Eye diagram - margin of error in digital signal

Logic errors on a bus

Encoding will minimize potential for and possibly detect errors (e.g. 8b/10b)

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus read Buses can be read or written with PIO (Programmed I/O) one word at a time - but rarely are • CPU involved in each transfer • Address, Data, Address, Data, …

Normally Block transfer with DMA is used • Address, Length, Data, Data, Data, … • DMA handles transfer and interrupt CPU upon completion

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS State diagrams for bus read

Get Done Send Release data data ack

See Ack ack Adrs Adrs

Wait Wait

device CPU start

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus wait state

Buses may need to wait on slow devices

Wait states allow devices time to respond

Many modern buses use split transaction reads and posted writes to allow for higher bus throughput

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus burst read

PCI Express handles all bus transactions as burst

Programmed I/O is handled as a burst of 1

Common high-rate I/O devices • Network interfaces • Disk drives • SSD/Flash • Cameras • USB devices

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus multiplexing

PCI and PCI Express ended dedicated Address and Data lines data enable device for buses to reduce trace data count on PCBs CPU Most often, we address adrs once and do block transfer anyway adrs

Adrs enable

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS DMA

Direct memory access (DMA) performs data transfers without executing instructions. – CPU sets up transfer. – DMA engine fetches, writes. DMA controller is a separate unit.

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus mastership

By default, CPU is bus master and initiates transfers.

DMA must become bus master to perform its work. – CPU can’t use bus while DMA operates.

Bus mastership protocol: – Bus request. – Bus grant.

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS DMA operation

CPU sets DMA registers for start address, length. DMA status register controls the unit. Once DMA is bus master, it transfers automatically. – May run continuously until complete. – May use every nth bus cycle.

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus transfer sequence diagram

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS configurations

Multiple busses allow parallelism: CPU slow device – Slow devices on one

bus. bridge memory slow device – Fast devices on separate bus. high-speed device A bridge connects two busses.

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS ARM AMBA bus - Internal

Two varieties: – AHB is high-performance. – APB is lower-speed, lower cost. AHB supports pipelining, burst transfers, split transactions, multiple bus masters. All devices are slaves on APB.

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Memory components

Several different types of memory: – DRAM. – SRAM. – Flash.

Each type of memory comes in varying: – Capacities. – Widths.

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Random-access memory

Dynamic RAM is dense, requires refresh. – SDRAM: synchronous DRAM. – EDO DRAM: extended data out. – FPM DRAM: fast page mode. – DDR DRAM: double-data rate.

Static RAM is faster, less dense, consumes more power.

ECC - SECDED (Single Error Correction, Double Error Detection) is critical for Mission Critical RT Systems

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS SDRAM read operation

DRAM - high density

RAS and CAS (Row and Column address strobe)

Burst Data Out • E.g. Cache line size or more • High throughput • Potentially higher latency than SRAM for specific byte access

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Memory packaging

SIMM: single in-line memory module. DIMM: dual in-ilen memory module.

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Memory systems and memory controllers

Memory has complex internal organization.

Memory controller hides details of memory interface, schedules transfers to maximize performance.

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Channels and banks

Channels provide separate connections to parts of memory. Banks are separate memory arrays.

Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Single Board Computer SoCs

SBC = Single Board Computer (Instead of Backplane)

For RT Systems 2 Boards are Use for High Rate I/O (with Co-Processing) – Jetson TK-1 – Multi-Core CPU + GPU Co-Processor – DE1-SoC – Multi-Core CPU + FPGA Co-Processor

For Low Rate, Texas Instruments Tiva TM4C is also an Option

SBCs are Less Scalable than a CPCI or VXS/VXI Backplane, But SoC Packs Multiple Cores and I/O onto a Single Chip!

 Sam Siewert 41 Embedded GP-GPU SoCs - Jetson TK1 Jetson TK1 CPU+GPU – NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 – NVIDIA Kepler "GK20a" GPU with 192 SM3.2 CUDA cores (up to 326 GFLOPS)

Jetson Nano Tegra K1 – Competitive with R-Pi, TI OMAP, etc. in terms of price, fanless, etc. ($99) – Same Tegra K1 SoC – Much more compact – Good for student projects

involving machine vision, AI https://developer.nvidia.com/embedded/jetson-nano-developer-kit

 Sam Siewert 42 Embedded FPGA SoC Devices – DE1-SoC Reconfigurable SoC with FPGA Co-processing Dual-Core ARM Cortex A9, Linux or FreeRTOS

 Sam Siewert 43