CEC 320 and 322 Microprocessor Systems Class and Lab
Lecture 13 - MCU Platforms Exam #2 Results and Solutions Ave=68.2, High=94 – Exam-1 (Ave=75) and Exam-2 (Ave=65) = 30% (15% each) – Final Exam = 20% (40% Part-1, 40% Part-2, 20% Ch. 4) – Ex #1…5 = 30% (Ex #6 canceled - more review & lab-10 time) – 6 Quizzes = 20% (2 left - Ch. 4 and Final Review) – Canvas weights and policies, curve applied after final (help only)
Solutions Posted on Canvas – Solutions Walk-through – Q&A
Remaining Grade Events (32.67%) – 2 Quizzes (6.7%) - Complete before 11/26, 12/9 (on Canvas) – Ex #5 (6%) - Due 12/2 – Final (20%) - 12:00-02:30pm, Tues, 12/10 (Schedule 2019) Lab #9 Demo - Video
Re-do Lab #4 Test ISR Board Configuration & OLED Display - Verify ASM and C Version of ISR 3+ Clock Rates Re-compile, Reset between tests
PD7 Analog Input Goals: 1) Hand optimize ASM 2) Compare C and ASM 3) ARM Example Menu Procedure Call and Commands Standard ARM ISA and Platform Documents
ARM Architecture (like x86, MIPS, PowerPC, etc.) – ARM Infocenter – ARM Developer – ARM ABI – Azeria Labs - ARM Platform Security – ARM University - Overview of Resources – ARM Ltd.
Platform Documentation is Vendor Specific – E.g. Broadcom - bcm58712 (Raspberry Pi BCM2837, BCM2711) – TI Sitara (A-Series) and Tiva TM4C (M-Series used in CEC 320) – NVIDIA Tegra Series – Marvell (XSC) – Cypress MCUs – ST Micro MCUs – Altera FPGA SoC – Silicon Labs MCUs
© Sam Siewert Continuation of MCU Related Studies Purpose-built MCUs SE program – CEC 470 Comp. Arch. – CEC 450 RT Systems Processor Scale-Up, Scale-Down RTOS OS + RT extensions – CEC 460 Telecomm Network processors CE Program
CE program – CEC 460 Telecomm – CEC 450 RT Systems – CEC 470 Comp. Arch. SE Program
Network Processors
© Sam Siewert Life-long Study of Embedded Systems SoC platforms and/or CPU core design - ALU with an FPGA or Sim
System on a Chip and embedded MCU platforms – Altera DE SoC (DE2-115) - Nios II Soft Core – Xilinx Digilent - MicroBlaze Soft Core – NVIDIA Jetson Nano, Xavier NX IMSAI “Workstation” – Texas Instruments - Launch Pads (TM4C123GXL LP, TM4C1294XL Connected LP) Intel 8080 One 2 Mhz core 64KB von Neumann Arch. Useful for real-time systems (CEC 450) $931 in 1979 assembled – Concepts such as WCET (pipeline performance) for RMA – Resource view of Platforms for HAL or OS (CPU, I/O, Memory, Power) – RTOS introduction (e.g. FreeRTOS, Zephyr, TI RTOS, VxWorks, ARM Univ., Micrium, etc.) 40 – RT Services years 1. FPGA VHDL or SoC (CEC 330, CEC399 Special Topics), 2. Bare Metal CE / Main+ISR (CEC 320/322), 3. RTOS or IoT (URI, CEC399 Special Topics), 4. OS+RT extensions (CEC 450)
Self-Study Continuation after Micro, before CEC450 (Real-Time) – MIPS with Simulation of the ALU that is Cycle Accurate or Approximate – Hennessy and Patterson - MIPS Comp Org Book, 5th Ed., Cortex-A8, NVIDIA, ARM v7, v8, x86 QtSpim MARS – ARM MCU or SoC with ETM / KEIL CoreSight (IAR Tools, p. 262, Code Composer) QEMU Jetson Xavier NX Quartus-II and ModelSim Six 1.4 GHz ARM Cortex A cores – Intel x86, x64 PMU with VTune (to see chip-level events in Windows or Linux) 8GB, 384 Co-processors $399 in 2019 URI to learn and work on Comp Org for ICARUS (or CEC330 DB, CEC330 PC) – Between CEC 320 and CEC 450 with embedded FPGA, SoC, IoT, GP-GPU and RTOS/OS experience – Participate in research as an option before/after industry internships (e.g. summer after 2nd year)
© Sam Siewert Recall ARM M & A Series
ARM M Series - MCU ARM cortex-m4 – TIVA TM4C123G (M4), NXP, Cypress, Silicon Labs – The Cortex-M4 processor is developed to address digital signal control markets that demand an efficient, easy-to-use blend of control and signal processing capabilities.
ARM A Series - Adv. Mobile ARM cortex-a15 – Smart Phone – Qualcomm, Broadcomm, NVIDIA – Harvard Split L1, Unified L2, L3, Multi-core – The processor cluster has one to four cores. Each core has its own L1 instruction and data caches, together with a single shared L2 unified cache. © Sam Siewert Recall ARM R Series
ARM R Series - Real-Time – Redundancy (no SPOFs) - ARM cortex-r52 Lock-step MISD – Predictable / Deterministic response (TCM) – Resilience - recovery and fail- safe – ECC memory – Flash memory with data protection – Software sanity monitoring – RT critical services – Best-effort services – The Cortex-R52 processor meets the rising performance needs of advanced real-time embedded systems.
© Sam Siewert Assignment #5 Final Assignment - Ex1 … Ex5
Explore ARM MCU Platforms (Do, Observe, Explain) – Jetson TK1 - King 112 lab – Raspberry Pi 3b+ (Broadcomm) - borrow, remote login
Compare ARM MCU SoC Platforms (on paper) – Jetson Nano - remote login – DE2-115
Provides concrete examples to motivate CAC Ch. 4
Bridge to CEC 450, CS 415, Capstone
© Sam Siewert From MCUs to Platforms 1980’s - early 1990’s - Multi-chip, TTL logic, complex PCBs – von Neumann (no split L1 cache), no pipeline, zero or low wait-state memory – predictable - ASM clocks per instruction in x86 86/88, 186/188 User’s Manual - HW Ref – Introduction of 32-bit MCUs – 8-bit, 16-bit MCUs common (still widely used for deeply embedded today) – E.g. 8051, 68HC11 used in robotics, automotive, etc. (IEEE 485, RS232, Token Ring, etc.) – Today - Microchip/Atmel 8-bit, 16-bit AVR, TI for Scale-down (subsumption - CAN, I2C, SPI, BLE)
1990’s - MIPS, ARM, PowerPC, Alpha (3.3v) – Introduction of Pipelines and L1 cache (split cache Harvard architecture) – 32-bit MCUs common, 64-bit for Workstations (e.g. DEC) – Vector processing (SIMD) introduced - Altivec (PPC), MMX (Intel), (ARM NEON - 2009)
Early 2000 - Super-pipelines (XSC 7/8 stage, ARM-11), Superscalar (AMD Opteron, Intel P6/Xeon x64), Dual-core (ARM, XSC)
Current Decade (2010’s) - Many Core, MICA, FPGA & GP-GPU SoC – ARM Cortex M-Series (embedded), A-Series (mobile), R-Series (real-time) – Many new ARM SoCs
Next - IoT (Scale down), Visual (Scale-up), Neuromorphic (Purpose built) – Google TPU (Machine learning) – NVIDIA GP-GPU (Visual processing and ML) – Intel Neural Compute Stick (ML) – ARM NXP, TI, ST-Micro, Cypress, Silicon Labs, etc. (IoT) © Sam Siewert Scaling MCUs - 8, 16, to 32-bit Motorola 68K Early MCUs were not pipelined and had zero wait-state memory access (or single-wait state worst case) – Today, this is Tightly Coupled Memory
– TCM can be emulated with https://en.wikipedia.org/wiki/Motorola_68000 pipelined modern MCU with cache load and lock in L1 – 32-bit Examples: Motorola 68000 (Mac), Intel 8088 (IBM PC) – L1 split cache (Harvard) and unified L2/L3 minimizes wait-state slow down today
Cady, Frederick M. Microcontrollers and Microcomputers principles of software and hardware engineering. Oxford University Press, Inc., 2009. © Sam Siewert Simplify, Speed-Up - RISC Pipelined MCUs
MIPS - R2000, R3000 (Late 1980’s - Early 1990’s) – Harris Radiation Hardened RH3000 (NASA New Horizons) – Mongoose-V (1993) – AAS 2017 Presentation on
Modern RH MCUs (Siewert) https://en.wikipedia.org/wiki/R3000 – Competition in 1993 was 64-bit DEC Alpha, 32-bit PowerPC Board level solutions (Mac), 32-bit 80486/Pentium P5 became System on Chip and MCU (Wintel PC), 32-bit ARM7 (von Solutions on Chip Neumann arch.) Lower part count, fewer issues with signal integrity, Other RISC MCUs - ARM, simpler, but sometimes PowerPC more than you need
© Sam Siewert Current Scale-down, Scale-up MCUs Scale-up (e.g. Cavium MIPS) - ARM A/R Series – Many new 64-bit MCUs MIPS 64 (Cavium Octeon, etc.) ARM 64 A-Series – Multi-core and Many-core MCUs – Co-processor SoCs - FPGA and GP-GPU
Scale-down (e.g. Microchip/Atmel AVR) - ARM M Series – Simple 32-bit IoT (BLE, 802.11, 5G) for predictive maintenance and consumer IoT (e.g. smart home) – Continuation of 8-bit and 16-bit MCUs (Sensor networks, robotics) – Subsumption architecture, Sensor networks
© Sam Siewert Cortex M-Series is Scale Down TIVA TM4C123G Dev Board
TM4C123G Dev board uses the TM4C123GH6PGE MCU Includes a number of demonstration devices – I2C devices (e.g. MPU9150 Motion Tracker) – GPIO LED, Switches, Pins (Multi-function) – Analog inputs (Temp sensor) – CAN bus interface – 96x64 color OLED (Synch. Serial Interface) – MicroSD (Synch. Serial)
TM4C123G has lower part count with MCU
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Computing Platforms
Platform organization. – MCU processor, peripherals (on-chip), peripherals (off-chip), on- chip memory, off-chip memory, on-chip/off-chip Nand flash, etc. Busses. – Local bus (AMBA) and I/O bus (e.g. PCIe) Memory devices. – Don’t confuse a Memory Controller (MCU) with a Microcontroller Unit (MCU) – Overloaded acronym – MMU - Memory Management Unit used for memory mapping and access control
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Computing platform architecture
DMA Request queue
DMA Completion queue
Request • Src starting address • Dst starting address • Length • Interrupt on done • Return request tag DMA provides direct memory access. Timers used by OS, devices. Completion Multiple busses connect CPU, memory to devices. • Request tag • Status For TIVA TM4C123G we used Programmed MMIO – Read, Write FIFO or MMIO Registers (e.g. 16x8 UART FIFO) – ADC Channel Reads – GPIO Reads and Writes – I2C Bus Writes (Function Generator) – Exception is Motion Tracker - Data Filled in and Completion indicated by Call-back Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Platform software Platform software provides core functions, utilities.
Low-level functions depend on architecture--- TI interrupt vectors, etc. PDL
CE Main+ISR - e.g. Texas Instruments PDL RTOS - e.g. Wind River VxWorks Wind kernel, Zephyr micro-kernel, FreeRTOS OS + Extensions - e.g. Embedded Linux with POSIX RT
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Example 4Gb System Memory Map
0xFFFF_FFFF 1 Mbyte Boot ROM device Boot ROM (Flash) (reset vector address @ high address) 0xFFF0_0000 0xFFEF_FFFF 4015 Mbytes unused 0x0500_0000 0x04FF_FFFF 16 Mbytes Memory Mapped IO MMIO (PCI BARs for Device 0x0400_0000 Function Registers) 0x03FF_FFFF 32 Mbytes unused (space left for memory upgrades) 0x0200_0000 0x01FF_FFFF Main Working Memory for OS/Apps Working Memory (e.g. 32 Mbytes SRAM, SDRAM, DDR)
0x0000_0000
Sam Siewert 18 RTOS and App Use of 32 Mb Memory
0x01FF_FFFF App Code Heap Loadable App modules (.text, .data, .bss, .rom)
ISR Stack ISR_STACK_SIZE WDB Pool WDB_POOL_SIZE _end+1 _end Loadable VxWorks image System Code (.text, .data, .bss, .rom) 0x0010_8000 System Stack 32K for kernel – grows down 0x0010_0000
0x0009_FFFF Bootrom Image 608 Kbyte Boot_rom image 0x0000_8000 Bootrom Stack 12K stack for boot – grows down 0x0000_5000 256 Bytes bootDev, unitNum, Boot Parameters 0x0000_1200 procNum, flags 0x0000_07FF Interrupt Vector Table 2 Kbytes for IRQ 0-15 Handlers 0x0000_0000 (128 Bytes, 32 Dwords of Code) Sam Siewert 19 CPU buses Which ones are faster? Bus allows CPU, memory, devices to communicate. – Shared communication medium.
A bus is: – A set of wires. – A communications protocol. – Address, Data (Multiplexed or Dedicated), Control
CPCI
Parallel buses were dominant until 2002 - e.g. ISA, VME, PCI Byte lane serial buses emerged in 2002 - e.g. PCI Express, Infiniband
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus protocols Bus protocol determines how devices communicate.
Devices on the bus go through sequences of states. – Protocols are specified by state machines, one state machine per actor in the protocol.
May contain asynchronous logic behavior.
DSP has allowed PCIe and Infiniband serial buses to surpass parallel
Many x N byte serial lanes
SERDES (serialize, de- serialize)
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Microprocessor busses
Clock provides synchronization.
R/W is true when reading (R/W’ is false when reading).
Address is a-bit bundle of address lines.
Data is n-bit bundle of data lines. Posted writes - writes to slow devices posted to FIFO, output later when device is ready Data ready signals when n-bit data is ready. Split-transaction reads - reads from slow devices requested with an I/O-tag, when device final responds, read data is matched to tag
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Timing diagrams - Analog Discovery
On-Off Keying or Amplitude
Eye diagram - margin of error in digital signal
Logic errors on a bus
Encoding will minimize potential for and possibly detect errors (e.g. 8b/10b)
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus read Buses can be read or written with PIO (Programmed I/O) one word at a time - but rarely are • CPU involved in each transfer • Address, Data, Address, Data, …
Normally Block transfer with DMA is used • Address, Length, Data, Data, Data, … • DMA handles transfer and interrupt CPU upon completion
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS State diagrams for bus read
Get Done Send Release data data ack
See Ack ack Adrs Adrs
Wait Wait
device CPU start
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus wait state
Buses may need to wait on slow devices
Wait states allow devices time to respond
Many modern buses use split transaction reads and posted writes to allow for higher bus throughput
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus burst read
PCI Express handles all bus transactions as burst
Programmed I/O is handled as a burst of 1
Common high-rate I/O devices • Network interfaces • Disk drives • SSD/Flash • Cameras • USB devices
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus multiplexing
PCI and PCI Express ended dedicated Address and Data lines data enable device for buses to reduce trace data count on PCBs CPU Most often, we address adrs once and do block transfer anyway adrs
Adrs enable
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS DMA
Direct memory access (DMA) performs data transfers without executing instructions. – CPU sets up transfer. – DMA engine fetches, writes. DMA controller is a separate unit.
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus mastership
By default, CPU is bus master and initiates transfers.
DMA must become bus master to perform its work. – CPU can’t use bus while DMA operates.
Bus mastership protocol: – Bus request. – Bus grant.
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS DMA operation
CPU sets DMA registers for start address, length. DMA status register controls the unit. Once DMA is bus master, it transfers automatically. – May run continuously until complete. – May use every nth bus cycle.
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Bus transfer sequence diagram
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS System bus configurations
Multiple busses allow parallelism: CPU slow device – Slow devices on one
bus. bridge memory slow device – Fast devices on separate bus. high-speed device A bridge connects two busses.
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS ARM AMBA bus - Internal
Two varieties: – AHB is high-performance. – APB is lower-speed, lower cost. AHB supports pipelining, burst transfers, split transactions, multiple bus masters. All devices are slaves on APB.
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Memory components
Several different types of memory: – DRAM. – SRAM. – Flash.
Each type of memory comes in varying: – Capacities. – Widths.
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Random-access memory
Dynamic RAM is dense, requires refresh. – SDRAM: synchronous DRAM. – EDO DRAM: extended data out. – FPM DRAM: fast page mode. – DDR DRAM: double-data rate.
Static RAM is faster, less dense, consumes more power.
ECC - SECDED (Single Error Correction, Double Error Detection) is critical for Mission Critical RT Systems
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS SDRAM read operation
DRAM - high density
RAS and CAS (Row and Column address strobe)
Burst Data Out • E.g. Cache line size or more • High throughput • Potentially higher latency than SRAM for specific byte access
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Memory packaging
SIMM: single in-line memory module. DIMM: dual in-ilen memory module.
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Memory systems and memory controllers
Memory has complex internal organization.
Memory controller hides details of memory interface, schedules transfers to maximize performance.
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Channels and banks
Channels provide separate connections to parts of memory. Banks are separate memory arrays.
Computers as Components 4e © 2016 Marilyn Wolf, Updated by SBS Single Board Computer SoCs
SBC = Single Board Computer (Instead of Backplane)
For RT Systems 2 Boards are Use for High Rate I/O (with Co-Processing) – Jetson TK-1 – Multi-Core CPU + GPU Co-Processor – DE1-SoC – Multi-Core CPU + FPGA Co-Processor
For Low Rate, Texas Instruments Tiva TM4C is also an Option
SBCs are Less Scalable than a CPCI or VXS/VXI Backplane, But SoC Packs Multiple Cores and I/O onto a Single Chip!
Sam Siewert 41 Embedded GP-GPU SoCs - Jetson TK1 Jetson TK1 CPU+GPU – NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 – NVIDIA Kepler "GK20a" GPU with 192 SM3.2 CUDA cores (up to 326 GFLOPS)
Jetson Nano Tegra K1 – Competitive with R-Pi, TI OMAP, etc. in terms of price, fanless, etc. ($99) – Same Tegra K1 SoC – Much more compact – Good for student projects
involving machine vision, AI https://developer.nvidia.com/embedded/jetson-nano-developer-kit
Sam Siewert 42 Embedded FPGA SoC Devices – DE1-SoC Reconfigurable SoC with FPGA Co-processing Dual-Core ARM Cortex A9, Linux or FreeRTOS
Sam Siewert 43