Embedded Processors and CPU Cores

EE8205: Embedded Computer Systems

http://www.ee.ryerson.ca/~courses/ee8205/

Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering

Ryerson University

Overview • CPU Basics Operation • Embedded CPUs • Nios-II CPU Organization • Nios-II Cores • MPS32K CPU Core

Text by M. Wolf: part of Chapters/Sections 2.1, 2.2, 2.3 and 3.1-3.5

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:1 Processor Architecture von Neumann Architecture • Memory holds data, instructions. • Central processing unit (CPU) fetches instructions from memory. • Separate CPU and memory distinguishes programmable computer. • CPU registers help out: program counter (PC), instruction register (IR), general-purpose registers, etc.

address PC Memory data CPU 200 ADD r5, r1, r3 IR

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:2 Harvard Architecture

address

Data memory data PC

CPU address

Program memory data

• Harvard architecture cannot use self-modifying code. • Harvard allows two simultaneous memory fetches. • Most DSPs use Harvard architecture for streaming data: . greater memory bandwidth . more predictable bandwidth

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:3 Instruction Execution Process

Instruction Fetch: Reads next instruction into instruction register (IR). Instruction address is in the program counter (PC). Instruction Interpretation: Decodes the op-code, gets the required operands and routes them to ALU. Sequencing Determines the address of next instruction and loads it into the PC. Execution: Generates control signals of ALU for execution.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:4 System Organization

Memory and I/O having Separate Bus

Memory Bus

I/O Bus

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:5 Memory Mapped Peripherals

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:6 Single Accumulator Architecture

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:7 Cache Memory

• Cache: Expensive but very fast memory directly connected to CPU interacting with slower but much larger main memory. • Invisible to OS and user programs but interact with memory management hardware. • Processor first checks if the addresses word is in cache. • If the word is not found in cache, a block of memory containing the word is moved to the cache.

address data

Cache

Main CPU cache controller address Memory data data

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:8 Separate Data and Instruction Caches FF..FF copies of 16 instructions address

instructions cache

address instructions instructions

registers

processor

address data data

address

copies of data data memory cache 00..0016

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:9 A Unified Instruction and Data Cache

FF..FF 16 registers instructions

processor

instructions address and data data copies of instructions address copies of data memory cache 00..00 instructions 16 and data

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:10 CPU Pipelining

Improve performance by increasing instruction throughput. Ideal speedup is equal to number of stages in the pipeline.

1st Instruction (Adrs: PC) Fetch Decode Execute

2nd Instruction (Adrs: PC+1) Fetch Decode Execute

3rd Instruction (Adrs: PC+2) Fetch Decode Execute

Time

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:11 CPU Pipelining

What makes pipelining easy? • When all instructions are of the same length. • Few instruction formats. • Memory operands appear only in loads and stores.

What makes pipelining hard? • Structural Hazards: Single memory • Control Hazards: Branch instructions • Data Hazards An instruction depends on a previous instruction

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:12 CPU Power Consumption

• Most modern CPUs are designed with power consumption in mind to some degree. • Power vs. energy: . Heat depends on power consumption; . Battery life depends on energy consumption.

Power Saving Strategies • Reduce power supply voltage. • Run at lower clock frequency. • Disable the CPU functional units with control signals when they are not in use. • Disconnect parts from power supply when not in use.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:13 RISC Architecture

RISC: Reduce Instruction Set Computer CISC: Complex Instruction Set Computer

RISC Features • RISC instruction set attributes includes a load/store model of execution. (CISC has memory/register model) • A non-destructive triadic register file that provides a distinct and highly efficient data preservation model. (CISC has destructive accumulator/register file) • Normalized fixed length instructions. (CISC has variable length contextual-field instructions)

Load/Store Model of Execution • The only instructions that can access main memory are load and store instructions. • All the other instructions operate on internal register files. • Increases concurrency, raising performance by de-coupling the loading and storing operation from data processing operations.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:14 Non-Destructive Model

Non-destructive architecture, Data Preserving Model, is fundamental to minimize the load-store traffic.

For example in an addition instruction: RISC Model: AR <= BR + CR

A combined load/store and non-destructive register model provides a dramatic boost in RISC performance.

• RISC ability to minimize the load/store traffic from/to memory. • De-coupling load/store and processing operations. • Allow optimizing compilers to fill the stall slots.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:15 Leading RISC Architectures

SPARC, MIPS, ARM, ARM Cortex, Nios-II, µBlaze SPARC: Scalable Processor Architecture • The first and open RISC architecture • Created by Sun Microsystems in response to the declining performance enhancement of CISC microprocessors • RISC work is based on Patterson research at UC Berkeley MIPS32 4Kx based soft processor core. • Supported by 5 semiconductor licenses Cypress, , LSI Logic, Fujitsu, and Bit • A flexible, scalable RISC architecture.

Current implementations are in CMOS, ECL and GaAs. A complete architecture having: • Integer unit, Floating point unit • Memory Management and Cache Controller • Multiprocessing

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:16 Processor Operation Modes

User mode . A user program is running. . Certain instructions are not allowed. . Memory mapping (base and bound) is enabled.

Supervisor mode . The operating system is running. . All instructions are allowed. . Memory mapping (base and bound) is disabled.

A single PSW (processor status word) bit sets the above two modes: For instance: PSW-bit =1 for Supervisor mode PSW-bit =0 for User mode

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:17 Interrupts

A computer program has only two ways to determine the conditions that exist in internal and external circuits. • One method uses software instructions that jump to subroutine on some flag status. • The second method responds to hardware signals called interrupts that force the program to call interrupt-handling subroutines. • Interrupts take processor time only when action is required. • Processor can respond to an external event much faster by using interrupts.

The whole programming of microcomputers and micro- controller by using interrupts is called real-time programming. Interrupts are often the only way in which real-time programming can be done successfully.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:18 Interrupt based I/O

Computers permit I/O modules to INTERRUPT a CPU during its normal operation. • I/O module asserts an interrupt request on the control bus. • CPU transfers the control to an interrupt handler routine. • Interrupt handler is generally part of the operating system.

Interrupts . Allows the processor to execute other instructions while an I/O operation is in progress. . Suspension of a processing caused by an event external to a processor in such a way that the computation can be resumed. . Improves processing efficiency.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:19 Instruction Cycle with Interrupts

CPU check for interrupts at the end of each instruction and executes the interrupt handler if required.

Interrupt Handler program identifies the nature/source of an interrupt and performs whatever actions are needed. • It takes over the control after the interrupt. • Control is transferred back to the interrupted program that will resume execution from the point of interruption. • Point of interruption can occur anywhere in a program. • State of the program is saved. (PC + PSW + relevant registers + …) User program

1 i Interrupt handler i+1

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:20 Interrupt Processing

Hardware Device controller or other system h/w Software issues an interrupt

Processor finishes Save remainder of execution of current process state instruction information

Processor signals acknowledgment of interrupt Process interrupt

Processor pushes PSW and PC onto Restore process control stack state information

Processor loads new PC value based on the interrupt Restore old PSW, PC, etc.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:21 Multiple Interrupts (Sequential Order) • Disable interrupts to complete the interrupting task at hand. • Additional interrupts remain pending until interrupts are enabled. Then interrupts are considered in order • After completing the interrupt handler routine, the processor checks for additional interrupts.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:22 Multiple Interrupts (Nested)

• Higher priority interrupts cause lower-priority interrupts to wait. • A lower-priority interrupt handler is interrupted.

For example, when input arrives from a communication line, it needs to be absorbed quickly to make room for additional inputs.

Interrupt the interrupting task!

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:23 Microcontrollers and Embedded CPUs

Microcontrollers are available in 4 to 32-bit word sizes. 8-bit or 16-bit micro-controllers are widely used.

Main/Popular Embedded CPUs and Microcontrollers

Model Pins RAM ROM Counter Remarks I/O Timer 8051 40:32 128 4K 2 128K External, Serial port Motorola 6811 52:40 256 8K 2 A/D, Watch Dog timer, Serial Port ARM7 CPUs Soft 64KB- IP 16/32-bit RISC TMS470-ARM HW 8-64KB 1MB DMA, Watch Dog timer, CAN, I2C Intel, 80C196 68:40 232 8K 2 16-bit 64K External, Serial, A/D, WD Intel, 80960 132 512 Ins -- 20MHz 32-bit bus, FPU, Interrupt control (i960) Cache Clock No I/O on-chip MIPS32 4K Soft I/D -- IPs 32-bit RISC, EJTAG and On-chip bus core Cache MC68360 240/241 4 or 32-bit, WD, Fault-tolerant, 4-Ethernet, QUICC 16-bit 2 serial, Integrated Com Controller Nios II Soft On-chip IPs 32-bit RISC, Soft Core for Embedded core Ram CPUs on FPGA ARM11, Soft/ Cache IPs 64-bit Databus, Harvard and support Cortex-MPCore HW multiple cores on a chip.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:24 NIOS-II: A Soft-Core CPU

An Embedded CPU for FPGAs • 32-bit Instruction set, data path, and address space • Harvard Architecture • 32 general-purpose registers • 32 external interrupt sources • Single-instruction 32 × 32 multiply & divide • Dedicated instructions for computing 64-bit and 128-bit products of multiplication • Floating-point instructions for single-precision operations • Single-instruction barrel shifter • Hardware-assisted debug module enabling processor start, stop, step and trace under integrated development environment (IDE) control. • Configurable High performance CPU cores

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:25 Nios-II Soft Processor Core

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:26 Memory and I/O Organization

• Instruction master port: An Avalon master port that connects to instruction memory via Avalon switch fabric • Instruction cache: Fast cache memory internal to the Nios II core • Data master port: An Avalon master port that connects to data memory and peripherals via Avalon switch fabric • Data cache: Fast cache memory internal to the Nios II core • Tightly coupled instruction or data memory port-interface to fast memory outside the Nios II core

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:27 Memory and I/O Organization

M Avalon Bus Master Port

S Avalon Bus Slave Port

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:28 Nios II Data Bus

Nios II with a Harvard architecture supports separate instruction and data buses. The data bus implemented as 32-bit Avalon master port connects to both memory and peripheral components.

Data bus performs two functions: • Read data from memory or a peripheral when the processor executes a load instruction. • Write data to memory or a peripheral when the processor executes a store instruction.

Memory-mapped I/O access Both data memory and peripherals are mapped into the address space of the data master port.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:29 Nios-II Instruction Bus

Instruction bus connects only to memory components.

It is implemented as a 32-bit Avalon master port. • Fetch instructions to be executed by the processor • No memory write • Always retrieves 32 bits of data

Shared Memory for Instructions and Data • Instruction and data master ports can share a single memory that contains both instructions and data. • The overall Nios II processor system might present a single, shared instruction and data bus to the outside world.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:30 Nios-II Memory Support

Nios II architecture supports tightly coupled memory that provides low latency access to on-chip memory.

Cache Memory • The Nios II architecture supports cache memories on both the instruction master port (instruction cache) and the data master port (data cache). • Cache memory resides on-chip. • Nios core might include one, both, or no cache memories. • Optimal cache configuration is application specific For example: If a Nios II system includes only fast, on-chip memory then an instruction or data cache makes no sense.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:31 Memory Address Map

The address map for memories and peripherals in a Nios II processor system is design dependent.

Specify the address map at the system generation time.

There are three addresses that are part of the CPU Core • Reset address • Exception address • Break handler address

Reset Address ● We can select the memory module where the reset code (boot loader) resides, and the location of reset address. ● Normally, a nonvolatile memory module is selected for the reset code.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:32 Memory Address Map

Exception Address ● We can select a memory module where the exception vector resides, and its location. ● In a typical system, you select a low-latency memory module for the exception code.

Break Location ● For Nios II cores containing a JTAG debug module. ● Memory Module is always the JTAG debug module. Offset is fixed at 0x20, and Address is determined by the base address of the JTAG debug module. ● You cannot modify any of the Break Location fields.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:33 NIOS Exception & Interrupt Control

Exception Controller A simple, non-vectored exception control with single address exception handling capabilities.

Integral Interrupt Controller • 32 level-sensitive interrupt request (IRQ) inputs, irq0 through irq31. • The software can enable and disable any interrupt source individually through the ienable control register. • Software can also enable and disable interrupts globally using the PIE bit of the status control register.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:34 A Typical Nios II System

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:35 JTAG Debug Module

The Nios II architecture supports a JTAG debug module. Host-based tools communicate with the JTAG debug module: • Downloading programs to memory • Starting and stopping execution • Setting breakpoints and watch points • Analyzing registers and memory • Collecting real-time execution trace data • The debug module connects to the JTAG circuitry in an FPGA.

 The debug module connects to signals inside the processor core (on the processor side).

 It has non-maskable control over the processor under test.

 Its functionality can be reduced, or removed altogether.

All system resources visible to the processor in supervisor mode are available to the debug module.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:36 JTAG Debug Module Levels

(2) Not including the dedicated JTAG pins on the Altera FPGA.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:37 Instruction Set Format

Three types of inst format: I-type, R-type, and J-type

I-type instruction-word format contains an immediate value embedded within the instruction word. I-type instructions words contain: ● A 6-bit opcode field OP ● Two 5-bit register fields A and B ● A 16 bit immediate data field IMM16 In most cases, fields A and IMM16 specify the source operands, and field B specifies the destination register. IMM16 is considered signed except for logical operations and unsigned comparisons.

I-type instructions include arithmetic and logical operations such as addi and andi; branch operations; load and store operations;

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:38 Nios-II Instruction Formats

R-Type instructions contain: ● A 6-bit opcode field OP ● Three 5-bit register fields A, B, and C ● An 11-bit opcode-extension field OPX In most cases, fields A and B specify the source operands, and field C specifies the destination register. R-type instructions include arithmetic and logical operations.

J-type instructions contain: ● A 6-bit opcode field and a 26-bit immediate data field ● The only J-type instruction is call.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:39 Nios-II Processor Cores

Three Nios II cores are available:

Nios II/f “fast” core for high performance. It presents the most configuration options allowing us to fine-tune the processor performance.

Nios II/s “standard” core is designed for small size while maintaining performance.

Nios II/e “economy” core is designed to achieve the smallest possible core size. This core has a limited feature set, and many settings are not available when the Nios II/e core is selected.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:40 Nios-II Multiply & Divide Settings

Nios II cores offer different multiply or divide options.

Choose the best option to balance embedded multiplier or logic element (LE) usage, and performance.

Hardware Multiply setting provides the following options: ● Include embedded multipliers (e.g., the DSP blocks in Stratix devices) in the arithmetic logic unit (ALU). ● Include LE-based multipliers in the ALU. ● Omit hardware and have multiply operations in software.

Hardware Divide setting includes LE-based divide hardware in the ALU that achieves much greater performance than emulated software divide operations.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:41 Nios-II Core Performance and Features

(1) DMIPS performance for the Nios II/s and Nios II/f cores depends on the hardware multiply option. (2) Using the fastest hardware multiply option, and targeting a Stratix II FPGA in the fastest speed grade.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:42 Nios-II Processor Core Features

(3) Multiply and shift performance depends on which hardware multiply option is used.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:43 The Nios II/f Core

● Separate instruction and data caches. ● Can access up to 2 GBytes of external address space. ● Supports optional tightly coupled memory for instructions and data. ● Employs a 6-stage pipeline to achieve maximum DMIPS/MHz. ● Performs dynamic branch prediction. ● Provides hardware multiply, divide, and shift options to improve arithmetic performance. ● Supports the addition of custom instructions. ● Supports the JTAG debug module. ● Supports optional JTAG debug module enhancements, including hardware breakpoints and real-time trace.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:44 Nios II/f Core

● The Nios II/f cache size for each is user-definable, between 512 bytes and 64 Kbytes. ● It supports the bit-31 cache bypass method for accessing I/O on the data master port. ● Addresses are 31 bits wide to accommodate the bit-31 cache bypass method. ● The Nios II/f core employs a 6-stage pipeline.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:45 Nios II/s Core

Nios II/s “standard” core is designed for small core size. On- chip logic and memory resources are conserved at the expense of execution performance. ● Uses approximately 20% less logic than Nios II/f ● Its execution performance also drops by roughly 40%

Main design goals include ● Do not cripple performance for the sake of size. ● Remove hardware features that have the highest ratio of resource usage to performance impact. ● Optimal core for cost-sensitive, medium-performance applications.

Main applications are with large amounts of code and/or data, such as systems running an operating system where performance is not the highest priority.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:46 Overview: Nios II/s

● Has instruction cache (512 bytes to 64 Kbytes) but no data cache ● Can access up to 2 GBytes of external address space ● Supports optional tightly coupled memory for instructions ● Performs static branch prediction. ● Does not support bit-31 data cache bypass ● Provides hardware multiply, divide, and shift options to improve arithmetic performance ● Supports the addition of custom instructions ● Supports the JTAG debug module ● Supports optional JTAG debug module enhancements, including hardware breakpoints and real-time trace

● Employs a 5-stage pipeline of Fetch, Decode, Execute, Memory and Writeback

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:47

Nios II/e Core

Nios II/e “economy” core is designed to achieve the smallest possible core size.

Singular design goal Reduce resource utilization, while maintaining compatibility with the Nios II instruction set architecture.

The core is roughly half the size of the standard core, but the performance is substantially lower.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:48 Overview: Nios II/e Core

● Executes at most one instruction per six clock cycles ● Can access up to 2 GBytes of external address space ● Supports the addition of custom instructions ● Supports the JTAG debug module ● Does not provide hardware support for potential unimplemented instructions ● Has no instruction cache or data cache ● Does not perform branch prediction ● No hardware support for any of the potential unimplemented instructions. ● Employs dedicated shift circuitry to perform shift and rotate operations.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:49 Nios- II Programming Model

General-Purpose Registers ● Some registers have names recognized by the assembler. The (r0) always returns the value 0 ● The ra register (r31) holds the return address used by procedure calls and is implicitly accessed by call and ret instructions. ● C and C++ compilers use a common procedure-call convention, assigning specific meaning to registers r1 through r23 and r26 through r28.

Control Registers ● Six 32-bit control registers ● These registers are accessed differently than the general-purpose registers.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:50 Nios II Registers

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:51 Nios-II CPU Operating Modes

The Nios II processor has two operating modes:

Normal Mode ● System and application code execute in normal mode. ● Registers bt (r25), ba (r30) & bstatus (ctl2) are not available.

Debug Mode ● Software debugging tools use it to implement breakpoints and watch-points. ● System and application code never execute in debug mode.

Changing Modes ● The processor starts in normal mode after reset. ● It enters debug mode only as directed by debugging tools. ● System and application code have no control over when the processor enters debug mode. Processor returns to its prior state on exiting from debug mode

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:52 MIPS32 4K CPU Soft Core

Main blocks of the core are: Execution Unit, Multiply-Divide Unit (MDU), System Control Coprocessor (CP0), Memory Management Unit (MMU), Cache Controller, Bus I/F Unit (BIU), I-Cache, D-Cache, Enhanced JTAG Controller, Power Management

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:53 MIP32 4K Features

● 32-bit Address and Data Paths ● Programmable Cache Sizes (0 to 16 Kbytes) ● Supports for ScratchPad RAM Max 20-bit index (1M address) ● Supports Multiply-Divide Unit: 32x16 multiply per clock 32x32 multiply every other clock ● Power Control ● Supports EJTAG debug module: Test Access Port (TAP) facilitates high-speed download of application code

● Employs a 5-stage pipeline of Instruction, Execute, Memory, Align/Accum & Writeback ● CP0 is responsible for: Virtual-to-physical address translation, cache protocols, exception control system, processor’s diagnostics, operating mode selection (kernel vs. user mode) and enabling/disabling of interrupts.

© G.N. Khan Embedded Processors and Cores – EE8205: Embedded Computer Systems Page:54