1 Tema III – Microcontrollers and

Roberto Gutiérrez Mazón 2

¨ Introduction

¨ Architectural Features. Datapath & pipeline.

¨ Data Representation: Fixed-point vs Floating-point

¨ Interrupts, Exceptions, Watch-Dog, …

¨ 32-bit microcontroller. ARM Cortex-M3 ¤ ARM Cortex-M3 Architecture. Programmers Model.

¨ 32/64bit . ¤ Intel x86, UltraSparc Architecture. Programmers Model Processor Architectural Features

3 What is “Computer Architecture”??

Applications

Operating System Compiler Firmware Instruction Set Architecture Instr. Set Proc. I/O system Datapath & Control Digital Design Circuit Design Layout & fab Semiconductor Materials Introduction

4

¨ Moore`s Law

¨ “Cramming More Components onto Integrated Circuits” ¤ Gordon Moore, Electronics, 1965

¨ Nº on transistors on cost-effective integrated circuit double every 18 months Introduction

5

¨ Prehistoric Computer Architecture:

¤ The Z1 was the first mechanical freely programmable computer in the world which used Boolean logic and binary floating point numbers

¤ Memory: 64 words of 22bits.

¤ Clock Frequency: 1Hz

¤ Registers: two 22bits floating-point registers.

¤ ALU: add (5 seg), sub, mult. (16 seg) ,div (18seg).

¤ Weight: 1000 kg Introduction

6

¨ The zEC12 Zseries IBM Microprocessor: ¤ 5.5 GHz in IBM 32nm PD-SOI CMOS technology ¤ 2.75 billion transistors in 597 mm2 ¤ 64-bit virtual addressing n original S/360 was 24-bit, and S/370 was 31-bit extension ¤ Six-core design ¤ Three-issue out-of-order superscalar pipeline ¤ Out-of-order memory accesses ¤ Redundant datapaths n every instruction performed in two parallel datapaths and results compared ¤ 64KB L1 I-cache, 128KB L1 D-cache on-chip ¤ 1MB private L2 unified instruction and data cache per core, on-chip ¤ On-Chip 48MB eDRAM L3 cache ¤ Scales to 120-core multiprocessor with 384MB of shared L4 eDRAM Introduction

7

IC 4004 Intel (1971) IC 486DX2 Intel (1989) 1er Transistor (Shokley, ENIAC(1946) Bardeen,Brattain) (1947)

Maquina Diferencias Baggage (1832)

Nanotecnología (¿?)

Intel Quad (2007)

Procesadores opticos (¿?) Cell (2005) MEMS(2000) Introduction

8 Battery Solar Cells Wireless Sensor Network

Sensors, timers Cortex-M0 +16KB RAM 65nm UWB Radio antenna

10 kB Storage memory ~3fW/bit 12µAh Li-ion Battery

A B C Processor, SRAM and PMU Wirelessly networked into large scale sensor arrays

University of Michigan

Cortex-M0; 65¢ Introduction

9

4200 ARM powered Neutrino Detectors

70 bore holes 2.5km deep

1km 60 detectors per string 2.5km starting 1.5km down

1km3 of active telescope

Work supported by the National Science Foundation and University of Wisconsin-Madison Introduction

10 11

¨ Introduction

¨ Processor Architectural Features. Datapath & pipeline.

¨ Data Representation: Fixed-point vs Floating-point

¨ Interrupts, Exceptions, Watch-Dog, …

¨ 32-bit microcontroller. ARM Cortex-M3 ¤ ARM Cortex-M3 Architecture. Programmers Model.

¨ 32/64bit microprocessor. ¤ Intel x86, UltraSparc Architecture. Programmers Model Processor Architectural Features

12 Programming Model

¨ Microprocessors can be High-Level Language Level 5 programmed directly using an assembly language. Assembly Language Level 4 ¨ Differences with high-level languages: Operating System ¤ Use commands to execute data Level 3 movements, arithmetic, logic and program control operations. Instruction Set ¤ Use registers to hold data for Architecture Level 2 operation.

¨ Programmers need to know not Microarchitecture Level 1 only the assembly language for the microprocessor, but also the internal configuration of the Digital Logic Level 0 microprocessor. Processor Architectural Features

13 A Basic Processor Processor ¨ The basic components: core ¤ Processor with its associate temporary memory (registers and cache if available) for code execution Cache/SRAM

Registers Registers memory ¤ Main memory and secondary memory where code and data are temporary and permanently stored Main

¤ Input and output modules that provide memory interface between the processor and the I/O user Interface Storage ¨ Connected through an interface bus consists of memory ¤ Address, Data, and Control signals. Address bus, data e.g. AMBA bus for the ARM-based bus, and bus processor control signals Processor Architectural Features

14 The gap widens between DRAM, disk, and CPU speeds. 100,000,000 10,000,000 1,000,000 100,000 Disk seek time DRAM access time

ns 10,000 SRAM access time 1,000 CPU cycle time 100 10 1 1980 1985 1990 1995 2000 ye ar register cache memory disk Access time 1 1-10 50-100 20,000,000 (cycles) Processor Architectural Features

15 Memory Hierarchy

¨ A typical processor is supported by: L0: registers ¤ on-board main memory (e.g. SDRAM up to GB) Smaller, faster, and more expensive (per L1: on-chip L1 ¤ on-chip or on-die cache memory (e.g. SRAM byte) storage devices cache (SRAM) KB to MB) off-chip L2 ¤ L2: on-die registers cache (SRAM) ¨ Some processors also provide general

purpose on-chip L3: main memory (DRAM) ¤ SRAM (e.g. embedded processor) which may be configured as SRAM/Cache combination (e.g. Larger, slower, and TI’s DSP) cheaper (per byte) storage devices L4: local secondary storage (virtual memory) (local disks) ¨ Typically, a processor also utilizes secondary non-volatile memory

¤ For permanent code and data storage like Flash- L5: remote secondary storage based memory and hard disk (tapes, distributed file systems, Web servers) Processor Architectural Features

16 ¨ Multiple machine cycles are required when reading from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.

L1 Data 1 cycle latency Regs. 16 KB L2 Unified 4-way assoc 128KB--2 MB Main Write-through 4-way assoc 32B lines Write-back Memory Write allocate Up to 4GB L1 Instruction 32B lines 16 KB, 4-way 32B lines

Processor Chip Pentium III cache hierarchy Processor Architectural Features

17 Address Space

¨ Address space of a processor depends on its address decoding mechanism. ¤ Size will depend on the number of address bit used.

¨ Depending on the processor design, there may be two types of address space: ¤ One is used by normal memory access. ¤ Another one is reserved for I/O peripheral registers (control, status, and data). ¤ Need extra control signal or special means of accessing the alternate address space. Processor Architectural Features

18 Address Space

¨ Refer to the range of address that can be accessed by the processor determined by the number of address bit utilized in the processor architecture.

¨ Some processor families (e.g. ARM) utilize only one address space for both memory and I/O devices

¤ i.e. everything is mapped in the same address space

0xFFFFFFFF I/O Reg I/O I/O Reg Processor Data Memory Code 0x00000000 Processor Architectural Features

19 Memory mapped vs I/O mapped

¨ Some processor families have two address spaces.

¨ E.g., for the x86 processor, memory and I/O devices can be mapped in two different address spaces: ¤ Memory address space and I/O address space 0xFFFF 0xFFFFFFFF Data I/O Reg Code Processor I/O Reg Data 0x0000 Code 0x00000000 I/O Address Memory Address Space Space

Processor Architectural Features

20 Memory system Architectures

¨ Two types of information are found in a typical program code: ¤ Instruction codes for execution ¤ Data that is used by the instruction codes

¨ Two classes of memory system design to store these information: ¤ Von Neumann architecture ¤ Harvard architecture FFFFh FFFFh Separate bus for Code Data Data & Data Table Von Neumann Data

Data Data 8000h

Processor Code Processor 7FFFh Code Single path (bus) for both Data Code & Data Code Code Code 0000h Harvard 0000h Processor Architectural Features

21 Processor Size

¨ The processor size is described in terms of ‘bits’ (e.g. an 8 bit, 32-bit processor).

¤ Corresponds to the data size that can be manipulated at a time by the processor.

¤ Typically reflected in the size of the processor (internal) data path and register bank.

¨ Hence an 8-bit processor can only manipulate byte size data at a time, while a 32-bit processor can handle 32-bit double word size data at a time.

• Even though the data content may only be of single byte size.

Processor Architectural Features

22 Registers program counter

¨ The most fundamental storage area instruction queue in the processor is closely located to PC the processor provides very fast program I-1 I-2 I-3 I-4 access, operating at the processor fetch clock but is of limited amount (less memory op1 read than 100 typical) op2 registers registers ¨ Most are of the general purpose type instruction and can store any type of I-1 register

information: decode ¤ data – e.g. timer value, constants ¤ address – e.g. ASCII table, stack write write ¨ Some are reserved for specific flags ALU purpose execute ¤ program counter (IP). (output) ¤ program status register (SR).

Processor Architectural Features

23 Data Organization in Memory

¨ A typical memory contains a storage location that can store data of a certain fixed size (most commonly of the 8-bit (byte) size). Each location is provided with a unique address.

¨ Depending on the data path size of the processor. The memory content is accessible in the size of an 8-bit byte, a 16-bit half word, a 32-bit word, and even a 64-bit double word.

¨ A 32-bit data consists of four bytes of data, and are stored in four successive memory locations. Data and code must be aligned to the respective address size boundary.

¤ E.g. for a 32-bit system, align to the word boundary, with the lowest two address bits equal to zero

¨ But what is the order of the four bytes of data?. Depends on the Endianness adopted Processor Architectural Features

24 Data Endianness

¨ In the Little Endian format, the least significant byte (LSB) is stored in the lowest address of the memory, with the most significant byte (MSB) stored in the highest address location of the memory. ¨ In the Big Endian format, the least significant byte (LSB) is stored in the highest address of the memory, with the most significant byte (MSB) stored in the lowest address location of the memory. MSB LSB

Big Endian Little Endian Memory Memory Address Address Space Space

0x000000 0x000000 Processor Architectural Features

25 Top Boot and Botton Boot

¨ Different processor family uses different location for its reset vector boot-up purpose.

¨ Examples:

¤ x86 boot up from the top of the memory space

¤ ARM boot up from the bottom of its memory space FF..FFh FF..FFh Reset vector

Data

Processor Data Processor Data

Program Data x86 ARM Reset Program vector 00..00h 00..00h Processor Architectural Features

26 CISC – Complex Instruction Set Computer. Philosophy: Hardware is always faster than the software. Objective: Instruction set should be as powerful as possible With a power instruction set, fewer instructions needed to complete (and less memory) the same task as RISC. CISC was developed at a time (early 60’s), when memory technology was not so advanced. Memory was small (in terms of kilobytes) and expensive. But for embedded systems, especially Internet Appliances, memory efficiency comes into play again, especially in chip area and power.

¨ Many instructions ¨ Complex instructions ¤ Each instruction can execute several low level operations

¨ Complex addressing modes ¤ Smaller number of registers needed

¨ A semantically rich instruction set is accommodated by allowing instructions that can be of variable lengths Processor Architectural Features

27 RISC – Reduce Instruction Set Computer. By reducing the number of instructions that a processor supports and thereby reducing the complexity of the chip, it is possible to make individual instructions execute faster and achieve a net gain in performance even though more instructions might be required to accomplish a task. RISC trades-off instruction set complexity for instruction execution timing. Large register set: having more registers allows memory access to be minimized. Load/Store architecture: operating data in memory directly is one of the most expensive in terms of clock cycle. Fixed length instruction encoding: This simplifies instruction fetching and decoding logic and allows easy implementation of pipelining.

¤ All instructions are register-to-register format except Load/Store which access memory

¤ All instructions execute in a single cycle save branch instructions which require two.

¤ Almost all single instruction size & same format. Processor Architectural Features

28 Advantages of CISC Limitations of CISC

¨ As each instruction can execute several ¨ A highly encoded instruction set low level operations, the code size is needs to be decoded by reduced to save on memory hardwired microcode electronic requirement. less main memory access circuitry.

is required and hence faster. ¤ More complex hardware design ¨ Backward code compatibility is ¤ Slower instruction decoding/ maintained. execution ¤ Can add new (and more powerful) ¨ Variable length instructions instructions while retaining the ‘old’ different execution time among instruction set for code compatibility (i.e. instructions affect pipelined the legacy program can still run) operations. ¨ Easy to program. ¤ direct support of high-level language constructs. ¤ complex instructions that fit well with high- level language expression. Processor Architectural Features

29 Advantages of RISC Limitations of RISC

¨ Simpler instructions: ¨ Fewer instructions than CISC:

¤ One clock per instruction gives faster ¤ Compared to CISC, RISC needs more execution than on a CISC processor with instructions to execute one task. the same clock speed ¤ code density is less.

¨ Simpler addressing mode: ¤ need more memory.

¤ Faster decoding ¨ No complex instruction:

¨ Fixed length instructions: ¤ No hardware support for division, floating-point arithmetic operation. ¤ Faster decoding and better pipeline performance ¤ Need a more complex compiler and a longer compiling time ¨ Simpler hardware: But ARM also adds DSP-like ¤ Less silicon area instructions to support commonly used ¤ Less power consumption signal processing function. Processor Architectural Features

30 CISC RISC Any instruction may reference memory Only load/store references memory Many instructions & addressing modes Few instructions & addressing modes Variable instruction formats Fixed instruction formats Single register set Multiple register sets Multi-clock cycle instructions Single-clock cycle instructions Micro-program interprets instructions Hardware (FSM) executes instructions Complexity is in the micro-program Complexity is in the compiler Less to no pipelining Highly pipelined Program code size small Program code size large Processor Architectural Features

31 RISC vs CISC

¨ RISC machines: SUN SPARC, SGI Mips, HP PA-RISC, ARM

¨ CISC machines: Intel 80x86, Motorola 680x0

¨ What really distinguishes RISC from CISC these days lies in the architecture and not in the instruction set.

¨ CISC occurs whenever there is a disparity in speed between CPU operations and memory accesses due to technology or cost.

¨ What about combining both ideas?

¤ Intel 8086 Pentium P6 architecture is externally CISC but internally RISC & CISC!

¤ Intel IA-64 executes many instructions in parallel. CISC RISC (Intel 486) (MIPS R4000) #instructions 235 94 Addr. modes 11 1 Inst. Size (bytes) 1-12 4 GP registers 8 32 Processor Architectural Features

32 Instruction Code Format

¨ Opcode encoding depends on the number of bit used.

¤ Example: For ARM, all instructions are of 32-bit length, but only 8 bits (bit 20 to 28) are used to encode the instruction. Hence a total of 28 = 256 different instructions possible.

¨ A typical instruction is encoded with a specific bit pattern that consists of the following:

¤ Opcode field specifying the operation to be performed. ¤ Operand(s) identification (address) field that depends on the modes of addressing; n this provides the address of the register/memory location (s) that store the operand(s), or the operand itself. Processor Architectural Features

33 Instruction Opcode Operand Addressing Types:

Types: ¨ Immediate addressing.

¤ Operand is given in the instruction. ¨ General categories of instruction operations: ¨ Register addressing. ¤ Operand is stored in a register. ¤ Data transfer ¨ Direct addressing. ¤ E.g. move, load, and store ¤ Operand is stored in memory, with the address ¤ Data manipulation given in the instruction.

¤ E.g. add, subtract, logical ¨ Indirect (Index) addressing. operation ¤ Operand is stored in memory, with the address ¤ Program control given in a register (address adds with an offset given in the instruction). ¤ E.g. branch, subroutine call ¨ Implied addressing

¤ Implicit location like stack and program counter. Processor Architectural Features

34 Instruction Execution

¨ Multiple stages are involved in executing an instruction. Example:

1) Fetching the instruction code. Reads the instruction from the memory 1) Decoding the instruction code. Determining which instruction is to be executed

2) Executing the instruction code. Performs the operations necessary to complete what the instruction is suppose to do. Read data from memory, write data to memory or I/O device, perform only operations within CPU or combination of those. ¨ Hence multiple processor clock cycles are needed to execute one single instruction. 1st instruction 2nd instruction

Fetch Decode Execute Fetch Decode Execute Instruction Instruction Instruction Instruction Instruction Instruction time Processor Architectural Features

35 Instruction Pipeline

¨ Pipeline allows concurrent execution of multiple different instructions at the same time

¨ During a normal operation

• While one instruction is being executed. • The next instruction is being decoded.

• And a third instruction is being fetched from memory.

• Allows effective throughput to increase to one instruction per clock cycle. Processor Architectural Features

36 Pipeline Architecture: Longer pipeline can also be used to further break down the operation carried out in the individual stage. Simpler logic for each stage to increase system clock. Maximum Speedup é Number of stages Fetch Instruction Speedup ≈ Time for unpipelined operation Decode Fetch Time for longest stage Instruction Instruction Parallel Fetch Decode Fetch 4th execution of Operand Instruction Instruction multiple Execute Fetch Decode Fetch 5th instructions. Instruction Operand Instruction Instruction Store Result Execute Fetch Decode Fetch time Assume Instruction Operand Instruction Instruction instructions Store Result Execute Fetch Decode are 1st Instruction Operand Instruction completely Store Result Execute Fetch independent! 2nd Instruction Operand Store Result Execute 3rd Instruction Example: A 5-stage instruction Store Result pipeline Processor Architectural Features

37

ARM Cortex-A15 38

¨ Introduction

¨ Processor Architectural Features. Datapath & pipeline.

¨ Data Representation: Fixed-point vs Floating- point

¨ Interrupts, Exceptions, Watch-Dog, …

¨ 32-bit microcontroller. ARM Cortex-M3 ¤ ARM Cortex-M3 Architecture. Programmers Model.

¨ 32/64bit microprocessor. ¤ Intel x86, UltraSparc Architecture. Programmers Model Data representation. Fixed-point vs Floating-point

39

¨ Numerical values represented as binary fractions: -1.0 ≤ value < 1.0

¨ Why a fractional representation? ¤ Multiplying a fraction by a fraction always results in a fraction and will not produce an overflow (e.g., 0.99 x 0.9999 = less than 1). Successive additions may cause overflow ¤ Normalized representation is convenient. Signal processing is multiplication-intensive. ¤ Coefficients from digital filter designs are typically already in fractional form. Radix point

Sign bit -20 2-1 2-2 2-3 2-(n-1) Data representation. Fixed-point vs Floating-point

40

¨ Fixed-point Notation: ¤ Decimal point is always in a fixed location (e.g., 0.74, 0.34, etc.).

¤ Fixed-point notation prevents overflow (useful with a small dynamic range).

¤ Fixed-point notation is less expensive.

¨ How is fixed-point notation realized in a DSP? ¤ Most fixed-point DSPs are 16 bits.

¤ The range of numbers that can be represented is 215-1 to -215.

¤ The most common fixed-point format is Q15.

Q15 Notation Bit 15 Bits 14 to 0 sign two’s complement number

Data representation. Fixed-point vs Floating-point

41 Dynamic range in Q15 Addition

Number Biggest Smallest Decimal Q15 Scale back Fractional number 0.999 -1.000 Q15 / 32767 Scaled integer for Q15 32767 -32768 0.5 + 0.05 = 0.55 16384 + 1638 = 0.55 18022 Number representations in Q15 16384 – 1638 = 0.5 – 0.05 = 0.45 14746 0.45 Decimal Q15 = Decimal x 215 Q15 Integer

0.5 0.5 x 32767 16384 Multiplication 2 x 0.5 x 0.45 = 0.05 0.05 x 32767 1638 Decimal Q15 Back to Q15 Scale back 0.0012 0.0012 x 32767 39 Product / 32767 Q15 / 32767

0.5 x 0.45 = 0.225 16384 x 14745 7373 Rules for operations = 241584537 Avoid operations with numbers larger than 1 7373 + 7373 = 0.225 + 0.225 = 0.45 0.45 2.0 x (0.5 x 0.45) = (0.2 x 0.5 x 0.45) x 10 14746

= (0.5 x 0.45) + (0.5 x 0.45) Scale numbers before the operation 0.5 in Q15 = 0.5 x 32767 =16384 Data representation. Fixed-point vs Floating-point 42

¨ Floating-point Notation: single-precision floating-point format

31 ... 24 23 22 ...... 0 Bit No e s f 8 bits 1 bit 23 bits e = exponent is a signed two’s compliment 8-bit field and determines the location of the binary Q point s = sign of mantissa (s = 0 positive, s =1 negative) f = fractional part of the mantissa; an implied 1.0 is added to this fraction but is not allocated in the bit field since this value is always present Binary Decimal Equation e e s = 0 X = 01.f x 2 X = 01.f x 2 1 Conversion equations e e s = 1 X = 10.f x 2 X = ( -2 + 0.f ) x 2 2

Exponent (e) Special case s = 0 X = 0 e = -128 Decimal 0 1 127 -1 -128 Hex two’s comp. 00 01 7F FF 80 Data representation. Fixed-point vs Floating-point 43 ¨ Floating-point Numbers:

Calculate 1.0e0 Calculate -2.0e0 In hex 00 00 00 00 In binary 00000000000000000000000000000000 In hex 00 80 00 00 s = 0 Equation 1 applies: X = 01.f x2e In binary 00000000100000000000000000000000 e s = 1 Equation 2 applies: X = ( -2.0 + 0.f ) x 2 f = 0 f = 0

01.0 x 20 e = 0 = 1.0 0 e = 0 ( -2.0 + 0.0 ) x 2 = -2.0 Calculate 1.5e01 In hex 03 70 00 00 In binary 00110111000000000000000000000000 Addition s = 0 Equation 1 applies: X = 01.f x2e 1.5 + (-2.0) = 0.5

0011 e = 3 s111 ... f = 0.5 + 0.25 + 0.125 = 0.875 Multiplication 1.5e00 x 1.5e01 = 2.25e01 = 22.5 X = 01.875 x 23 = 15.0 decimal Data representation. Fixed-point vs Floating-point

44

¨ Dynamic Range ¤ Ranges of number systems Two’s Numbers Base 2 Decimal Complement Hex 31 Largest Integer 2 - 1 2 147 483 647 7F FF FF FF 31 Smallest Integer - 2 -2 147 483 648 80 00 00 00 15 Largest Q15 2 - 1 32 767 7F FF 15 Smallest Q15 - 2 -32 768 80 00 -23 127 38 Largest Floating Point ( 2 - 2 ) x 2 3.402823 x 10 7F 7F FF FD 127 38 Smallest Floating Point -2 x 2 -3.402823 x 10 83 39 44 6E

¤ The dynamic range of floating-point representation is very large ¤ Conclusion n Largest integer x (1.5 x 10 29 ) ~ = largest floating point n Largest Q15 x (1.03 x 10 34 ) ~ = largest floating point

Data representation. Fixed-point vs Floating-point

45

¨ DSP devices are designed as floating point or fixed point. ¨ Floating-point devices usually have a full set of fixed-point instructions. ¨ Floating point devices are easier to program. ¨ Fixed-point devices can emulate floating point in software.

Characteristic Floating point Fixed point Dynamic range much larger smaller Resolution comparable comparable Speed comparable comparable Ease of programming much easier more difficult Compiler efficiency more efficient less efficient Power consumption comparable comparable Chip cost comparable comparable System cost comparable comparable Design cost less more Time to market faster slower

DSP Data representation. Fixed-point vs Floating-point

46

¨ Applications which require: ¤ High precision. ¤ Wide dynamic range. ¤ High signal-to-noise ratio. ¤ Ease of use. Need a floating point processor.

¨ Drawback of floating point processors: ¤ Higher power consumption. ¤ Can be more expensive. ¤ Can be slower than fixed-point counterparts and larger in size.

47

¨ Introduction

¨ Processor Architectural Features. Datapath & pipeline.

¨ Data Representation: Fixed-point vs Floating-point

¨ Interrupts, Exceptions, Watch-Dog, …

¨ 32-bit microcontroller. ARM Cortex-M3 ¤ ARM Cortex-M3 Architecture. Programmers Model.

¨ 32/64bit microprocessor. ¤ Intel x86, UltraSparc Architecture. Programmers Model Interrupt, Exceptions, Watch-Dog, …

48

¨ Exceptions:

¤ Exception handling is a combination of hardware behaviors and software constructs designed to manage a unique condition. n Related to the current program flow. n Result of unexpected error conditions (such as a bus error). n Result of illegal operations (guarded memory access). n Some exceptions can be programmed to occur (FIT, PIT). n A software routine could not execute properly (divide by 0). ¤ Exception handling changes the normal flow of software execution. Interrupt, Exceptions, Watch-Dog, …

49

¨ Interrupts:

¤ A hardware interrupt is an asynchronous signal from hardware, either originating outside the SoC or from the programmable logic within the SoC, indicating a peripheral's need for attention. n Embedded processor peripheral (FIT, PIT, for example). n External bus peripheral (UART, EMAC, for example). n External interrupts enter via hardware pin(s). n Multiple hardware interrupts can utilize general interrupt controller of the PS. ¤ A software interrupt is a synchronous event in software, often referred to as exceptions, indicating the need for a change in execution. n Examples n Divide by zero. n Illegal instruction. n User-generated software interrupt. Interrupt, Exceptions, Watch-Dog, …

50

¨ Cortex-A9 Modes and Registers: Cortex-A9 has seven execution modes Cortex-A9 has 37 registers

¤ Five are exception modes. ¤ Up to 18 visible at any one time.

¤ Each mode has its own stack space and different ¤ Execution modes have some private registers subset of registers. that are banked in when the mode is changed.

¤ System mode will use the user mode registers. ¤ Non-banked registers are shared between modes. Interrupt, Exceptions, Watch-Dog, …

51

¨ Cortex-A9 Exceptions:

In Cortex-A9 processor interrupts are handled as exceptions

¤ Each Cortex-A9 processor core accepts two different levels of interrupts. n nFIQ interrupts from secure sources (serviced first). n nIRQ interrupts from either secure sources or non-secure sources. Interrupt, Exceptions, Watch-Dog, …

52 Interrupt Servicing in Cortex-A9:

¨ When an interrupt is received, the current executing instruction completes.

¨ Save processor status

¤ Copies CPSR into SPSR_irq.

¤ Stores the return address in LR_irq.

¨ Change processor status for exception

¤ Mode field bits.

¤ ARM or thumb (T2) state.

¤ Interrupt disable bits (if appropriate).

¤ Sets PC to vector address (either FIQ or IRQ).

¨ The above steps are performed automatically by the core Interrupt, Exceptions, Watch-Dog, …

53 General Interrupt Controller (GIC)

¨ Supports interrupt prioritization

¨ Handles up to 16 software- generated interrupts (SGI)

¨ Supports 64 shared peripheral interrupts (SPI) starting at ID 32

¨ Processes both level-sensitive interrupts and edge-sensitive interrupts ¤ Five private peripheral interrupts (PPI) dedicated for each. ¤ The global timer, private watchdog timer, private timer, and FIQ/ IRQ from the PL. 54

¨ Introduction

¨ Processor Architectural Features. Datapath & pipeline.

¨ Data Representation: Fixed-point vs Floating-point

¨ Interrupts, Exceptions, Watch-Dog, …

¨ 32-bit microcontroller. ARM Cortex-M3 ¤ ARM Cortex-M3 Architecture. Programmers Model.

¨ 32/64bit microprocessor. ¤ Intel x86, UltraSparc Architecture. Programmers Model 32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

55

A microcontroller combines onto the same microchip : ¨ The CPU core ¨ Memory (both ROM and RAM) ¨ I/O – parallel, serial, analog, digital

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

56 ARM Ltd Equipment Adopting 32-bit ARM Microcontrollers ¨ Founded in November 1990 ¤ Spun out of Acorn Computers ¤ Initial funding from Apple, Acorn and VLSI IR Fire Detector ¨ Designs the ARM range of RISC Exercise Utility Intelligent Machines Energy Efficient Intelligent Meters Vending processor cores toys Appliances Tele- ¤ Licenses ARM core designs to parking semiconductor partners who fabricate and sell to their customers ¤ ARM does not fabricate silicon itself ¨ Also develop technologies to assist with the design-in of the ARM architecture ¤ Software tools, boards, debug hardware ¤ Application software ¤ Bus architectures ¤ Peripherals, etc

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

57

Cortex Family x1-4 Cortex-A15 ¨ ARM Cortex-A family (v7-A): ...2.5GHz x1-4 ¤ Applications processors for full OS Cortex-A9 rd and 3 party applications Cortex-A8 x1-4 Cortex-A5 ¨ ARM Cortex-R family (v7-R):

¤ Embedded processors for real-time signal processing, control applications Cortex-R4 Cortex-M4

¨ ARM Cortex-M family (v7-M): SC300™ ™ ¤ Microcontroller-oriented processors Cortex -M3 for MCU and SoC applications Cortex-M1

Cortex-M0 12k gates... 32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

58 ARM Cortex Family Cortex-A8 Cortex-R4 Cortex-M3 § Architecture v7A § Architecture v7R § Architecture v7M § MMU § MPU (optional) § MPU (optional) § AXI § AXI § AHB Lite & APB § VFP & NEON support § Dual Issue 32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

59 Relative Perfomance

2500

2000

1500

1000

500 Max Frequency (Mhz) (Mhz) Frequency Max

0 Cortex- Cortex- Cortex- ARM92 ARM10 ARM11 ARM11 Cortex- A9 ARM7 M0 M3 6 26 36 76 A8 Dual- core Max Freq (MHz) 50 150 184 470 540 610 750 1100 2000 Min Power (mW/MHz) 0,012 0,06 0,35 0,235 0,36 0,335 0,568 0,43 0,5 32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

60 ARM architecture

¨ Load/store architecture

¨ A large array of uniform registers

¨ Fixed-length 32-bit instructions

¨ 3-address instructions

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

61

Data Sizes and Instruction Sets 31 24 23 16 15 8 7 0

¨ The ARM is a 32-bit architecture. 8-bit Byte ¨ When used in relation to the 16-bit Half word ARM: 32-bit word ¤ Byte means 8 bits

¤ Halfword means 16 bits (two ARM and Thumb Performance

bytes) 30000 ARM

¤ Word means 32 bits (four bytes) 25000 Thumb

¨ Most ARM’s implement two Dhrystone 20000 instruction sets 2.1/sec 15000 ¤ 32-bit ARM Instruction Set @ 20MHz 10000

¤ 16-bit Thumb Instruction Set 5000

0 ¨ Jazelle cores can also execute 32-bit 16-bit 16-bit with 32-bit stack Java bytecode Memory width (zero wait state)

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

62

The Thumb-2 instruction set

¨ Variable-length instructions

¤ ARM instructions are a fixed length of 32 bits.

¤ Thumb instructions are a fixed length of 16 bits.

¤ Thumb-2 instructions can be either 16-bit or 32-bit.

¨ Thumb-2 gives approximately 26% improvement in code density over ARM

¨ Thumb-2 gives approximately 25% improvement in performance over Thumb.

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

63 Main Address Space r0 Cortex-M r1 r2 Programmer’s Model r3 r4 ¨ Fully programmable in C r5 ¨ Stack-based exception r6 model r7 r8 ¨ Only two processor modes r9 r10 ¤ Mode for User r11 tasks r12 Process ¤ Handler Mode for OS sp sp tasks and exceptions lr r15 (pc) ¨ Vector table contains addresses xPSR 32-bits Endianess 32-bits Endianess 32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

64

Address Space

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

65 Cortex-M3 Processor Privilege

ARM Cortex-M3 Supervisor

Handler Mode

Privileged Aborts Interrupts Reset

OS

System Call User Non-Privileged (SVCall) Undefined Application code Thread Mode Instruction

Memory

Instructions & Data 32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

66 Cortex-M3 Interrupt Handling

¨ One Non-Maskable Interrupt (INTNMI) supported INTNMI ¨ 1-240 prioritizable interrupts supported NVIC 1-240 Interrupts Cortex-M3 ¤ Interrupts can be masked INTISR[239:0] … Processor Core ¤ Implementation option selects number of interrupts supported

¨ Nested Vectored Interrupt Controller (NVIC) is tightly Cortex-M3 coupled with processor core

¨ Interrupt inputs are active HIGH

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

67 Cortex-M3 Exception Handling ¤ Reset : power-on or system reset ¤ NMI : cannot be stopped or preempted by any exception other than reset ¤ Faults n Hard Fault : default Fault or any fault unable to activate n Memory Manage : MPU violations n Bus Fault : prefetch and memory access violations n Usage Fault : undef instructions, divide by zero, etc. ¤ SVCall : privileged OS requests ¤ Debug Monitor : debug monitor program ¤ PendSV : pending SVCalls ¤ SysTick Interrupt : internal sys timer, i.e., used by RTOS to periodically check resources or peripherals ¤ External Interrupt : i.e., external peripherals

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

68 Cortex-M3 Program Status Register

¨ One Status Register consisting of ¤ APSR - Application Program Status Register – ALU flags ¤ IPSR - Interrupt Program Status Register – Interrupt/Exception No. ¤ EPSR - Execution Program Status Register n IT field – If/Then block information n ICI field – Interruptible-Continuable Instruction information

¨ xPSR ¤ Composite of the 3 PSRs ¤ Stored on the stack on exception entry

31 28 27 26 25 24 23 16 15 10 7 0 N Z C V Q IT T IT/ICI ISR Number 32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

69 Conditional Execution

¨ If – Then (IT) instruction added (16 bit)

¤ Up to 3 additional “then” or “else” conditions maybe specified (T or E)

¤ Makes up to 4 following instructions conditional ITTET EQ MOVEQ Inst 1 ADDEQ Inst 2 SUBNE Inst 3 Inst 4 ORREQ ¨ Any normal ARM condition code can be used ¨ 16-bit instructions in block do not affect condition code flags ¤ Apart from comparison instruction. ¤ 32 bit instructions may affect flags (normal rules apply) ¨ Current “if-then status” stored in CPSR ¤ Conditional block maybe safely interrupted and returned to ¤ Must NOT branch into or out of ‘if-then’ block

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

70 Classes of Instructions (v4T)

Load/Store LDR Miscellaneous STR ADR CMP SWI Data SWP Operations ADD MUL Change LSL of Flow MOV PC, Rm AND Bcc BL BLX 32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

71 Data Processing Instructions

¨ Consist of : ¤ Arithmetic: ADD ADC SUB SBC RSB RSC ¤ Logical: AND ORR EOR BIC ¤ Comparisons: CMP CMN TST TEQ ¤ Data movement: MOV MVN ¨ These instructions only work on registers, NOT memory.

¨ Syntax: {}{S} Rd, Rn, Operand2 n Comparisons set flags only - they do not specify Rd n Data movement does not specify Rn n Second operand is sent to the ALU via barrel shifter.

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

72 Using a Barrel-shifter: The 2nd Operand

Operand 1 Operand 2 Register, optionally with shift operation • Shift value can be either be: • 5 bit unsigned integer • Specified in bottom byte of Barrel Shifter another register. • Used for multiplication by constant

Immediate value • 8 bit number, with a range of ALU 0-255. • Rotated right through even number of positions Result • Allows increased range of 32-bit constants to be loaded directly into registers 32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

73 Single Register Data Transfer LDR STR Word LDRB STRB Byte LDRH STRH Halfword LDRSB Signed byte load LDRSH Signed halfword load

¨ Memory system must support all access sizes

¨ Syntax: ¤ LDR{}{} Rd,

¤ STR{}{} Rd,
e.g. LDREQB

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

74 Cortex-M3 Datapath I_HRDATA Instruction Decode

Write Data D_HWDATA Address Register Incrementer Read Data D_HRDATA D_HADDR Address Register Register

B Address Register Barrel Incrementer Mul/Div Bank Shifter I_HADDR ALU A ALU Address Register Writeback INTADDR 32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

75 Cortex-M3 Pipeline

¨ Cortex-M3 has 3-stage fetch-decode-execute pipeline ¤ Similar to ARM7 ¤ Cortex-M3 does more in each stage to increase overall performance

1st Stage - Fetch 2nd Stage - Decode 3rd Stage - Execute

Data Phase Address Phase AGU Load/Store & & Write Back Branch Instruction Fetch Decode & Write (Prefetch) Multiply & Divide Register Read

Branch Shift ALU & Branch Branch forwarding & speculation

Execute stage branch (ALU branch & Load Store Branch) 32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

76 SW Development ¨ GNU compiler and Editor KeilTM uVision® binutils Simulated Processor ¤ gcc: GNU C compiler Source code Start Microcontroller Start ; direction register Debug ¤ as: GNU assembler LDR R1,=GPIO_PORTD_DIR_R Session Memory LDR R0,[R1] ¤ ld: GNU linker ORR R0,R0,#0x0F ; make PD3-0 output I/O ¤ gdb: GNU project STR R0, [R1] debugger Build Target (F7) ¤ COFF (common object file format) Object code Real Processor ¤ ELF (extended linker Microcontroller 0x00000142 4912 Download 0x00000144 6808 format) 0x00000146 F040000F Memory 0x0000014A 6008 Start ¤ Segments in the object file Debug n Text: code Address Data Session I/O n Data: initialized global variables n BSS: uninitialized global variables gcc as ld Simulator .c .s .coff .elf Debugger … C source asm source object file executable 77

¨ Introduction

¨ Processor Architectural Features. Datapath & pipeline.

¨ Data Representation: Fixed-point vs Floating-point

¨ Interrupts, Exceptions, Watch-Dog, …

¨ 32-bit microcontroller. ARM Cortex-M3 ¤ ARM Cortex-M3 Architecture. Programmers Model.

¨ 32/64bit microprocessor. ¤ Intel x86, UltraSparc Architecture. Programmers Model 32/64bit microprocessor. Intel x86, UltraSparc

78 Intel x86 Processor Evolution: Name Date Transistors MHz 8086 1978 29K 5-10 n First 16-bit processor. Basis for IBM PC & DOS n 1MB address space 386 1985 275K 16-33 n First 32 bit processor , referred to as IA32 n Added “flat addressing” n Capable of running Unix n Until recently, 32-bit Linux/gcc used no instructions introduced in later models Pentium 4F 2005 230M 2800-3800 n First 64-bit processor n Meanwhile, Pentium 4s (Netburst arch.) phased out in favor of “Core” line

32/64bit microprocessor. Intel x86, UltraSparc

79 Intel x86 Processor Evolution: Machine Evolution n 486 1989 1.9M n Pentium 1993 3.1M n Pentium/MMX 1997 4.5M n PentiumPro 1995 6.5M n Pentium III 1999 8.2M n Pentium 4 2001 42M n Core 2 Duo 2006 291M Added Features n Instructions to support multimedia operations l Parallel operations on 1, 2, and 4-byte data, both integer & FP n Instructions to enable more efficient conditional operations Linux/GCC Evolution n Very limited, needs to get better – trying to maintain compatibility

32/64bit microprocessor. Intel x86, UltraSparc

80 Intel x86 Processor Evolution: Name Date Transistors Itanium 2001 10M n First shot at 64-bit architecture: first called IA64 n Radically new instruction set designed for high performance n Can run existing IA32 programs l On-board “x86 engine” n Joint project with Hewlett-Packard - Boat Anchor Itanium 2 2002 221M n Big performance boost Itanium 2 Dual-Core 2006 1.7B ¤ Itanium has not taken off in marketplace n Lack of backward compatibility, no good compiler support, Pentium 4 too good.

32/64bit microprocessor. Intel x86, UltraSparc

81

¨ IA-32 architecture

¤ Lots of architecture improvements, pipelining, superscalar, branch prediction, hyperthreading and multi-core.

¤ From programmer’s point of view, IA-32 has not changed substantially except the introduction of a set of high-performance instructions

¨ Modes of operation ¨ Addressable Memory ¤ Protected mode ¤ Protected mode n Native mode (Windows, Linux), full features, separate memory n 4 GB

• Virtual-8086 mode n 32-bit address • hybrid of Protected • each program has its own 8086 computer ¤ Real-address and Virtual-8086 modes ¤ Real-address mode n 1 MB space n Native MS-DOS n 20-bit address ¤ System management mode

n Power management, system security, diagnostics

32/64bit microprocessor. Intel x86, UltraSparc

82

¨ General Purpose 32-bit General-Purpose Registers

Registers EAX EBP EBX ESP ECX ESI EDX EDI

16-bit Segment Registers

EFLAGS CS ES SS FS EIP DS GS 32/64bit microprocessor. Intel x86, UltraSparc

83

¨ Accessing parts of registers 8 8

¤ Use 8-bit name, 16-bit AH AL 8 bits + 8 bits name, or 32-bit name

¤ Applies to EAX, EBX, ECX, and EDX AX 16 bits

¤ The 16-bit registers are usually used only in real- address mode. EAX 32 bits

32/64bit microprocessor. Intel x86, UltraSparc

84

¨ Floating-point, MMX,XMM 80-bit Data Registers 48-bit Pointer Registers registers. ST(0) ¤ Eight 80-bit floating-point data registers FPU Instruction Pointer ST(1) n ST(0), ST(1), . . . , ST(7) ST(2) FPU Data Pointer n arranged in a stack

n used for all floating-point ST(3) arithmetic ST(4) 16-bit Control Registers ¤ Eight 64-bit MMX registers. ST(5) Tag Register ¤ Eight 128-bit XMM registers for single- instruction multiple-data (SIMD) ST(6) Control Register operations. ST(7) Status Register

Opcode Register 32/64bit microprocessor. Intel x86, UltraSparc

85

¨ Programmer’s Model

32/64bit microprocessor. Intel x86, UltraSparc

86

¨ IA-32 addressing Modes

8 32/64bit microprocessor. Intel x86, UltraSparc

87

¨ IA-32 Memory F0000 Management E0000 8000:FFFF D0000 ¤ Protected Mode C0000 n 1 MB RAM maximum B0000 one addressable (20-bit A0000 segment address) 90000 (64K) n Application programs can 80000 access any area of memory 70000 60000 8000:0250 n Single tasking 50000

linear addresses n Supported by MS-DOS 40000 0250 operating system 30000 8000:0000 20000 10000 seg ofs 00000 Segmented memory addressing: absolute (linear) address is a combination of a 16-bit segment value added to a 16- bit offset 32/64bit microprocessor. Intel x86, UltraSparc

88

¨ IA-32 Memory Management ¤ Real-address mode n 4 GB addressable RAM (32- bit address) model model n (00000000 to FFFFFFFFh)

n Each program assigned a memory partition which is segmentation Flat protected from other programs n Designed for multitasking RAM

n Supported by Linux & MS- Windows Local Descriptor Table n Segment descriptor tables

n Program structure 26000 base limit access n code, data, and stack areas 00026000 0010 n CS, DS, SS segment descriptors 00008000 000A 00003000 0002 8000 n global descriptor table (GDT)

n MASM Programs use the model Multi-segment multiplied by 3000 Microsoft flat memory model. 1000h

32/64bit microprocessor. Intel x86, UltraSparc

89

¨ IA-32 Memory Logical address Management Selector Offset

¤ Translating Addresses Descriptor table n The IA-32 processor uses a one- or two-step process to convert a variable's logical address into a unique memory location. Segment Descriptor + n The first step combines a segment value with a variable’s offset to create a linear address. n The second optional step, called page translation, GDTR/LDTR converts a linear address Linear address to a physical address. (contains base address of descriptor table)

32/64bit microprocessor. Intel x86, UltraSparc

90

¨ IA-32 Memory Linear address space Management ¤ Indexing into a (unused) Descriptor Table Logical addresses n Each segment descriptor Local Descriptor Table DRAM indexes into the program's SS ESP local descriptor table 0018 0000003A (LDT). Each table entry is mapped to a linear address. DS offset (index) 0010 000001B6 18 001A0000 10 0002A000 08 0001A000 IP 00 00003000 0008 00002CD3

LDTR register 32/64bit microprocessor. Intel x86, UltraSparc

91

¨ IA-32 Memory Management ¤ Paging Linear Address n Virtual memory uses disk as part of the memory, thus 10 10 12 allowing sum of all programs can be larger than Directory Table Offset physical memory. Only part of a program must be kept in memory, while the remaining parts are kept on disk. Page Frame

n The memory used by the program is divided into Page Directory Page Table small units called pages (4096-byte). Physical Address n As the program runs, the processor selectively unloads inactive pages from memory and loads other pages that are immediately required. Page-Table Entry

n OS maintains page directory and page tables Directory Entry n Page translation: CPU converts the linear address into a physical address

n Page fault: occurs when a needed page is not in memory, and the CPU interrupts the program CR3 32 n Virtual memory manager (VMM) – OS utility that manages the loading and unloading of pages

n OS copies the page into memory, program resumes execution

32/64bit microprocessor. Intel x86, UltraSparc

92

¨ Interrupt Handling

¨ Processor generates interrupts that index into a Interrupt Descriptor Table, whose base is stored in IDTR and loaded using the privileged instruction LIDT.

¨ The descriptors in IDT can be ¤ Interrupt gate: ISR handled as a normal call subroutine – uses the interrupted processor stack to save EIP,CS, (SS, ESP in case of stack switch – new stack got from TSS). ¤ Task gate: ISR handled as a task switch n Needed for stack fault in CPL = 0 and double faults.

32/64bit microprocessor. Intel x86, UltraSparc

93 Intel® Core® Micro-architecture Blocks

To L2 Cache Fetch / Decode Execute

32 KB 32 KB Next IP Bus Unit Instruction Cache Data Cache

FP Integer SIMD Add Branch Target Port Arithmetic FP Buffer Integer Div/Mul Shift/Rotate Integer SIMD

Port Arithmetic Instruction Microcode Decode Sequencer Integer SIMD (4 issue) Port Arithmetic

Load

Register Allocation Port Table (RAT) Store Addr Memory Order Retire

Port Store Buffer Reservation Stations (RS) Stations Reservation 32 entry Scheduler/ Dispatch Ports Data (MOB) Re-Order Buffer

Port (ROB) – 96 entry

IA Register Set 94 Intel® Core® Micro-architecture Blocks Execute 6 operations/cycle ¨Intel® Wide Dynamic Execution Unified Reservation Station • 3 Memory Operations • Schedules operations to Execution units • 1 Load ¤ 14-stage efficient pipeline • Single Scheduler for all Execution Units • 1 Store Address • Can be used by all integer, all FP, etc. • 1 Store Data n Wider decoding capacity • 3 “Computational” Operations n Advanced branch prediction n Wider execution path Unified Reservation Station ¤ 64-Bit Support Port 0

n Port 4 Port 2 Merom, Conroe, and Woodcrest support Port 1 Port 3 Port 5 EM64T ¨Intel® Advanced Smart Cache

Integer ALU & Integer ALU & Store Store Integer ALU & ¤ Multi-core optimization Shift LEA Load Address Data Shift n Shared between the two cores FP Multiply FP Add Branch n Advanced Transfer Cache architecture Divide Complex n Reduced bus traffic Integer FP Shuffle

n Both cores have full access to the entire cache SSE Integer ALU SSE Integer Integer Shuffles Multiply SSE Integer ALU Integer Shuffles n Dynamic Cache sizing ¤ Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache Execution Unit Overview

32/64bit microprocessor. Intel x86, UltraSparc

95 Intel® Core® Micro-architecture Blocks

¨Instruction Decode

¤ Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation

Micro-op fusion effectively widens the pipeline 32/64bit microprocessor. Intel x86, UltraSparc

96

Intel® Core® Micro-architecture SSE Operation Blocks (SSE/SSE2/SSE3) SOURCE 127 0 ¨Intel® Advanced Digital Media Boost X4 X3 X2 X1 ¤ Single Cycle SSE

n 8 Single Precision Flops/cycle SSE/2/3 OP n 4 Double Precision Flops/cycle Y4 Y3 Y2 Y1 ¤ Wide Operations DEST n 128-bit packed Add n 128-bit packed Multiply Core™ µarch n 128-bit packed Load CLOCK n 128-bit packed Store X4opY4 X3opY3 X2opY2 X1opY1 CYCLE 1 ¤ Support for Intel® EM64T instructions CLOCK Previous X2opY2 X1opY1 CYCLE 1

CLOCK X4opY4 X3opY3 CYCLE 2 32/64bit microprocessor. Intel x86, UltraSparc

97 Intel® Core® Micro-architecture Blocks On-Die Caches ¨ Hyperthreading ¤ Ability of processor to run multiple threads Architecture State Architecture State n Duplicate architecture state creates illusion to SW of Dual Processor (DP). Adv. Programmable Adv. Programmable n Execution unit shared between Interrupt Control Interrupt Control two threads, but dedicated if one stalls. ¤ Almost two Logical Processors. Processor Execution ¤ Architecture state (registers) and APIC duplicated. Resource

¤ Share execution units, caches, branch prediction, control logic and buses. System Bus 32/64bit microprocessor. Intel x86, UltraSparc

98 Intel® Core® Micro-architecture Vcc BCLK

Blocks Core PLL Vcc ¨ Power Efficient Support Freq . PLL ¤ Advanced power gating & Dynamic Sensors power coordination Core n Multi-point demand-based switching Vcc n Voltage-Frequency switching separation Freq . PLL Sensors n Supports transitions to deeper sleep modes Core PCU n Event blocking Vcc n Clock partitioning and recovery Freq . PLL Sensors n Dynamic Bus Parking Core n During periods of high performance Uncore , execution, many parts of the chip core Vcc LLC can be shut off Freq . PLL Sensors

32/64bit microprocessor. Intel x86, UltraSparc

99 X86-64 Architecture

¨ Full support for 64-bit integers ¤ All general-purpose registers are expanded from 32 bits to 64 bits ¤ All arithmetic and logical operations, memory-to-register, and register-to-memory operations are now directly supported for 64-bit integers ¤ Pushes and pops on the stack are always in eight-byte strides, and pointers are eight bytes wide

¨ Additional registers ¤ The number of named registers is increased from 8 (i.e. eax, ebx, ecx, edx, ebp, esp, esi, edi) to 16. ¤ Compilers can keep more local variables in registers rather than on the stack. ¤ Can use registers for frequently accessed constants. ¤ Arguments for small and fast subroutines may also be passed in registers to a greater extent.

32/64bit microprocessor. Intel x86, UltraSparc

100 X86-64 Architecture

¨ Larger virtual address space

¤ Current models can address up to 256 terabytes

¤ Expandable in the future to 16 exabytes

¤ Compared to just 4 gigabytes for 32-bit x86

¨ Larger physical address space

¤ Current models can address up to 1 terabyte

¤ Expandable in the future to 4 petabytes

32/64bit microprocessor. Intel x86, UltraSparc

101

UltraSparc (RISC) Begin developing Sparc – 1984

¨ (ORACLE) First Sparc Processor – 1986 SuperSparc – 1992 ¨ Sparc = Scalable Processor Architecture Open processor UltraSparc I – 1995 architecture UltraSparc II – 1997 UltraSparc III – 2001 ¨ SUN UltraSparc v9: UltraSparc IV – 2004 ¤ RISC Architecture big-endian. UltraSparc IV+ – 2005 ¤ 64 bit address and data. UltraSparc T1 – 2005 ¤ Memory Management UltraSparc T2 – 2007 Unit(MMU). Sparc T3 – 2010 ¤ Superscalar. Sparc T4 – 2011 ¤ OpenSparc (open-source) Sparc T5 – 2013 ¤ LEON (soft-core). Space rated. VHDL

32/64bit microprocessor. Intel x86, UltraSparc

102

UltraSparc (RISC) ¨ Data Formats ¤ Integers are 8-, 16-, 32-, 64-bit binary ¨ Registers numbers ¤ ~160 general-purpose registers ¤ 2’s complement is used for negative values ¤ Any procedure can access only 32 registers (r0~r31) ¤ Support both big-endian and little-endian byte orderings n First 8 registers (r0~r8) are global, i.e. they can be access by all n (big-endian means the most significant part of procedures on the system (r0 is zero) a numeric value is stored at the lowest- numbered address) n Other 24 registers can be visualized as a window through which part of ¤ Three different floating-point data formats the register file can be seen n Single-precision, 32 bits long (23 + 8 + 1) ¤ Program counter (PC) n Double-precision, 64 bits long (52 + 11 + 1) n The address of the next instruction to n Quad-precision, 128 bits long (112 + 15 + 1) be executed

¤ Condition code registers

¤ Other control registers

32/64bit microprocessor. Intel x86, UltraSparc

103 UltraSparc (RISC) ¨ Instruction Set ¤ <150 instructions ¨ Addressing Modes ¤ Pipelined execution ¤ Immediate mode n While one instruction is being executed, the next one is fetched from memory and decoded ¤ Register direct mode ¤ Delayed branches ¤ Memory addressing n The instruction immediately following the branch instruction is actually executed before the branch is Mode Target address calculation taken PC-relative* TA= (PC)+displacement {30 bits, signed} Register indirect TA= (register)+displacement {13 bits, signed} ¤ Special-purpose instructions with displacement n High-bandwidth block load and store operations Register indirect indexed TA= (register-1)+(register-2) n Special “atomic” instructions to support multi- processor system

*PC-relative is used only for branch instructions ¨ Input and Output

¤ A range of memory locations is logically replaced by device registers ¤ Each I/O device has a unique address, or set of addresses ¤ No special I/O instructions are needed 32/64bit microprocessor. Intel x86, UltraSparc

104 UltraSparc T2 (RISC) ¨ Codename Niagara2 ¨ Multi-threaded(8), multi- ¨ Member of SPARC family core(8) CPU ¨ 2 previous multi-core processors ¨ Frequency ranges from ¤ UltraSPARC IV 900MHz to 1.4GHz ¤ UltraSPARC IV+ ¨ Powered by less than 95 watts ¨ UltraSPARC T1 (first multi-core and (nominal) with less than 2 multi-threaded) watts per thread ¤ Released 14 November 2005 ¨ Integrated ¤ 4, 6, or 8 cores with 4 threads each ¤ 10 Gb Ethernet networking ¨ UltraSPARC T2 Released 7 August ¤ PCI Express I/O expansion 2007 ¤ FPU and cryptographic ¤ Now 8 threads per core (instead of 4) processing units per core

32/64bit microprocessor. Intel x86, UltraSparc

105 UltraSparc T2 (RISC)

¨ 8 Fully pipelined FPUs

¨ 8 SPUs

¨ 2 integer ALUs per core, each one shared by a group of four threads

¨ 4MB L2 Cache (8-banks, 16- way associative)

¨ 8 KB data cache and 16 KB instruction cache

¨ Two 10Gb Ethernet ports and one PCIe port

32/64bit microprocessor. Intel x86, UltraSparc

106 UltraSparc T2 (RISC)

32/64bit microprocessor. Intel x86, UltraSparc

107 UltraSparc T2.Core Architecture

32/64bit microprocessor. Intel x86, UltraSparc

108 UltraSparc T2.Core Architecture

32/64bit microprocessor. Intel x86, UltraSparc

109 UltraSparc T2 Pipeline

¨ Eight-stage integer pipeline

Fetch Cache Pick Decode Execute Mem Bypass W

¤ Pick is for selecting 2 threads for execution (Added this stage for T2) ¤ In the bypass stage, the load/store unit (LSU) forwards data to the integer register files (IRFs) with sufficient write timing margin. All integer operations pass through the bypass stage.

¨ 12-stage floating point pipeline

Fetch Cache Pick Decode Execute Fx1 . . . Fx5 FB FW

Ø 6-cycle latency for dependent FP ops! Ø Integer multiplies are pipelined between different threads. Integer multiplies block within the same thread.! Ø Integer divide is a long latency operation. Integer divides are not pipelined between different threads.!

32/64bit microprocessor. Intel x86, UltraSparc

110 MIPS (ARM) vs x86

MIPS (ARM) x86 MIPS: “Three-address architecture” Address: 32/64-bit 32/64-bit • Arithmetic-logic specify all 3 operands Page size: 4KB 4KB add $s0,$s1,$s2 # s0=s1+s2 Data aligned Data unaligned Benefit: fewer instructions éé performance Destination reg: Left Right x86: “Two-address architecture” add $rd,$rs1,$rs2 add %rs1,%rs2,%rd • Only 2 operands, so the destination is also one of Regs: $0, $1, ..., $31 %r0, %r1, ..., %r7 the sources Reg = 0: $0 (n.a.) add $s1,$s0 # s0=s0+s1 Return address: $31 (n.a.) Often true in C statements: c += b; Benefit: smaller instructions êê smaller code

32/64bit microprocessor. Intel x86, UltraSparc

111 MIPS (ARM) vs x86

MIPS: “load-store architecture” MIPS: “fixed-length instructions” • Only Load/Store access memory; rest • All instructions same size, e.g., 4 bytes operations register-register; e.g., • Simple hardware  performance lw $t0, 12($gp) • Branches can be multiples of 4 bytes add $s0,$s0,$t0 # s0=s0+Mem[12+gp] Benefit: simpler hardware è easier to pipeline, higher x86: “variable-length instructions” performance • Instructions are multiple of bytes: 1 to 17; x86: “register-memory architecture” êê small code size (30% smaller?) • All operations can have an operand in memory; • More Recent Performance Benefit: other operand is a register; e.g., better instruction cache hit rates add 12(%gp),%s0 # s0=s0+Mem[12+gp] • Instructions can include 8- or 32-bit immediates Benefit: fewer instructions è smaller code