ARM Cortex-A Family (V7-A): General Purpose Processors - Applications Processors for Full OS and 3Rd Party Applications

ARM architecture Computer architecture M 1 History • Acorn computer: an english company Cambridge spin-off (UK) which had developed a 8 bit microprocessor for the BBC on 6502 architecture (Synertek e Rockwell) • In 1982 Acorn engineers looked for a new microprocessor per more sophisticated applications but decided against CISC solutions because too slow for the specific requirements and interrupt latency time • They decided to design a totally new architecture. At the same time Stanford RISC I and II and MIPS (Microprocessor without Interlocked Pipeline Stages) of Berkley appeared on the market Berkley and they decided to follow that philosophy • ARM (Advanced RISC Machine) whose three stages is still now used • ARM is now a true industry (from 1990) and a «brand» with multiple implementations and is used by many processor companies (Intel too) in multiple environments in tailored versions (Intellectual Property – IP - cores) • Design software can be bought (Verilog) – soft core 2 ARM • T: Thumb • D: On-chip debug support • M: Enhanced multiplier • I: Embedded ICE hardware • T2: Thumb-2 • S: Synthesizable code • E: Enhanced DSP instruction set • J: JAVA support • Z: TrustZone • F: Floating point unit • H: Handshake, clockless design for synchronous or asynchronous design 3 ARM- base concepts • Arm is a family of Risc processors conceptually similar to DLX • There are several versions from a very simple to a very sophisticated one • Multiple environments (i.e. mobile phones) Apple iPod Photo e iPod Video 5th gen (2X, @80MHz) Roomba 500 Lego Mindstorm 4 ARM – first version • LOAD/STORE architecture very simple since the designers had no full-custom previous experience • 32 bit fixed length instructions • Three addresses instructions RISC type (with some exceptions CISC type) • Fixed register bank. Obviously in addition to programmer visible registers there are the machine registers • Single cycle instructions but potentially multiple cycle since the no Harvard architecture is implemented. When more that a single memory access is required (i.e. a LOAD) the extra cycles are used for useful microoperations (i.e. autoindex address) 5 Tthree stages ARM • 16 32-bit general-purpose registers (r0 - r15) • Three ports register bank (two for reading and one for writing) An additional port for read and write register 15 (PC) • N-positions barrel shifter • 32 bit ALU • The address register is provided with an incrementer (for sequential accesses) – In practice it is a programmable counter • Two buffer registers for data to and from the memory (invisible to the programmer). Single bank memory • Instructions decoder and control logic • Status register (CSPR) • Two interrupts: fast and standard 6 ARM register set r0 fiq: fast interrupt r1 r0-r7-are common to svc: software interrupt r2 user and system mode abt: memory faults (abort) irq: standard interrupt r3 und:undefined instructions r4 r5 r6 r7 System mode only r8_fiq r8 r9-fiq r9 r10_fiq r10 r11_fiq r13_und r11 r13_irq r12_fiq r13_abt r14_und r12 r13_svc r14_irq r13_fiq r14_abt r13 (MSP) r14_svc r14_fiq r14 (LR) SPSR_und r15 (PC) SPSR_irq SPSR_abt SPSR_svc SPSR_fiq CPSR User mode fiq mode svc mode abort mode irq mode undef. mode CPSR Current Program Status Register SPSR Saved Processor Status Register MSP Master Stack Pointer LR Link Register (return register for subroutines) 7 Current Program Status Register CPSR (similar to flag register) 31 28 27 8 7 6 5 4 0 N Z C V Unused I F T Mode Condition codes N negative Z zero I,F interrupt masks C carry T Thumb Instr. Set V oVerflow CPSR [4:0] mode Use Used Register Set 10000 User Normal user code user 10001 FIQ Processing Fast interrupt fiq 10010 IRQ Processing standard interrupts irq 10011 SVC Processing software interrupts svc 10111 Abort Processing memory faults abt 11011 Undef Handling undefined instructions trap und 11111 System Running privileged operating system tasks user Each privileged mode has a Saved Program Status Register SPSR where the current CPSR is saved and a specific r14 (Link Register) NB Thumb Instruction Set: higly encoded instructions for memory save 8 Exceptions When an exception occurs: 1) The corresponding mode is activated 2) The PC (r15) is saved in r14 (link register) of the new mode 3) The old CSPR is saved in the SPSR of the new mode 4) IRQ is disabled setting bit 7 of CSPR and if the exception corresponds to the Fast Interrupt CSPR bit6 is set 5) PC assumes the value of the following table (fixed addresses) Exception Mode Address Reset SVC 00000000 Undef. Instr. UND 00000004 Soft. Int. (SWI) SVC 00000008 Prefetch Abort ( Instr.Mem. Fault) Abort 0000000C Data Abort (Data Mem. Fault) Abort 00000010 IRQ IRQ 00000018 FIQ FIQ 0000001C 00000014 cannot be used (old ARMs compatibility) 9 ARM three stages pipeline Write Read Read 10 Organisation • Register bank (=Register File): two ports (read) and one port (write) - as in DLX – for the normal data traffic plus two accesses (read e write) for r15 (PC) • Memory access register has an incrementer for sequential accesses which is used for incrementing the PC too • One extra register for 32 bit multiplication (when multiplying 32 bit data the result can be longer that 32 bits) • Two transit register for the memory (Datain and Dataout - no harvard architecture initially) • No forwarding unit (not needed because the following instruction finds the updated value already in the RF - three stages pipelines – see next slide) 11 Three stages pipeline Single cycle instructions fetch decode exec.add add r3,r1,r5 sub r2,r3,r6 fetch decode exec. sub cmp r2,#3 fetch decode exec.cmp 1 2 3 time • A single execution clock instruction accesses during the execute stage two operands; the datum on bus B shifted (if required), combined in the ALU with bus A datum. The result is written back in the register bank. The PC is incremented by the incrementer and the result is stored back in r15 AND in the address register for next instruction access • Fetch stage: the instruction is read from the memory into the data-in register for decoding • Decode stage: the instruction in the data-in register (it doesn’t use the datapath) and in the meantime the next instruction is read from the memory and is «clocked» at the end of the fetch stage • Exec stage the instruction uses the datapath. In an arithmetic instruction two operands are read, that on the bus B shifted (combinatorially) if needed and combined with datum on bus A. The result is written back in the RF in the same clock period NB The datum required by an instruction (exec stage) finds the datum already available in the RF. No forwarding unit !!!! 12 Three stages pipeline Multiple cycles instructions Here no fetch because the 1 fetchADD decode execute memory is busy with the WB 2(memory fetch STR decode calc. addr. data xfer Store) 3 fetchADD bubble decode execute The address computation 4 prevents the decoding because fetchADD bubble decode execute the registers towards the ALU cannot be opened 5 Decoder busy No fetch fetchADD decode execute instruction Memory busy time • Multi-cycles instructions are executed more irregularly. In this example an ADD followed by a STORE and three ADDs • The greyed stages are those where the memory is accessed • The datapath is used by the STORE for the address computation • Since the PC(r15) is incremented in the first stage the programmer must be aware that it was already twice incremented (two instructions – 8 bytes) if it has to be used in the exec stage 13 Three stages pipeline Multiple cycles instructions «register based» load with autoincrement ldmia r0!,{r2,r3} fetch decode ex ld r2 ex ld r3 sub r2,r3,r6 fetch decode ex sub cmp r2,#3 fetch decode ex cmp time Ldmia -> Load multiple registers increment address • This instruction loads two registers (in this case r2 and r3) with data starting from the address in r0 (in this case). No need for address computation (value already present in r0). The address is incremented by 4 each load (incrementer) 14 Branch Decision on the third clock bne foo fetch decodexecute linkret adjust sub r2,r3,r6 fetch decod ex add add r13,r14,r2 fetch decod ex add fetch decod ex add foo add r0,r1,r2 time The branch can be with return and the PC value is saved in the linkret stage. The adjust stage adjusts its value which has been already incremented by 8 15 Register/Register instructions Datapath address register incrementer Rd PC(r15) Instruction Reg-Reg registers Rd <= Rn op Rm Rn Rm R15 (PC) <= AR + 4 AR <= AR + 4 multiplier Barrel AR: Address Register as per ins. PC value incremented by 4 as per instruction The same incremented value in the AR data out data in instr. pipe 16 Register/Immediate instructions Datapath address register incrementer Reg-Imm Rd <= Rn op Imm Rd PC(r15) R15(PC) <= AR + 4 registers AR <= AR + 4 Rn In this case the operand is in the instruction multiplier as per ins. As per instruction [7:0] data out data in instr. pipe 17 Store instruction Datapath address register address register Compute address AR <= Rn op Disp increment increment R15 (PC)<= AR + 4 PC Rn PC(r15) registers registers Rn Rd Store data AR <= R15 (PC) mem[AR] <= Rd If autoindexing lsl #0 shifter Rn <= Rn +/- 4 = A / A + B / A - B = A + B / A - B [11:0] data out data in i. pipe byte? data in i. pipe (a) 1st cycle – The STORE address is computed and stored in the AR. In the meantime r15(PC) is incremented and the value stored in the RF ONLY for the next instruction (a) 2nd cycle – r15 (PC) is copied into the AR while the datum is written into the memory and an autoincrement (if required) is executed.

Load more