EmbeddedEmbedded ProcessorProcessor

WWW.ANDESTECH.COM AgendaAgenda

™CPU architecture ™Pipeline ™ ™Memory Management Units (MMU) ™Direct Memory Access (DMA) ™ Interface Unit (BIU) ™Examples

Page 2 CPUCPU ArchitectureArchitecture (1/8)(1/8)

™ ƒ Data and instructions are stored Memory in memory, the takes instructions and controls data manipulation in the . ƒ Input/Output is needed to make Control ALU the machine a practicality ƒ Execution in multiple cycles ƒ Serial fetch instructions & data ƒ Single memory structure I/O • Can get data/program mixed • Data/instructions same size Examples of von Neumann ORDVAC (U-Illinois) at Aberdeen Proving Ground, Maryland (completed Nov 1951) IAS machine at Princeton University (Jan 1952)

Page 3 CPUCPU ArchitectureArchitecture (2/8)(2/8)

™The ALU manipulates two binary words according to the instruction decoded in the Control unit. The result is in the Accumulator and may be moved into Memory.

Page 4 CPUCPU ArchitectureArchitecture (3/8)(3/8)

™von Neumann architecture introduces a problem, which is not solved until much later. ™The CPU work on only one instruction at a time, each must be fetched from memory then executed. During fetch the CPU is idle, this is a waste of time. Made worse by slow memory technology compared with CPU. ™Time is also lost in the CPU during instruction decode.

F E F E ------D E

Page 5 CPUCPU ArchitectureArchitecture (4/8)(4/8) ™ ƒ In a using the Harvard address architecture, the CPU can both read an instruction and perform a data data memory memory access at the same time. ƒ A Harvard architecture computer data can thus be faster for a given circuit complexity because instruction CPU fetches and data access do not address contend for a single memory pathway. ƒ Execution in 1 program memory cycle data ƒ Parallel fetch instructions & data Examples of Harvard: ƒ More Complex Microchip PIC hardware families • Instructions and data always Atmel AVR separate AndeScore • Different code/data path Atom widths (E.G. 14 bit instructions, 8 bit data) Page 6 CPUCPU ArchitectureArchitecture (5/8)(5/8)

™Take a closer look at the action part of the CPU, the ALU, and how data circulates around it. ™An ALU is only, no data is stored, i.e. no registers. This is the original reason for having CPU registers.

Page 7 CPUCPU ArchitectureArchitecture (6/8)(6/8)

CPU Architecture In the simplest, minimum hardware, solution one of them, say X, is the data bus accumulator A, the other, Y, is straight off the memory bus (this requires a temporary register not visible to the programmer). X Y The instruction may be ADDA, which means: add to the contents ALU of A the number (Y) and put the answer in A.

A

Page 8 CPUCPU ArchitectureArchitecture (7/8)(7/8)

• It’s a simple step to add more CPU data registers and extend the instructions to include B, C,….. as well as A. • An internal CPU bus structure then becomes a necessity.

Page 9 CPUCPU ArchitectureArchitecture (8/8)(8/8)

Page 10 Architectures:Architectures: CISCCISC vs.vs. RISCRISC (1/2)(1/2) ™CISC - Complex Instruction Set : ƒ von Neumann architecture ƒ Emphasis on hardware ƒ Includes multi-clock complex instructions ƒ Memory-to-memory ƒ Sophisticated arithmetic (multiply, divide, trigonometry etc.) ƒ Special instructions are added to optimize performance with particular compilers

Page 11 Architectures:Architectures: CISCCISC vs.vs. RISCRISC (2/2)(2/2) ™RISC - Reduced Instruction Set Computers: ƒ Harvard architecture ƒ A very small set of primitive instructions ƒ Fixed instruction format ƒ Emphasis on software ƒ All instructions execute in one cycle (Fast!) ƒ Register to register (except Load/Store instructions) ƒ Pipeline architecture

Page 12 Single-,Single-, Dual-,Dual-, Multi-,Multi-, Many-Many- CoresCores

™Single-core: ƒ Most popular today. ™Dual-core, multi-core, many-core: ƒ Forms of multiprocessors in a single chip ™Small-scale multiprocessors (2-4 cores): ƒ Utilize task-level parallelism. ƒ Task example: audio decode, video decode, display control, network packet handling. ™Large-scale multiprocessors (>32 cores): ƒ nVidia’s graphics chip: >128 core ƒ Sun’s server chips: 64 threads

Page 13 PipelinePipeline (1/2)(1/2)

™ An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time). ™ The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step. ™ This allows the computer's control circuitry to issue instructions at the processing rate of the slowest step, which is much faster than the time needed to perform all steps at once. ™ The term pipeline refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the links of a pipe.)

Page 14 PipelinePipeline (2/2)(2/2)

™ Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.

Page 15 8-stage8-stage pipelinepipeline

F1 F2 I1 I2 E1 E2 E3 E4

Instruction-Fetch Instruction Issue and Data Address Data Access Instruction Retire and First and Second Read Generation First and Second Result Write Back

IF1 IF2 ID RF AG DA1 DA2 WB

EX

Instruction Decode MAC1 MAC2

Page 16 InstructionInstruction FetchFetch StageStage

™ F1 – Instruction Fetch First ƒ Instruction Tag/Data Arrays ƒ ITLB Address Translation ƒ Branch Target Buffer Prediction ™ F2 – Instruction Fetch Second ƒ Instruction Cache Hit Detection ƒ Cache Way Selection ƒ Instruction Alignment

IF1 IF2 ID RF AG DA1 DA2 WB

EX

MAC1 MAC2

Page 17 InstructionInstruction IssueIssue StageStage

™ I1 – Instruction Issue First / Instruction Decode ƒ 32/16-Bit Instruction Decode ƒ Return Address Stack prediction ™ I2 – Instruction Issue Second / Register File Access ƒ Instruction Issue Logic ƒ Register File Access

IF1 IF2 ID RF AG DA1 DA2 WB

EX

MAC1 MAC2

Page 18 ExecutionExecution StageStage

™ E1 – Instruction Execute First / Address Generation / MAC First ƒ Data Access Address Generation ƒ Multiply Operation (if MAC presents) ™ E2 –Instruction Execute Second / Data Access First / MAC Second / ALU Execute ƒ ALU ƒ Branch/Jump/Return Resolution ƒ Data Tag/Data arrays ƒ DTLB address translation ƒ Accumulation Operation (if MAC presents) ™ E3 –Instruction Execute Third / Data Access Second ƒ Data Cache Hit Detection ƒ Cache Way Selection ƒ Data Alignment

IF1 IF2 ID RF AG DA1 DA2 WB

EX

MAC1 MAC2

Page 19 WriteWrite BackBack StageStage

™E4 –Instruction Execute Fourth / Write Back ƒ Interruption Resolution ƒ Instruction Retire ƒ Register File Write Back

IF1 IF2 ID RF AG DA1 DA2 WB

EX

MAC1 MAC2

Page 20 BranchBranch PredictionPrediction OverviewOverview

™ Why is branch prediction required? ƒ A deep pipeline is required for high speed ƒ Branch cost is a combination of penalty, frequency of branches, and frequency that they are taken ƒ Moving detection earlier reduces penalty ƒ Conditional operations reduce number of branches ƒ Pipelines offer speedup due to parallel execution of instructions ƒ Full speedup is rarely obtained dues to hazards (data, control, structural) ƒ Control hazards handled by avoidance and prediction ƒ Prediction can be static, dynamic, local, global, and hybrid ƒ Prediction reduces number "taken“ ™ Why dynamic branch prediction? ƒ Static branch prediction • Static uses fixed prediction algorithm, or compile-time information ƒ Dynamic branch prediction • Dynamic gathers information as the program executes, using historical behavior of branch to predict its next execution

Page 21 BranchBranch PredictionPrediction OverviewOverview

™Simple Static Prediction: ƒ if 60% of branches are taken, assume all are taken, and the percent "taken" drops to 40% ƒ Once a choice is made, the compiler can order branches appropriately ƒ Very simple, cheap to implement, moderately effective ™Complex Static Prediction: ƒ Assume backwards branches are taken (likely loop returns) and forward branches aren't (BTFN) ƒ Compiler provides hints by selecting different branches or including flags to indicate likely to be taken -- requires a predecoding of the branch

Page 22 BranchBranch PredictionPrediction OverviewOverview

™ Simple Dynamic Prediction ƒ Records portion of jump address, and most recent action (taken/not) ƒ Partial address can result in aliasing (behavior of multiple branches is combined) ƒ Single bit of state can lead to worse behavior in degenerate cases ™ Extended Dynamic Prediction ƒ 2-bit state -- none, taken, not, both ƒ 2-bit state -- taken, multi-taken, not-taken, multi-not-taken ƒ Not fooled by branch occasionally behaving in the opposite manner ƒ May be associated with I-cache, with BTB, or as a separate table

Page 23 BranchBranch PredictionPrediction UnitUnit

™ Branch Target Buffer (BTB) ƒ Branch target buffer (BTB) stores address of the last target location to avoid recomputing ƒ Indexed by low-order bits of jump address, tag holds high- order bits ƒ Enables earlier prefetching of target based on prediction ƒ 128 entries of 2-bit saturating counters ƒ 128 entries, 32-bit predicted PC and 26-bit address tag ™ Return Address Stack (RAS) ƒ Four entries ™ BTB and RAS updated by committing branches/jumps

Page 24 BTBBTB InstructionInstruction PredictionPrediction

™ BTB predictions are performed based on the previous PC instead of the actual instruction decoding information, BTB may make the following two mistakes ƒ Wrongly predicts the non-branch/jump instructions as branch/jump instructions ƒ Wrongly predicts the instruction boundary (32-bit -> 16-bit) ™ If these cases are detected, IFU will trigger a BTB instruction misprediction in the I1 stage and re-start the program sequence from the recovered PC. There will be a 2-cycle penalty introduced here

BTB instruction misprediction branch F1 F2 I1

PC+4 F1 F2 killed

PC+4 F1 killed

Recovered PC F1 F2 I1

Page 25 RASRAS PredictionPrediction

™ When return instructions present in the instruction sequence, RAS predictions are performed and the fetch sequence is changed to the predicted PC. ™ Since the RAS prediction is performed in the I1 stage. There will be a 2-cycle penalty in the case of return instructions since the sequential fetches in between will not be used.

RAS prediction return F1 F2 I1

PC+4 F1 F2 killed

PC+4 F1 killed

target F1 F2 I1

Page 26 BranchBranch Miss-PredictionMiss-Prediction

™ In core, the resolution of the branch/return instructions is performed by the ALU in the E2 stage and will be used by the IFU in the next (F1) stage. In this case, the misprediction penalty will be 5 cycles.

branch F1 F2 I1 I2 E1 E2 predicted taken (wrong)

PC+4 F1 F2 I1 I2 E1 killed PC+4 F1 F2 I1 I2

target F1 F2 I1 killed F1 F2 F1

redirect F1 F2 I1 I2

Page 27 CacheCache

™Store copies of data at places that can be accessed more quickly than accessing the original ƒ Speed up access to frequently used data ƒ At a cost: Slows down the infrequently used data

Page 28 BlockBlock diagramdiagram

EDM JTAG Core AHB (1 or 2) B I ADDRESS/COMMAND U D MMU N A M D N M A A O T C M / A S M D S O E C / R S D S A D E T A R A D D D A B HSMP (1 or 2) ADDRESS/DATA I Cache Local Memory DMA U

Page 29 MemoryMemory HierarchyHierarchy -- DiagramDiagram

Page 30 CacheCache (1/3)(1/3)

™ Small amount of fast memory ™ Sits between normal main memory and CPU ™ May be located on CPU chip or module ™ Size does matter ƒ Cost • More cache is expensive ƒ Speed • More cache is faster (up to a point) • Checking cache for data takes time

Page 31 CacheCache (2/3)(2/3)

Page 32 CacheCache (3/3)(3/3)

Page 33 HierarchyHierarchy ListList

™Registers ™L1 Cache ™L2 Cache ™Main memory ™Disk cache ™Disk ™Optical ™Tape

Page 34 CachingCaching inin MemoryMemory HierarchyHierarchy

™ Provides the illusion of GB storage ƒ With register access time

Access Time Size Cost

Primary memory Registers 1 clock cycle ~500 bytes On chip Cache 1-2 clock cycles <10 MB

Main memory 1-4 clock cycles < 4GB $0.1/MB Secondary memory Disk 5-50 msec < 200 GB $0.001/MB

Page 35 CachingCaching inin MemoryMemory HierarchyHierarchy

™Exploits two hardware characteristics ƒ Smaller memory provides faster access times ƒ Large memory provides cheaper storage per byte ™Puts frequently accessed data in small, fast, and expensive memory

Page 36 CacheCache ReadRead OperationOperation -- FlowchartFlowchart

Page 37 CacheCache andand CPUCPU (1/2)(1/2)

address data cache main CPU

cache memory address controller data data

Page 38 CacheCache andand CPUCPU (2/2)(2/2)

Roughly

CPU 1 cycle ~1-5 cycles L1 cache

L2 cache ~5-20 cycles

Main memory ~40-100 cycles

Page 39 SomeSome CacheCache Spec.Spec.

L1 cache (I/D) L2 cache

PS2 16K/8K† 2-way N/A

GameCube 32K/32K‡ 8-way 256K 2-way unified

XBOX 16K/16K 4-way 128K 8-way unified

PC ~32-64K ~128-512K

Page 40 MultipleMultiple levelslevels ofof cachecache

CPU L1 cache L2 cache

Page 41 CacheCache datadata flowflow

I-Cache I C es ac tch he Fe re I fill Uncached Instruction/data CPU Uncached write/write-through Ext Memory

Write back Lo ad &

St or e D-Cache D-Cache refill

Page 42 CacheCache operationoperation

™ CPU requests contents of memory location ™ Check cache for this data ™ If present, get from cache (fast) ™ If not present, read required block from main memory to cache ™ Then deliver from cache to CPU ™ Cache includes tags to identify which block of main memory is in each cache slot ™ Many main memory locations are mapped onto one cache entry. ™ May have caches for: ƒ instructions; ƒ data; ƒ data + instructions (unified).

Page 43 Cache/MainCache/Main MemoryMemory StructureStructure

Page 44 TypicalTypical CacheCache OrganizationOrganization

Page 45 DirectDirect MappingMapping (1/2)(1/2)

™Each block of main memory maps to only one cache line ƒ i.e. if a block is in cache, it must be in one specific place ™Address is in two parts ™Least Significant w bits identify unique word ™Most Significant s bits specify one memory block ™The MSBs are split into a cache line field r and a tag of s-r (most significant)

Page 46 DirectDirect MappingMapping (2/2)(2/2)

•Direct mapping 主記憶體 Block 0 Block 1

快取 Block 127 tag Block 0 Block 128 tag Block 1 Block 129

tag Block 127 Block 255 Block 256 Block 257

Block 4095

標籤 區塊 字組 5 7 4 主記憶體位址

Page 47 DirectDirect MappingMapping AddressAddress StructureStructure

Tag s-r Line or Slot r Word w 8 14 2

™ 24 bit address ™ 2 bit word identifier (4 byte block) ™ 22 bit block identifier ƒ 8 bit tag (=22-14) ƒ 14 bit slot or line ™ No two blocks in the same line have the same Tag field ™ Check contents of cache by finding line and checking Tag

Page 48 DirectDirect MappingMapping CacheCache OrganizationOrganization

Page 49 DirectDirect MappingMapping ExampleExample

Page 50 DirectDirect MappingMapping SummarySummary

™Address length = (s + w) bits ™Number of addressable units = 2s+w words or bytes ™Block size = line size = 2w words or bytes ™Number of blocks in main memory = 2s+ w/2w = 2s ™Number of lines in cache = m = 2r ™Size of tag = (s – r) bits

Page 51 AssociativeAssociative MappingMapping (1/2)(1/2)

™A main memory block can load into any line of cache ™Memory address is interpreted as tag and word ™Tag uniquely identifies block of memory ™Every line’s tag is examined for a match ™Cache searching gets expensive

Page 52 AssociativeAssociative MappingMapping (2/2)(2/2)

Fully associative mapping(完全關聯映射)

主記憶體 Block 0 Block 1 快取 tag Block 0 tag Block 1

Block i

tag Block 127

Block 4095

標籤 字組 12 4 主記憶體位址

Page 53 FullyFully AssociativeAssociative CacheCache OrganizationOrganization

Page 54 AssociativeAssociative MappingMapping AddressAddress StructureStructure

Word Tag 22 bit 2 bit

™22 bit tag stored with each 32 bit block of data ™Compare tag field with tag entry in cache to check for hit ™e.g. ƒ Address Tag Data Cache line ƒ FFFFFC FFFFFC 24682468 3FFF

Page 55 AssociativeAssociative MappingMapping ExampleExample

Page 56 AssociativeAssociative MappingMapping SummarySummary

™Address length = (s + w) bits ™Number of addressable units = 2s+w words or bytes ™Block size = line size = 2w words or bytes ™Number of blocks in main memory = 2s+ w/2w = 2s ™Number of lines in cache = undetermined ™Size of tag = s bits

Page 57 SetSet AssociativeAssociative MappingMapping (1/2)(1/2)

™Cache is divided into a number of sets ™Each set contains a number of lines ™A given block maps to any line in a given set ƒ e.g. Block B can be in any line of set i ™e.g. 2 lines per set ƒ 2 way associative mapping ƒ A given block can be in one of 2 lines in only one set

Page 58 SetSet AssociativeAssociative MappingMapping (2/2)(2/2)

主記憶體 Block 0 Block 1

快取 tag Set 0 Block 0 tag Block 63 Block 1 Block 64 tag Block 2 Set 1 tag Block 65 Block 3

tag Block 127 Set 63 Block 126 tag Block 128 Block 127 Block 129

Block 4095

T標籤 集合 字組 6 6 4 主記憶體位址

Page 59 SetSet AssociativeAssociative MappingMapping AddressAddress StructureStructure

Word Tag 9 bit Set 13 bit 2 bit

™Use set field to determine cache set to look in ™Compare tag field to see if we have a hit ™e.g ƒ Address Tag Data Set number ƒ 1FF 7FFC 1FF 12345678 1FFF ƒ 001 7FFC 001 11223344 1FFF

Page 60 SetSet AssociativeAssociative MappingMapping ExampleExample ™13 bit set number ™Block number in main memory is modulo 213 ™000000, 00A000, 00B000, 00C000 … map to same set

Page 61 SetSet AssociativeAssociative MappingMapping AddressAddress StructureStructure

Word Tag 9 bit Set 13 bit 2 bit

™Use set field to determine cache set to look in ™Compare tag field to see if we have a hit ™e.g ƒ Address Tag Data Set number ƒ 1FF 7FFC 1FF 12345678 1FFF ƒ 001 7FFC 001 11223344 1FFF

Page 62 TwoTwo WayWay SetSet AssociativeAssociative CacheCache OrganizationOrganization

Page 63 TwoTwo WayWay SetSet AssociativeAssociative MappingMapping ExampleExample

Page 64 SetSet AssociativeAssociative MappingMapping SummarySummary

™Address length = (s + w) bits ™Number of addressable units = 2s+w words or bytes ™Block size = line size = 2w words or bytes ™Number of blocks in main memory = 2d ™Number of lines in set = k ™Number of sets = v = 2d ™Number of lines in cache = kv = k * 2d ™Size of tag = (s – d) bits

Page 65 ReplacementReplacement policypolicy

™ Replacement policy ƒ strategy for choosing which cache entry to throw out to make room for a new memory location ™ Two popular strategies: ƒ Random ƒ Least-recently used (LRU)

Page 66 WriteWrite operationsoperations

™ Write Policy ƒ Must not overwrite a cache block unless main memory is up to date ƒ Multiple CPUs may have individual caches ƒ I/O may address main memory directly ™ Write-through ƒ immediately copy write to main memory ƒ immediately propagates update through various levels of caching • For critical data ƒ All writes go to main memory as well as cache ƒ Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date ƒ Lots of traffic ƒ Slows down writes

Page 67 WriteWrite operationsoperations

™Write-back ƒ write to main memory only when location is removed from cache ƒ delays the propagation until the cached item is replaced • Goal: spread the cost of update propagation over multiple updates • Less costly ƒ Updates initially made in cache only ƒ Update bit for cache slot is set when update occurs ƒ Other caches get out of synchronization ƒ If block is to be replaced, write to main memory only if update bit is set

Page 68 GenericGeneric IssuesIssues inin CachingCaching

™Cache hit: requested instruction/data by is found in the cache ™Cache miss: request instruction/data is not in the cache => read from main memory (DRAM) and the associated data is copied in the cache (cache update)

Page 69 ReasonsReasons forfor CacheCache MissesMisses

™Compulsory misses: data brought into the cache for the first time ƒ e.g., booting ™Capacity misses: caused by the limited size of a cache ƒ A program may require a hash table that exceeds the cache capacity • Random access pattern • No caching policy can be effective

Page 70 ReasonsReasons forfor CacheCache MissesMisses

™Misses due to competing cache entries: a cache entry assigned to two pieces of data ƒ When both active ƒ Each will preempt the other ™Policy misses: caused by cache replacement policy, which chooses which cache entry to replace when the cache is full

Page 71 ImprovingImproving CacheCache PerformancePerformance

™Goal: reduce the Average Memory Access Time (AMAT) ƒ AMAT = Hit Time + Miss Rate * Miss Penalty ™Approaches ƒ Reduce Hit Time ƒ Reduce or Miss Penalty ƒ Reduce Miss Rate ™Notes ƒ There may be conflicting goals ƒ Keep track of clock cycle time, area, and power consumption

Page 72 EffectiveEffective AccessAccess TimeTime

™Cache hit rate: 99% ƒ Cost: 2 clock cycles ™Cache miss rate: 1% ƒ Cost: 4 clock cycles ™Effective access time: ƒ 99%*2 + 1%*(2 + 4) = 1.98 + 0.06 = 2.04 (clock cycles)

Page 73 TuningTuning CacheCache ParametersParameters

™ Size: ƒ Must be large enough to fit working set (temporal locality) ƒ If too big, then hit time degrades ™ Associativity ƒ Need large to avoid conflicts, but 4-8 way is as good a FA ƒ If too big, then hit time degrades ™ Block ƒ Need large to exploit spatial locality & reduce tag overhead ƒ If too large, few blocks ⇒ higher misses & miss penalty

ConfigurableConfigurable architecturearchitecture allowsallows designersdesigners toto makemake thethe bestbest performance/costperformance/cost trade-offstrade-offs

Page 74 ProcessorProcessor CoreCore StartStart procedureprocedure

AndeScore

Local Memory

Cache

External Memory

Page 75 LocalLocal memorymemory

™Local memory is also used to describe a portion of memory that a software program or utility only has access to once obtained. ™On-chip memory, based on unit to use. ™The memory used by a single CPU or allocated to a single program or function.

Page 76 LocalLocal memorymemory

™Memory model

Page 77 Cache/LocalCache/Local MemoryMemory DesignDesign

™Size ™Mapping Function ™Replacement Algorithm ™Write Policy ™Block Size ™Number of Caches/Local Memories

Page 78 MemoryMemory ManagementManagement UnitUnit (MMU)(MMU)

™Hardware device that maps virtual to physical address. ™In MMU scheme, the value in the relocation register is added to every address generated by a user at the time it is sent to memory. ™The user program deals with logical addresses; it never sees the real physical addresses.

Page 79 BlockBlock diagramdiagram

EDM JTAG Core AHB (1 or 2) B I ADDRESS/COMMAND U D MMU N A M D N M A A O T C M / A S M D S O E C / R S D S A D E T A R A D D D A B HSMP (1 or 2) ADDRESS/DATA I Cache Local Memory DMA U

Page 80 MMUMMU FunctionalityFunctionality

™ (MMU) translates addresses

logical physical address memory address CPU management unit

Page 81 VirtualVirtual memorymemory

™ Virtual address (logical address) ™ MMU (built in CPU) ™ Physical address ™ Page table (in Main Memory) ™ Page frame ™ Address translation ™ TLB ƒ Cache built within CPU for holding translated address just used ™ Page fault ™ Replacement algorithm ƒ LRU

Page 82 VirtualVirtual memorymemory

處理器

虛擬位址

資料 MMU

實際位址

快取

資料 實際位址

主記憶體

DMA 傳送

磁碟儲存體

Page 83 VirtualVirtual memorymemory

指向分頁表的起始位址 來自處理器的虛擬位址

分頁表基底暫存器 分頁表位址 虛擬分頁編號 位移

+ 分頁表

指向分頁表中某個entry

指向實體分頁表的起始位址

控制位元 記憶體中 的分頁訊框 分頁訊框 位移 Valid bit

Dirty bit 主記憶體中的實際位址 Access right of the program to the page 指向實體分頁表中的某個byte

Page 84 LogicalLogical vs.vs. PhysicalPhysical AddressAddress SpaceSpace

™The concept of a logical address space that is bound to a separate physical address space is central to proper memory management. ƒ Logical address – generated by the CPU; also referred to as virtual address. ƒ Physical address – address seen by the memory unit. ™Logical and physical addresses are the same in compile-time and load-time address-binding schemes; logical (virtual) and physical addresses differ in execution-time address-binding scheme.

Page 85 DynamicDynamic relocationrelocation usingusing aa relocationrelocation registerregister

Page 86 MMUMMU ArchitectureArchitecture

M-TLB entry index IFU LSU N(=32) sets k(=4) ways =128-entry

4/8 I-uTLB 4/8 D-uTLB 6 Way number 5 4 Set number 0 Log2(N*K)-1 Log2(N) Log2(N)-1 0

M-TLB arbiter

M-TLB Tag M-TLB data 32x4 M-TLB

HPTWK M-TLB Tag M-TLB data

Bus interface unit

Page 87 MMUMMU FunctionalityFunctionality

™ addressing ƒ Better memory allocation, less fragmentation ƒ Allows shared memory ƒ Dynamic loading ™ Memory protection (read/write/execute) ƒ Different permission flags for kernel/user mode ƒ OS typically runs in kernel mode ƒ Applications run in user mode ƒ Implemented by associating protection bit with each frame. ƒ Valid-invalid bit attached to each entry in the page table: • “valid” indicates that the associated page is in the process’ logical address space, and is thus a legal page. • “invalid” indicates that the page is not in the process’ logical address space. ™ Cache control (cached/uncached) ƒ Accesses to peripherals and other processors needs to be uncached.

Page 88 ValidValid (v)(v) oror InvalidInvalid (i)(i) BitBit InIn AA PagePage TableTable

Page 89 DirectDirect MemoryMemory AccessAccess (DMA)(DMA)

™ Direct memory access (DMA) is a feature of modern computers and that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the . ™ Many hardware systems use DMA including disk drive controllers, graphics cards, network cards and sound cards. ™ DMA is also used for intra-chip data transfer, especially in multiprocessor system-on-chips, where its processing element is equipped with a local memory and DMA is used for transferring data between the local memory and the main memory. ™ Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without a DMA channel. Page 90 BlockBlock diagramdiagram

EDM JTAG Core AHB (1 or 2) B I ADDRESS/COMMAND U

D MMU N A D M N M A A O T C M / A S M D S O E C / R S D S A D E T A R A D D D A B HSMP (1 or 2) ADDRESS/DATA I Cache Local Memory DMA U

Page 91 DMADMA overviewoverview

™ Two channels Local Memory ™ One active channel ™ Programmed using physical addressing ™ For both instruction and data local DMA ™ External address can be incremented with stride ™ Optional 2-D Element Transfer (2DET) feature which provides an easy way to transfer two- Ext. Memory dimensional blocks from external memory.

Page 92 LMDMALMDMA DoubleDouble BufferBuffer ModeMode

Core Pipeline

External Memory Local Memory Bank 0

Local Memory DMA Engine Bank 1

Computation Data Movement Bank between core and DMA engine

Width byte stride (in DMA Setup register)=1 Page 93 BusBus InterfaceInterface UnitUnit (BIU)(BIU)

™ The bus interface unit is the part of the processor that interfaces with the rest of the PC. ™ It deals with moving information over the processor data bus, the primary conduit for the transfer of information to and from the CPU. ™ The bus interface unit is responsible for responding to all signals that go to the processor, and generating all signals that go from the processor to other parts of the system. ™Bus Interface unit is responsible for off-CPU memory access which includes ƒ System memory access ƒ Instruction/data local memory access ƒ Memory-mapped register access in devices.

Page 94 BlockBlock diagramdiagram

EDM JTAG Core AHB (1 or 2) B I ADDRESS/COMMAND U D MMU N A M D N M A A O T C M / A S M D S O E C / R S D S A D E T A R A D D D A B HSMP (1 or 2) ADDRESS/DATA I Cache Local Memory DMA U

Page 95 BusBus InterfaceInterface

™Compliance to AHB/AHB-Lite/APB ™High Speed Memory Port ™Andes Memory Interface ™External LM Interface

Page 96 HSMPHSMP –– HighHigh speedspeed memorymemory portport

™N12 also provides a high speed memory port interface which has higher bus protocol efficiency and can run at a higher frequency to connect to a memory controller. ™The high speed memory port will be AMBA3.0 (AXI) protocol compliant, but with reduced I/O requirements.

Page 97 ExamplesExamples

Page 98 N903:N903: Low-powerLow-power Cost-efficientCost-efficient EmbeddedEmbedded ControllerController ™ Features: ƒ Harvard architecture, 5-stage pipeline. ƒ 16 general-purpose registers. ƒ Static branch prediction JTAG/EDM ƒ Fast MAC ƒ Hardware divider ƒ Fully clock gated pipeline N9 uCore ƒ 2-level nested interrupt ƒ External instruction/data local memory interface Instr Instr Data Data ƒ Instruction/data cache LM/IF Cache Cache LM/IF ƒ APB/AHB/AHB-Lite/AMI bus interface ƒ Power management instructions ƒ 45K ~ 110K gate count External Bus Interface ƒ 250MHz @ 130nm ™ Applications: ƒ MCU APB/AHB/AHB-Lite/AMI ƒ Storage ƒ Automotive control ƒ Toys Page 99 N1033A:N1033A: Lowe-powerLowe-power Cost-efficientCost-efficient ApplicationApplication ProcessorProcessor

™ Features: ƒ Harvard architecture, 5-stage pipeline. ƒ 32 general-purpose registers ƒ Dynamic branch prediction ƒ Fast MAC JTAG/EDM EPT I/F ƒ Hardware divider ƒ Audio acceleration instructions N10 Core +Audio ƒ Fully clock gated pipeline ITLB DTLB ƒ 3-level nested interrupt MMU/MPU ƒ Instruction/Data local memory ƒ Instruction/Data cache Instruction Instruction Data Data ƒ DMA support for 1-D and 2-D transfer LM/INF Cache Cache LM/INF ƒ AHB/AHB-Lite/APB bus ƒ MMU/MPU DMA ƒ Power management instructions

™ Applications: External Bus Interface ƒ Portable audio/media player ƒ DVB/DMB baseband ƒ DVD AHB/AHB(D)/AHB-Lite/APB AHB(I) ƒ DSC ƒ Toys, Games

Page 100 N1213N1213 –– HighHigh PerformancePerformance ApplicationApplication ProcessorProcessor

™ Features: ƒ Harvard architecture, 8-stage pipeline. ƒ 32 general-purpose registers JTAG/EDM EPT I/F ƒ Dynamic branch prediction. ƒ Multiply-add and multiply-subtract instructions. N12 Execution Core ƒ Divide instructions. ITLB DTLB ƒ Instruction/Data local memory. MMU ƒ Instruction/Data cache. ƒ MMU ƒ AHB or HSMP(AXI like) bus Instruction Instruction Data Data ƒ Power management instructions Cache LM LM Cache ™ Applications: ƒ Portable media player ƒ MFP DMA ƒ Networking ƒ Gateway/Router ƒ Home entertainment External Bus Interface ƒ Smartphone/Mobile phone AHB HSMP

Page 101 ThankThank YouYou

WWW.ANDESTECH.COM