Digital Design & Computer Arch

Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. Onur Mutlu ETH Zürich Spring 2020 14 May 2020 Readings for This Lecture and Next n Memory Hierarchy and Caches n Required q H&H Chapters 8.1-8.3 q Refresh: P&P Chapter 3.5 n Recommended q An early cache paper by Maurice Wilkes n Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965. 2 A Computing System n Three key components n Computation n Communication n Storage/memory Burks, Goldstein, von Neumann, “Preliminary discussion of the logical design of an electronic computing instrument,” 1946. 3 Image source: https://lbsitbytes2010.wordpress.com/2013/03/29/john-von-neumann-roll-no-15/ What is A Computer? n We will cover all three components Processing control Memory (sequencing) (program I/O and data) datapath 4 Memory (Programmer’s View) 5 Abstraction: Virtual vs. Physical Memory n Programmer sees virtual memory q Can assume the memory is “infinite” n Reality: Physical memory size is much smaller than what the programmer assumes n The system (system software + hardware, cooperatively) maps virtual memory addresses to physical memory q The system automatically manages the physical memory space transparently to the programmer + Programmer does not need to know the physical size of memory nor manage it à A small physical memory can appear as a huge one to the programmer à Life is easier for the programmer -- More complex system software and architecture A classic example of the programmer/(micro)architect tradeoff 6 (Physical) Memory System n You need a larger level of storage to manage a small amount of physical memory automatically à Physical memory has a backing store: disk n We will first start with the physical memory system n For now, ignore the virtualàphysical indirection n We will get back to it later, if time permits… 7 Idealism Pipeline Instruction (Instruction Data Supply Supply execution) - Zero latency access - No pipeline stalls - Zero latency access - Infinite capacity -Perfect data flow - Infinite capacity (reg/memory dependencies) - Zero cost - Infinite bandwidth - Zero-cycle interconnect - Perfect control flow (operand communication) - Zero cost - Enough functional units - Zero latency compute 8 Quick Overview of Memory Arrays How Can We Store Data? n Flip-Flops (or Latches) q Very fast, parallel access q Very expensive (one bit costs tens of transistors) n Static RAM (we will describe them in a moment) q Relatively fast, only one data word at a time q Expensive (one bit costs 6 transistors) n Dynamic RAM (we will describe them a bit later) q Slower, one data word at a time, reading destroys content (refresh), needs special process for manufacturing q Cheap (one bit costs only one transistor plus one capacitor) n Other storage technology (flash memory, hard disk, tape) q Much slower, access takes a long time, non-volatile q Very cheap (no transistors directly involved) Array Organization of Memories n Goal: Efficiently store large amounts of data q A memory array (stores data) q Address selection logic (selects one row of the array) q Readout circuitry (reads data out) Address N Array M n An M-bit value can be read or written at each unique N-bit address Data q All values can be accessed, but only M-bits at a time q Access restriction allows more compact organization Memory Arrays n Two-dimensional array of bit cells q Each bit cell stores one bit n An array with N address bits and M data bits: N q 2 rows and M columns q Depth: number of rows (number of words) q Width: number of columns (size of word) N q Array size: depth × width = 2 × M Address Data 11 0 1 0 N 2 Address Array Address Array 10 1 0 0 depth 01 1 1 0 M 3 00 0 1 1 Data Data width Memory Array Example n 22 × 3-bit array n Number of words: 4 n Word size: 3-bits n For example, the 3-bit word stored at address 10 is 100 Address Data 11 0 1 0 2 Address Array 10 1 0 0 depth 01 1 1 0 3 00 0 1 1 Data width Larger and Wider Memory Array Example 1024-word x Address 10 32-bit Array 32 Data Memory Array Organization (I) n Storage nodes in one column connected to one bitline n Address decoder activates only ONE wordline n Content of one line of storage available at output 2:4 Decoder bitline2 bitline1 bitline0 wordline 11 3 2 stored stored stored Address bit = 0 bit = 1 bit = 0 wordline 10 2 stored stored stored wordline bit = 1 bit = 0 bit = 0 01 1 stored stored stored bit = 1 bit = 1 bit = 0 wordline 00 0 stored stored stored bit = 0 bit = 1 bit = 1 Data2 Data1 Data0 Memory Array Organization (II) n Storage nodes in one column connected to one bitline n Address decoder activates only ONE wordline n Content of one line of storage available at output 2:4 Decoder bitline2 bitline1 bitline0 wordline 11 3 2 Active wordline stored stored stored Address bit = 0 bit = 1 bit = 0 wordline 10 10 2 stored stored stored wordline bit = 1 bit = 0 bit = 0 01 1 stored stored stored bit = 1 bit = 1 bit = 0 wordline 00 0 stored stored stored bit = 0 bit = 1 bit = 1 Data Data Data 1 2 0 1 0 0 How is Access Controlled? n Access transistors configured as switches connect the bit storage to the bitline. n Access controlled by the wordline bitline wordline stored bit bitline wordline bitline bitline wordline DRAM SRAM Building Larger Memories n Requires larger memory arrays n Large à slow n How do we make the memory large without making it very slow? n Idea: Divide the memory into smaller arrays and interconnect the arrays to input/output buses q Large memories are hierarchical array structures q DRAM: Channel à Rank à Bank à Subarrays à Mats 18 General Principle: Interleaving (Banking) n Interleaving (banking) q Problem: a single monolithic large memory array takes long to access and does not enable multiple accesses in parallel q Goal: Reduce the latency of memory array access and enable multiple accesses in parallel q Idea: Divide a large array into multiple banks that can be accessed independently (in the same cycle or in consecutive cycles) n Each bank is smaller than the entire memory storage n Accesses to different banks can be overlapped q A Key Issue: How do you map data to different banks? (i.e., how do you interleave data across banks?) 19 Memory Technology: DRAM and SRAM Memory Technology: DRAM n Dynamic random access memory n Capacitor charge state indicates stored value q Whether the capacitor is charged or discharged indicates storage of 1 or 0 q 1 capacitor q 1 access transistor row enable n Capacitor leaks through the RC path q DRAM cell loses charge over time q DRAM cell needs to be refreshed _bitline 21 Memory Technology: SRAM n Static random access memory n Two cross coupled inverters store a single bit q Feedback path enables the stored value to persist in the “cell” q 4 transistors for storage q 2 transistors for access row select bitline _bitline 22 Memory Bank Organization and Operation n Read access sequence: 1. Decode row address & drive word-lines 2. Selected bits drive bit-lines • Entire row read 3. Amplify row data 4. Decode column address & select subset of row • Send to output 5. PrecharGe bit-lines • For next access 23 SRAM (Static Random Access Memory) Read Sequence row select 1. address decode 2. drive row select 3. selected bit-cells drive bitlines (entire row is read together) bitline _bitline 4. differential sensing and column select (data is ready) 5. precharge all bitlines (for next read or write) bit-cell array 2n n+m n 2n row x 2m-col Access latency dominated by steps 2 and 3 Cycling time dominated by steps 2, 3 and 5 (n»m to minimize m overall latency) - step 2 proportional to 2 n - step 3 and 5 proportional to 2 m 2m diff pairs sense amp and mux 1 24 DRAM (Dynamic Random Access Memory) row enable Bits stored as charges on node capacitance (non-restorative) - bit cell loses charge when read - bit cell loses charge over time _bitline Read Sequence 1~3 same as SRAM 4. a “flip-flopping” sense amp RAS bit-cell array amplifies and regenerates the bitline, data bit is mux’ed out 2n n 2n row x 2m-col 5. precharge all bitlines (n»m to minimize overall latency) Destructive reads Charge loss over time m 2m sense amp and mux Refresh: A DRAM controller must 1 periodically read each row within A DRAM die comprises the allowed refresh time (10s of CAS of multiple such arrays ms) such that charge is restored 25 DRAM vs. SRAM n DRAM q Slower access (capacitor) q Higher density (1T 1C cell) q Lower cost q Requires refresh (power, performance, circuitry) q Manufacturing requires putting capacitor and logic together n SRAM q Faster access (no capacitor) q Lower density (6T cell) q Higher cost q No need for refresh q Manufacturing compatible with logic process (no capacitor) 26 Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. Onur Mutlu ETH Zürich Spring 2020 14 May 2020 We did not cover the following slides in lecture. These are for your preparation for the next lecture. The Memory Hierarchy Memory in a Modern System L2 CACHE 1 L2 L2 CACHE 0 L2 SHAREDCACHEL3 DRAM INTERFACE DRAM DRAM BANKS DRAM CORE 0 CORE 1 DRAM MEMORY CONTROLLER L2 CACHE 2 L2 CACHE 3 L2 CORE 2 CORE 3 30 Ideal Memory n Zero access time (latency) n Infinite capacity n Zero cost n Infinite bandwidth (to support multiple accesses in parallel) 31 The Problem n Ideal memory’s requirements oppose each other n Bigger is slower q Bigger à Takes longer to determine the location n Faster is more expensive q Memory technology: SRAM vs.

Load more