Random-Access Data Storage Components in Customized Architectures

Large Embedded Memories Random-Access Data Storage Components in Customized Architectures Lode Nachtergaele Francky Catthoor Chidamber Kulkarni IMEC This tutorial covers the basic design choices ences cited. An article by Kozyrakis et al.,2 for example, discusses the impact of embedded involved in customized data storage, including DRAM technology on the global memory organization of more general-purpose programma- those for register files, local memory, caches, and ble multiprocessors. main memory. General principles and storage classification STORAGE TECHNOLOGY TAKES center stage The goal of a storage device is to store a in increasingly more systems because of the eter- number of n-bit data words for a short or long nal push for more complex applications, espe- term. Under control of the address unit(s), cially those with larger and more complicated these data words are transferred to the custom data types.1 In addition, the access speed, size, processing units (which we’ll call processors, and power consumption associated with storage for simplicity) at the appropriate point in time form a severe bottleneck in these systems, espe- (cycle), and the results of the operations are cially in the embedded context. In this article, then written back in the storage device for we will investigate several building blocks and future use. Because of the different character- overall organizations for memory storage, with a istics of storage and access, different styles of focus on customized memory architectures. The devices have been developed. main emphasis will be on the generic storage Among memories for frequent and repetitive components that are used in modern multime- use, we can see a very important difference dia- and telecom-oriented processors, of both the between short- and long-term storage. Short- digital-signal-processor (DSP) and application- term memories are normally located very close specific types. We will not address the specific to the operators and require a very short access memory organizations (using instantiations of time so that they can be accessed in the same these components) that are part of complete sys- cycle as the arithmetic or logic operation tem designs. applied to the resulting data. Consequently, The content of this article is at the tutorial they should be limited to a relatively small level; we intend it mainly as an introduction to capacity (typically less than 128 words), and other articles that explore customized memory are usually taken up in feedback loops over the organizations. For more details, see the refer- operators (for example, the loop RegfA-BufA- 40 0740-7475/01/$10.00 © 2001 IEEE IEEE Design & Test of Computers BusA-EXU-RegfA in Figure 1) or at the input of execution units. This kind of short-term memory is typically called foreground memory. Mux Devices for longer-term storage generally BusB12 BusA cater to far larger capacities (from 256 to 32 mil- 12 12 + RegfA BufA lion 32-bit words) and take a separate cycle for 8 read or write access. They are connected BusC through a data bus (which can be partly off- RegfB Data-path chip) with the execution units. execution unit We can make six other important distinc- (EXU) tions using the tree-like “genealogy” of storage devices presented in Figure 2. The tree on the Figure1. Register file illustration in a feedback loop of a simple left focuses on the external usage behavior of customized data path. the storage approach. It contains three layers. The first layer addresses read-only versus read/write (R/W) access. Some memories the power goes down. In some cases, this prob- (ROMs, for example) are used only to store lem can be avoided, but these nonvolatile constant data. When the addresses are sparse- options are expensive and slow. Examples are ly spread or there are multilevel logic circuits, magnetic media and tapes, which are intend- especially when the amount of data is relative- ed for slow access of mass data, or flash RAMs, ly small, good storage choices are devices such which also allow higher speeds. We will restrict as programmable logic arrays (PLAs). In most ourselves to what is common on most chips, cases, however, data must be overwritable at namely volatile memory. high speed, meaning read and write are treat- The last layer of Figure 2a concerns the ed with the same priority (R/W access), such as address mechanism. Some devices require only in random-access memories (RAMs). In some sequential addressing, such as first-in, first-out cases, ROMs can be made electrically alterable (FIFO) queues or first-in, last-out (FILO) stack (EAROM) or erasable—“write-few”—or pro- structures. Sequential addressing puts a tight grammable (PROM) using means such as fuses. restriction on the order in which the data are This article covers only R/W memories. read out. However, for many applications, this The next layer of Figure 2a concerns restriction is acceptable. A more general, but still whether or not the memory is volatile. For R/W sequential access order is available in pointer- memories, the data is usually removed once addressed memory (PAM). The main limitation Storage usage Memory realization classification tree classification tree Write-few Read-only Read/write Single Dual Multi- (EAROM, (ROM, port port port PROM) PLA) Volatile Nonvolatile RAM circuit Register file Sequential Random access addressing Static SDRAM Dynamic Direct addressing Cache FIFO PAM LIFO (a) (b) Figure 2. Storage classification trees: usage (a) and memory realization (b). May–June 2001 41 Large Embedded Memories Processor 16K-64K 512K-4M 32M-512M 2G-16G Data Regf 2-port 1-port 1-port 1-port paths SRAM S(D)RAM (S)DRAM DRAM bank/ hard disk Foreground L1 cacheL2 cache Main memory Disk level Figure 3. Hierarchical memory pipeline. of PAM is that each data value is written and dynamic class, high-throughput synchronous read once in any statically predefined order. DRAMs (SDRAMs) have recently become However, in most cases the address available. For circuit-level issues, see overview sequence should be allowed to be random articles such as Evans and Franzon3 (for (including repetition). Usually these devices SRAMs), Itoh4 and Prince5 (for DRAMs). are implemented with a direct-addressing scheme, typically called a RAM. An important Custom processors for data-dominated appli- requirement is that the access time be inde- cations are driven from a customized memory pendent of the selected address. Many pro- organization. In programmable instruction set grammable processors use a special case of processors, this organization is usually con- random-access-based buffering, exploiting structed on a rigorous hierarchy almost like a comparisons of tags and usually also including bidirectional pipeline, from the disk (possibly associativity (in what is called a cache buffer). with a disk cache to hide the long latency) over The tree on the right in Figure 2 focuses on the the main memory and the L2/L1 cache to the way the memory is realized as a circuit or archi- multiport register file (see Figure 3). Still, the tecture component. It also contains three layers: pipeline becomes increasingly saturated and blocked because of the large latencies intro- 1. Number of independent addresses and cor- duced compared to the CPU clock cycles in cur- responding gateways (buses) for access. rent process technologies. This parameter can be one (single-port device), two (dual-port device), or even Register files and more (multiport device). Any of these ports local memory organization can be for reading only, writing only, or Figure 4 shows an illustrative organization R/W. Of course, the area occupied increas- for a dual-port register file with two address es considerably with the number of ports. buses, where the separate read and write 2. Internal organization. The memory can be addresses are generated from an address cal- meant for capacities that remain small or that culation unit (ACU). This organization uses two can become large. Here, designers usually data buses (A and B), but only in one direction, must trade off between speed and area effi- so the write and read addresses directly control ciency. The register files we describe in the the port access. In general, the number of dif- next section constitute an example of fast, ferent address words can be smaller than the small-capacity organizations, which are usual- number of port(s) when they are shared (for ly also dual- or even multiported. The queues example, for either read or write), and the and stacks are meant for medium-sized capac- buses can be bidirectional. Additional control ities. The RAMs can become extremely large signals decide whether to write or read and for (up to 1 Gbit in the state-of-the-art devices), but which port the address applies. The number of they are also far slower in random access. address bits per word is log2 (N). 3. Static or dynamic. For R/W memories, the data The register file of Figure 4 can be used very can remain valid as long as VDD is on (static cell efficiently in the feedback loop of a data path, as in SRAM), or the data should be regularly already illustrated in Figure 1. In general, it is used refreshed (dynamic cell in DRAM). Within the only for the storage of temporary variables in the 42 IEEE Design & Test of Computers application running on the data path (sometimes Read also referred to as the execution unit). Such reg- Address ister files are also used heavily in most modern Address log (N) calculation 2 decoders general-purpose RISC chips and especially for unit modern multimedia-oriented signal processors, (ACU) Write which have register files at up to 128 locations.

Random-Access Data Storage Components in Customized Architectures

Error Detection and Correction Methods for Memories Used in System-On-Chip Designs

Embedded DRAM

Semiconductor Memories

SRAM Edram DRAM

Digital Recorded Announcement Machine DRAM and EDRAM Guide

Introduction to Advanced Semiconductor Memories

A Survey of Architectural Approaches for Managing Embedded DRAM and Non-Volatile On-Chip Caches Sparsh Mittal, Jeffrey S

EDRAM Memory Controller for the TMS320C31 DSP Application Report

An Hybrid Edram/SRAM Macrocell to Implement First-Level Data Caches

High-Performance Architectures for Embedded Memory Systems

Novel Memory Concepts on SOI Destructive to Decode the Columns Prior to Sensing

An Overview of DRAM-Based Security Primitives