Large Embedded Memories Random-Access Storage Components in Customized Architectures

Lode Nachtergaele

Francky Catthoor

Chidamber Kulkarni IMEC

This tutorial covers the basic design choices ences cited. An article by Kozyrakis et al.,2 for example, discusses the impact of embedded involved in customized , including DRAM technology on the global memory orga- nization of more general-purpose programma- those for register files, local memory, caches, and ble multiprocessors.

main memory. General principles and storage classification STORAGE TECHNOLOGY TAKES center stage The goal of a storage device is to store a in increasingly more systems because of the eter- number of n-bit data words for a short or long nal push for more complex applications, espe- term. Under control of the address unit(s), cially those with larger and more complicated these data words are transferred to the custom data types.1 In addition, the access speed, size, processing units (which we’ll call processors, and power consumption associated with storage for simplicity) at the appropriate point in time form a severe bottleneck in these systems, espe- (cycle), and the results of the operations are cially in the embedded context. In this article, then written back in the storage device for we will investigate several building blocks and future use. Because of the different character- overall organizations for memory storage, with a istics of storage and access, different styles of focus on customized memory architectures. The devices have been developed. main emphasis will be on the generic storage Among memories for frequent and repetitive components that are used in modern multime- use, we can see a very important difference dia- and telecom-oriented processors, of both the between short- and long-term storage. Short- digital-signal-processor (DSP) and application- term memories are normally located very close specific types. We will not address the specific to the operators and require a very short access memory organizations (using instantiations of time so that they can be accessed in the same these components) that are part of complete sys- cycle as the arithmetic or logic operation tem designs. applied to the resulting data. Consequently, The content of this article is at the tutorial they should be limited to a relatively small level; we intend it mainly as an introduction to capacity (typically less than 128 words), and other articles that explore customized memory are usually taken up in feedback loops over the organizations. For more details, see the refer- operators (for example, the loop RegfA-BufA-

40 0740-7475/01/$10.00 © 2001 IEEE IEEE Design & Test of Computers BusA-EXU-RegfA in Figure 1) or at the input of execution units. This kind of short-term memo- ry is typically called foreground memory. Mux Devices for longer-term storage generally BusB12 BusA cater to far larger capacities (from 256 to 32 mil- 12 12 + RegfA BufA lion 32-bit words) and take a separate cycle for 8 read or write access. They are connected BusC through a data (which can be partly off- RegfB Data-path chip) with the execution units. execution unit We can make six other important distinc- (EXU) tions using the tree-like “genealogy” of storage devices presented in Figure 2. The tree on the Figure1. Register file illustration in a feedback loop of a simple left focuses on the external usage behavior of customized data path. the storage approach. It contains three layers. The first layer addresses read-only versus read/write (R/W) access. Some memories the power goes down. In some cases, this prob- (ROMs, for example) are used only to store lem can be avoided, but these nonvolatile constant data. When the addresses are sparse- options are expensive and slow. Examples are ly spread or there are multilevel logic circuits, magnetic media and tapes, which are intend- especially when the amount of data is relative- ed for slow access of mass data, or flash RAMs, ly small, good storage choices are devices such which also allow higher speeds. We will restrict as programmable logic arrays (PLAs). In most ourselves to what is common on most chips, cases, however, data must be overwritable at namely . high speed, meaning read and write are treat- The last layer of Figure 2a concerns the ed with the same priority (R/W access), such as address mechanism. Some devices require only in random-access memories (RAMs). In some sequential addressing, such as first-in, first-out cases, ROMs can be made electrically alterable (FIFO) queues or first-in, last-out (FILO) stack (EAROM) or erasable—“write-few”—or pro- structures. Sequential addressing puts a tight grammable (PROM) using means such as fuses. restriction on the order in which the data are This article covers only R/W memories. read out. However, for many applications, this The next layer of Figure 2a concerns restriction is acceptable. A more general, but still whether or not the memory is volatile. For R/W sequential access order is available in pointer- memories, the data is usually removed once addressed memory (PAM). The main limitation

Storage usage Memory realization classification tree classification tree

Write-few Read-only Read/write Single Dual Multi- (EAROM, (ROM, port port port PROM) PLA) Volatile Nonvolatile RAM circuit Register file

Sequential Random access addressing Static SDRAM Dynamic

Direct addressing FIFO PAM LIFO

(a) (b)

Figure 2. Storage classification trees: usage (a) and memory realization (b).

May–June 2001 41 Large Embedded Memories

Processor 16K-64K 512K-4M 32M-512M 2G-16G Data Regf 2-port 1-port 1-port 1-port paths SRAM S(D)RAM (S)DRAM DRAM bank/ hard disk Foreground L1 cacheL2 cache Main memory Disk level

Figure 3. Hierarchical memory pipeline.

of PAM is that each data value is written and dynamic class, high-throughput synchronous read once in any statically predefined order. DRAMs (SDRAMs) have recently become However, in most cases the address available. For circuit-level issues, see overview sequence should be allowed to be random articles such as Evans and Franzon3 (for (including repetition). Usually these devices SRAMs), Itoh4 and Prince5 (for DRAMs). are implemented with a direct-addressing scheme, typically called a RAM. An important Custom processors for data-dominated appli- requirement is that the access time be inde- cations are driven from a customized memory pendent of the selected address. Many pro- organization. In programmable instruction set grammable processors use a special case of processors, this organization is usually con- random-access-based buffering, exploiting structed on a rigorous hierarchy almost like a comparisons of tags and usually also including bidirectional pipeline, from the disk (possibly associativity (in what is called a cache buffer). with a disk cache to hide the long latency) over The tree on the right in Figure 2 focuses on the the main memory and the L2/L1 cache to the way the memory is realized as a circuit or archi- multiport register file (see Figure 3). Still, the tecture component. It also contains three layers: pipeline becomes increasingly saturated and blocked because of the large latencies intro- 1. Number of independent addresses and cor- duced compared to the CPU clock cycles in cur- responding gateways (buses) for access. rent process technologies. This parameter can be one (single-port device), two (dual-port device), or even Register files and more (multiport device). Any of these ports local memory organization can be for reading only, writing only, or Figure 4 shows an illustrative organization R/W. Of course, the area occupied increas- for a dual-port register file with two address es considerably with the number of ports. buses, where the separate read and write 2. Internal organization. The memory can be addresses are generated from an address cal- meant for capacities that remain small or that culation unit (ACU). This organization uses two can become large. Here, designers usually data buses (A and B), but only in one direction, must trade off between speed and area effi- so the write and read addresses directly control ciency. The register files we describe in the the port access. In general, the number of dif- next section constitute an example of fast, ferent address words can be smaller than the small-capacity organizations, which are usual- number of port(s) when they are shared (for ly also dual- or even multiported. The queues example, for either read or write), and the and stacks are meant for medium-sized capac- buses can be bidirectional. Additional control ities. The RAMs can become extremely large signals decide whether to write or read and for (up to 1 Gbit in the state-of-the-art devices), but which port the address applies. The number of

they are also far slower in random access. address bits per word is log2 (N). 3. Static or dynamic. For R/W memories, the data The register file of Figure 4 can be used very

can remain valid as long as VDD is on (static cell efficiently in the feedback loop of a data path, as in SRAM), or the data should be regularly already illustrated in Figure 1. In general, it is used refreshed (dynamic cell in DRAM). Within the only for the storage of temporary variables in the

42 IEEE Design & Test of Computers application running on the data path (sometimes Read also referred to as the execution unit). Such reg- Address ister files are also used heavily in most modern Address log (N) calculation 2 decoders general-purpose RISC chips and especially for unit modern multimedia-oriented signal processors, (ACU) Write which have register files at up to 128 locations. BusA BusB 1 N (In this case, it becomes doubtful whether a reg- DataIn DataOut ister file is really a good option, because of the very high area and power penalty.) Multimedia-oriented VLIW processors or recent superscalar processors give register files Figure 4. Two-port register file with a capacity of N words, with very large access bandwidth and many ports— both read and write address supplied in parallel from an address for example, the 17-port iWarp register file and calculation unit. the 15-port TriMedia register file.6 Application-spe- cific instruction set processors (ASIPs) and cus- tom processors make heavy use of register files Here, we summarize the different steps involved for the same purpose. Although these register files in cache operation. For our explanation, we’ll have the clear advantage of very fast access, the use a direct-mapped cache as shown in Figure 5 number of data words to be stored should be (see also the “Cache architecture design choic- minimized as much as possible because of their es” sidebar), but the basic principles remain the power- and area-intensive structure, because of same for other types of cache memories.8 The both the decoder and the cell overhead.7 following steps occur whenever there is a read/write from or to a cache: Cache memory organization Nearly every modern programmable proces- 1. address decoding; sor, especially those intended as general- 2. selection based on index, offset, or both; purpose devices, uses caches. 3. tag comparison; and 4. data transfer. Basic cache architecture Caches are mainly intended to exploit the tem- In Figure 5 and Figure 6 (page 45), these steps poral and spatial locality of memory accesses.8 are highlighted in gray circles.

BlockAddress Block offset <21> <8> <5> CPU address 1 Tag Index Block offset Data Data in out Valid Tag Data <1> <21> <256> 4 2

256 blocks

Write buffer 3

=? 4:1 Tag comparison multiplexer Lower-level memory

Figure 5. Eight-Kbyte, direct-mapped cache with 32-byte blocks.

May–June 2001 43 Large Embedded Memories

Cache architecture design choices Several important decisions are involved in the archi- Nanda2 include some architectural techniques for tecture design of a cache. Here, we discuss the choice low-power caches. of line size, degree of associativity, updating policy, and replacement policy. Updating policy The process of maintaining coherency between two Line size consecutive memory levels is termed the updating pol- The line size is the unit of data transfer between the icy. There are two basic approaches to updating the cache and the main memory. In many publications, line main memory: write through and write back. With write size is also called block size. As the line size increases through, all the writes are immediately transmitted to the from very small to very large, the miss ratio initially main memory (apart from writing to the cache); with decreases, because a miss fetches more data at a time. write back, most writes are written to the cache and are Further increases then cause the miss ratio to increase, then copied back to the main memory as those lines are as the probability of using the newly fetched (addition- replaced. al) information becomes less than the probability of Initially, write through was the preferred updating poli- reusing the data that is replaced, and also as fewer lines cy, because it is very easy to implement and solves the fit into the cache. problems of data coherence. But it also generates lot of The optimal line size is completely algorithm depen- traffic between various levels of memories. Hence, most dent. In general, very large line sizes are not desirable, current state-of-the art media and DSP processors use a because they contribute to larger load latencies and write-back policy. Lately, we have seen a trend toward giv- increased cache pollution. This is true for both general- ing control of the updating policy to the user (or compiler). purpose and embedded applications. This can be effectively exploited to reduce power, because of reduced write backs, by compile-time analysis. Mapping and associativity The process of transferring data elements from main Replacement policy memory to the cache is called mapping. Associativity A cache’s replacement policy refers to the type of refers to the process of retrieving several cache lines protocol used to replace a partial or complete line in the and then determining if any of them is the target. The cache on a cache miss. Typically, a least recently used degree of associativity and the type of mapping signifi- (LRU) policy is preferable, because it has acceptable cantly affect cache performance. Most caches are set results. associative; an address maps to a set, and then an Current state-of-the-art general-purpose and DSP associative search is made of that set (see Figures 5 and processors have hardware-controlled replacement 6 in the main text). policies; hardware counters monitor the least recently Empirically, and as one would expect, increasing used data in the cache. In general, we have observed the degree of associativity decreases the miss ratio. that a policy based on compile-time analysis always The highest miss ratios are observed for direct- has significantly better results than fixed statistical mapped caches; two-way associativity is significantly decisions. better, and four-way is slightly better still. Further increases in associativity only slowly decrease the misses. Nevertheless, a cache with a References greater associativity requires more tag comparisons, 1. C. Su and A. Despain, “Cache Design Trade-Offs for Power and these comparators constitute a significant and Performance Optimization: A Case Study,” Proc. IEEE amount of total power consumption in the memories. Int’l Symp. Low-Power Design, IEEE CS Press, Los Alamitos, Thus, for embedded applications, where power con- Calif., 1995, pp. 63-68. sumption is an important consideration, associativi- 2. U. Ko, P. Balsara, and A. Nanda, “Energy Optimization of ties greater than four or eight are not commonly Multi-Level Processor Cache Architectures,” Proc. IEEE Int’l observed. Workshop on Low Power Design, IEEE CS Press, Los Articles by Su and Despain1 and Ko, Balsara, and Alamitos, Calif., 1995, pp. 45-50.

44 IEEE Design & Test of Computers BlockAddress Block offset <22> <7> <5> Tag Index 1 CPU address Data <64> Valid <1> Tag <22> Data Data in out

4 2

=? 3 2:1 multiplexer =?

Write buffer

Lower-level memory

Figure 6. Eight-Kbyte, two-way associative cache with 32-byte blocks.

The first step performs the task of decoding sent. Figure 6, which shows a two-way set-asso- the address, supplied by the CPU, into the ciative cache, illustrates this circumstance. block address and the block offset. The block address is further divided into the cache’s tag In the sidebar, we briefly discuss some of the and index addresses. Once the address decod- issues that must be considered during cache ing is complete, the index part of the block memory design. address obtains the particular cache line Caches are prime candidates for a real- demanded by the CPU. This line is chosen only ization with embedded-memory technology. if the valid bit is set and the tag portion of the The L1 cache, which supplies data directly data cache and the block address match each to the processor data paths, requires an other. This involves comparing all the individ- access speed that is still too high to realize ual bits. For a direct-mapped cache, we have with DRAM components, so SRAM is pre- only one tag comparison. After this process, the ferred for that level. For the much larger L2 block offset obtains the particular data element cache, where the access requirements are in the chosen cache line. This data element is less stringent, eDRAM is a very good candi- now transferred to or from the CPU. date technology, especially when based on Figure 5 shows a direct-mapped cache; an n- the synchronous DRAM principles. (For a way set-associative cache would result in the discussion of these issues in the context of following differences: programmable processors, see also Kozyrakis et al.2) Rather than just one, n tag comparisons are Designers use three main types of architec- used. tural enhancement techniques to reduce com- Fewer index bits and more tag bits are pre- pulsory misses, conflict misses, and write backs.

May–June 2001 45 Large Embedded Memories

for (i=3 ; i<11 ; i++) b[i−1] = b[i−3] + a[i] + a[i−3] ;

i=3 i=4 i=5 i=6 i=7

0 a[0] 0 a[0] 0 a[0] a[2] 0 a[0] a[2] 0 a[0] a[2] a[4]

1 a[3] 1 a[3] 1 a[3] a[5] 1 a[3] a[5] 1 a[3] a[5] a[7]

2 b[0] 2 b[0] 2 b[0] b[4] 2 b[0] b[4] 2 b[0] b[4]

3 b[2] 3 b[2] 3 b[2] 3 b[2] 3 b[2] b[6]

4 4 a[1] 4 a[1] 4 a[1] a[3] 4 a[1] a[3]

5 5 a[4] 5 a[4] 5 a[4] a[6] 5 a[4] a[6]

6 6 b[1] 6 b[1] 6 b[1] b[5] 6 b[1] b[5]

7 7 b[3] 7 b[3] 7 b[3] 7 b[3]

i=8 i=9 i=10

0 a[0] a[2] a[4] 0 a[0] a[2] a[4] a[6] 0 a[0] a[2] a[4] a[6]

1 a[3] a[5] a[7] 1 a[3] a[5] a[7] a[9] 1 a[3] a[5] a[7] a[9]

2 b[0] b[4] 2 b[0] b[4] b[8] 2 b[0] b[4] b[8]

3 b[2] b[6] 3 b[2] b[6] 3 b[2] b[6]

4 a[1] a[3] a[5] 4 a[1] a[3] a[5] 4 a[1] a[3] a[5] a[7]

5 a[4] a[6] a[8] 5 a[4] a[6] a[8] 5 a[4] a[6] a[8] a[10]

6 b[1] b[5] 6 b[1] b[5] 6 b[1] b[5] b[9]

7 b[3] b[7] 7 b[3] b[7] 7 b[3] b[7]

Figure 7. Initial algorithm and cache states corresponding to different iterations i of that algorithm for a fully associative cache. Diagonal bars represent cache misses. This example has 26 cache misses: 26/32, an 81% miss rate.

They are victim buffers (or similar buffers that Classification of cache misses store a limited number of words evicted from Whenever data requested by the CPU is not the cache at its different ports),9 hardware or found in the cache, a cache miss has software prefetching,10 and cache locking.6 occurred. Essentially, three types of cache Cache locking used for reduced write backs misses exist: and software prefetching need extensive com- piler support (see “Data Memory Organization Compulsory cache misses. The first access to and Optimizations in Application-Specific the block is not in the cache, so the block Systems,” by P. Panda et al., in this issue). Other must be brought into the cache. researchers have written about interesting tech- Capacity cache misses. If the cache cannot niques related to cache and memory hierar- contain all the blocks needed during exe- chies in the context of low power and reduced cution of a program (even when it is fully memory bandwidth.11 associative), capacity misses will occur

46 IEEE Design & Test of Computers for (i=3 ; i<11 ; i++) b[i−1] = b[i−3] + a[i] + a[i−3] ;

i=3 i=4 i=5 i=6 i=7

0 a[0] 0 a[0] 0 a[0] 0 a[0] b[5] 0 a[0] b[5]

1 1 a[1] 1 a[1] 1 a[1] 1 a[1] b[6]

2 2 2 a[2] 2 a[2] 2 a[2]

3 a[3] b[0] 3 a[3] b[0] 3 a[3] b[0] 3 a[3] b[0] a[3] 3 a[3] b[0] a[3]

4 4 a[4] b[1] 4 a[4] b[1] 4 a[4] b[1] 4 a[4] b[1] a[4]

5 b[2] 5 b[2] 5 b[2] a[5] b[2] 5 b[2] a[5] b[2] 5 b[2] a[5] b[2]

6 6 b[3] 6 b[3] 6 b[3] a[6] b[3] 6 b[3] a[6] b[3]

7 7 7 b[4] 7 b[4] 7 b[4] a[7] b[4]

i=8 i=9 i=10

0 a[0] b[5] a[8] b[5] 0 a[0] b[5] a[8] b[5] 0 a[0] b[5] a[8] b[5]

1 a[1] b[6] 1 a[1] b[6] a[9] b[6] 1 a[1] b[6] a[9] b[6]

2 a[2] b[7] 2 a[2] b[7] 2 a[2] b[7] a[10] b[7]

3 a[3] b[0] a[3] 3 a[3] b[0] a[3] b[8] 3 a[3] b[0] a[3] b[8]

4 a[4] b[1] a[4] 4 a[4] b[1] a[4] 4 a[4] b[1] a[4] b[9]

5 b[2] a[5] b[2] a[5] 5 b[2] a[5] b[2] a[5] 5 b[2] a[5] b[2] a[5]

6 b[3] a[6] b[3]6 b[3] a[6] b[3] a[6] 6 b[3] a[6] b[3] a[6]

7 b[4] a[7] b[4] 7 b[4] a[7] b[4] 7 b[4] a[7] b[4] a[7]

Figure 8. Initial algorithm and cache states corresponding to different iterations i of that algorithm for a direct- mapped cache. Diagonal bars represent cache misses. This example has a 100% miss rate: 32/32.

because of blocks being discarded and later defined as the number of conflict misses. retrieved. (An alternative definition for capacity misses is the total number of misses In most real-life applications, capacity and con- in a fully associative cache.) flict misses are dominant. Hence, reducing Conflict misses. If the block placement strat- these misses is vital to achieving better perfor- egy is set-associative or direct-mapped, con- mance and reducing power consumption. flict misses occur (in addition to compulsory Now let’s look at an example that illustrates and capacity misses), because a block can these three types of cache misses. Figure 7 and be discarded and later retrieved if too many Figure 8 show the detailed cache states for a blocks map to its set. Lower associativity fully associative cache and a direct-mapped results in more conflicts. The difference cache. A diagonal bar on an element indicates between the total number of cache misses that the particular element is replaced by the in a direct-mapped or set-associative cache next element—in the next column of the same and that of a fully associative cache is row without a bar—because of the (hardware)

May–June 2001 47 Large Embedded Memories

case, the hardware controller no longer does a Table 1. Differences between hardware and software caches for current state- good job. However, recently developed, of-the-art multimedia and DSP processors. advanced compiler technology is simplifying Characteristic Hardware control Software control analysis of such applications (see “Code Basic concepts Lines and sets Lines Transformations for Data Transfer and Storage Data transfer Hardware Partly software Exploration Preprocessing in Multimedia Updating policy Hardware Software Processors,” by F. Catthoor et al., and “Data Replacement policy Hardware Software Memory Organization and Optimizations in Application-Specific Systems,” by P. Panda et al., both in this issue). Therefore, the newer cache-mapping policy. Hence, every diagonal cache architectures make more software con- bar represents a cache miss. The main memo- trol available. An example is the software-lock- ry layout is assumed to be single and contigu- able cache in the Philips TriMedia.6 ous. That is, array a[] resides in locations 0 to Table 1 lists the major differences between 10, and array b[] in locations 11 to 21. For this hardware- and software-controlled caches for illustration, we use a cache of size 8 bytes and a state-of-the-art multimedia and DSP processors. line size of 1 byte. First, the hardware-controlled caches rely on We observe from Figure 7 that the algorithm the hardware to do the data cache manage- needs 32 data accesses. To complete these 32 ment. Hence, to perform this task, the hardware data accesses, the fully associative cache uses basic concepts of cache lines and sets. On requires 26 cache misses. Out of these 26 miss- the other hand, for software-controlled caches, es, 21 are compulsory and the remaining five the compiler (or the user) performs cache man- are capacity misses. The direct-mapped cache agement. The complexity of designing a soft- requires 32 misses, as seen in Figure 8. This ware cache is far lower and requires only a means that out of 32 data accesses, all of them basic concept of unit data transfer equivalent result in misses, and the on-chip direct-mapped to a cache line. cache is not exploiting the available data reuse Second, the hardware performs the data at all (for the given access order). Thus, for the transfer based on the execution order of the algorithm in our example, we have 21 com- algorithm at runtime, using fixed statistical mea- pulsory misses, five capacity misses, and six sures. In contrast, for the software-controlled conflict misses. cache, either the compiler or the user performs this task. This is currently possible using high- Hardware- and software-controlled caches level compile-time directives such as ALLO- Until recently, the embedded hardware CATE() and link-time options such as LOCK to cache controller steered most cache behav- lock certain data in part of the cache, through ior—a reasonable setup if little can be analyzed the compiler or linker.6 about the behavior of the application at com- Third, the most important difference between pile time. This is the case for programs that are hardware- and software-controlled caches is in heavily dependent on user interaction, such as the way the next higher level of memory is office automation tools, , or debug- updated—namely, the way the caches maintain gers. Over the years, however, the cache con- coherence of data. For a hardware-controlled troller became more and more costly in terms cache, the hardware writes data to the next high- of both area and power. er level of memory either every time a write For many modern applications such as mul- occurs or when the particular cache line is evict- timedia and telecommunications, cache ed. In contrast, for the software-controlled cache, behavior can be analyzed at compile time far the compiler decides when and whether to write more easily, even when irregular but manifest back a particular data element.6 This results in a access patterns are present. Still, the code that large reduction in the amount of data transfers is initially written by designers usually lacks between different levels of , locality (temporal, spatial, or both); in that contributing to lower power and reduced band-

48 IEEE Design & Test of Computers width usage by the algorithm. This is also the restriction here. Finally, the hardware-controlled cache Most other distinguishing characteristics needs an extra bit for every cache line to deter- relate to the circuit design (and the technologi- mine the least recently used data, which will be cal issues) of the RAM cell, the decoder, and the replaced on a cache miss. For the software- auxiliary R/W devices. One main conclusion of controlled cache, the compiler manages the a literature study on the recent evolution of data transfer, so no need exists for additional main memory is that the main emphasis—in bits or a particular replacement policy. almost all avenues of research—is on the access structure and not on the way the actual storage Main and global-hierarchical takes place in the “cell.” This access structure, memory organization which includes the selection part and the As mentioned earlier, the access bottleneck input/output part, starts to dominate the overall is especially severe for main memories in the cost in terms of power, speed, and throughput hierarchical pipeline (see Figure 3). This bot- (assuming, of course, that leakage power in the tleneck is due to the large sizes of these main memory matrix is kept under control). When memories, and partly to their location (until power-sensitive memories are constructed, the now) off chip, which causes a highly capacitive area contribution also becomes more balanced on/off chip bus. Because of the use of embed- between the heavily distributed memory matri- ded (S)DRAM technology, the latter problem ces and the access structures. can be solved for several new applications. However, in many cases, this new solution The external data access bottleneck will work at the L2 cache level but, because of In most cases, a processor requires one or the large size requirements, not yet for the main more large external memories to store the long- memory. Kozyrakis et al.2 discuss this issue, indi- term data (mostly of the DRAM type). For data- cating that modern multiprocessors still rely on a dominated applications in the past, total system (large) central shared main memory (at least in power cost was for a large part due to the pres- terms of the address space). Even with eDRAM- ence of these external memories on the board. based main memory solutions, the central shared Until recently, most DRAMs were of the pipelined memory would still impose heavy restrictions on page-mode type. Pipelined memory implemen- the bandwidth; nor would eDRAM sufficiently tations improve a memory’s throughput, but they solve the large energy consumption issues. So the don’t improve the latency. For instance, in an customization benefits available with an on-chip extended data-out (EDO) RAM, the address eDRAM solution are not really suited for the main decoding and the actual access to the storage memory stage in the hierarchy. Hence, the bot- matrix (of the previous memory access) occur in tlenecks still largely remain. parallel. In that case, the access sequence to the SRAMs are excluded for the large sizes DRAM is also important. required for main memories, because of their Figure 9 (next page) shows a DRAM’s prin- significantly lower density—more than an cipal blocks. The address bus is divided into a order of magnitude lower than (S)DRAMs. column address and a row address, and these The literature on storage includes proposals two parts are latched. This allows the device to for a large variety of possible types of RAMs for pipeline the transfer of the address in two use as main memories, and research on RAM stages. To make this work, two control pins are technology is still very active, as demonstrated necessary: The row address strobe (RAS) pin by interest in meetings such as the International controls the row address latch, and the column Solid-State Circuits (ISSCC) and Custom address strobe (CAS) pin controls the column Integrated Circuit (CICC) conferences. Several address latch. summary articles on this research are avail- To read data from DRAM, the following steps able.1,5,12 The general organization depends on must take place (see also Figure 9): the number of ports, but usually single-port structures are present in high-density memories. 1. The row address is placed on the address

May–June 2001 49 Large Embedded Memories

Write thus saving time in enable accessing the mem- 4 ory. The electronic Row standards agency, access 2 JEDEC, has specified strobe Row address latch standards for FPM

1 Row address decoder DRAM. Address bus 7 Extended data-out 5 DRAMs. EDO DRAMs Data Sense work very similarly to Column Column bus and FPM DRAMs, also hav- address address refresh latch decoder amplifiers ing the ability to work Column 6 within a page. The pri- access mary advantage of strobe 3 EDO over FPM is that Figure 9. Illustration of DRAM operation. EDO DRAMs hold the data valid even after the signal that strobes pins via the address bus. the column address goes inactive. Faster micro- 2. The RAS pin is activated, placing the row processors can thus manage time more effi- address onto the row address latch. ciently, performing many tasks without having 3. The row address decoder selects the prop- to attend to slower memory. That is, while the er row to be sent to the sense amplifiers. EDO DRAM is retrieving an instruction for the 4. The write-enable is deactivated so that the , the microprocessor can per- DRAM knows that it’s not being written to. form other tasks without worrying that the data 5. The column address is placed on the will become invalid. JEDEC has also specified address pins via the address bus. standards for the EDO DRAM. 6. The CAS pin is activated, placing the col- umn address on the column address latch. Example. Consider the following C program: 7. The CAS pin also serves as the output- enable, so once the CAS signal is stable, the const int N=100; sense amplifiers place the data from the int A[N][N]; addressed row and column on the data- int r,c; output pins, and the data can be transferred for (c=0; c < N; c++) { to the processor. for (r=0; r < N; r++) { 8. The RAS and CAS are both deactivated so ... = foo(A[r][c]); that the cycle can begin again. } } Fast page-mode DRAMs. FPM DRAMs are currently the most used category of DRAMs; The physical address of array A, assuming a they are faster than previous-generation DRAMs row-wise organization, is baseaddress(A) + because of their ability to work within a page 100 × r + c. Hence, every time the inner itera- (the section of memory available within a row tor r increments, the DRAM address is address). Within one specific row are several increased by 100. This means the row address columns of memory bits. With FPM DRAMs, must be reapplied to the DRAM and latched you need only specify the row address once for into its suited latch; hence several cycles are accesses within the same page addresses. lost. Successive accesses to the same page of mem- This problem can easily be alleviated by a ory require selecting only a column address, loop interchange:

50 IEEE Design & Test of Computers const int N=100; latency when data is needed from the same int A[N][N]; bank but in another row. int r,c; As with SDRAMs, bank conflicts are still pos- for (r=0; r < N; r++) { sible. However the high bank count, typical for for (c=0; c < N; c++) { RDRAMs, reduces the probability of conflicts. ... = foo(A[r][c]); For example, a RIMM of eight devices, 16 banks } each, has 128 banks. } The cost to pay for RDRAM’s high bandwidth is the high power consumption. Power manage- This is very obvious in the preceding example, ment techniques keep RDRAM chip dissipation but in general it takes a systematic global under control. Six states of activity are defined, approach to trade off all possibilities. “Data ranging from a power-down mode to the highest Memory Organization and Optimizations in attention state. The higher the activity state, the Application-Specific Systems,” by P. Panda et al. addresses this topic. The “EDO DRAM access” sidebar provides a more elaborate EDO DRAM access example example. Let’s work out a brief example, directly based on the material in Coelho and Hawash.1 The data sheet of an enhanced data-out (EDO) Synchronous DRAMs. SDRAMs pipeline fur- memory chip specifies a sequence such as {10-2-2-2}{3-2-2-2}, ther and can provide a huge theoretical band- where the numbers represent clock cycles. Each curly bracket indi- width. Basically, the internal state machine cates four bus cycles of 64 bits each—that is, one cache line. The enables the enlargement of the pipeline. An first sequence, {10-2-2-2}, specifies the timing if the page is first SDRAM can sustain this high throughput rate by opened and accessed four times. The second sequence, {3-2-2-2}, data-access interleaving over several banks. specifies the timing if the page was already open and accessed four However, the address must be known several additional times—that means you did not already open and access cycles before the data is actually needed. any other memory page in the meantime. The last sequence repeats Otherwise the data path will stall, canceling the as long as you access memory within the same page. advantage of the pipelined memory. The used data sheet relates to a memory bus running at 66 MHz. For another processing speed, such as a 233-MHz bus, the timing Rambus DRAM. RDRAMs exploit a high-speed becomes {35-7-7-7}{11-7-7-7} in processor clocks. Table A provides interface technology to provide transfer speeds the timing in the different parts of the overall memory organization. of 1 Gbyte/s or more. RDRAMs are connected via a single 16-bit-wide data bus. Hence, each Reference RDRAM chip can drive the 16-bit data bus by 1. R. Coelho and M. Hawash, DirectX, RDX, RSX, and MMX Technology: A itself. This data bus is part of a larger shared Jumpstart Guide to High Performance APIs, Addison-Wesley, Reading, Rambus channel, to which every RDRAM chip Mass., 1997. in the system is attached. This interconnectivity allows multiple memory controllers, theoreti- Table A. Memory architecture and timing for a system using the Pentium cally providing multiple times the Rambus chan- II processor (at 233 MHz) and EDO memory. nel bandwidth. For example, ’s PlayStation 2 uses two RDRAM channels, each with a single Bus CPU Total RDRAM, to achieve a total of 3.2 Gbytes/s mem- clock clock clock Bandwidth ory bandwidth. Bus cycles cycles cycles (Mbytes/s) The negative side of this single-bus system is L1 cache {1-1-1-1} {1-1-1-1} 4 1,864 that the system latency can be as great as the L2 cache {5-1-1-1} {10-2-2-2} 16 466 latency to the RDRAM farthest away from the EDO {10-2-2-2} {3-2-2-2} 56 133 memory controller. An advantage that RDRAM memory {35-7-7-7} {11-7-7-7} 32 233 has over SDRAM is that it has separate row and SDRAM {11-1-1-1} {2-1-1-1} 51 146 column control, allowing pipelining of the {39-4-4-4} {7-4-4-4} 19 392 addressing with the data transfers. This reduces

May–June 2001 51 Large Embedded Memories

Client SDRAM Local Local Bank 1 128- to latch select Cache Global On-chip 1,024-bit bus and bank cache bank select/ hierarchy Data combine control Address/ Local Local control latch Bank N select

Figure 10. External data access bottleneck illustration with SDRAM.

smaller the latency for access but the greater the the 16- to 32-Mbit generation to about 100 mW power consumption. For power consumption con- for 100-MHz operation in a 256-Mbit DRAM. trol, usually only a few chips can be active at the Hence, modern stand-alone DRAM chips, same time. Hence, when data is distributed across which are often of this SDRAM type, also offer several RDRAMs, high latencies can be observed low-power solutions, but at a price: Internally, when accessing the data sequentially. This is they contain banks and a small cache with a because each chip has to alternate between a low- very wide width connected to the external high- power mode and a high active state. speed bus (see Figure 10).15 So the low-power operation per bit is only feasible when they oper- Data-memory power bottleneck ate in burst mode with entire, or parts of, mem- Because of the heavy push toward lower- ory columns transferred over the external bus. power solutions to keep package costs low, and This is not directly compatible with the actual recently also for mobile applications or to use of the data in the processor data paths; so address reliability issues, the power consump- without a buffer to the processors, most of the tion of such external DRAMs has been reduced data words exchanged would be useless and dis- significantly. Apart from circuit and internal carded. Obviously, in this case the effective ener- organization techniques for achieving this,4,13 gy consumption per useful bit becomes very technology modifications such as switching to high, so the effective bandwidth is low. a SOI (silicon-on-insulator) approach have Therefore, a hierarchical and typically far more been considered. With all these techniques for power-hungry intermediate memory organization distributing the power consumption from a few is needed to match the central DRAM to the data- hot spots to all parts of the architecture, the end ordering and bandwidth requirements of the result is indeed a very optimized design for processor data paths. The reduction of power con- power, where every part of the memory orga- sumption in fast random-access memories is not nization consumes a similar amount.4,14 as advanced yet as in DRAMs, but this is also an However, little more can be gained in this area that is becoming saturated; many circuit- and area, because the bag of tricks now contains technology-level tricks have already been applied only the more complex solutions with a smaller in SRAMs. As a result, fast SRAMs continue to con- return on investment. Nevertheless, the combi- sume on the order of watts for high-speed opera- nation of all these approaches indicates a very tion around 500 MHz. Thus, the memory-related advanced circuit technology that still outper- system power bottleneck remains a very critical forms the current state of the art in data path and issue for data-dominated applications. logic circuits for low-power design (at least in industry). Hence, the relative power in the non- storage parts can be reduced still more drasti- FROM THE PROCESS technology perspective, cally, if this goal receives similar investments. the importance of the memory-related system Combined with the advance in process tech- power bottleneck is not surprising, especially nology, all this has led to a remarkable reduction for submicron technologies. The relative power of DRAM-related power—from several watts for cost of interconnections is increasing rapidly

52 IEEE Design & Test of Computers compared with the transistor-related (active cir- increased system design complexity, which can cuit) components. Clearly, local data paths and be offset with appropriate design methodology controllers themselves contribute little to this support tools. This requires looking at platform overall interconnect compared to the major architecture design and mapping methodolo- data/instruction buses and the internal con- gies from a new perspective. In addition, archi- nections in the large memories. Hence, if all tectures will have to become more compiler other parameters remain constant, the energy friendly.19 consumption, as well as the delay or area, in the storage and transfer organization will become even more dominant, especially for References deep-submicron technologies. The remaining 1. G. Lawton, “Storage Technology Takes the Cen- basic limitations lie in transporting the data and ter Stage,” Computer, vol. 32, no.11, Nov. 1999, the control (such as addresses and internal sig- pp.10-13. nals) over large on-chip distances, and in stor- 2. C. Kozyrakis et al., “Scalable Processors in the ing them. Billion-Transistor Era: IRAM,” Computer, vol. 30, One last technological recourse for alleviat- no. 9, Sept. 1997, pp. 75-78. ing the energy-delay bottleneck is to embed the 3. R. Evans and P. Franzon, “Energy Consumption memories as much as possible on chip. This Modeling and Optimization for SRAMs,” IEEE J. has been the focus of several recent activities, Solid-State Circuits, vol. SC-30, no. 5, May 1995, such as the Mitsubishi announcement of an pp. 571-579. SIMD processor with a large distributed DRAM 4. K. Itoh, “Low Voltage Memory Design,” in tutorial in 199616 (followed by the offering of embed- on “Low Voltage Technologies and Circuits,” Proc. ded DRAM technology by several other ven- IEEE Int’l Symp. Low-Power Design, IEEE CS dors) and the Intelligent RAM (IRAM) initiative Press, Los Alamitos, Calif., 1997. of Dave Patterson’s group at UC Berkeley.17 The 5. B. Prince, “Memory in the Fast Lane,” IEEE Spec- results show that the option of embedding logic trum, Feb. 1994, pp. 38-41. on a DRAM process leads to a reduced power 6. G. Slavenburg, S. Rathnam, and H. Dijkstra, “The cost and an increased bandwidth between the Trimedia TM1 PCI VLIW Media Processor,” 8th central DRAM and the rest of the system. This is Hot Chips Symp., 1996. indeed true for applications where the 7. N. Weste and K. Eshraghian, Principles of CMOS increased processing cost is acceptable.18 VLSI Design, Addison-Wesley, Reading, Mass., However, it is a one-time drop, after which the 1993. widening energy-delay gap between the storage 8. D. Patterson and J. Hennessey, Computer Archi- and the logic will continue to progress, because tecture: A Quantitative Approach, Morgan of the unavoidable evolution of the relative Kaufmann, San Francisco, 1996. interconnect contributions. 9. N. Jouppi, “Improving Direct-Mapped Cache Per- Thus, it is clear that for midterm (and even formance by the Addition of a Small Fully- short-term) projects, the access and power bot- Associative Cache and Prefetch Buffers,” Proc. tlenecks should be broken by other, nontech- Int’l Symp. Computer Architecture, ACM Press, nological means. This is feasible with quite New York, 1990, pp. 364-373. spectacular effects through a more optimal 10. T.F. Chen and J.L. Baer, “Effective Hardware- design of the memory organization and system- Based Prefetching for High Performance level code transformations applied to the initial ,” IEEE Trans. Computers, May application specification. (See “Code 1995, pp. 609-623. Transformations for Data Transfer and Storage 11. T. Johnson and W. Hwu, “Run-Time Adaptive Exploration Preprocessing in Multimedia Cache Hierarchy Management via Reference Processors,” by F. Catthoor et al., and “Data Analysis,” Proc. Int’l Symp. Computer Architecture, Memory Organization and Optimizations in ACM Press, New York, 1997, pp. 315-326. Application-Specific Systems,” by P. Panda et al.) 12. R. Comerford and G. Watson, eds., “Memory The price paid for these solutions will be Catches Up,” IEEE Spectrum, Oct. 1992, pp. 34-57.

May–June 2001 53 Large Embedded Memories

13. T. Yamagata et al., “Circuit Design Techniques for 17. D.A. Patterson et al., “Intelligent RAM (IRAM): Low-Voltage Operating and/or Giga-Scale Chips that Remember and Compute,” Proc. IEEE DRAMs,” Proc. IEEE Int’l Solid-State Circuits Int’l Solid-State Circuits Conf., IEEE CS Press, Conf., IEEE CS Press, Los Alamitos, Calif., 1995, Los Alamitos, Calif., 1997, pp. 224-225. pp. 248-249. 18. N. When and S. Hein, “Embedded DRAM Archi- 14. T. Seki et al., “A 6-ns 1-Mb CMOS SRAM with tectural Trade-Offs,” Proc. First Design and Test Latched Sense Amplifier,” IEEE J. Solid-State Cir- in Europe Conf., IEEE Press, Piscataway, N.J., cuits, vol. SC-28, no. 4, Apr. 1993, pp. 478-483. 1998, pp. 704-708. 15. T. Kirihata et al., “A 220 mm2, Four- and Eight- 19. N. Mitchell, L. Carter, and J. Ferrante, “A Compiler Bank, 256 Mb SDRAM with Single-Sided Stitched Perspective on Architectural Evolutions,” IEEE TC WL Architecture,” IEEE J. Solid-State Circuits, vol. on Computer Architecture Newsletter, June 1997, SC-33, Nov. 1998, pp. 1711-1719. pp. 7-9. 16. T. Tsuruda et al., “High-Speed, High Bandwidth Design Methodologies for On-Chip DRAM Core Multi-Media System LSIs,” Proc. IEEE Custom Integrated Circuits Conf., IEEE CS Press, Los Lode Nachtergaele is a Alamitos, Calif., 1996, pp. 265-268. senior research engineer at IMEC, responsible for the technical coordination of IMEC’s Multimedia Image Compression Systems (MICS) group. His research interests include distilling an operational system design methodology that improves design times of embedded multimedia systems, using the resulting design flow as a step- [email protected] ping stone to future application challenges. FREE! Nachtergaele has a BS in industrial engineering All IEEE Computer Society from the Katholieke Industriele Hogeschool, members can obtain a free, Oostende, Belgium. portable email [email protected]. Select your own user name and initiate your account. The address you choose Chidamber Kulkarni is a is yours for as long as you are a PhD student in DESICS at member. If you change jobs or IMEC. His research interests Internet service providers, just update your information with us, include design methods and and the society automatically tools for cost-efficient embed- forwards all your mail. ded implementation of multi- media applications and architectural support for Sign up today at software systems. Kulkarni has a BE in electron- http://computer.org ics and communication engineering from Karnataka University, India, and an ME and PhD in electrical engineering from the Katholieke Universiteit, Leuven.

The biography of Francky Catthoor appears on page 4 in this issue.

Direct questions and comments about this article to Francky Catthoor, IMEC, Kapeldreef 75, B3001 Leuven, Belgium; [email protected].

54 IEEE Design & Test of Computers