Non-Uniform Power Access in Large Caches with Low-Swing Wires

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi Naveen Muralimanohar Rajeev Balasubramonian University of Utah HP Labs University of Utah [email protected] [email protected] [email protected] Abstract The CACTI 6.0 tool computes the characteristics of Modern processors dedicate more than half their large NUCA caches and shows that optimal behavior chip area to large L2 and L3 caches and these is exhibited when the cache is partitioned into large caches contribute significantly to the total processor banks, where each bank can accommodate a few mega- power. A large cache is typically split into multiple bytes of data [4]. About 50% of the NUCA cache’s banks and these banks are either connected through dynamic power is dissipated within these large banks a bus (uniform cache access – UCA) or an on- (the rest is mostly within the inter-bank network). chip network (non-uniform cache access – NUCA). Processors that do not employ NUCA will implement Irrespective of the cache model (NUCA or UCA), the monolithic multi-megabyte private or shared L2s or complex interconnects that must be navigated within L3s (the Nehalem and Montecito serving as examples). large caches are found to be the dominant part of In addition to consuming silicon area, these caches cache access power. While there have been a number will also contribute significantly to the energy con- of proposals to minimize energy consumption in the sumed by the processor. Large last-level shared caches inter-bank network, very little attention has been paid often serve as the cache coherence interface on multi- to the optimization of intra-bank network power that core processors. Multi-threaded applications will likely contributes more than 50% of the total cache dynamic make frequent expensive look-ups into the L2 to access power in large cache banks. In this work we study shared data. There already appears to be a trend to sim- various mechanisms that introduce low-swing wires plify the design and power efficiency of cores. Intel’s inside cache banks as energy saving measures. We pro- shift from the Netburst to Core microarchitecture is a pose a novel non-uniform power access design, which sign of things to come. Sun’s Niagara and Rock proces- when coupled with simple architectural mechanisms, sors are also designed for low energy-per-instruction provides the best power-performance tradeoff. The among other things. In the meantime, if SRAM cache proposed mechanisms reduce cache bank energy by arrays remain stagnant in design, their contribution to 42% while incurring a minor 1% drop in performance. overall chip power will continue to grow. Hence, this paper attempts to provide circuit/architecture innova- 1. Introduction tions to improve energy dissipation within large cache To alleviate the growing gap between processors and banks. main memory, contemporary processors have begun We show that energy dissipation in a large cache to provide large multi-megabyte last level caches, is dominated by the H-tree network within each bank. often occupying upwards of half the total die area. To address this bottleneck, we propose various designs For example, the Montecito processor from Intel has that leverage low-swing wiring within the cache. Low- 24 MB of L3 cache [1]. Intel’s consumer desktop swing wires are an attractive choice from the power Nehalem processors have 8 MB of L3 cache [2]. With perspective but are inherently slow. Pipelining them is the memory wall showing no signs of breaking down, inefficient and requires additional transceivers at inter- these trends are likely to continue, with future caches mediate points along the bus. If employed judiciously, only growing larger. however, their performance penalty can be mitigated Large caches are likely to use a Non Uniform Cache while exploiting their low-power characteristics. We Access (NUCA) architecture, with the cache split into discuss these trade-offs in detail for a variety of multiple banks connected by an on-chip network [3]. designs. We finally show that limited use of low-swing wiring provides the best balance between performance This work was supported in parts by NSF grants CCF-0430063, CCF-0811249, CCF-0916436, NSF CAREER award CCF-0545959, and power. This leads us to introduce the notion of Intel, SRC grant 1847.001, and the University of Utah. non-uniform power access, with certain regions of the cache being accessible at low energy with low-swing orthogonal to our optimizations, and focus on dynamic wires. Architectural mechanisms are required to exploit power for the rest of the paper. the low-power region of a cache bank and we explore We now describe briefly the factors that influence novel block placement policies to maximize use of the the organization, power, and delay of a cache bank. A low-power region. Our results show significant cache naive cache takes the form of a single array of mem- energy savings at a very modest performance penalty, ory cells and employs centralized decoder and logic the penalty primarily arising from the non-pipelined circuitry to store and retrieve data from cells. Such a nature of our low-swing transfers. monolithic model, however, has serious scalability is- The paper is organized as follows. Section 2 pro- sues. First, wordlines (due to a lack of available silicon vides basics on cache bank organization. Section 3 dis- area) and bitlines (due to differential signaling) cannot cusses different novel designs that employ low-swing be repeated at regular intervals, causing their delay to wiring within a cache bank. Architectural innovations increase quadratically with the size of the cache. Sec- that support non-uniform power access are described in ond, the bandwidth of the cache is a function of cycle Section 4. Section 5 shows results. We discuss related time1 and a single array cache’s bandwidth deteriorates work in Section 6 and draw conclusions in Section 7. quickly as the array size grows. The performance of the cache is thus limited by long bitlines and wordlines that 2. Background span the array. To reduce this quadratic delay impact, the cache is divided into multiple segments referred to Large caches of the future are expected to experience as subarrays. These subarrays need to be connected increasing disparity between access delays to different through an interconnect fabric to transfer addresses parts of the cache depending on their proximity to the and data within the cache. In order to reduce design processor core or cache controller. This mandates a complexity, an interconnect with easily predictable Non-Uniform Cache Access (NUCA) architecture [3], timing characteristics is essential. A balanced H-tree with large caches being divided into multiple banks network (Figure 1) provides uniform pipelined access connected by an on-chip network for data and address without complex switching circuits and proves to be an transfer between banks. A recent study [5] has shown ideal choice. The number of subarrays that the bank that due to high power and latency overheads associ- is split into, the height/width of the grid of subarrays ated with on-chip network routers, NUCA caches will (the aspect ratio of the bank) and the aspect ratio of the implement a few banks, each of non-trivial size. The subarrays themselves are defined by three fundamental bank count/size of each bank is determined by the parameters of cache bank design: relative contributions of the banks and the network • NDWL - Number of vertical partitions in the array to the total delay and energy consumption of the i.e., the number of segments that a single wordline cache, and associated design constraints. According to is partitioned into. This determines the number of CACTI 6.0 [4], a NUCA modeling tool that identifies columns of subarrays. an optimal trade-off point between bank and network • NDBL - Number of horizontal partitions in the components, a 64 MB NUCA cache will likely be array i.e., the number of segments that a single partitioned into large 2 MB or 4 MB banks. Up to bitline is partitioned into. This determines the 50% of cache dynamic power is dissipated within number of rows of subarrays. these large banks. Some processors may also adopt • NSPD - Number of sets stored in each row of a a tiled architecture where every core is associated subarray. For given Ndwl and Ndbl values, this with a large L2 bank (either a private cache or a decides the aspect ratio of the subarray. slice of a large shared cache) [1]. Thus, regardless of whether future processors adopt private/shared caches, An example organization is illustrated in Figure 1. The or UCA/NUCA architectures, or tiled/contiguous L2 design space of bank design is defined by variations of caches, it is evident that several large cache banks these parameters and the resultant complex interaction will be found on chip. This work focuses on reducing of various internal power and delay components. the significant component of dynamic power within For example, a small subarray count would enable these large cache banks. As per estimates from CACTI tighter packing of cells leading to increased area 5.3 [6], leakage in a 4 MB cache contributes about efficiency (cell area/total area), but would result in 20% to total power consumption. However, there are increased delay due to longer wordlines and bitlines. A well studied circuit ([7], [8], [9]) and microarchi- large subarray count would give better delay character- tectural ([10], [11], [12], [13]) techniques in current istics but result in increased silicon area consumption. literature to tackle leakage in caches. We assume that 1. The cycle time of a cache is the sum of the wordline, bitline, several of these can continue to be applied to the cache and senseamp delays.

Load more