<<

Energy behaviour of NUCA caches in CMPs

Alessandro Bardine, Pierfrancesco Foglia, Julio Sahuquillo Francesco Panicucci, Marco Solinas Departmento de Ingeniería de Sistemas y Computadores Dipartimento di Ingegneria dell’Informazione Universidad Politécnica de Valencia Università di Pisa Valencia, Spain Pisa, Italy [email protected] {a.bardine, foglia, f.panicucci, marco.solinas}@iet.unipi.it

Abstract—Advances in technology of semiconductor make into many independent banks usually interconnected by a nowadays possible to design Chip Multiprocessor Systems equipped switched network. In this model the access latency is with huge on-chip Last Level Caches. Due to the wire delay proportional to the physical distance of the banks from the problem, the use of traditional cache memories with a uniform requesting element (typically one core or a single cache access time would result in unacceptable response latencies. NUCA controller). The mapping between cache lines and banks can be (Non Uniform Cache Access) architecture has been proposed as a performed either in a static or dynamic way (namely S-NUCA viable solution to hide the adverse impact of wires delay on and D-NUCA); in the former, each line can exclusively reside in performance. Many previous studies have focused on the a single predetermined bank, while in the latter a line can effectiveness of NUCA architectures, but the study of the energy dynamically migrate from one bank to another. When adopting and power aspects of NUCA caches is still limited. the dynamic approach, blocks are mapped to a set of banks, In this work, we present an energy model specifically suited for NUCA-based CMP systems, together with a methodology to employ called bankset, and they are able to migrate among the banks the model to evaluate the NUCA energy consumption. Moreover, belonging to their bankset. The migration mechanism allows data we present a performance and energy dissipation analysis for two to move towards the most frequently referring CPUs, thus 8-core CMP systems with an S-NUCA and a D-NUCA, respectively. reducing the average cache latency by storing the most frequently Experimental results show that, similarly to the monolithic accessed blocks in banks close to the referring CPUs. processor, the static power also dominates the total power budget in Various previous works have shown that NUCA-based the CMP system. systems are effective in hiding delay effects on performance Keywords: CMP, NUCA, energy dissipation, wire delay, power [14][15][16][17][18], and several works have shown interest in consumption energy aspects of such architectures. In [19] [20] techniques are proposed dealing with the reduction of power consumption, I. INTRODUCTION while in [21], a model for NUCA interbank network elements is The use of on-chip large Last-Level-Cache (LLC) [1][2][3] devised. In [22], an energy model is defined for a single core and multiple homogeneous cores (Chip Multi-Processor, CMP) NUCA based architecture, taking into account both static and [3][4] are two of the main features that characterize current dynamic contributions of NUCA caches. In [23][24][25] the design. Both of these aspects are a simple yet model proposed in [22] is employed to evaluate energy aspects of effective way of leveraging the ever growing number of proposed solution in CMP environment. transistors available on the die: in fact, continuous manufacturing This work extends the model proposed in [22] presenting an advances in semiconductor technologies are still satisfying energy model specifically suited for NUCA-based CMP systems nowadays the famous Moore’s prediction [5]. The feature size and a methodology to employ the model in order to evaluate shrinking of manufacturing processes reduces transistor NUCA energy consumption. In addition, we propose an analysis switching time, enabling higher clock frequencies for of both performance and energy dissipation for two 8-core CMP microprocessor pipeline that, together with growing integration systems respectively adopting an S-NUCA and a D-NUCA as capabilities, translate in performance improvements. While LLC. We show that, similarly to the monolithic processor case increasing clock frequency leads to an improvement in processor [22], also for multicore NUCA systems the static component performance, it also increases power consumption. At the same dominates the total power budget despite of the increase in the time, the performances of on-chip conventional caches are number of traffic sources (i.e. many processors accessing the jeopardized by the wire delay problem [6][7][8]: as the shared LLC), and of the communication overhead due to the need dimensions of on-chip wires is reduced and the signal of managing the coherence. propagation delay consequently augments on such wires, the The rest of this work is organized as follows. Section II higher the clock frequency the higher the number of clock cycles describes the reference architecture considered in our study. needed to propagate a signal on the on-chip communication lines Section III extensively describes our energy model. Section IV [9]. Consequently, the access time increases in large conventional exposes the adopted evaluation methodology, while in Section V on-chip caches due to wire delay, since most of this time is spent performance and energy results are presented and discussed. for signal routing on interconnections. In order to reduce the Section VI presents some related works, and Section VII effects of the long cache latencies on the overall system concludes this paper. performances, NUCA caches (Non-Uniform Cache Access) have been proposed for both single-core and CMP systems [10][11] II. REFERENCE ARCHITECTURE [12][13][14]. Figure 1 shows the architecture of our reference NUCA CMP In a NUCA architecture, the total cache space is partitioned system, organized in a dancehall configuration. We consider 8

processors, each with its own private L1 Instruction and Data A. Maiin components caches, and a shared L2 NUCA as LLC. Processor nodes (CPUs The energy model takes into account both the dynamic and with private L1 caches) are placed at two opposite sides of the L2 static energy dissipated by the cache, and the extra dynamic cache: four at the top part and four at the bottom part. energy consumed when accessing the lower levels of the memory hierarchy on cache misses. The total energy dissipated by the memorry subsystem during the run of an appllication can be calculaatted according to the following formula:

_ (1)

where and are the dynamic and the static components of the energy dissipated at run time, respectively, and _ is the energy dissipated to solve the cache misses. The basic elements of a NUCA architecture are: memory banks, network links and network switches (see Figure 1). So, both the static and dynamic components of the energy can be calculaatted by taking into account the contribution of each element. Figure 1 The reference architecture. Small squares are the NUCA banks, black B. Dynamic componeent circles are NoC switches, white big squares represent cpu+L1 cache nodes.Bank columns are the banksets. Wee can calculate the dynamic component of the cache by considering the dynamic contribution of the various elements as The NUCA is organized as a matrix of 64 banks, eight rows follows: and eight columns. All the banks are connected among themselves, and with the processor nodes, via a highly-scalable _ _ _ (2) Network-on-Chip (NoC) [26][27], composed of 64 wormhole switches, and horizontal and vertical links connecting all the For bank access the model distinguishes read miss, read hit nodes, resulting in a partial-2D mesh network topology. and write operations as they are affected by different energy The architecture adopts a directory version of MESI as the dissipation. So _ can be calculated as follows: baseline coherence protocol, in which the directory is non- _ blocking [28][29] and distributed. The cache hierarchy is __ ∙ inclusive, so there is no need for a centralized directory, as _ _ directory information is held in the TAG field of the L2 copy. __ ∙_ (3) In the case of S-NUCA, we use the interleaved mapping strategy [30] (low-order bits of the index field of the physical _ ∙ address) to statically map each memory block to a single NUCA __ ∙_ bank. __ ∙_ In the case of D-NUCA each bankset is made up of all the banks belonging to the same column and we adopt the same where _ , _ , , _ , _ are mapping policy to map each block to a single baankset. Thus in, a respecttiively the dynamic energies dissipated when accessing the in a D-NUCA a block will be able to migrate among all the banks bank i in a read hit, a read miss, a write operation and a directory that make part of the bankset the block is mapped to. We access (either a read or a write to the part of the TAG holding consider the migration scheme proposed in [31] that exploits the directory information) related to the coherence protocol. As per-hit [10] block migration: a data promotion is triggered previously exposed, each bank of the NUCA cache is a self- whenever a block request from an L1 results in a hit in any of the contained traditional UCA cache memory. Thus its energy NUCA banks belonging to the block’s bankset. As blocks can be parameters values can be derived from available existing tools stored in any of the bankset’s baanks, we assume, as the search that model these aspects of a cache memory. In particular, by policy, that L1-to-L2 requests are broadcasted to all the banks of using the CACTI 5.1 tool [37], we obtain the energy the reference bankset. consumption per every kind of access. Num_banks, is the total number of cache banks that are present in the NUCA cache. III. ENERGY MODEL FOR CMP NUCA CACHES __ , __ , _ , This section presents the proposed energy model for CMP __ , __ are the total number of each NUCA caches that extends the one introduced in [22] for single type of operation to the same bank performed by all the cores core NUCA architectures. during the application execution. __ and __ exxplicitly take into account directory

coherence operations in our model, as a cache access implies one C. Static component or more directory accesses depending on the current request type The static energy dissipated by cache is due to leakage (e.g. L1 read miss or writeback) and block state (e.g. currents. The elements that contribute to dissipate static energy shared/private, clean/dirty). These directory accesses translate in are banks and network switches, so and can be calculated extra TAG read or writes in addition to the canonical TAG search as follows: needed to detect whether the access is a hit or a miss. For example, in the considered coherence protocol, when a Read Hit _ occurs in L2, an extra TAG write in order to update directory _ information follows TAG lookup and a block read. ∙ _ For network links, the overall dynamic energy consumption (9) can be calculated by accumulating the energy dissipated by each _ link during each flit traversal: _ _ _ __ ∙ (4) where _ and _ are the static power consumption that affect each single element and can be determined with the CACTI tool and the Orion 2.0 tool[40], where Num_links is the total number of network links, is the respectively. energy required to transmit a flit on the link i and __ is the total number of flits traveling on link i D. Off- cache energy during the whole application execution as a result of the _ represents the dynamic energy dissipated during operations performed by all the processor cores. The value of off-cache accesses due to cache misses. It can be calculated as: can be calculated according to the following formula: _ ___ ∙ __ (10) ∙ ___ ∙ (5) where is the dynamic energy spent for each off- where α is the switching activity factor, num_of_link_wires is the __ cache access (including main memory and off chip access) and number of lines that constitute the link (i.e. is the link width in n_off_cach_accesses is the total number of accesses to the bits), and is the energy spent to drive a single line. DRAM performed to solve cache misses. was calculated adopting a simple capacitance model according to For off-cache energy, we assumed that the L2 cache is backed the following formula: by off-chip DRAM memory; the term Eoff-cache corresponds to the 1 energy dissipated by the DRAM memory during off-chip ∙ ∙ (6) 2 accesses. We derived the energy dissipated on each main memory access (Eoff-cache access) assuming a modern DDR2 system where V is the drive voltage of the wire and Cwire is the [45] and the related bus[45]. With the same approach taken by capacitance of the wire calculated as: previous works [43][44], we focused our evaluation on the energy dissipated during active cycles (read/write cycles), ∙_ (7) isolating it from the background energy that is associated to each where is the capacitance per unit length and wire_lenght is DRAM power state. the length of the wire. We referred to the Berkeley Predictive Technology Model [38] to derive wire resistance and capacitance IV. METHODOLOGY per unit length. The length of each link, and thus its total resistance and capacitance, was calculated according to the A. Considered architecture physical dimensions of cache banks derived from CACTI and The architecture considered for this study is depicted in assuming that each link surrounds a memory bank. Figure 1. It is a CMP system based on 8 UltraSparc II processor, For network switches, we referred to the model presented in where each core has its private L1 Instruction and Data caches [40] and we calculate the overall dynamic energy as follow: and is connected to a shared L2 cache that is a NUCA (used _ either as a S-NUCA or a D-NUCA). We assume the system being __ backed by a 2GB DDR memory. Architectural details of the _ (8) system are reported in Table I. ∙_ B. Simulation tools where Num_switches is the total number of network switches that The simulation platform adopted for this study is based on are present in the system, __ is the total Simics [35], that we employed to perform full system simulations number of flits that travel throught the switch i and _ is of the CMP system. We used GEMS [36] in order to simulate the the energy consumed by the switch i when elaborating a single cache and memory hierarchy. We used both simulators to flit. Its value can be determined with Orion. produce execution statistics needed by the proposed energy

model (e.g. type and respective number of bank accesses, number TABLE II ENERGY PARAMETERS OPERATING AT A 80°C TEMPERATURE of flit transmissions, execution time, total number of flit 12,3 pJ traversals,…). The resulting execution statistics and the energy _ 93,5 mW parameters that have been previously described are taken as _(5 port switches) 84,6 mW inputs by our energy model to derive the energy and power _(4 port switches) consumption. Figure 2 depicts the adopted methodology, while 76,3 mW _(3 port switches) Table II reports the values adopted for energy and power 129 pJ dissipations. We estimated power and energy with the tools _ (5 port switches) 98,5 pJ [37][38][40] and the datasheet [45] already mentioned in Section _ (4 port switches) III. We considered a temperature of 80°C that is quite is 78,5 pJ _ (3 port switches) representative of the typical working condition of the processor 355,958 mW [22]. _ 434,705 pJ _

86,941 pJ _ 86,941 pJ _ _ 31836 pJ __

A. Performance Evaluation Figure 3 shows the Normalized CPI for the considered benchmarks. As observed, for the adopted configuration, the D- NUCA performs better than the S-NUCA for the considered

applications, with the exception of radiosity, radix and Figure 2 Methodology workflow adopted in the evaluation: execution statistics streamcluster, for which S-NUCA outperforms D-NUCA up to for the chosen benchmarks are collected by Simics and GEMS and then combined with the energy parameters in order to get the energy model and 5%. On average, adopting the dynamic mapping policy leads to a finally the energy consumption. performance increase by about 4%. 1,05 S-NUCA D-NUCA TABLE I. CONFIGURATION PARAMETERS 1

0,95 Tech, node 65nm 0,9 CPUs 8 cpus (Ultra Sparc II), in-order 0,85

Clock Normalized CPI ~5 GHz 0,8 Frequency 0,75 Private 16 Kbytes I + 16 Kbytes D, 2 way s.a., L1 cache latency: 1 cycle to TAG, 2 cycles to TAG+Data 16 Mbytes, 64 banks,256 Kbyte banks, 4 way, L2 cache latency:5 cycles to TAG, 8 cycles to TAG+Data Figure 3 Normalized CPI Block Size 64 bytes NoC Partial 2D Mesh Network, flit size: 128 bits, It is worth noting that for three applications D-NUCA configuration switch latency: 1 cycle, link latency: 1 cycle performance is better than S-NUCA in a noticeable percentage: Main 2 GByte, 300 cycles latency raytrace (+9%), blackscholes (+10%) and spec2K (+9%). This Memory effect is due to the combination of different factors. As shown in Figure 4, the D-NUCA hits distribution for these applications is V. EXPERIMENTAL RESULTS strongly unbalanced, resulting in a 70-80% of total hits in the We simulate the execution of ocean, barnes, radix, radiosity ways number 1 and 8, which are the closest to the processor and raytrace from the SPLASH-2 suite, and blackscholes, nodes. This different hit distribution has a direct impact on the L2 bodytrack, canneal, streamcluster and swaptions from the hit latency as shown in Figure 5: the NUCA response time, in PARSEC 2.0 suite. We also run a set of applications from the case of a L2 hit, is reduced up to 20% for blackscholes. SPEC2K suite (mcf, bzip2, art and parser, each loaded twice in This is a consequence of the migration mechanism that is able order to have eight processes). We choose the Cycles-per- to move most frequently accessed blocks near the referring Instruction (CPI) as the reference performance indicator, and the processors. We observe that other applications present an bytes-per-instruction as the reference NoC traffic indicator. We excellent reduction in the L2 hit latency, thanks to the changes of also propose a detailed evaluation of both static and dynamic the hit distribution among NUCA banks: ocean, bodytrack and energy dissipation. streamcluster. However, such benchmarks present little performance improvements (ocean, bodytrack), or even

100% 90% Way 8 80% Way 7 70% Way 6 60% Way 5 50% 40% Way 4 30% Way 3 20% Way 2 10% Way 1 0% SDSDSDSDSDSDSDSDSDSDSD barnes ocean radiosity radix raytrace blackscholes bodytrack canneal streamcluster swaptions spec2K

Figure 4 Hit distribution. Label of each series (Way 1, … , Way 8) is chosen according to the reference architecture of Figure 1. degradation (streamcluster), due to the strong increase in the Other applications, like barnes and canneal, present a number of L2 misses (ocean and streamcluster) and/or due to the moderate L2 hit latency improvement (6% and 9% respectively) small number of L2 accesses that is a consequence of a reduced and a slight L2 miss rate degradation. Consequently, the CPI L1 miss rate (less than 0.5% for bodytrack and streamcluster). L2 improvement is also limited. This is due to the fact that such miss rate and L1 miss rate are reported in Figure 6 and Figure 7. applications, similarly to radiosity and radix, present a very high number of blocks accessed in a shared fashion by all the threads 1,2 S-NUCA D-NUCA of the application. In other words, when adopting the migration 1 mechanism, blocks never succeed in being stored in low-latency 0,8 ways (Way 1 and Way8) since they are alternatively accessed by

0,6 processor nodes that stay at opposite sides of the shared NUCA,

0,4 each alternatively moving the shared blocks in opposite directions. This is a well-known phenomenon of D-NUCA 0,2

Norm. avg L2 hit avg latency Norm. caches, called ping-pong, whose impact on performance can be 0 mitigated by adopting more sophisticated techniques such as a replication mechanism [18]. 40 Data Control Figure 5 Normalized average L2 Hit Latency 35

30 9 S-NUCA D-NUCA 8 25 7 20 6 5 15 4 10 3 instruction / bytes L2 miss rate % 2 5 1 0 0 SDSDSDSDSDSDSDSDSDSDSD barnes ocean radiosity radix raytrace black- bodytrack canneal stream- swap- spec2K scholes cluster tions Figure 8 Total NoC traffic Figure 6 L2 miss rate Figure 8 shows the total amount of NoC traffic. In particular, 3 the figure highlights how the control and data traffic impact on 2,5 the overall traffic generated during the execution. In the 2 implementations, control message size is 8 bytes (header only), 1,5 while data message size is 72 bytes (8 bytes for the header, 64 1 bytes for the data). The total bandwidth utilization increases for L1 miss rate % 0,5 0 all the applications when moving from S-NUCA to D-NUCA. In detail, the control message component of the NoC traffic strongly increases, whereas the data message component decreases or stay constant. The increase in the control component of the NoC Figure 7 L1 miss rate traffic is due to the extra network traffic needed by the multicast search mechanism and to the migration mechanism, although

with a lesser extent. The reduction of the paths length of data banks consumption, whereas the contribution of switches is very messages from L2 banks to the core is also effective in reducing low, about 15%. The small switch component of the static energy the corresponding component of the total NoC bandwidth comes from the adoption of simple NoC switches, which also utilization, for those applications that are able to take advantage relay on little input and output buffers (1 flit buffers). The cache from the migration mechanism. In fact, as the most part of the banks component is dominant, and is the most influenced by the accesses are to shared blocks, and considered that the effects of CPI reduction: the lowest the execution time, the higher the ping-pong on the access pattern does not let such blocks to be reduction of the energy dissipation. It is interesting to analyze the stored in faster banks, the amount hops (i.e. the path length) data breakdown of the dynamic energy dissipation. Figure 11 shows messages have to traverse to be delivered is not reduced so much, that the most relevant dynamic component is the switch activity and this let the data component of the total NoC traffic no to be in the forwarding process. The contribution of the wires is always reduced in all the considered cases, or even increased for some very small, while the energy consumed in the access to NUCA benchmarks (e.g. canneal). banks is pretty higher in the case of D-NUCA, because the

mechanisms introduced in this architecture let the number of B. Evaluation of the energy dissipation requests to L2 banks increase. From the point of view of the Figure 9 shows the total energy consumption in a single run dynamic energy consumption, we observe that D-NUCAs are the for all the considered applications and for the different NUCA most energy-hungry configurations: this is due to the increase of architectures. First, we notice that for every architecture the the NoC traffic that is a consequence of all the extra messages dominating component is the static energy. Instead, the dynamic exchanged (migration protocols, multicast block search) among consumption varies from about 0,6%, in the case of streamcluster network nodes. In any case, thanks to the speed-up of the with S-NUCA, to about 11% for radiosity with D-NUCA. The execution time, the increase in the dynamic component does not high impact of the static component over the dynamic one can be affect the total energy saving achieved by the reduction of the explained considering the low number of operations performed in static energy component. L2 and off chip with respect to the total application runtime as consequence of the high L1 hit rate. If we consider the applications that take advantage from the VI. RELATED WORK adoption of the migration mechanism (like raytrace or the The NUCA cache architecture was firstly proposed in [10] as spec2K), we observe that the dynamic component, even if it does a viable solution to mitigate the effects of wire delays [6] in last not reduce or even increases, is always overwhelmed by the level cache memories, showing that NUCA structures reduction of the static component. This is because static energy outperforms a traditional Uniform Cache Architecture (UCA) savings derive from the shorter execution time, that we can see when maintaining the same size and manufacturing technology. for example in the reduction of the CPI. In fact, if we consider Subsequent works have extended this design in order to optimize applications (like radix) whose performance are degradated when efficiency [6][13], performance [6] or have proposed shared adopting D-NUCA, we observe that D-NUCA dissipates more NUCA designs in the context of multiprocessors [12] static energy because the execution time is higher with respect to [13][14][32][33]. While some of those work take into account the S-NUCA. We also observe that ocean slightly reduces the also some energy related aspects, none of them have considered static component, but the total energy budget augments because power and energy consumption as a main design issue and none the dynamic component increases. of them exposes an explicit energy model that is suitable to be In Figures 10 and 11 we show the static and dynamic used with NUCA cache memories. consumption breakdown, respectively, for all the configurations Nevertheless a major concern in modern microprocessor and applications. As for the static energy, we observe that the design is the rise of total power consumption, a large portion of most important component of leakage power comes from cache which, for CMOS processes at 65 nm and below, is due to static

1,2 Dynamic Static 1

0,8

0,6

0,4 Normalized energy 0,2

0 SDSDSDSDSDSDSDSDSDSDSD barnes ocean radix radiosity raytrace blackscholes bodytrack canneal streamcluster swaptions spec2K Figure 9 Total energy (static + dynamic)

1,2 Cache Switch 1 0,8 0,6

0,4 0,2

Normalized static energy Normalized static 0 SDSDSDSDSDSDSDSDSDSDSD barnes ocean radix radiosity raytrace blackscholes bodytrack canneal streamcluster swaptions spec2K Figure 10 Static energy 3 Wires Cache 2,5 Switch 2

1,5

1

0,5 Normalized dyn. energy Normalized dyn. 0 SDSDSDSDSDSDSDSDSDSDSD barnes ocean radix radiosity raytrace blackscholes bodytrack canneal streamcluster swaptions spec2K Figure 11 Dynamic energy power dissipated by leakage currents in particular by large adopting the model proposed in this paper, which takes into SRAM structures like the ones implemented in last level on-chip account cache physical organization as well as its usage for data caches [34]. In this context, techniques to reduce NUCA power access and coherence management, and the off-chip accesses consumption have been proposed in [19][20]. component. The model can be used in a simple methodology A proposal for an energy model for S-NUCA and D-NUCA based on existing tools. has been proposed in [22]: the cache is decomposed into various The application of the prosed model and methodology to elements (banks, links, network switches) and the static and reference S-NUCA and D-NUCA architectures has shown that dynamic energy consumption of each element is determined by the energy consumption is always dominated by the static referring to pre-existing models and tools. In the same work an component. Consequently, being the static consumption tied to analysis of the balance between static and dynamic power for the execution time lasting, the overall energy consumption can NUCA architectures and an energy/performance trade-off benefit from any technique aimed at enhancing the performance, evaluation for NUCA and its comparison with UCA is also given although it will turn in an increase of the dynamic energy concluding that the static components dominate energy consumption. dissipation in NUCA designs, making the reduction of static power consumption a substantial issue for NUCA caches in deep ACKNOWLEDGMENTS sub-micron CMOS processes. The energy model is presented This work has been partially supported by PRIN 2008 project only for single core systems, however some works [23][24][25] “Design methodologies for high performance systems for have proposed the use of that model also for CMP NUCA scientific and multimedia applications” (contract no systems, showing the relevance of the need for a CMP NUCA 200855LRP2_001) and by the FIRB project “PHOTONICA - energy model. In [21] a model for NUCA interbank network Photonic Interconnect Technology for Chip Multiprocessing elements is devised. Architectures” (contract no RBFR08LE6V), both funded by the The energy model we propose extends the one proposed in Italian MIUR. This work has been also partially supported by [22], being applicable to CMP NUCA systems and taking into Spanish CICYT under Grant TIN2009-14475-C04-01. account specifics CMP systems aspects such as coherence protocol traffic in the cache memory. Besides it proposes an REFERENCES improved methodology based on validated tools. [1] R. Kumar and G. Hinton: “A Family of 45nm IA Processors”, Proceedings of the 56th Int.l Solid State Circuits Conference (ISSCC), Feb. 2009. VII. CONCLUSIONS [2] N. A. Kurd, S. Bhamidipati, C. Mozak, J. L. Miller, T. M. Wilson, P. NUCA caches are a promising solution for tolerating wire- Mosalikanti, T.M. Wilson, A. M. El-Husseini, M. Neidengard, R.E. Aly; M. Nemani, M.Chowdhury, R.Kumar: “A Family of 32nm IA Processors”, delay negative effects on performance. In addition to IEEE Journal Of solid State Circuits, Vol. 46(1), pp. 119-130, Dec. 2010. performance, it is important to take into account also energy [3] R. Agny, E. DeLano, M. Kumar, M. Nachimutu and R. Shiveley: “The aspects when dealing with such architecture. This can be done by Processor 9300 Series”, Intel White Paper, 2010.

[4] J. Casazza: “ i7-800 Processor Series and the Intel Core i5-700 in Non-Uniform Cache Architectures”. In Proc. of the 24th ACM Int.l Conf. Processor Series Based on Intel Microarchitecture (Nehalem)”, Intel on Supercomputing (ICS 10).ACM, New York, NY, USA, pp.37-47, 2010. White Paper, 2009. [25] J. Lira, C. Molina, and A. González. “HK-NUCA: Boosting Data Searches [5] G. Moore: “Excerpts from A Conversation with Gordon Moore: Moore’s in Dynamic Non-Uniform Cache Architectures for Chip Multiprocessors”. Law”, Intel Transcript. Source: Intel. Intel.com. In 25th IEEE Int. Parallel & Distributed Processing Symp.May 16-20,2011. [6] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger: “Clock Rate [26] J. Duato, S. Yalamanchili and L. Ni, Interconnection Networks an versus IPC: “The End of the Road for Conventional Microarchitectures”, Engineering Approach. . Morgan Kauffmann, Elsevier, 2003. Proc. of the 27th Int. Symp. on Computer Architecture (ISCA), June 2000. [27] W.J. Dally and B. Towels, Principles and Practices of Interconnection [7] R. Ho, K. W. Mai, and M. A. Horowitz: “The future of wires”, Networks. San Francisco, CA. Morgan Kauffmann, Elsevier, 2004. Proceedings of the IEEE, Vol. 89(4), pp 490–504, Apr. 2001. [28] K. Gharachorloo, M. Sharma, S. Steely and S. Van Doren, “Architecture [8] R. Ho and M.A. Horowitz: “On Chip Wires, Scaling and Efficiency”, PhD and Design of AlphaServer GS320” Proc. of the 9th Int. Conf. on Thesis, Dep. , , Aug. 2003 Architectural Support for Programming Languages and Operating Systems, [9] D. Matzke: “Will Physical Scalability Sabotage Performance Gains?”, Cambridge, MA, USA, pp. 13-24, Nov. 2000. IEEE Computer, Vol. 30(9), pp. 37-39, Sept.1997. [29] P. Foglia, F. Panicucci, C.A. Prete and M. Solinas, “An evaluation of th [10] C. Kim, D. Burger, and S. W. Keckler: “An Adaptive, Non-Uniform Cache behaviors of S-NUCA CMPs running scientific workload” Proc. of the 12 Structure for Wire-Delay Dominated On-Chip Caches”, Proceedings of the Euromicro conf. on Digital System Design, Patras, Greece, Aug. 2009. 10th Int.l Conf. on Architectural Support for Programming Languages and [30] Q. Lu, U. Bondhugula, S. Krishnamoorthy, P. Sadayappan, J. Ramanujam, Operating Systems (ASPLOS), Oct. 2002. Y. Chen, H. Lin and T. Ngai, “A compile-time data locality optimization [11] A. Bardine, P. Foglia, C. A. Prete and M. Solinas: “NUMA Caches”, framework for NUCA chip multiprocessors”. Technical Report, OSU- Encyclopedia of Parallel Computing, D. Padua Eds, Springer, June 2011 CISRC-6/08-TR29 [12] J. Hu, C. Kim, H. Shafi, L. Zhang, D. Burger and S. W. Keckler: “A Nuca [31] P. Foglia, F. Panicucci, C.A. Prete and M. Solinas, “Analysis of Substrate for Flexible CMP Cache Sharing”, IEEE Transactions on performance dependencies in NUCA-based CMP systems” Proc. of the Parallel and Distributed Systems, Vol. 18(8), pp 1028-1040, Aug. 2007. 21st Int.l Symp. on Computer Architecture and High Performance Computing, Sao Paulo, Brazil, pp.49-56, Oct.2009. [13] Z. Chishti, M. D. Powell, and T. N. Vijaykumar: “Optimizing Replication, Communication, and Capacity Allocation in CMPs”, Proceedings of the [32] J. Chang and G.S. Sohi, “Cooperative Caching for Chip Multiprocessors” 32th Int.l Symposium on Computer Architecture (ISCA), June 2005. Proc. of the 33rd annual Int. Symp. on Computer Architecture, Boston, MA, USA, pp. 264-276, June 2006. [14] B. M. Beckmann, and D. A. Wood: “Managing Wire Delay in Large Chip- Multiprocessor Caches”, Proceedings of the 37th Int.l Symposium on [33] B. M. Beckmann, M. R. Marty, D. A. Wood, "ASR: Adaptive Selective Microarchitecture (MICRO), pp. 319-330, Portland, Dec. 2004. Replication for CMP Caches" Proc. of the 39th annual IEEE/ACM Int. Symp. on Microarchitecture, Orlando,FL, pp. 443-454, Dec. 2006. [15] A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli and C. A. Prete: “Impact of On-Chip Network Parameters on NUCA Cache Performance”. [34] N. S. Kim, et al., “Leakage Current: Moore’s Law Meets Static Power”, IET Computers & Digital Techniques, Vol. 3(5),pp.501-512,Aug. 2009. IEEE Computer, vol. 36, n. 12, pp. 68-75, Dec. 2003. [16] J. Merino, V. Puente and J. A. Gregorio: “ESP-NUCA: A Low-cost [35] Virtutec Simics: http://www.virtutech.com Adaptive Non-Uniform Cache Architecture”, Proceedings of the 16th Int.l [36] Winsconsin Multifacet GEMS Simulator: http://www.cs.wisc.edu/gems/ Symp. on High-Performance Computer Architecture (HPCA), Jan. 2010. [37] S. Thoziyoor, N. Muralimanohar, J.H. Ahn and N.P. Jouppi, “CACTI 5.1” [17] S.Bartolini, P. Foglia, C.A. Prete and M. Solinas: “Feedback Driven Technical Report, HP Laboratories, Palo Alto, CA, April 2008. Restructuring of Multi-Threaded Applications for NUCA Cache Perfor- [38] Predictive Technology Model (PTM): http://www.eas.asu.edu/~ptm/ mance in CMPs”, Proc. of the 22nd Int.l Symp. on Computer Architecture

and High Performance Computing (IEEE SBAC-PAD 2010), Oct. 2010. [39] S.C. Woo, M. Ohara, E. Torrie, J. P.Singh, A. Gupta, “The SPLASH-2 programs: characterization and methodological considerations”. Proc. of [18] P. Foglia, G. Monni, C.A. Prete and M. Solinas: “Re-Nuca: Boosting CMP the 22th Int. Symp. on Computer Architecture, S. Margherita Ligure, Italy, performances through block replication”, Proceedings of the 13th pp.24-36, June 1995. Euromicro Conference on Digital System Design, Architectures, Methods and Tools (DSD 2010), Sept. 2010. [40] A. Kahng, B. Li, Li-S. Peh and K. Samadi, “ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space [19] A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli and C. A. Prete: “Way- Exploration”. Proc. of Design Automation and Test in Europe (DATE), Adaptable D-NUCA Caches”, Int.l J. of High Performance Systems Archi- Nice, France, April 2009. tecture, Vol. 2(3), pp. 215-228, Inderscience Enterprise Ltd, Aug. 2010. [41] C. Bienia, S. Kumar, J. Pal Singh and K. Li, “The PARSEC Benchmark [20] P. Foglia, D. Mangano, C. A. Prete, “A NUCA Model for Embedded Suite: Characterization and Architectural Implications” Proc. of the 17th Systems Cache Design”, IEEE 2005 Workshop on Embedded Systems for Int. Conf. on Parallel Architectures and Compilation Techniques. Toronto, Real-Time Multimedia (ESTIMEDIA), New York Metropolitan Area, Usa, Canada, pp. 72-81, Oct. 2008. pp. 41-46, Sept.2005. [42] Standard Performance Evaluation Corporation: http://www.spec.org/ [21] N. Muralimanohar, and R. Balasubramonian, “Interconnect Design

Considerations for Large NUCA Caches”, Proc. of the 34th Int.l Symp. on [43] N. S. Kim, D. Blaauw, and T. Mudge, “Quantitative Analysis and Computer Architecture (ISCA), pp. 369-380, San Diego, CA, June 2007. Optimization Techniques for On-Chip Cache Leakage Power”, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol. 13 (10), pp. [22] A. Bardine. P. Foglia, G. Gabrielli and C. A. Prete: “Analysis of Static and 1147-1156, Oct. 2005. Dynamic Energy Consumption in NUCA Caches: Initial Results”, ACM [44] V. Delaluz, et al., “Compiler-Directed Array Interleaving for Reducing SIGARCH & Proceedings of the MEDEA 2007 Worskhop on Memory Energy in Multi-Bank Memories”, Proc. of Asia and South Pacific Design Performance: Dealing with Applications, Systems and Architecture, Automation Conference, pp. 288-293, Bangalore, India, January 2002. Brasov, Romania, pp. 113-120. ACM Press, Sept. 2007 [45] Micron Designline, vol.10 (2), 2001 [23] J. Lira, C. Molina, A. González, "LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors" IEEE Int.l [46] J.Trajkovic, A. Veidenbaum and A.Kejariwal, “Improving SDRAM access Conf. on Computer Design, 2009. ICCD 2009., pp.275-281, 4-7 Oct. 2009. energy efficiency for low-power embedded systems” ACM Trans. Embed. Comput. Syst, Vol 7 (3), May 2008 [24] J. Lira, C. Molina, and A. González. “The auction: optimizing banks usage