Energy Behaviour of NUCA Caches in Cmps
Total Page:16
File Type:pdf, Size:1020Kb
Energy behaviour of NUCA caches in CMPs Alessandro Bardine, Pierfrancesco Foglia, Julio Sahuquillo Francesco Panicucci, Marco Solinas Departmento de Ingeniería de Sistemas y Computadores Dipartimento di Ingegneria dell’Informazione Universidad Politécnica de Valencia Università di Pisa Valencia, Spain Pisa, Italy [email protected] {a.bardine, foglia, f.panicucci, marco.solinas}@iet.unipi.it Abstract—Advances in technology of semiconductor make into many independent banks usually interconnected by a nowadays possible to design Chip Multiprocessor Systems equipped switched network. In this model the access latency is with huge on-chip Last Level Caches. Due to the wire delay proportional to the physical distance of the banks from the problem, the use of traditional cache memories with a uniform requesting element (typically one core or a single cache access time would result in unacceptable response latencies. NUCA controller). The mapping between cache lines and banks can be (Non Uniform Cache Access) architecture has been proposed as a performed either in a static or dynamic way (namely S-NUCA viable solution to hide the adverse impact of wires delay on and D-NUCA); in the former, each line can exclusively reside in performance. Many previous studies have focused on the a single predetermined bank, while in the latter a line can effectiveness of NUCA architectures, but the study of the energy dynamically migrate from one bank to another. When adopting and power aspects of NUCA caches is still limited. the dynamic approach, blocks are mapped to a set of banks, In this work, we present an energy model specifically suited for NUCA-based CMP systems, together with a methodology to employ called bankset, and they are able to migrate among the banks the model to evaluate the NUCA energy consumption. Moreover, belonging to their bankset. The migration mechanism allows data we present a performance and energy dissipation analysis for two to move towards the most frequently referring CPUs, thus 8-core CMP systems with an S-NUCA and a D-NUCA, respectively. reducing the average cache latency by storing the most frequently Experimental results show that, similarly to the monolithic accessed blocks in banks close to the referring CPUs. processor, the static power also dominates the total power budget in Various previous works have shown that NUCA-based the CMP system. systems are effective in hiding delay effects on performance Keywords: CMP, NUCA, energy dissipation, wire delay, power [14][15][16][17][18], and several works have shown interest in consumption energy aspects of such architectures. In [19] [20] techniques are proposed dealing with the reduction of power consumption, I. INTRODUCTION while in [21], a model for NUCA interbank network elements is The use of on-chip large Last-Level-Cache (LLC) [1][2][3] devised. In [22], an energy model is defined for a single core and multiple homogeneous cores (Chip Multi-Processor, CMP) NUCA based architecture, taking into account both static and [3][4] are two of the main features that characterize current dynamic contributions of NUCA caches. In [23][24][25] the microprocessor design. Both of these aspects are a simple yet model proposed in [22] is employed to evaluate energy aspects of effective way of leveraging the ever growing number of proposed solution in CMP environment. transistors available on the die: in fact, continuous manufacturing This work extends the model proposed in [22] presenting an advances in semiconductor technologies are still satisfying energy model specifically suited for NUCA-based CMP systems nowadays the famous Moore’s prediction [5]. The feature size and a methodology to employ the model in order to evaluate shrinking of manufacturing processes reduces transistor NUCA energy consumption. In addition, we propose an analysis switching time, enabling higher clock frequencies for of both performance and energy dissipation for two 8-core CMP microprocessor pipeline that, together with growing integration systems respectively adopting an S-NUCA and a D-NUCA as capabilities, translate in performance improvements. While LLC. We show that, similarly to the monolithic processor case increasing clock frequency leads to an improvement in processor [22], also for multicore NUCA systems the static component performance, it also increases power consumption. At the same dominates the total power budget despite of the increase in the time, the performances of on-chip conventional caches are number of traffic sources (i.e. many processors accessing the jeopardized by the wire delay problem [6][7][8]: as the shared LLC), and of the communication overhead due to the need dimensions of on-chip wires is reduced and the signal of managing the coherence. propagation delay consequently augments on such wires, the The rest of this work is organized as follows. Section II higher the clock frequency the higher the number of clock cycles describes the reference architecture considered in our study. needed to propagate a signal on the on-chip communication lines Section III extensively describes our energy model. Section IV [9]. Consequently, the access time increases in large conventional exposes the adopted evaluation methodology, while in Section V on-chip caches due to wire delay, since most of this time is spent performance and energy results are presented and discussed. for signal routing on interconnections. In order to reduce the Section VI presents some related works, and Section VII effects of the long cache latencies on the overall system concludes this paper. performances, NUCA caches (Non-Uniform Cache Access) have been proposed for both single-core and CMP systems [10][11] II. REFERENCE ARCHITECTURE [12][13][14]. Figure 1 shows the architecture of our reference NUCA CMP In a NUCA architecture, the total cache space is partitioned system, organized in a dancehall configuration. We consider 8 processors, each with its own private L1 Instruction and Data A. Maiin components caches, and a shared L2 NUCA as LLC. Processor nodes (CPUs The energy model takes into account both the dynamic and with private L1 caches) are placed at two opposite sides of the L2 static energy dissipated by the cache, and the extra dynamic cache: four at the top part and four at the bottom part. energy consumed when accessing the lower levels of the memory hierarchy on cache misses. The total energy dissipated by the memorry subsystem during the run of an appllication can be calculaatted according to the following formula: _ (1) where and are the dynamic and the static components of the energy dissipated at run time, respectively, and _ is the energy dissipated to solve the cache misses. The basic elements of a NUCA architecture are: memory banks, network links and network switches (see Figure 1). So, both the static and dynamic components of the energy can be calculaatted by taking into account the contribution of each element. Figure 1 The reference architecture. Small squares are the NUCA banks, black B. Dynamic componeent circles are NoC switches, white big squares represent cpu+L1 cache nodes.Bank columns are the banksets. Wee can calculate the dynamic component of the cache by considering the dynamic contribution of the various elements as The NUCA is organized as a matrix of 64 banks, eight rows follows: and eight columns. All the banks are connected among themselves, and with the processor nodes, via a highly-scalable _ _ _ (2) Network-on-Chip (NoC) [26][27], composed of 64 wormhole switches, and horizontal and vertical links connecting all the For bank access the model distinguishes read miss, read hit nodes, resulting in a partial-2D mesh network topology. and write operations as they are affected by different energy The architecture adopts a directory version of MESI as the dissipation. So _ can be calculated as follows: baseline coherence protocol, in which the directory is non- _ blocking [28][29] and distributed. The cache hierarchy is __ ∙ inclusive, so there is no need for a centralized directory, as _ _ directory information is held in the TAG field of the L2 copy. __ ∙_ (3) In the case of S-NUCA, we use the interleaved mapping strategy [30] (low-order bits of the index field of the physical _ ∙ address) to statically map each memory block to a single NUCA __ ∙_ bank. __ ∙_ In the case of D-NUCA each bankset is made up of all the banks belonging to the same column and we adopt the same where _ , _ , , _ , _ are mapping policy to map each block to a single baankset. Thus in, a respecttiively the dynamic energies dissipated when accessing the in a D-NUCA a block will be able to migrate among all the banks bank i in a read hit, a read miss, a write operation and a directory that make part of the bankset the block is mapped to. We access (either a read or a write to the part of the TAG holding consider the migration scheme proposed in [31] that exploits the directory information) related to the coherence protocol. As per-hit [10] block migration: a data promotion is triggered previously exposed, each bank of the NUCA cache is a self- whenever a block request from an L1 results in a hit in any of the contained traditional UCA cache memory. Thus its energy NUCA banks belonging to the block’s bankset. As blocks can be parameters values can be derived from available existing tools stored in any of the bankset’s baanks, we assume, as the search that model these aspects of a cache memory. In particular, by policy, that L1-to-L2 requests are broadcasted to all the banks of using the CACTI 5.1 tool [37], we obtain the energy the reference bankset. consumption per every kind of access.