Heterogeneous Architecture Design with Emerging 3D and Non-Volatile

9B-1

Heterogeneous Architecture Design with Emerging 3D and Non-Volatile Memory Technologies Qiaosha Zou∗, Matthew Poremba∗, Rui He†, Wei Yang†, Junfeng Zhao†, Yuan Xie‡ ∗ Computer Science and Engineering, The Pennsylvania State University, USA † Huawei Shannon Lab, China ‡ Electrical and Computer Engineering, University of California, Santa Barbara, USA Email: ∗{qszou, mrp5060}@cse.psu.edu, †{ray.herui,william.yangwei,junfeng.zhao}@huawei.com ,‡[email protected]

Abstract—Energy becomes the primary concern in nowadays dimensional integrated circuits (3D ICs), can alleviate the pin multi-core architecture designs. Moore’s law predicts that the count limitation with low latency, high bandwidth vertical exponentially increasing number of cores can be packed into a connections. Since the processing isolation between chips, 3D single chip every two years, however, the increasing power density is the obstacle to continuous performance gains. Recent studies ICs further accelerates the adoption of heterogeneous system show that heterogeneous multi-core is a competitive promising in a cost-efficient fashion. Figure 1 shows a sketch of future solution to optimize performance per watt. In this paper, different 3D computing system with layers of CPUs, GPUs and accel- types of heterogeneous architecture are discussed. For each type, erators and on-chip hybrid memories are placed by the side current challenges and latest solutions are briefly introduced. of computing fabrics with interposer [2, 3]. TSVs (through- Preliminary analyses are performed to illustrate the scalability of the heterogeneous system and the potential benefits towards silicon vias) are used as the vertical connection between each future application requirements. Moreover, we demonstrate the tier, providing high bandwidth, low latency interconnects. advantages of leveraging three-dimensional (3D) integration on heterogeneous architectures. With 3D die stacking, disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new ./01

1 ͙ paradigm of architecture design. 1"3+$4' ./019:;'0$$%4 >?@'2+$", :3%&&'2+$", 5#6"$7%8" !"#"$%&'()$*+,"'2(-,9:;'0$$%4 I. INTRODUCTION 088"&"$%6+$,

As Moore’s law predicted, the number of transistors doubles ͙ 1"3+$4' === (<'0$$%4 9:;'0$$%4 (<'0$$%4 (21 every 18 months. However, the performance is not expo- 5#6"$7%8" ͙ nentially scaled because it is restricted by the scaling speed !"#"$%&'()$*+,"'!(-, ./01 mismatch of power consumption and memory bandwidth. 5#6"$*+,"$ Furthermore, in nowadays computing systems, energy efficiency becomes the primary concern during system design. Fig. 1. Overview of future 3D computing system combining CPUs, GPUs, accelerators, on-chip memory, and off-chip hybrid memory. The traditional scale out strategy by packing more cores into a single chip is no longer power sustainable. The dark silicon In the 3D heterogeneous system, digital and analog fabrics × concept emerges as a 2 shortfall occurs when powering can be integrated in a single chip while each component can be a chip at its native frequency [1]. Therefore, the utilization optimized separately and has its own optimal clock frequency, rate of cores in the same area drops exponentially across supply voltage, and even technology node. In the digital generations. Fortunately, single-chip heterogeneous multi-core part, several layers of CPUs are built containing multiple is one potential solution to balance the power consumption cores with diverse computing capabilities and pipeline depths and performance enhancement. The heterogeneous multi-core for power efficiency. Tiers of GPUs are stacked providing combines the conventional processors with the emerging com- massive processing elements for highly parallelism. Numerous puting fabrics, such as GPGPUs and NVMs. accelerators that are dedicated for the target applications are Meanwhile, the slowly growing number of pin count poses designed by requirement and integrated at low cost. Moreover, another challenge on both homogeneous and heterogeneous on-chip memory is stacked to satisfy the increasing memory systems. Compared to the exponential growing rate of tran- bandwidth demand through short latency TSV arrays. Off-chip sistors, pin counts only increase linearly, resulting in scare hybrid memory moves in closer proximity with the processing bandwidth resources. Moreover, because of the application chip thanks to the interposer. The CPUs and GPUs connect variety and blooming of media and streaming processing, the with off-chip memory through specific memory interface and bandwidth is a crucial factor for performance and through- high bandwidth on-chip interconnects. put. Expanding chips on the third direction, which is three In this paper, three major heterogeneous integration strate- 1Zou, Poremba, and Xie were supported in part by NSF 1218867, 1213052, gies in the multi-core field are summarized. The first one 1409798, and 1017277 and SRC grants is the single-ISA heterogeneous multi-core architectures in

Section II. Specifically, we focus on the core integration of 2-3× higher performance than A7, however, A7 is about 3-4× different technology nodes utilizing 3D stacking. This kind more energy efficient due to the different pipeline lengths. of heterogeneity can maximize the performance within given Due to the various energy and performance characteristics in cost and power budget. The second comes with the most the heterogeneous system, the application scheduling on the popular heterogeneous ISA architecture that integrates con- appropriate core is crucial. Moreover, since the application ventional CPUs and other unconventional processing elements requirement changes along the execution phases, it is neces- (GPUs, FPGAs, and accelerators). We mainly focus on the sary for dynamic application mapping and migration. Recently, integration of CPU and GPU in Section III. The performance substantial research efforts are made to accurately model the on both latency-aware and throughput-aware applications can application performance under different cores [9] and conduct be beneficial from this integration. This strategy is advocated application mapping/scheduling and migration [10, 11, 12]. by numerous industrial products, such as AMD Fusion, Intel An analytical performance model is used and the fundamen- Sandy Bridge, and NVidia Tegra [4, 5, 6]. The last one in tal design tradeoffs of single-ISA heterogeneous multi-core Section IV is the combination of memories with different is studied [9]. In addition to the whole system throughput materials, namely, the integration of traditional DRAM/SRAM as focused by previous work, the per-program performance technology and the emerging non-volatile memory (NVM). is also considered as one of the design metrics. From the Leveraging the low standby power property of NVMs and determination of the frontier of Pareto-optimal architectures, short access latency of traditional memories, the memory it is found that there is no such an optimal configuration bandwidth can be guaranteed with affordable power consump- in heterogeneous system that can balance the per-program tion. performance and system throughput. Fundamentally, the single performance is traded for the system throughput. Moreover, II. SINGLE-ISA HETEROGENEOUS MULTI-CORE the effectiveness of heterogeneity is heavily depended on the ARCHITECTURES job mapping. The applications nowadays show enormous diversity on In general, the task scheduling can be performed statically the demands of computing resources. Furthermore, the re- or dynamically. The application characteristic can be extracted quirement varies during different execution phases even in before execution and be used to determine the suitable core the same program. Therefore, providing a uniform multi- statically [13]. For dynamic application mapping, characteris- core architecture with the general-purpose computing capa- tics of tasks and processors, such as power, utilization, and bility is over-provisioned, stressing the power management. bandwidth requirement, are monitored. The appropriate core Instead of packing more and more powerful cores, designers is then selected for better speedup or power-efficiency. For are switching to seek the solution using cores with various instance, the balance of power-performance guides the task-to- computing capabilities to provide just enough service. The core mapping utilizing the price theory [10]. The application most efficient design without involving much re-design efforts characteristics vary over time, thus a predictive trace-based is the single-ISA (instruction set architecture) multi-core ar- controller predicts the upcoming phase and migrates the exe- chitecture, which keeps the cores have the same ISA, only cution accordingly [11]. The processor utilization affects the varying the computing resources . system performance and energy, thus, a utilization based load The prevalent architecture design with single-ISA is the balancing demonstrated by Linux is proved to reduce energy combination of high performance cores (big core) and low up to 11.35% [12]. power cores (small core). Big cores have higher performance In addition to the basic integration of big core and small to handle compute intensive and latency sensitive applications core, heterogeneous multi-core design with different technol- at the cost of higher power consumption. Small cores are the ogy nodes is another possible solution for power efficient simple and low power processors (i.e. in-order processors) architecture. Moreover, the kind of design can be cost-efficient with lower throughput, however, they are more energy-efficient as we can reduce the cost by integrating components from towards the applications that are memory intensive. The per- earlier and mature fabrication processes. As shown in Figure 2, formance and power modelings of this multi-core architecture at the beginning of new technology node, the cost is extremely are first conduct in the exploration on Alpha cores [7]. high compared to primitive technologies. However, the new The researchers examine the energy saving and performance technology guarantees the integration density and transistor degradation of SPEC benchmarks over the architecture con- speed, driving the industry to adopt this technology. On the taining four cores: two in-order small cores and two out-of- other hand, in addition to the cost, the ever growing power order big cores. The results show that a 39% average energy density and leakage power of smaller feature size make the reduction is achieved with only 3% performance degradation. previous technologies appealing. Therefore, instead of building The energy saving is even better than applying the dynamic all the cores with the same technology, we can have some voltage/frequency scaling. cores using the older technologies and use the price gap to The idea of big core and small core integration is promoted integrate more cores. For example, if the transistor cost in by industry as ARM announced their heterogeneous multi- technology N is four times of the cost in technology N-1. We core, named big.LITTLE [8]. In their design, Cortex-A15, can integrate up to four cores with node N-1 with the same which is a triple-issue out-of-order processor, works as the cost as integrating only one core with node N. big core, while Cortex-A7, an in-order, non-symmetric dual- For cores in different technology generations, the clock issue processor, is the little core. In general, Cortex-A15 has frequency and power supply become the major challenges,

786 ifrn opne a oprt easily. and sepa- design cooperate processed own can their is companies on domain different die focus fabrication can each power designers the because Therefore, own Moreover, rately. simplify help its be interface. the can has domain under process power can others cross interfering it without of die, network different each on clock cores and In the constructing [15]. integration by 3D dies problem power emerging this sophisticated the solve requires Fortunately, can die designs. single clock the and on tech- different nodes with nology determines cores maximum building technology corresponding Therefore, frequency. the of clock and supply voltage voltage threshold minimum the [14] intrinsic nodes. technology the different of as cost transistor normalized The 2. Fig. n Psaeuigdfeetmmre.We combining When memories. different using are GPUs is and there the parallelism, for for gain. 18]. performance margin throughput, [17, significant little no 1300MHz higher have is provides 3GHz that GPU about applications GPU is recent rate even of clock Moreover, CPU frequency the the For i7, priority. while highest core at the Intel instructions of in is complex example, latency handle the for can and substitute frequency CPUs higher a CPUs. slower be of much is that not GPUs than can of frequency GPU data operation the massive the because However, CPU, yet types. instruction any simple be in with can parallel elements These in elements. operated processing of contains number it large a because applications graphics throughput-oriented parallel latency massive lower with processing. throughput for data higher computing provides sequential GPU and in while [16]. latency powerful way efficient is both energy can CPU an of GPU in applications and improvement aware CPU throughput performance of the advantages handling the sustain at combining good have that only researches are found recent cores Therefore, the applications. all multi-core latency-aware limitation single-ISA has still the heterogeneity single-ISA section, the However, previous illustrated. is the performance architecture In sustainable with watt. architectures per new exploit to ers u otedsic adit eurmns sal CPUs usually requirements, bandwidth distinct the to Due for out brought first was GPU purpose general The design- microprocessor the forcing is efficiency energy The I.H III.

Normalized transistor Cost 0.25 0.75 ETEROGENEOUS 0.5 0 1 Q031 Q131 Q231 Q331 Q43Q14 1Q14 3Q13 1Q13 3Q12 1Q12 3Q11 1Q11 3Q10 1Q10 40nm CPU A AND RCHITECTURES TimeLine 28nm GPU S 20nm C ONTAINING 787 h adit otedrcoyi eueb naeaeof average an by reduce is bus, directory direct-access [25]. the moving and 94% incoherent to By CPU the bandwidth regions. onto both permission the requests the for coherent track added to the the are caches on buffers L2 applied GPU region directly traditional the and replaces to directory if directory region hard A worse is scenario. even heterogeneity design be coherence directory- Traditional can based bandwidth. The the CPUs affect [23]. requests coherence for proposed are bandwidth behavior strategies parallelism available management thread-level symmetric new cache the two sampling aware and exploit core applications to a GPGPU of Consequently, proposed the GPGPUs. is increase for to in mechanism necessary rate multi-threading not hit is of it cache effect Therefore, the invalid. the hiding is at when to GPUs effective only up becomes latency cache improved row- level is memory last the throughput The overall enhance [24]. the 8% to and Various proposed locality different. buffer are dramatically techniques a is optimization on application scheduling parallel memory the single the on focus scenarios, final that applications perform studies multiple previous to level Unlike timing [22]. last and operations commands The data DRAM requests. the second with inter-application the deals the and locality, stage schedule row-buffer decouple first stage on The to stages. based proposed three requests is into groups level task memory memory controller’s Staged memory main level. the the cache from 24]. the reduced 23, or [22, be sharing can bandwidth bandwidth efficient The and fair for requests 256 with Mem- applications up Bandwidth provide can High 320 HMC specifications, and to the to [20] According [21]. Cube ory demonstrated Memory as 3D Hybrid solution, The to competitive by GPUs. memory one and bandwidth is CPUs memory high both stacked a for device resource to adequate provide is first The problem. sharing, degraded. the thus to is due performance period. loss the long bandwidth and relatively the a causes they integration for parallelism, The bandwidth level memory thread through high latency occupy long designed are the they hide as to latency GPUs memory Even on latency. requirement of memory little feature have on tolerance sensitive low latency indicates The the CPUs and unified scheduling. bandwidth request of memory shared memory introduction the the burdens mem- space Nevertheless, unified memory a space. of support address the to ory thanks CPU eliminated between is duplications GPU data and extensive The proposed. is 200GB/s), ory than latency. (more sharing memory data large the GPU in of However, from resulting bandwidth copied example. smaller dramatically the be is for [19]) than should 20GB/s (about PCIe, GPUs PCIe of interconnect, by bandwidth GPUs. needed via and are CPUs CPUs that between sections, data necessary parallel are The and exchanges serial data both improvement. with the application performance move- single for and a bottleneck sharing For data the the become cooperation, ments for GPUs and CPUs nte ieto st nelgnl ceuetememory the schedule intelligently to is direction Another bandwidth shared the solving for directions two are There mem- shared with GPUs and CPUs integrated Fortunately, GB/s adit hl B sddctdfrgraphic for dedicated is HBM while bandwidth GB/s bandwidth. 9B-1 9B-1

In addition to the bandwidth allocation, another challeng- 1 ing task is how to efficiently execute tasks on the system, Speedup = (1) (1 − )+ αf + + (1−α)f including the programming model and task scheduling from f c γ gβ both software and hardware aspects [26, 27, 28, 29]. In the Figure 3 shows the speedups for a program running on heterogeneous platform, the ISA and functionality of GPUs CPU+GPU system. We vary the CPU core from 1 to 99 and are fundamentally different from the general purpose CPUs. examine the speedup when γ equals 0, 0.1, 0.5, 1.0, and 2.5. Therefore, the traditional and developing applications would The result curve indicates that under the power constraint, be tailored to fit the new architecture with great effort given the speedup increases at first with more CPUs, then after a the developers may not be familiar with the new features. turning point, integrating more CPUs results in performance OpenCL [26] is emerging as the first open standard for cross- degradation because of the reduced parallel components (one platform, parallel programming of modern architectures. In CPU power consumption equals to four GPUs). Furthermore, addition to the open standard, several studies vary the basic when the data sharing delay is sufficient large, there is no programming model for their own specific target systems. performance benefit of CPU+GPU integration. The significant An integrated C/C++ programming environment supporting benefit of unified memory space is indicated from this result. specialized cores is proposed for their heterogeneous ISA- based MIMD architectures [27]. An OpenCL-based frame- 0 0.1 0.5 1.0 2.5 work, SnuCL, is proposed to get OpenCL applications portable 4 between compute devices in the heterogeneous clusters [29].

Performance modeling is also an interesting topic in het- 3 erogeneous system as it can predict the scalability at the early design stage. One of the simple analytical model that can applied is Amdahl’s Law. Previous studies have extended the 2

Amdahl’s Law into the heterogeneous computing era [30, 31]. Speedup

The future computer will integration various unconventional 1 computing units (GPGPUs, FPGAs, and ASICs) as suggested by the measurements and predictions [30]. In addition, the study also shows that sufficient parallelism is the prerequisite 0 1 10 100 for the significant performance gains from heterogeneous computing. Moreover, bandwidth is the first-order concern in CPU Count developing efficient system. Fig. 3. Speedups considering the data sharing delay. The x-axis is in logarithm scale. f =0.7, α =0.5, β =0.5, and w =0.25. Two processing modes are available for CPUs and GPUs g integration: asymmetric and simultaneous asymmetric. The IV. HETEROGENEOUS MEMORY ARCHITECTURES first mode divides a program into three segments: serial execution on one CPU, parallel execution on multi-cores, The memory and storage system is a critical component and parallel execution on GPUs. The second mode schedules in various computer systems. More and more applications different programs onto CPUs and GPUs which computes are shifting from computing bounding to data bounding. A simultaneously. Amdahl’s Law is revised to capture the system hierarchy of memory and storage components are used to effi- configuration leading to the optimal speedup [31]. The study ciently store and manipulate a large amount of data. However, is based on the structure that CPU and GPU share the same the performance and energy scaling of main-stream memory memory space. We extend the work to take the data sharing technologies cannot catch up the requirements of current delay into consideration and compare the speedups between computing systems. The commodity memory technologies, unified and separate memory space. such as SRAM and DRAM, are facing scalability challenges due to the limitations of their device cell size and power Similar to the definition in previous work [31], the total dissipation. In particular, the leakage power of SRAM and numbers of CPUs and GPUs are denoted by c and g. However, DRAM and the refresh power of DRAM are increasing, which due to the power constraint which is measured by a single contribute a significant portion of overall system energy [32]. CPU power consumption, the number of GPUs is limited by Therefore, an energy efficient memory subsystem is in urgent (PB − c)/wg, where PB is the power budget and wg is the need to continue the system improvement. power ratio between GPU and CPU. Assume the portion of Recently, emerging byte-addressable nonvolatile memory the program can be paralleled is f and the portion of parallel technologies have been studied as the replacement of tradi- program running on multi-cores is α. β is the execution tional memories due to their promising characteristics: higher timing ratio of GPU to CPU. γ is the data sharing latency density, lower leakage power, and non volatility. The represen- normalized to the program computation time on a single CPU. tative technologies include spin-transfer torque memory (STT- For memory intensive applications, the value of γ can be RAM), phase-change memory (PCM), and resistive memory larger than 1. Note that, we assume that the memory latency (ReRAM) [33, 34, 35]. Nevertheless, several shortcomings of of data sharing between CPUs is negligible. The theoretical NVMs, such as high write latency/power and low endurance, asymmetric speedup is represented as follows: impede the direct adoption and fully replacement of NVMs.

788 9B-1

Heterogeneous integration with SRAM/DRAM and NVMs is system contains both DRAM and PCM as main memory, the one potential solution towards the design challenges. DRAM capacity is 1GB while the PCM capacity is 3GB. Most NVM technologies are not compatible with CMOS When the DRAM is used as cache, the capacity is 32MB. technology which is the traditional technology used to im- Figure 4 shows the average latency and power consumption plement the digital logics. Consequently, with most types of of four memory configurations: pure DRAM, pure PCM, NVMs, silicon interposer or 3D stacking is leveraged for the DRAM+PCM, and DRAM as cache. Due to the long write implementation [36]. Layers of different NVM technologies latency of PCM, pure PCM has the highest latency compared and traditional SRAM/DRAM can be integrated. Moreover, to other three cases. It is obvious that pure DRAM has the the 3D stacked memory can increase the memory capacity lowest latency. The latency of DRAM cache is better in three and bandwidth in a cost and energy efficient fashion. For applications (pagerank, volrend, radiosity), especially in the instance, Sun et al. [33] proposed a 3D cache architecture traditional applications, due to the relatively higher cache hit design with STT-RAM as L2 cache. The system power is rate. The largest latency of volrend occurs in the DRAM+PCM reduced by 70% and performance is moderately improved as case because most of the memory accesses go to PCM, which demonstrated by the study due to the non volatility property is also suggested by the power consumption. In the power of STT-RAM. CMPs are vulnerable to soft errors, which can consumption part, the low hit rate of DRAM cache in big be eliminated by stacking all levels of memory hierarchy with data benchmarks leads to the high power consumption. More- STT-RAM. The system performance is improved by 14.5% over, due to the metadata which are located in DRAM, the with 13.44% power reduction [37]. The non-volatility and DRAM cache has almost the same level of power consumption soft-error resistivity features of NVM make it attractive to compared to the pure DRAM cases. The power consumption FPGAs. The 3D PCM-based FPGA exhibits advantages over of DRAM+PCM is very close to pure PCM because of the 3D FPGAs with traditional memory technologies in terms of relatively small DRAM capacity. power consumption, wirelength, and critical path delay [38]. Main memory needs to be sufficiently large to hold most of DRAM PCM Parallel DRAM Cache the data in applications. Commodity computer systems lever- 1200 age DRAM as main memory, which may not be sustainable due to the inherent scalability fo DRAM. Among various 900 NVMs, PCM is believed to be the best candidate for main memory. However, using PCM alone as main memory has 600 endurance and power problem as indicated by Lee et al. [34].

From the study, they show that a pure PCM-based main 300 memory can be 1.6× slower and consumes 2.2× higher energy than DRAM-based memory due to the high write latency and Latency (Clock Cycles) Average 0 energy consumption. In the DRAM and PCM hybrid memory, aggregation join kmeans pagerank select terasort volrend radiosity DRAM can work as a buffer to serve memory writes with Benchmark low latency or it can be placed parallel with PCM to share (a) Average Latency portion of the memory requests. Qureshi et al. [39] propose the DRAM PCM Parallel DRAM Cache main memory design with a PCM region and a small DRAM 1.4 buffer. Their study show that a 3× speedup can be achieved due to the beneﬁt from both short latency of DRAM and high 1.05 capacity of PCM. Ramos et al. [40] study the energy-delay2 (ED2) of the hybrid system when pages are migrating between DRAM and PCM by monitoring access patterns. Their system 0.7 is more robust and has lower ED2 through the simulation of

27 workloads (SPEC, SPEC2006, and Stream suites). Power (W) Average 0.35 Lately, the big data application emerges as one mainstream application on the datacenter and warehouse-scale computing. 0 The enormous memory footprint and energy consumption aggregation join kmeans pagerank select terasort volrend radiosity make the DRAM+PCM hybrid memory more attractive. In Benchmark this section, we preliminarily explore the latency and energy (b) Power Consumption of these two different hybrid memory styles (DRAM as buffer Fig. 4. The comparison of the average latency and power consumption for or main memory) under big data applications. We extract the four memory conﬁgurations: pure DRAM, pure PCM, DRAM+PCM, and memory traces of 6 applications (aggregation, join, kmeans, DRAM cache. pagerank, select, terasort) from big data benchmark [41] There is no single winner as illustrated by the results. and two traditional applications (volrend, radiosity) from However, we can conduct some optimizations towards the SPLASH-2 [42]. Then the traces are replayed using the non- hybrid memory design to mitigate the long latency of PCM and volatile memory simulator NVMain [43] for latency and power high energy of DRAM. For example, we can use hierarchical estimation. The total off-chip memory capacity keeps constant metadata placement to reduce the access amount of DRAM as 4GB and the frequency of memory is 800MHz. When the with energy efﬁciency [44]. We can also develop an intelligent

789 9B-1 data placement to balance the workloads between DRAM and [22] R. Ausavarungnirun, K.-W. Chang, L. Subramanian, G. Loh, and PCM [45] and increase the cache hit rate. O. Mutlu, “Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems,” in International Symposium V. C ONCLUSION on Computer Architecture, 2012. [23] J. Lee and H. Kim, “TAP: A TLP-aware cache management policy for a The heterogeneous multi-core architecture promises the CPU-GPU heterogeneous architecture,” in International Symposium on power sustainable performance scaling in future computer High Performance Computer Architecture, 2012. systems. In addition to the traditional CPUs and memories, [24] H. Wang, R. Singh, M. J. Schulte, and N. S. Kim, “Memory scheduling towards high-throughput cooperative heterogeneous computing,” in In- emerging computing fabrics and non-volatile memory are ternational Conference on Parallel Architectures and Compilation, 2014. engaged to reduce the performance per watt. In this paper, [25] J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, three typical heterogeneous systems are introduced: single-ISA S. K. Reinhardt, and D. A. Wood, “Heterogeneous system coherence for integrated CPU-GPU systems,” in IEEE/ACM International Symposium multi-core, integration of CPU and GPU architecture, and the on Microarchitecture, 2013. hybrid memory system. Moreover, the heterogeneous system [26] “Opencl api,” https://www.khronos.org/opencl/. can leverage 3D integration to further expand the design space. [27] P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang, “Exochi: Architecture and Despite the promising features of heterogeneity, several keen programming environment for a heterogeneous multi-core multithreaded challenges should be tackled, such as task mapping/scheduling system,” in Conference on Programming Language Design and Imple- and application-specific design optimizations. mentation, 2007. [28] A. Kerr, G. Diamos, and S. Yalamanchili, “Modeling GPU-CPU work- REFERENCES loads and systems,” in Workshop on General-Purpose Computation on [1] M. Taylor, “Is dark silicon useful? harnessing the four horsemen of Graphics Processing Units, 2010. the coming dark silicon apocalypse,” in Design Automation Conference, [29] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: An OpenCL 2012. framework for heterogeneous CPU/GPU clusters,” in International Con- [2] Y. Koizumi, N. Miura, E. Sasaki, Y. Take, H. Matsutani, T. Kuroda, ference on Supercomputing, 2012. H. Amano, R. Sakamoto, M. Namiki, K. Usami, M. Kondo, and [30] E. Chung, P. Milder, J. Hoe, and K. Mai, “Single-chip heterogeneous H. Nakamura, “A scalable 3D heterogeneous multi-core processor with computing: Does the future include custom logic, FPGAs, and GPG- inductive-coupling thruchip interface,” IEEE Micro, vol. 33, pp. 6–15, PUs?” in International Symposium on Microarchitecture, 2010. 2013. [31] A. Marowka, “Extending Amdahl’s law for heterogeneous computing,” [3] S. Borkar, “3D integration for energy efficient system design,” in Design in International Symposium on Parallel and Distributed Processing with Automation Conference, 2011. Applications, 2012. [4] A. Branover, D. Foley, and M. Steinman, “AMD Fusion APU: Llano,” [32] C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. Keller, IEEE Micro, vol. 32, pp. 28–37, 2012. “Energy management for commercial servers,” Computer, vol. 36, [5] M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts, “A fully integrated no. 12, pp. 39–48, Dec 2003. multi-CPU, GPU and memory controller 32nm processor,” in IEEE [33] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, “A novel architecture of the International Solid-State Circuits Conference, 2011. 3D stacked MRAM L2 cache for CMPs,” in International Symposium [6] “Nvidia Tegra,” http://www.nvidia.com/object/white-papers.html. on High Performance Computer Architecture, 2009, pp. 239–249. [7] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas, [34] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase change “Single-ISA heterogeneous multi-core architectures for multithreaded memory as a scalable DRAM alternative,” in International Symposium workload performance,” in International Symposium on Computer Ar- on Computer Architecture, 2009, pp. 2–13. chitecture, 2004. [35] C. Xu, X. Dong, N. Jouppi, and Y. Xie, “Design implications of [8] “ARM big.LITTLE technology,” http://www.arm.com/products/processors/ memristor-based RRAM cross-point structures,” in Design, Automation technologies/biglittleprocessing.php. Test in Europe Conference, 2011. [9] K. Van Craeynest and L. Eeckhout, “Understanding fundamental design [36] Y. Xie, “Future memory and interconnect technologies,” in Design, choices in single-isa heterogeneous multicore architectures,” ACM Trans. Automation Test in Europe Conference Exhibition, 2013. Archit. Code Optim., vol. 9, pp. 1–32, 2013. [37] G. Sun, E. Kursun, J. Rivers, and Y. Xie, “Exploring the vulnerability [10] T. Somu Muthukaruppan, A. Pathania, and T. Mitra, “Price theory based of CMPs to soft errors with 3D stacked non-volatile memory,” in power management for heterogeneous multi-cores,” in International International Conference on Computer Design, 2011. Conference on Architectural Support for Programming Languages and [38] Y. Chen, J. Zhao, and Y. Xie, “3D-NonFAR: Three-dimensional non- Operating Systems, 2014. volatile FPGA architecture using phase change memory,” in Interna- [11] S. Padmanabha, A. Lukefahr, R. Das, and S. Mahlke, “Trace based phase tional Symposium on Low-Power Electronics and Design, 2010. prediction for tightly-coupled heterogeneous cores,” in International [39] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high perfor- Symposium on Microarchitecture, 2013. mance main memory system using phase-change memory technology,” [12] M. Kim, K. Kim, J. R. Geraci, and S. Hong, “Utilization-aware load bal- in International Symposium on Computer Architecture, ser. ISCA ’09. ancing for the energy efficient operation of the big.LITTLE processor,” New York, NY, USA: ACM, 2009, pp. 24–33. in Design, Automation and Test in Europe, 2014. [40] L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page placement in hybrid [13] J. Chen and L. John, “Efficient program scheduling for heterogeneous memory systems,” in International Conference on Supercomputing, multi-core processors,” in Design Automation Conference, 2009. 2011. [14] “Cost scaling trend,” http://www.extremetech.com/computing/123529- [41] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, nvidia-deeply-unhappy-with-tsmc-claims-22nm-essentially-worthless. S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “Bigdatabench: [15] Y. Xie, G. H. Loh, B. Black, and K. Bernstein, “Design space exploration A big data benchmark suite from internet services,” in International for 3D architectures,” Journal of Emerging Technology Computing Symposium on High Performance Computer Architecture, 2014. Systems, vol. 2, pp. 65–103, 2006. [42] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta, “The SPLASH- [16] “Amd heterogeneous system architecture,” 2 programs: characterization and methodological considerations,” in http://developer.amd.com/resources/heterogeneous-computing/what- International Symposium on Computer Architecture, 1995. is-heterogeneous-system-architecture-hsa/. [43] M. Poremba and Y. Xie, “NVMain: An architectural-level main memory [17] “Intel core i7,” http://ark.intel.com/products/37148/. simulator for emerging non-volatile memories,” in IEEE Computer [18] “Nvidia GeForce GTX 980,” http://www.geforce.com/hardware/desktop- Society Annual Symposium on VLSI, 2012. gpus/geforce-gtx-980/specifications. [44] J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling [19] “PCI Express 3.0,” https://www.pcisig.com/specifications/pciexpress/base3/. efficient and scalable hybrid memories using fine-granularity DRAM [20] J. Jeddeloh and B. Keeth, “Hybrid memory cube new DRAM archi- cache management,” Computer Architecture Letters, vol. 11, pp. 61–64, tecture increases density and performance,” in Symposium on VLSI 2012. Technology, 2012. [45] M. Pavlovic, N. Puzovic, and A. Ramirez, “Data placement in HPC [21] JEDECHBM, “http://www.jedec.org/category/technology-focus-area/3d- architectures with heterogeneous off-chip memory,” in International ics-0.” Conference on Computer Design, 2013.

790