[3B2-9] mmi2011040063.3d 19/7/011 11:9 Page 63

...... ATTAINING SINGLE-CHIP,HIGH- PERFORMANCE COMPUTING THROUGH 3D SYSTEMS WITH ACTIVE COOLING

...... THIS ARTICLE EXPLORES THE BENEFITS AND THE CHALLENGES OF 3D DESIGN AND

DISCUSSES NOVEL TECHNIQUES TO INTEGRATE PREDICTIVE COOLING CONTROL WITH

CHIP-LEVEL THERMAL-MANAGEMENT METHODS SUCH AS JOB SCHEDULING AND VOLTAGE

FREQUENCY SCALING.USING 3D LIQUID-COOLED SYSTEMS WITH INTELLIGENT RUNTIME

MANAGEMENT PROVIDES AN ENERGY-EFFICIENT SOLUTION FOR DESIGNING SINGLE-CHIP

MANY-CORE ARCHITECTURES.

...... Performance demands are increas- An emerging design technique called 3D ing in data centers and high-performance stacking addresses these challenges. First, computing clusters, which today run various because a 3D stacked system has a smaller applications from document and media pro- footprint (that is, per-chip area), the manu- cessing to scientific computing and complex facturing yield is higher, reducing the overall modeling. In tandem, the industry has design cost.2,3 The on-chip interconnects are Ayse K. Coskun moved into building many-core systems, shorter in 3D systems, reducing the wire where a single chip has dozens of cores. delay and capacitance. Because the through- 1 Jie Meng Intel’s Single-Chip Cloud Computer and silicon vias (TSVs) connecting the layers do Tilera’s 64-core processors are recent exam- not adhere to the chip’s pin-out restrictions, Boston University ples of such systems. Although many-core we can build interconnect architectures with systems could provide immense computa- higher bandwidth between cores and mem- tional capacity, achieving high performance ory blocks. However, 3D design accelerates David Atienza in such systems is highly challenging. Com- the thermal challenges because of the higher munication latency and memory bandwidth thermal resistivity resulting from vertical Mohamed M. Sabry limit many-core performance. Many-core stacking. In fact, high temperatures and cool- systems have other challenges as well. Be- ing challenges are among the major gating E´cole Polytechnique cause of the larger die sizes, the manufactur- factors in building high-performance 3D ing yield is lower, high reliability is more systems.4 Fe´de´rale de Lausanne difficult to achieve, process variations are Prior research has addressed the thermal more severe, and the production cost is challenges in 3D systems through design higher compared to smaller chips. These techniques such as temperature-aware floor- challenges accelerate with smaller technology planning5 and dynamic management tech- nodes. niques such as temperature-aware job ......

0272-1732/11/$26.00 c 2011 IEEE Published by the IEEE Computer Society 63 [3B2-9] mmi2011040063.3d 19/7/011 11:9 Page 64

...... BIG CHIPS

chip—has emerged as a viable cooling alter- native for high-performance 3D systems in the past decade.7 Recently, a prototype 3D system with built-in microchannels was man- ufactured8 (see Figures 1 and 2). Liquid cooling has a higher efficiency of removing heat compared to conventional heat sinks and fans, and therefore can address the press- ing thermal challenges in 3D systems. Liquid- cooled 3D systems, however, bring novel challenges in cooling control and in efficient in- tegration with chip-level thermal-management Figure 1. Liquid-cooled 3D chip prototype, techniques. built by IBM Zu¨rich and E´ cole Polytechnique Many-core3Ddesignwithactivecooling Fe´de´rale de Lausanne (EPFL).8 Liquid cool- is highly complex, with a number of con- ing can more effectively remove heat com- straints such as cost, peak power, energy, re- pared to conventional heat sinks and fans. liability, and yield. Solving these challenges through design-time optimization adds to the area overhead, increases time-to-market and cost, and often results in suboptimal op- eration owing to dynamic variations in workload. We propose integrating active- cooling control with thermally aware job scheduling and DVFS to optimize energy ef- ficiency of high-performance 3D systems while maintaining low and stable tempera- ture profiles. Temperature and energy bene- fits as compared to conventional cooling systems are remarkable: we reduce the peak temperature to 50Cfromcriticallyhighlev- els, while achieving over 2 cooling-energy savings on a 64-core 3D system. Figure 2. Side view of the 3D liquid-cooled system prototype showing A manufacturing perspective on 3D design the built-in microchannels. The liquid arrives at the chip through a pipe, Recent chip sizes for many-core systems gets distributed among the microchannels, flows through the chip, and is reach 500 to 600 mm2. An important ad- collected into another pipe at the outlet. vantage of 3D stacking comes from silicon economics: individual chip yield, which is inversely correlated with area, increases scheduling and dynamic voltage and fre- when a larger number of chips with smaller quency scaling (DVFS).6 Although these areas are manufactured.2 We can estimate approaches provide substantial benefits in the cost of manufacturing a 3D system, C, reducing the temperatures below critical lev- using the wafer cost Cwafer; wafer utilization els, they are not always sufficient or energy U3D (that is, how many 3D systems can be efficient for managing 3D many-core systems. cut from the wafer, considering the chip, Temperature can increase dramatically in 3D TSV, and scribe area); and the yield Ysystem systems (for example, above 100C), making (see Equation 1).3 well-known thermal-management techniques inadequate for controlling temperature with- C wafer C ¼ (1) out considerably hurting performance. U3D Ysystem Active cooling—in which liquid (for ex- ample, water) flowing through built-in We calculate a 3D system’s yield by extend- microchannels (or cold plates) cools the ing the negative binomial-distribution model...... 64 IEEE MICRO [3B2-9] mmi2011040063.3d 19/7/011 11:9 Page 65

Equation 2 shows the yield computation for 29 0.95 3D systems with known-good-die (KGD) Cost per system (left y-axis) bonding, where the dies are tested prior to 27 0.90 Y system (right y-axis) bonding. In the equation, D is the defect den- sity, with a typical range between 0.001 25 0.85 2 2 per mm and 0.005 per mm (D ¼ 23 0.80 0.001/mm2 in this work). A is the total chip area to be split into n layers, a is the de- 21 0.75 3 (%) Yield fect clustering ratio (we set a to 4), ATSV is 19 0.70 the area overhead of the TSVs, Pstack is the probability of having a successful stacking op- Cost per 2D/3D system ($) 17 0.65 eration for KGDs, and n is the number of 15 0.60 stacks. In wafer-to-wafer bonding, the yield 2D 2-tier 3D 4-tier 3D loss is higher because individual dies are not tested prior to bonding. Figure 3. Comparing the cost and yield for building a 64-core chip using 2D  and 3D design, assuming a mature stacking process (see Table 1 for chip D A properties). Building the 64-core chip as a four-tier 3D system decreases Y ¼ 1þ þ A P n system n TSV stack the manufacturing cost by 24 percent compared to the 2D chip. This decrease is mainly due to the increase in yield. For the four-tier system, (2) wafer utilization increases slightly as well (by 9 percent), because smaller chips use the wafer area more efficiently. Figure 3 shows the manufacturing yield and cost per 3D system for building a 64- core chip using 45-nm technology. We as- sume a standard 300-mm wafer with an esti- mated cost of $3,000, and we compare 3D Table 1. Properties of the 64-core big chip. designs with two and four layers to a sin- gle-layered many-core system with a total Parameter Value 2 area of 321 mm . Table 1 provides the sys- Process technology 45 nm tem properties, which are based on the No. of cores 64 sizes and architectures of the cores and caches No. of private Level 2 (L2) caches 64 in the Intel 48-core Single-Chip Cloud L2 cache size and area 256 Kbytes, 0.74 mm2 1 Computer (SCC). This example illustrates Core architecture Similar to Pentium-class cores in Intel the benefits of 3D stacking for designing Single-Chip Cloud Computer (SCC)1 big chips with respect to cost and yield: Core area 4.3 mm2 buildingthesamemany-corechipasa Through-silicon via (TSV) dimensions 50 mm, 50mm, 100-mm pitch four-tier 3D system instead of a 2D system On-chip bus width 128 bits decreases the per-system manufacturing cost by 24 percent. This analysis assumes a ma- ture, reliable bonding process with a proba- 3,9 bility of success value (Pstack) of 0.99. If existing process for the TSV design, which the stacking success is lower, system yield includes etching and filling up the etched drops with the number of layers, increasing space with copper. Based on our discussions the cost of 3D design (see Equation 2). with industry, the additional etching phase Therefore, a key aspect for making 3D for the microchannels incurs negligible addi- design cost-effective is the deployment of a tional cost with respect to a 3D system with- mature, reliable bonding process, which is out liquid cooling. The bonding phase for being pursued by manufacturers in both liquid-cooled 3D systems introduces addi- high-performance computing and embedded tional complexity, because the microchannels markets. remove part of the surface touching the chip, Building liquid-cooled systems requires and the gluing of the layers must be per- an additional etching phase for the micro- formed more carefully. This complexity channels.Thisprocessissimilartothe translates to around 20 percent additional ......

JULY/AUGUST 2011 65 [3B2-9] mmi2011040063.3d 19/7/011 11:9 Page 66

...... BIG CHIPS

manufacturing costs compared to 3D systems aware design and runtime management. with TSVs only (without microchannels). Today, well-known tools such as HotSpot Recently, researchers have implemented and include 3D-modeling features.11 In recent tested a practical implementation of silicon- work, we extended the HotSpot simulator based microchannel coolers for high-power to model the thermal impact of TSVs.12 Be- chips.10 cause copper has higher thermal conductivity Not only does 3D design improve the than silicon or the interlayer glue material, manufacturing yield for big chips, but it chip areas with a high TSV density observe also helps integrate a larger number of tran- a reduction in temperature. This temperature sistors within the same footprint without decrease, although usually limited to a few scaling the technology node. In addition, degrees, can be effective in reducing the 3D design allows for putting tested KGDs peak temperature in hot zones. together to build a large chip. Moving to Although temperature-aware design has process nodes beyond 45 nm will be prohib- considerable benefits in reducing peak and itively expensive for some manufacturers. average temperatures, it is often insufficient Therefore, 3D design is both a promising for eliminating the thermal hot spots in 3D method to pack more functionality on a systems, especially when there are more single chip and a key enabler for the design than two layers in the stack. Thus, dynamic of big chips. thermal-management policies, such as job scheduling and DVFS, are essential in 3D Thermal challenges of 3D design systems, especially at the absence of active 3D systems improve manufacturing yield, cooling. reduce design cost, and enable high-density In Figure 4, we show the effects of dy- integration. Stacking, however, makes it dif- namic thermal-management policies on 3D ficult to cool systems effectively through con- systems with conventional cooling infrastruc- ventional heat sinks and fans, because the tures (that is, only heat sinks and fans, no thermal resistivity is high for the layers away liquid cooling). Table 2 summarizes the from the heat sink. According to International package and floorplan parameters used in Technology Roadmap for Semiconductors (ITRS) the thermal simulations. Table 3 includes predictions, junction-to-air thermal resis- the liquid-cooling-related design assump- tance must drop dramatically below tions (we discuss results with active cooling 0.18C/W to cool future high-performance in the next section). We modeled a heat sink 3D chips. In practice, such cooling efficiency with a convection resistance of 0.1 K/W, is nearly impossible to achieve at acceptable representing the heat sinks in modern packaging and cooling costs and dimensions. CPUs. In these experiments, we leveraged Even though conventional heat-sink-based a 64-core processor manufactured at cooling is inadequate for high-performance 45 nm as our target system. Each core has 3D systems, it is viable for lower-power 3D two-way issue out-of-order execution, two designs. Therefore, 3D systems developed integer units and one floating-point unit, a for embedded-computing environments, 16-Kbyte private Level 1 (L1) instruction where a liquid-cooling infrastructure is not cache, a 16-Kbyte private L1 data cache, feasible, rely on thermal-management meth- and a 256-Kbyte private L2 cache. The ods to find reliable, energy-efficient operat- core architecture is based on the cores ing points. used in Intel SCC1 (see Table 1). Thermally aware floorplanning plays a We run eight benchmarks from PAR- crucial role in controlling peak temperature SEC13 (blackscholes, bodytrack, in 3D systems. Fundamental layout guide- canneal,andfluidanimate)and lines for temperature-aware 3D design NAS14 (cg, dc, mg,andua)parallelbench- include avoiding placing power-hungry com- mark suites on the M5 performance simula- ponents close together and making effective tor15 to collect performance traces for a range use of low-power components (such as mem- of workloads running on the 64-core system. ory units) to help the heat spread. Thermal We use McPAT to derive the power values modeling is a key component of thermally corresponding to the performance traces.16 ...... 66 IEEE MICRO [3B2-9] mmi2011040063.3d 19/7/011 11:9 Page 67

100 Max T (ºC) 90 Avg T (ºC)

80

70

Temperature (ºC) Temperature 60

50 No DTM TALBTALB + DVFSNo DTM TALB TALB + DVFS

2D, air cooled 2-tier 3D, air cooled

Figure 4. Comparing the peak and average temperatures for the 64-core 2D and 2-tier 3D systems without active cooling for no thermal management (No DTM), temperature- aware load balancing (TALB), and TALB combined with dynamic voltage and frequency scaling (TALB þ DVFS). TALB reduces the peak temperature below the critical value of 85C. TALB þ DVFS reduce the temperatures further. Peak and average temperatures exceed 100C and 90C, respectively, for the four-tier 3D system even with TALB þ DVFS, making the design infeasible.

In McPAT, we set the supply voltage to 1.14 V and the frequency to 1 GHz.1 We calibrate Table 2. Floorplan and package parameters used the power results from McPAT to match in the thermal simulations. the reported power values for the cores in the Parameter Value Intel SCC. We assume the cores have the fol- lowing additional voltage-frequency pairs Silicon conductivity 130 W/(mK) 3 available: (0.75 V, 366 MHz), (0.9 V, 590 Silicon capacitance 1,635,660 J/(m K) MHz), and (1.05 V, 866 MHz). The average Wiring-layer conductivity 2.25 W/(mK) 3 core power at the highest speed (dynamic Wiring-layer capacitance 2,174,502 J/(m K) and leakage) is 1.8 W, whereas the leakage Heat sink conductivity 10 W/K component is 0.7 W at 75C. Each L2 Heat sink capacitance 140 J/K 2 cache consumes 0.3 W dynamic power and Total area of each tier in two-tier, 64-core 176 mm 0.17 W leakage power at 75C. We com- 3D stack 2 pute the temperature dependence of leak- Total area of each tier in four-tier, 64-core 88 mm age power using a common exponential 3D stack formulation.17 In the 2D floorplan, each core is next to itsprivateL2cacheonan8 8 grid. In manageable ranges for low-power cache the two-tier and four-tier 3D systems, each layer stacking options. layer has 32 and 16 core cache pairs, respec- The temperature-aware load balancing tively. In this 64-core system, the core and (TALB) in Figure 4 balances the weighted cache areas differ greatly, so we do not utilization among the cores,12 which is the place the cores and caches onto separate actual utilization of the cores multiplied by layers to avoid wasting substantial area. a thermal factor representing the average When core and cache sizes are similar, how- thermal stress on the core. In this way, ever, it is beneficial to separate the core and cores that are at locations more prone to memory layers to exploit the heat dissipation hot spots (such as cores on the layers away from hotter cores to cooler caches. In fact, we from a heat sink) receive a lower workload expect temperature increases to be within compared to cores at easier-to-cool locations ......

JULY/AUGUST 2011 67 [3B2-9] mmi2011040063.3d 19/7/011 11:9 Page 68

...... BIG CHIPS

Table 3. Liquid-cooling design assumptions used in the thermal One such challenge is dynamic flow-rate simulations. control. The advanced cooling capacity comes with a pumping-energy cost to push Parameter Value the fluid into the microchannels. For exam- Water conductivity 0.6 W/(mK) ple, for a cluster with 60 computing stacks, Water capacitance 4,183 J/(kgK) as proposed in the Aquasar data center de- Channel width 50 mm sign, the cooling infrastructure consumes Channel thickness 100 mm up to 70 W (similar to the power consump- Channel pitch 150 mm tion of a many-core chip). Workload changes Channel length 11 mm significantly over time in real-life systems, Number of channels per cavity in the two-tier, 106 and there are many idle cycles—especially 64-core 3D stack for data centers, which must operate at me- Number of channels per cavity in the four-tier, 53 dium utilization levels to handle unexpected 64-core 3D stack rises in service requests. Thus, adjusting the liquid flow dynamically to meet the cooling demand saves cooling energy compared to setting a sufficiently high flow rate for han- on the chip. The DVFS policy we imple- dling the worst-case temperatures. Adjusting ment adjusts the voltage frequency level the flow rate of pumps, however, typically according to the estimated utilization of the has an overhead in the range of several hun- cores. TALB has a negligible performance dred milliseconds. This overhead requires cost and can reduce the temperature below implementing a predictive control strategy 85C for all workloads. Combining TALB to adjust the cooling before thermal emer- with DVFS reduces the temperatures gencies occur. We also must minimize the further. Such dynamic-management tech- flow rate changes for ensuring stable thermal niques are essential for providing low-cost profiles and reliable pump operation. thermal management of air-cooled 3D sys- Another challenge is integrating flow-rate tems, and they also offer opportunities to control with chip-level thermal-management improve cooling efficiency in active-cooling techniques. Chip-level techniques such as environments. DVFS and job scheduling successfully reduce and balance the temperature,6 improving cooling efficiency. Combining various con- High-performance 3D systems trol knobs in a single low-overhead control- with active cooling ler, however, is challenging because the Researchers have studied the use of con- control parameters differ in their time con- vection in microchannels to cool high- stants, performance and energy overheads, power density chips since the initial work and benefits. For example, changing the by Tuckerman and Pease.18 Their liquid- DVFS setting typically takes tens to hun- cooling system can remove 1,000 W/cm2; dreds of microseconds, whereas flow-rate however, the volumetric flow rate and pres- changes take hundreds of milliseconds. The sure drop are large, making the approach un- overhead for workload scheduling is typically suitable for chip-level implementation. low. However, when the system is highly uti- Recent work shows that backside liquid lized, job scheduling is not sufficient to control cold plates, such as staggered microchannel the temperature, requiring more aggressive and distributed return-jet plates, can handle techniques such as DVFS or flow-rate control. up to 400 W/cm2 in single-chip applica- Simultaneously using distinct control knobs at tions.19 Using pin fin structures at a chip runtime requires a controller capable of mak- size of 1 cm2, the heat-removal performance ing intelligent decisions quickly. is more than 200 W/cm2 for TSV pitches To address these challenges, we propose a larger than 50 mm.7 Although the cooling ca- dynamic thermal-management strategy to pacity of active cooling is remarkable, several enable high-performance, energy-efficient, key challenges must be addressed to enable and reliable operation of liquid-cooled reliable and efficient operation. 3D systems. The dynamic approach uses a ...... 68 IEEE MICRO [3B2-9] mmi2011040063.3d 19/7/011 11:9 Page 69

Design-time phase Runtime phase Thermal control Superposition- Control Runtime parameters based thermal decisions control identification response analysis lookup table actuators

Measurements Derivation of and real-time control policies requirements

Figure 5. Our proposed design-time and runtime thermal-management approach using a fuzzy-logic controller for DVFS and flow-rate control. Our goal is to minimize energy consumption while maintaining the desired performance and keeping temperature within safe bounds.

design-time thermal analysis of the liquid- infrastructure to address the differences in cooled 3D system with respect to flow rate, the thermal-resistivity values of the micro- DVFS, and task-scheduling decisions, channels carrying the liquid and the glue ma- exploring the thermal impact of tuning terial. Additionally, depending on the flow each thermal-control knob on the tempera- rate provided to push water in the micro- ture profile. Then, the integrated controller channels, the junction temperature of the monitors input variables (such as tempera- cells adjacent to the channels changes accord- ture and core utilization), allowing a degree ing to the heat absorption along the channel. of uncertainty, and tunes the control knobs We compute the total temperature rise on through a set of predefined rules. the junction DT jas Figure 5 demonstrates the schematic dia- gram of the combined design-time and run- ÁTj ¼ ÁTconduction þ ÁTconvection time thermal-management approach. This þ ÁTsensibleheat approach starts with identifying the input parameters to the controller, the control The thermal gradient due to heat conduc- knobs,andeachknob’sreactiontime. tion through the back-end-of-line (BEOL) Using this initial analysis, we derive a set of layer is computed using the BEOL resistivity. management rules for each knob and com- The convection rate depends on the heat bine the rules in a superposition phase to cre- transfer coefficients of the materials and is in- ate a global lookup table of control decisions. dependent of the flow rate when the system The thermal controller uses this set of rules at reaches boundary conditions. The tempera- runtime to dynamically set each core’s DVFS ture change resulting from the absorption setting and the chip’s coolant flow rate. The of sensible heat along the microchannel objective is to minimize the energy con- depends on the volumetric flow rate and is sumption of the cooling infrastructure and computed iteratively along the channel. In the performance degradation while keeping fact, the heat absorption along the channels the 3D stack’s temperature within the safe typically causes a noticeable thermal gradient bounds. across the die, as shown in Figure 6. Our To enable the 3D system’s design-time prior work provides a detailed understanding thermal analysis, we use an automated of the thermal model for liquid-cooled 3D thermal-modeling tool, 3D-ICE,20 which systems.12,20 can model the microchannels and stacked We leverage a multiple-input, multiple- layers. The modeling principles are similar to output fuzzy controller to integrate flow- those in compact automated thermal models rate control with DVFS for enabling such as HotSpot,11 except that the interlayer energy-efficient temperature management on material in our model is a heterogeneous liquid-cooled 3D systems. The controller is ......

JULY/AUGUST 2011 69 [3B2-9] mmi2011040063.3d 19/7/011 11:9 Page 70

...... BIG CHIPS

Coolant inlet Coolant flow Coolant outlet 700 44 600 42

500 40 Location of the hot spot 38 400 36 300 34 Height (um)

200 32 (°C) Temperature 30 100 28 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000 Distance along channel length (um)

Figure 6. Temperature change across the microchannels on the five-tier liquid-cooled prototype developed by IBM Zu¨ rich and EPFL.7 The temperature change across the channel reaches 15C for this system. 3D-ICE explicitly models the gradients across the channels.

based on the Takagi-Sugeno Fuzzy model,21 the total number of microchannels, we in- and its inputs include the distance from the clude the number of available microchannels inlet port, predicted maximum chip temper- per cavity (a cavity is the cooling layer, ature, and utilization for each core. Thermal including a set of microchannels between prediction is performed through regression- two adjacent active layers) in the table. The based methods.12 The outputs are the flow number of microchannels per cavity is rate of the liquid going into the stack and lower in the four-tier system, whereas the DVFS settings for each core. The controller number of cavities is larger compared to makes decisions using the rule base formed the two-tier system. during the offline-analysis stage. For exam- We use a maximum flow-rate setting of ple, if the distance from the port is long 46.9 ml/minute and 23.5 ml/minute per cav- (that is, cores are in the hotter zone), core ity for the two-tier and four-tier 3D systems, utilization is high, and temperature is high, respectively. We selected these flow rates on we apply a high flow rate and a medium the basis of the pumps suitable for liquid- DVFS level. If a core is close to the inlet cooled 3D systems (such as 12 V miniature port, we can maintain the temperature at a gear pumps), with a pressure drop of 1 bar. low level using a high DVFS level (incurring The per-cavity flow rate is lower in the no performance overhead) and a low flow- four-tier system because we keep the overall rate setting (please refer to Sabry et al. for flow rate coming into the stack the same in the complete set of rules21). The global con- both cases. We use straight microchannels, troller can explore energy-efficient solutions each with a cross section of 50 mm100 mm and trade-offs that would not be possible and a 150-mm pitch to account for TSV using independent control mechanisms. placement in the channel walls. Notice Figure 7 compares the thermal profiles in from Figure 7 that all the experiments with the liquid-cooled 64-core chip with dynamic liquid cooling reduce the peak and average control (Fuzzy þ TALB) against liquid- temperatures to below 50C. The fuzzy con- cooled systems that have a static flow-rate troller combines dynamic flow-rate adjust- setting to handle worst-case temperatures ment with DVFS and saves cooling energy (NoDTMandTALB).Table3provides by letting the temperature rise within safe the characteristics of the microchannels we margins. We could reduce the temperature used in these experiments. In addition, be- to a similar level as Fuzzy þ TALB by cause each 3D system’s footprint determines aggressively applying DVFS or other ...... 70 IEEE MICRO [3B2-9] mmi2011040063.3d 19/7/011 11:9 Page 71

80

70 Max T Avg T 60 50 40

Temperature (ºC) Temperature 30 20 No DTM TALB TALB No DTM TALBFuzzy No DTM TALB Fuzzy No DTM TALB Fuzzy + + + + DVFS TALB TALB TALB 2D, air cooled 2-tier 3D, liquid cooled 4-tier 3D, liquid cooled 4-tier 3D, liquid cooled @ 0.5 bar

Figure 7. The plot compares maximum and average temperatures between liquid-cooled 3D systems and the 2D air-cooled baseline. Liquid cooling dramatically reduces temperatures across all the workloads. No DTM and TALB policies on the liquid-cooled systems use a static flow rate to cool the chip even under worst-case temperatures. Our fuzzy controller combined with TALB trades off temperature reduction with cooling energy; we allow temperature to increase within safe margins by reducing the flow rate when the cooling demand is lower. The 0.5-bar setting reduces the liquid flow to relax temperature constraints further for larger energy savings.

thermal-management techniques, at the cost desirable because they can lead to design of severe performance degradation. Fuzzy þ complexity or timing problems on the die. TALB applies liquid-cooling management with performance-aware scheduling and D systems with active cooling, when DVFS, limiting the performance cost to 3 intelligently managed by adaptive tech- less than 1 percent of the original (No niques aware of the physical properties and DTM) case. constraints, offer a cost-efficient solution for Fuzzy þ TALB prevents overcooling and building high-performance many-core big reduces the associated cooling energy by chips. Many interesting research problems adjusting the flow rate to match the system’s remain in this area. Our results indicate the cooling needs. For example, for the four-tier need for developing novel temperature- 3D liquid-cooled system, cooling energy is aware floorplanning methods for 3D chips reduced by 61 percent compared to the cool- with liquid cooling, addressing the thermal ing energy spent in No DTM and TALB. impact of both TSVs and the thermal Because fuzzy control includes DVFS, the heterogeneity caused by the liquid flow. chip-level energy drops as well—for example, We have experimented with straight micro- 15 percent on average for the four-tier sys- channels in our work, following the recent tem. The overall energy savings (chip and developments and prototypes in the indus- cooling) on the two-tier and four-tier systems try. Two-phase cooling and different micro- are 16 and 21 percent, respectively. For channel geometries bring several new areas reducing cooling energy further, we also ex- to explore, along with new challenges in periment with a lower flow rate (0.5-bar manufacturing and control. In addition, pressure drop with a flow rate of 16.45 studying the reliability of 3D systems is ml/minute) to let the temperature rise with- crucial for developing future 3D stacks. For out exceeding 80C. In comparison to the liquid-cooled systems, we expect interesting 1-bar setting, the low flow-rate setting reliability challenges due to the presence of reduces cooling energy by 23 percent. How- water. Furthermore, DRAM stacking is a ever, we see a substantial increase in thermal promising feature for enabling higher gradients: while the original setting maintains performance. To thoroughly understand gradients below 15C, the 0.5-bar setting the performance and temperature impact causes spatial gradients on the chip to grow of memory stacking, we need various levels significantly higher. Large spatial gradients of design space exploration and potentially decrease the cooling efficiency, and are not new modeling techniques. Finally, for ......

JULY/AUGUST 2011 71 [3B2-9] mmi2011040063.3d 19/7/011 11:9 Page 72

...... BIG CHIPS

future data centers containing many liquid- of Integrated Circuits and Systems, vol. 27, cooled stacks, we need approaches to enable no. 8, 2008, pp. 1479-1492. reuse of the wasted cooling energy in the 7. T. Brunschwiler et al., ‘‘Interlayer Cooling hot water coming out of chip stacks. Recent Potential in Vertically Integrated Packages,’’ research contributions in this direction Microsystem Technologies, vol. 15, no. 1, include the IBM Aquasar liquid-cooled 2008, pp. 57-74. server rack (including 2D nodes), which 8. T. Brunschwiler et al., ‘‘Validation of the recovers a significant part of the cooling Porous-Medium Approach to Model energy by reusing the hot water for building Interlayer-Cooled 3D-Chip Stacks,’’ Proc. heating. MICRO IEEE Int’l Conf. 3D System Integration, IEEE CS Press, 2009, pp. 1-10. Acknowledgments 9. X. Dong and Y. Xie, ‘‘System-Level Cost This research has been partially funded by Analysis and Design Exploration for Three- the Nano-Tera RTD project CMOSAIC (ref. Dimensional Integrated Circuits (3D ICs),’’ 123618), financed by the Swiss Confederation Proc. Asia and South Pacific Design Auto- and scientifically evaluated by SNSF, and the mation Conference, IEEE Press, 2009, PRO3D European Union FP7-ICT-248776 pp. 234-241. project. We thank Thomas Brunschwiler and 10. E.G. Colgan et al., ‘‘A Practical Implementa- Bruno Michel from the Advanced Packaging tion of Silicon Microchannel Coolers for Group of IBM Zu¨rich, as well as Yusuf Leble- High Power Chips,’’ IEEE Trans. Compo- bici from the Microelectronic Systems Labora- nents and Packaging Technologies, vol. 30, tory of EPFL for their contributions in no. 2, 2007, pp. 218-225. experimentally validating the thermal model 11. K. Skadron et al., ‘‘Temperature-Aware for 3D ICs with intertier liquid cooling. Microarchitecture,’’ Proc. 30th Ann. Int’l Symp. Computer Architecture, ACM Press, ...... 2003, pp. 2-13. References 12. A.K. Coskun et al., ‘‘Energy-Efficient 1. J. Howard et al., ‘‘A 48-Core IA-32 Message- Variable-Flow Liquid Cooling in 3D Stacked PassingProcessorwithDVFSin45nm Architectures,’’ Proc. Design Automation CMOS,’’ Proc. IEEE Int’l Solid-State Circuits and Test in Europe, IEEE Press, 2010, Conf., 2010, IEEE Press, pp. 108-109. pp. 111-116. 2. S. Reda, G. Smith, and L. Smith, ‘‘Maxi- 13. C. Bienia, ‘‘Benchmarking Modern Multi- mizing the Functional Yield of Wafer-to- processors,’’ doctoral dissertation, Computer Wafer 3D Integration,’’ IEEE Trans. Very Science Dept., Princeton University, 2011. Large Scale Integration Systems, vol. 17, 14. D. Bailey et al., The NAS Parallel Bench- no. 9, 2009, pp. 1357-1362. marks, tech. report RNR-94-007, NASA 3. A.K. Coskun, A.B. Kahng, and T. Rosing, Ames Research Center, 1994. ‘‘Temperature- and Cost-Aware Design of 15. N.L. Binkert et al., ‘‘The M5 Simulator: Mod- 3D Multiprocessor Architectures,’’ Proc. eling Networked Systems,’’ IEEE Micro, 12th Euromicro Conf. Digital System De- July/Aug. 2006, pp. 52-60. sign, Architectures, Methods, and Tools, 16. S. Li et al., ‘‘McPAT: An Integrated Power, IEEE CS Press, 2009, pp. 183-190. Area, and Timing Modeling Framework for 4. G.H. Loh and Y. Xie, ‘‘3D Stacked Micro- Multicore and Many-Core Architectures,’’ processor: Are We There Yet?’’ IEEE Proc. 42nd Ann. IEEE/ACM Int’l Symp. Micro, vol. 30, no. 3, 2010, pp. 60-64. Microarchitecture, ACM Press, 2009, 5. M. Healy et al., ‘‘Multiobjective Microarchi- pp. 469-480. tectural Floorplanning for 2D and 3D ICs,’’ 17. S. Heo, K. Barr, and K. Asanovic, IEEE Trans. Computer-Aided Design of Inte- ‘‘Reducing Power Density through Activity grated Circuits and Systems, vol. 26, no. 1, Migration,’’ Proc. Int’l Symp. Low-Power 2007, pp. 38-52. and Design, ACM Press, 2003, 6. C. Zhu et al., ‘‘Three-Dimensional Chip- pp. 217-222. Multiprocessor Runtime Thermal Manage- 18. D.B. Tuckerman and R.F.W. Pease, ‘‘High- ment,’’ IEEE Trans. Computer-Aided Design Performance Heat Sinking for VLSI,’’ IEEE ...... 72 IEEE MICRO [3B2-9] mmi2011040063.3d 19/7/011 11:9 Page 73

Electron Device Letters, vol. 2, no. 5, 1981, Systems Laboratory at the E´cole Polytechni- pp. 126-129. que Fe´de´rale de Lausanne, and an adjunct 19. T. Brunschwiler et al., ‘‘Direct Liquid-Jet professor in the Computer Architecture Impingement Cooling with Micron-Sized Department at the Complutense University Nozzle Array and Distributed Return Archi- of Madrid (UCM). His research interests tecture,’’ Proc. 10th Intersociety Conf. Ther- include system-level design methodologies mal and Thermomechanical Phenomena in for high-performance multiprocessor systems Electronics Systems, IEEE Press, 2006, on chips (MPSoCs) and embedded systems, pp. 196-203. including 2D and 3D thermally aware 20. A. Sridhar et al., ‘‘3D-ICE: Fast Compact design, wireless sensor networks, hardware Transient Thermal Modeling for 3D-ICs and software reconfigurable systems, dy- with Intertier Liquid Cooling,’’ Proc. namic-memory optimizations, and network- IEEE/ACM Int’l Conf. Computer-Aided on-chip design. Atienza has a PhD in Design, IEEE Press, 2010, pp. 463-470. computer science and from the 21. M. Sabry, A.K. Coskun, and D. Atienza, Inter-University Microelectronics Center, ‘‘Fuzzy Control for Enforcing Energy Effi- Belgium, and the UCM. He’s an associate ciency in High-Performance 3D Systems,’’ editor of IEEE Transactions on CAD, IEEE Proc. IEEE/ACM Int’l Conf. Computer-Aided Embedded Systems Letters,andIntegration: Design, IEEE Press, 2010, pp. 642-648. The VLSI Journal.

Ayse K. Coskun is an assistant professor in Mohamed M. Sabry is a PhD student in the the Electrical and Computer Engineering Department and a Department at Boston University. Her member of the Embedded Systems Labora- research interests include energy-efficient tory at the E´cole Polytechnique Fe´de´rale de computing, multicore architectures, 3D Lausanne. His research interests include stacked architectures, embedded systems, system design and resource management and software. Coskun has a PhD in methodologies in embedded systems, and computer science and engineering from the MPSoCs, especially temperature and relia- University of California, San Diego. She’s a bility management of 2D and 3D MPSoCs. member of IEEE and the ACM. Sabry has an MS in electrical engineering from AinShams University, Egypt. Jie Meng is a PhD student in electrical and computer engineering at Boston University. Direct questions or comments about this Her research interests include many-core and article to Ayse K. Coskun, Electrical and 3D stacked architectures, focusing on energy Computer Engineering Department, Boston awareness and performance improvement for University, 8 Saint Mary’s Street, Boston, future systems. Meng has an MS in electrical MA 02215; [email protected]. engineering from McMaster University in Canada.

David Atienza is a professor of electrical engineering and the director of the Embedded

......

JULY/AUGUST 2011 73