
七 kl n CoDA , 二 n yn x n 2015 1 Exploring Energy and Scalability in Coprocessor-Dominated Architectures for Dark Silicon Regime By Zheng Qiaoshi Under the Supervision of Professor Gao Deyuan A Dissertation Submitted to Northwestern Polytechnical University In partial fulfillment of the requirement For the degree of Doctor of Computer Science and Technology Xi’an, P. R. China January 2015 ! 些wm产 g些vmm 二gpmm wmg mu些亿v丽v亚m ugwm xg丽 ,wm产,ypg! yx 10 um产 ygwm m,享产,y业y ug, CoDA k Coprocessor-Dominated Architecturelg! GreenDroid 互乘 于m七 CoDA ,fff CoDA ,二mvn k1l CoDA m且 CoDA g m乱m乱 乱yugm ,书yg仍 22nm v 7mm2 ,书y 90%g m CoDA ,g k2l CoDA 于享m CoDA ,m CoDA ,g,wmp wympy ,g万p CoDA ,fo ,mg k3l CoDA ,w Cache ff v二gvmx, CoDA 5.3 5 kenergy-delay product, EDPlou CoDA m 3.7 3.5 EDP g且 CoDA g I 七 m CoDA m CoDA g k4l CoDA g mg m介 CoDA 产产wm书 ym亚g CoDA ,产 QsCore g仍 QsCore m 41% 2 ym亮 11.1%~22.1%g k5l FPGA vp CoDA m FPGA 个w二m 2D-mesh ug 乔m p 2D-mesh g ASIC FPGA FPGA 于乔g 之m Virtex 6 FPGA CoDA , g 上nmCoDAm,mmm II Abstract FD6F As silicon technology approaching its physical limitation, the traditional scaling theory guided by Moore’s Law and Dennard’s Law is about to fail. Under the limited power budget, we find the Utilization Wall and Dark Silicon phenomenon exist in current chip designs in the Post-Dennard scaling era. Furthermore, the dark silicon phenomenon will worsen precipitously with each process generation, making the chip design go into the dark silicon regime. In the dark silicon regime, the percentage of a chip that can switch at full frequency is dropping precipitously, which leaves more and more on chip transistors couldn’t utilize. So silicon area becomes less expensive relative to power and energy consumption. This shift calls for new architectural techniques that trade dark silicon area for energy efficiency. One such technique is the use of heterogeneous specialized coprocessors. Because the specialized coprocessors could save more than 10x energy than general-purpose processors, so employing coprocessors could improve the energy efficiency of the system for single application. However, most of the common systems have a great number of diverse workloads, in order to improve the energy efficiency of such system, architects need to employ hundreds or even thousands of specialized coprocessors and schedule software to run on these coprocessors. As the number of coprocessors scales up, these designs will transform from coprocessor-enable systems to Coprocessor-Dominated Architectures (CoDAs). As a member of UCSD GreenDroid group and Dark Silicon Center, the author writes this paper at UCSD. This paper focus on the theory, scalability, energy efficiency, and some potential issues about the CoDA. The innovations include the followings: (1) This paper makes the feasibility study of CoDA, and demonstrates CoDA is suitable for the Dark Silicon regime. This paper analyzes the Android mobile software stack, and finds most applications running on native libraries and virtual machine. If we build coprocessors for these shared codes, most of the software will run on coprocessors. Then, the paper analyzes web browser, and uses silicon to build it. The results show that it only need 7mm2 chip area to cover 90% web browser dynamic execution on specialized coprocessors under 22nm process. Its only take an acceptable piece of silicon area could cover most of the application execution, which indicate that CoDA is suitable for the Dark Silicon regime. (2) In order to explore CoDA’s design space under acceptable speed, the paper proposes a CoDA analysis model. The CoDA architecture in analysis model could employ a III 七 multi-dimension scalable structure. In this paper, CoDA could compose by different number of tiles, each tile can contain different number of coprocessors and each coprocessor could be heterogeneous. The analysis model could evaluate the energy, performance and area of each specific CoDA design, and the design parameters includes both high level architecture configurations and low level implementation configurations. (3) Exploring the CoDA energy efficiency under different Cache configurations, tile sizes, coarse-grained energy management strategies and the transistor libraries. Under the optimal configuration, CoDA design approach that can deliver 5.3× improvements in energy efficiency and 5.0× improvements in energy-delay product for small workloads could continue to yield improvements of 3.7× in energy and 3.5× in energy-delay for designs covering over 100 applications. A scalable CoDA design can continue to deliver superior efficiency even for large workloads, which means CoDA could scale. In addition, the paper finds even with aggressive power management, leakage is still a sizable fraction of CoDA energy that grows with coprocessor count. (4) Exploring the influence of concurrent execution on CoDA’s energy efficiency. The effects are divided into positive and negative sides. On the positive side, running multiple threads on a CoDA increases overall energy efficiency because it amortizes fixed energy costs, including those due to leakage, across the work from multiple threads. On the negative side, when the target workloads drive CoDA generation mismatched the real workloads, concurrent threads raise the possibility of competition for c-cores, which could reduce the average energy efficiency of the system. The paper proposes to integrate the merged QsCore into CoDA to reduce the competition conflicts. The results show that using QsCore to provide twice number of C-cores for each type, only cost 41% additional area, but it could improve the energy efficiency from 11.1% to 22.1% in the non-uniform case. (5) Because single FPGA chip implemented on current technology process does not have enough recourse to emulate the CoDA chip target on next generation process. The paper proposes an inter-chip scalable 2D-mesh network, which connected by cross chip ring network. The inter-chip design also provides flow control for each physical channel of the 2D-mesh. The ring network design provides two types of connectors for crossing the chip. One is ASIC to FPGA (MURN IO) connector; the other one is FPGA to FPGA (FMC) connector. By using the inter-chip 2D-mesh network, the paper uses two Virtex 6 FPGA boards to set up the CoDA prototype system at the first time. IV Abstract Key words: Dark Silicon, Coprocessor-Dominated Architecture (CoDA), Massively Heterogeneous, Coprocessor, Energy Efficiency, Scalability V 8 0L][M]888 D88 F8 F888 七 二 (, (2Y30 / (2Y30 ,/ (() ) 七,- (2Y30 ,/ (y 2MY[O/ (2MY[O / ((2MY[O ,( ()产 2MY[O ,(( (2MY[O 为丽() (2MY[O 万( (( AMY[O( (((, (((AMY[O ,(- (()AMY[O ,(. ((AMY[O x万(/ () 6[OOX3[YSN) (),)( ()(6[OOX3[YSN )) ( AS2R[YWO) VII 七 () ((2Y30 ), ()- )2Y30 )/ )2Y30 ,)/ ) 2Y30 ,)/ )(2Y30 ))2Y30 ) )(2Y30 ) )(x) )(( ))2Y30 , )), ))(- )))f/ )2Y30 ,于 ) )(x( )) 2Y30 2Y30 (2Y30 于, (2Y30 , ((2Y30 x产/ ()2Y30 x/ )2Y30 , )2Y30 , )(y:x,( ))x,) 2Y30 ,) ,, ,,- VIII 2Y30 x,/ 1OTWZ ,/ 1OTWZ -( (于 C8% -) )- (3WOR-. 28VQ -/ ,.( (2Y30 .) (.) ((6[OOX3[YSN560 . ()仍.- )2Y30 ./ )个./ )(A2Y30 个享/ ))/) /- ,// ,// ,(p ) ) 于七 IX [15] 1-1ITRS 2012 ............................................... 3 [10] 1-2 : .................................................................. 4 1-3 亿于 ................................................................... 5 1-4 40nm , ................................................................. 11 [15] 1-5ITRS CoDA ...................................... 13 1-6 , ................................................................................. 18 2-1C-core , ..................................................................................... 21 2-2 产 C-core GreenDroid , ....................................................................... 23 2-3 C-core 于为丽 .................................... 24 2-4 ................................................................................. 25 2-5 ......................................................................................................... 26 2-6S-core , ..................................................................................... 27 2-7S-core ,五乔 ............................................. 28 2-8 ,, ..................................................................................... 29 2-9FPGA ................................................................................................. 30 2-10 ....................................................................................... 31 2-11Gartner PC .................................................. 31 2-12Gartner w ......................................... 32 2-13 ....................................................................................... 33 2-14GreenDroid p ...................................................................... 33 2-15GreenDroid 产 C-core ...................................... 34 2-16 于 ....................................................................................... 35 2-17 x ....................................................... 36 2-18 乱x ............................................... 36 3-1(a) CoDA (b) y产 ........................................... 40 3-2 ............................................................................................. 42 3-3 ......................................................................................... 49 3-4 ......................................................................................... 49 3-5 ....................................................
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages137 Page
-
File Size-