kl

n CoDA , 二

n yn x n

2015 1

Exploring Energy and Scalability in Coprocessor-Dominated Architectures for Dark Silicon Regime

By Zheng Qiaoshi

Under the Supervision of Professor Gao Deyuan

A Dissertation Submitted to Northwestern Polytechnical University

In partial fulfillment of the requirement For the degree of Doctor of Computer Science and Technology

Xi’an, P. R. China January 2015

! 些wm产 g些vmm 二gpmm wmg mu些亿v丽v亚m ugwm xg丽 ,wm产,ypg! yx 10 um产 ygwm m,享产,y业y ug, CoDA k Coprocessor-Dominated Architecturelg! GreenDroid 互乘 于m七 CoDA ,fff CoDA ,二mvn k1l CoDA m且 CoDA g m乱m乱 乱yugm ,书yg仍 22nm v 7mm2 ,书y 90%g m CoDA ,g k2l CoDA 于享m CoDA ,m CoDA ,g,wmp wympy ,g万p CoDA ,fo ,mg k3l CoDA ,w Cache ff v二gvmx, CoDA 5.3 5 kenergy-delay product, EDPlou CoDA m 3.7 3.5 EDP g且 CoDA g

I 七

m CoDA m CoDA g k4l CoDA g mg m介 CoDA 产产wm书 ym亚g CoDA ,产 QsCore g仍 QsCore m 41% 2 ym亮 11.1%~22.1%g k5l FPGA vp CoDA m FPGA 个w二m 2D-mesh ug 乔m p 2D-mesh g ASIC FPGA FPGA 于乔g 之m Virtex 6 FPGA CoDA , g

上nmCoDAm,mmm

II

Abstract

FD6F

As silicon technology approaching its physical limitation, the traditional scaling theory guided by Moore’s Law and Dennard’s Law is about to fail. Under the limited power budget, we find the Utilization Wall and Dark Silicon phenomenon exist in current chip designs in the Post-Dennard scaling era. Furthermore, the dark silicon phenomenon will worsen precipitously with each process generation, making the chip design go into the dark silicon regime. In the dark silicon regime, the percentage of a chip that can switch at full frequency is dropping precipitously, which leaves more and more on chip transistors couldn’t utilize. So silicon area becomes less expensive relative to power and energy consumption. This shift calls for new architectural techniques that trade dark silicon area for energy efficiency. One such technique is the use of heterogeneous specialized coprocessors. Because the specialized coprocessors could save more than 10x energy than general-purpose processors, so employing coprocessors could improve the energy efficiency of the system for single application. However, most of the common systems have a great number of diverse workloads, in order to improve the energy efficiency of such system, architects need to employ hundreds or even thousands of specialized coprocessors and schedule software to run on these coprocessors. As the number of coprocessors scales up, these designs will transform from coprocessor-enable systems to Coprocessor-Dominated Architectures (CoDAs). As a member of UCSD GreenDroid group and Dark Silicon Center, the author writes this paper at UCSD. This paper focus on the theory, scalability, energy efficiency, and some potential issues about the CoDA. The innovations include the followings: (1) This paper makes the feasibility study of CoDA, and demonstrates CoDA is suitable for the Dark Silicon regime. This paper analyzes the Android mobile software stack, and finds most applications running on native libraries and virtual machine. If we build coprocessors for these shared codes, most of the software will run on coprocessors. Then, the paper analyzes web browser, and uses silicon to build it. The results show that it only need 7mm2 chip area to cover 90% web browser dynamic execution on specialized coprocessors under 22nm process. Its only take an acceptable piece of silicon area could cover most of the application execution, which indicate that CoDA is suitable for the Dark Silicon regime. (2) In order to explore CoDA’s design space under acceptable speed, the paper proposes a CoDA analysis model. The CoDA architecture in analysis model could employ a

III 七 multi-dimension scalable structure. In this paper, CoDA could compose by different number of tiles, each tile can contain different number of coprocessors and each coprocessor could be heterogeneous. The analysis model could evaluate the energy, performance and area of each specific CoDA design, and the design parameters includes both high level architecture configurations and low level implementation configurations. (3) Exploring the CoDA energy efficiency under different Cache configurations, tile sizes, coarse-grained energy management strategies and the transistor libraries. Under the optimal configuration, CoDA design approach that can deliver 5.3× improvements in energy efficiency and 5.0× improvements in energy-delay product for small workloads could continue to yield improvements of 3.7× in energy and 3.5× in energy-delay for designs covering over 100 applications. A scalable CoDA design can continue to deliver superior efficiency even for large workloads, which means CoDA could scale. In addition, the paper finds even with aggressive power management, leakage is still a sizable fraction of CoDA energy that grows with coprocessor count. (4) Exploring the influence of concurrent execution on CoDA’s energy efficiency. The effects are divided into positive and negative sides. On the positive side, running multiple threads on a CoDA increases overall energy efficiency because it amortizes fixed energy costs, including those due to leakage, across the work from multiple threads. On the negative side, when the target workloads drive CoDA generation mismatched the real workloads, concurrent threads raise the possibility of competition for c-cores, which could reduce the average energy efficiency of the system. The paper proposes to integrate the merged QsCore into CoDA to reduce the competition conflicts. The results show that using QsCore to provide twice number of C-cores for each type, only cost 41% additional area, but it could improve the energy efficiency from 11.1% to 22.1% in the non-uniform case. (5) Because single FPGA chip implemented on current technology process does not have enough recourse to emulate the CoDA chip target on next generation process. The paper proposes an inter-chip scalable 2D-mesh network, which connected by cross chip ring network. The inter-chip design also provides flow control for each physical channel of the 2D-mesh. The ring network design provides two types of connectors for crossing the chip. One is ASIC to FPGA (MURN IO) connector; the other one is FPGA to FPGA (FMC) connector. By using the inter-chip 2D-mesh network, the paper uses two Virtex 6 FPGA boards to set up the CoDA prototype system at the first time.

IV

Abstract

Key words: Dark Silicon, Coprocessor-Dominated Architecture (CoDA), Massively Heterogeneous, Coprocessor, Energy Efficiency, Scalability

V

8

0L][M]888

D88

F8

F888

(,

(2Y30 /

(2Y30 ,/

(()

)

七,-

(2Y30 ,/

(y 2MY[O/

(2MY[O /

((2MY[O ,(

()产 2MY[O ,((

(2MY[O 为丽()

(2MY[O 万(

(( AMY[O(

(((,

(((AMY[O ,(-

(()AMY[O ,(.

((AMY[O x万(/

() 6[OOX3[YSN)

(),)(

()(6[OOX3[YSN ))

( AS2R[YWO)

VII 七

()

((2Y30 ),

()-

)2Y30 )/

)2Y30 ,)/

) 2Y30 ,)/

)(2Y30

))2Y30 )

)(2Y30 )

)(x)

)((

))2Y30 ,

)),

))(-

)))f/

)2Y30 ,于

)

)(x(

))

2Y30

2Y30

(2Y30 于,

(2Y30 ,

((2Y30 x产/

()2Y30 x/

)2Y30 ,

)2Y30 ,

)(y:x,(

))x,)

2Y30 ,)

,,

,,-

VIII

2Y30 x,/

1OTWZ ,/

1OTWZ -(

(于 C8% -)

)-

(3WOR-.

28VQ -/

,.(

(2Y30 .)

(.)

((6[OOX3[YSN560 .

()仍.-

)2Y30 ./

)个./

)(A2Y30 个享/

))/)

/-

,//

,//

,(p

)

)

于七

IX

1-1ITRS 2012 [15] ...... 3 1-2 :[10] ...... 4 1-3 亿于 ...... 5 1-4 40nm , ...... 11 1-5ITRS CoDA [15] ...... 13 1-6 , ...... 18 2-1C-core , ...... 21 2-2 产 C-core GreenDroid , ...... 23 2-3 C-core 于为丽 ...... 24 2-4 ...... 25 2-5 ...... 26 2-6S-core , ...... 27 2-7S-core ,五乔 ...... 28 2-8 ,, ...... 29 2-9FPGA ...... 30 2-10 ...... 31 2-11Gartner PC ...... 31 2-12Gartner w ...... 32 2-13 ...... 33 2-14GreenDroid p ...... 33 2-15GreenDroid 产 C-core ...... 34 2-16 于 ...... 35 2-17 x ...... 36 2-18 乱x ...... 36 3-1(a) CoDA (b) y产 ...... 40 3-2 ...... 42 3-3 ...... 49 3-4 ...... 49 3-5 ...... 50 4-1 (EDP) vs. vs. ...... 57

XI 七

4-2EDP ...... 59 4-3 ...... 61 4-4 ...... 61 4-5 C-core ...... 62 4-6 ...... 63 5-1Basejump ...... 70 5-2 ...... 71 5-32D mesh ...... 73 5-4MURN IO ...... 75 5-5 久 ...... 76 5-6 ...... 77 5-72D mesh x丽 BDIOM ...... 78 5-8 久 2D-mesh ...... 79 5-9PCI Plug ...... 80 5-10 PCI Plug , ...... 81 5-11Basejump 之 ...... 82 5-12 , ...... 85 5-13GreenDroid ...... 86 5-14FPGA 仍 cjpeg ...... 88 5-15 : IP ...... 93 5-16 : ...... 94 5-17 klkl ...... 95 5-18 I/O 了 I/O ...... 96 5-19 亚kIR Dropl ...... 97

XII

1-1 x[8] ...... 2 2-1 ,产 C-core ...... 22 2-2S-core ...... 30 2-3 R 16 :v ...... 30 3-1 ...... 45 3-2 ...... 47 3-3CoDA 于 ...... 51 3-4 wm ...... 52 4-1 ...... 58 5-1PIO ...... 81 5-2FPGA 仍 RTL ...... 84 5-3 仍享 ...... 85 5-4 ...... 87 5-5 ...... 87 5-6S-CoDA 个、个 ...... 90 5-7S-CoDA ...... 91 5-8 亮久 ...... 92 5-9S-CoDA 争 ...... 92 !

XIII

1 七

Equation Section (Next) 1

uoP 30 mVLSI [1] [2]mg 30 m CMOS m串串f串串 亚gpmpm丽 mv亚丽享 v亚g乎wvmu产 2 g 30 m串串m串串 f乱w:g fffg 乎mgu产 乐w予m丽享w亚v 亚gvm亿v丽 v亚g二[3-6]m[7-13]g vm七f云二 g 且二kUtilization Walll些mp mu丽亚g二 书g DRAM kRobert H. Dennardl[2] 1974 p S2( Sm 1.4)m亿 S mv S3k2.8lgxm 亚 1/Sm:亚 1/Sg uwg云二n1l w串串m串串丽 g2lmwg k1-1lgt乱m pmmtg 130nm mm亚 1/S2gu产 S2 mw

1 七

g七g之 1-1 g 1 pCVfIVIV=++α 2 (1-1) 2 dd leakage dd sc dd 90nm mg t乱nfg um乱mw亚 [14]g

2 fVVVmax ∝−()/dd t dd (1-2) k1-2l丽m丽x

g丽mVdd 亚 Vt 亚g

IVTsubthreshold∝−exp( th / ) (1-3) k1-3lm g且亚m 予g k1-2lk1-3lm mw亚享wg vm丽些m w亚g亚wm umg 1-1 x[8]

CMOS 1 1 1 1 kW, Ll 1/S 1/S S2 S2 亿 S S 1/S 1/S : 1/S 1 =!FCV2 1/S2 1 = x 1 S2 =1/ 1 1/S2 1-1 tvmg wmu丽 S2gvmu亚 1/S2g二七mg mg且pm u个 2.8 kmS2xSlg 2

1 七

些vm 1.4 g m,g vmu丽享 wm;m8 uk之 4 l 93.75%g 1-1 ITRS[15] 2012 1 2001 gu七gm CMOS mm乐 vmw2g

1-18BA(( [15]

,享个gm ,om ,kenergy effciencylm u个g 2005 丽:g : 1-2g65nm 32nm 丽之 2 m S=65/32 2g七umu产 4 :m享 4 久gw些um

1 ITRS 2012 2013 m 2013 七乐 k2015 1 之lm 2012 g http://www.itrs.net/Links/2012Winter/1205%20Presentation/DesignSD_12052012.pdf 2 demdDark Silicon EraefdDark Silicon Regimee dDark Silicon Ageeg d Dark Silicon Regimeem 么 w m 享,g上m乐且d egm且二g

3 七

wv1m 65nm 32nm ,: 2 m亿wk乱lm gp:wm亿g乐p 于,:pm亿pgwmu乱 个m个k付乱lg

1-2 :[10]

pgm Intel 4 I7 NehalemfSandy BridgefIvy Bridge Haswell mI7 , 七g亿予m :m 1-2 于p于gm Mac Pro [16] Intel :串m 亿串g 6 :亿 3.5GHzm 12 :亿 2.7GHz2g m仔 920 big.LITTLE ,[17],,;享 4 : Cortex-A15 m 4 : Cortex-A7 m亿v 8 :g 1-3 亿于k ISSCC 2014 [18]lm 2005 m亿w予g 七 1-2 m于pm u串串g:,p, 于mwgp七:,

1 w主fmg 4W m 50Wklg 2 :亿wpgu:m乐t Cachemm乱g 4

1 七

[7-11]gm 12nmf8nm ,二m享,p举mp举享 二wm mg

1-3 亿于

wfop 些亿v[8]gmk 久lg Cache 久m Cache 丽亿 1%o 乱g mw串串g ,u, v享v1g 且d二e,个g

亮p仅m两f g二f之 g pvfg二m 且二gxf wmp串u些mu些 ff,g u些些vmuw丽g

1 二g

5 七

muf亚丽亿 g,m, 个m些1g m,m 14nmf8nm 么些,二g 二 2006 乘kUCSDl Michael Taylor Steven Swanson m 2010 ASPLOS u UCSD Michael Taylor Steven Swanson 互[5]g 2010 HotChips u UCSD 二pkDark Siliconl[6]2g 2011 ISCA um人 Hadi [11]m, mpu:g g 2012 ISCA mUCSD Michael Taylor f Steven Swanson Jack Sampson() p Dark Silicon kDaSilgxm DAC 2012 umUCSD Michael Taylor p,[10]mp gu Michael 乐 CoDAkCoprocessor-Dominated Architecturel m CoDA ,g2013 UCSD Michael Taylor Steven Swanson 久m久p IEEE Micro ymy 七gg m 2010 2013 m 2014 u 且m,v3n1l之产 y久o2l(Dim Silicon)m:亚乱亿o3l MOSFETo4l乎g ygUCSD Michael Taylor 4fSteven Swanson 5 Jack Sampson 互》k互lg UCSD 互 2010 ASPLOS u[5]m七 二m七七m仍

1 Intel Die m且些v 乱g 2 Dark Silicon 上w UCSD mARM CTO Mike Muller ARM u 上mg http://www.eetimes.com/document.asp?doc_id=1172049m 上 Prof. Michael Taylor g 3 x Prof. Michael Taylor wm Die m 乎g 4 他 Raw ,g,:,mRaw ,g 5 人 WaveScalar , 6

1 七

七二g七 Conservation-core ( C-core)“,mypg 七[13, 19] C-core ,mp mg[20] C-core g m 2010 HotChips SoC 乱pp七mUCSD 七[6]二kDark Siliconlm产 C-core GreenDroid “g七[4]m七[3]m 七 GreenDroid m仍gm 七乐“ GreenDroid 仍g GreenDroid g 2012 DAC DaSi u七[10] wgm 亦g七 Michael 七[7-9]gm七 CoDA ,g Doug Burger 1f也 Karu 2人 Hadi Esmaeilzadehkl互yg 互七 2011 ISCA g七[11]七 vm:,g七w :,mg CPU ,m产之 35 :产: g m乐m:, k99%u乱l且m 8nm 3.7 g [21, 22]也 Karu gDySER m之 。五ug 2012 1st DaSi u Lisa Wu Martha A.Kim 七[23] w产mvt七n1l之 SPEC2006 产m C-core DySER 亮, C o2l Java 产w

1 也x SimpleScalar m于 TRIPS g 2 x TRIPS g

7 七

o3lym产 g七w C-core m七 CoDA g七g七 y久g [24]yg七 ,久g七wm ,久g,久w于 mg Cadence p, y二[25]gyg [26]m, x乔gyg 亚亚亿g; [27-30]mgpm !mp于之 g于么u些g m于m mmug ;七[31]g七, 产:m:o m亚乎knear threshold voltagelm gmm享 :g七七g 互[32, 33]:u, ,,kTopologically Homogeneous Power-Performance Heterogeneous, THPHlg :,p,:mp:w -亿kvoltage-frequencylg享m义 亿:uoxmw享m义 亿:umg Dark Silicon g g》g[34] 》ktunnel field-effect transistors, TFETl久gx MOSFET TFET mg TFET MOSFET g七: MOSFET

8

1 七

: TFET :g 乎什m享乎 乎w亚g[35-38]人 乎g乎kapproximate computingl 2012 ITRS m且享乎享 vm亚mg乎 [39][37]g,u乎享f下f ,g p乐n pu[40]g klkl u[41]g klfkl [42]fpg kl[43]u:, 业g big.LITTLE 亮:,二[44]g ,pv g m, 4 g y久fw-:w久 g 久,pmp, [8]gm g 800 100 sm 100 vgf久,m乐 亿(Dim)mymxgx 亿mv么 1KHzg 两m之de m,gp 享亿f予g B

B CoDA(Coprocessor-Dominated Architecture)乎 UCSD GreenDroid 互 Prof.

9 七

Michael Taylor 个,g, ffym产 ,g些ppyg m为丽yug 些mv CoDA :m久乱gmp y:事klg CoDA ,xgymw wgwf乔wm,m 于乔乐gCoDA py p产g 亿md亿e 1000 g CoDA ,ym亚久亿g CoDA pyg t乱久demp乱 乱d久eg CoDA ymp ym乱yg 互 CoDA ,ug,w m乐ym,y 之g CoDA ,my于为 丽m为丽之ug g 22nm vm 个, CoDAm CoDA ,uy g! 产y CoDA 乱 yum之w产yy ugy串m CoDA ,u 串g CoDA ,vpgmmu 串串mg CoDA ,个y享m之y乱 gxmuklm pkylg m,ygCoDA

10

1 七

pym[45]m k10-1000 l[5]g乎 ,ympy亚 10 u[6,! 12]gy于g tmmpm 享业p业g, ym业为 丽yug之my mg ymgy mo o乎m串串m, yg

1-4 XW ,

11 七

1-4 2011 [46]m产yg ymx wumg 七u CoDA ,m CoDA ,产y 乐v二n 1lxgx于 pm ugm产g 串串wympy gw二nmy y于m CUDA AMD Nvidia GPUomy um CELL ug 2ly亮m享 个gx:f:wmu产 py,fwg享p yg 3l乱gx七 Amdahl [47]mm乐 um亮些g UCSD GreenDroid 互[5, 13] Conservation CoreskC-corel yuyut二m CoDA ,g my二gy ymx gmw 串串ym乎x于 g mC-core 业 ABIm u业p业 C-corem为丽y gy C-core w享m 且习gwy g tm C-core g产

12

1 七

C-core m乱um乱 C-core umy g C-core CoDA ,g 于y乱g g C-core m互 GreenDroid gGreenDroid 享y C-core 产,gm GreenDroid 乐ymgm 互七 CoDA um CoDA ,g 互m ITRS 2012 Dark Silicon1 CoDA k2013 lm 1-52g 互 CoDA ,乱(DARPA)之v, (Center for Future Architectures Research, C-FAR)g

1-5ITRS CoDA [15]

y guwmCoDA ,享 vm且,享uw

1 ITRS Dark Silicon p power off 久g 2 ITRS 2012 2013 m 2013 七乐 m 2012 g http://www.itrs.net/Links/2012Winter/1205%20Presentation/DesignSD_12052012.pdf

13 七

ymy产 CoDA ,g CoDA , p七gw CoDA ,产y v二g m CoDA ,m产y m CoDA gxmp ym产yp ,u乔m享g! mxmCoDA ,u乱 kn争lmm 二g! tmy于为丽亿f为丽 k乔ml Cache g myg! m产ym产w om C/core 乱m EDA 享g! u两二享 CoDA ,二m二 CoDA ,:二gCoDA ,二vn! 1lCoDA g CoDA ,产ym pyg, p二g CoDA ,fm g CoDA ugmCoDA ,享g二g! 2lCoDA ,g CoDA 产uym享 CoDA ,gmwm CoDA ,wum CoDA , gmmCoDA ,乐享,f gt二g! 3lCoDA m CoDA ,:二gCoDA g乔fvm CoDA ,x,乐m CoDA :二g CoDA mCoDA ,m CoDA g于g! 4lCoDA g仍,p[48]m

14

1 七

,享东g GreenDroid/CoDA ,f m享 FPGA g ,g CoDA g! CoDA ,y 二gyp乱kn亿f lm CoDA ,享ym ygym 丢yum /gm亚 CoDA g m享pg! p享七二乐nal CoDA m obl CoDA ,mx ,o! ! o UCSD UCSD GreenDroid Dark Silicon tg C-core GreenDroid ym 产uy CoDA , 二gm UCSD GreenDroid m GreenDroid/CoDA FPGA f 介f下 f乱 C-core , 2 RTL 45nm 仍g 于, ,五:,m API g mvg k1l S-core g C-core 亮 mf乱 S-core gS-coreImagineMASA m,oS-core 乱,五m 之,gmm 介m、pfff API g k2l CoDA g互 mpg 二ff业g

15 七

GreenDroid y万m 90%享ym乱 g k3l CoDA ,万g互 ump CoDA ,万g CoDA fg万 CoDA ,,u mgf,业 CoDA ,g万于p,m fCache o乐pm fg k4l万 7200 w CoDA g CoDA ,w y,m乐产 352 y CoDA ,g万 CoDA fm CoDA g之 CoDA CoDA 乱m CoDA 享二g乐 CoDA x CoDA 七g k5l GreenDroid FPGA g GreenDroid 互 GreenDroid FPGA g Basejumpm ff介fPCB 、m w互互g Basejump FPGA gGreenDroid GreenDroid 产 Basejump g乐乱g乐云 nf严m 4-6 互g umvn k1l CoDA m且 CoDA g m乱m乱 乱yugm ,书yg仍 22nm v 7mm2 ,书y 90%g m CoDA ,g k2l CoDA 于享m CoDA ,m CoDA ,g,wmp wympy

16

1 七

,g万p CoDA ,fo ,mg k3l CoDA ,w Cache ff v二gvmx, CoDA 5.3 5 kenergy-delay product, EDPlou CoDA m 3.7 3.5 EDP g且 CoDA g m CoDA m CoDA g k4l CoDA g mg m介 CoDA 产产wm书 ym亚g CoDA ,产 QsCore g仍 QsCore m 41% 2 ym亮 11.1%~22.1%g k5l FPGA vp CoDA m FPGA 个w二m 2D-mesh ug 乔mp 2D-mesh g ASIC FPGA FPGA 于乔g 之m Virtex 6 FPGA CoDA , g 二 CoDA ,m CoDA ,七mg p七mfm f,g CoDA CoDA gCoDA y C-core S-coreg 七 CoDA ,gm um CoDA ,g CoDA 七g七dGreenDroid: An architecture for dark silicon ageefdSiChromenA Silicon Browser”dAn Evaluation of the Many-core Longtium SP Computer Systemeg

17 七

t CoDA ,ffff 于g CoDA ,m,f mp万mt CoDA fg CoDA 二 CoDA g七dExploring Energy Scalabiligy in Coprocessor-Dominated Architectures for Dark Siliconeg CoDA m七 CoDA ,g CoDA 七g七dExploring Energy Scalabiligy in Coprocessor-Dominated Architectures for Dark Siliconeg 乱 七 dTolerating memory latency: L2 cache actively push architectureedAAP and AAPM: improved prefetching structures of the L2 cacheeg

iuCoDAhorD Gj

B CoDAho CoDA CoDA CoDAa Dj rDe j A

p S h C o s d h e a S o a C n d m

rD horD

rD rD

1-6 ,

CoDA g, Basejumpo CoDA GreenDroid FPGA 仍o CoDA g GreenDroid/CoDA , g mvpg 七, 1-6 g

18

2 CoDA ,

Equation Section (Next) 2 B

CoDA ,个,gCoDA ,u个m乐y g CoDA ,七,g y Conservation Core(C-core)gx wmC-core 亮o S-corem乱,五gS-core 乱m S-core om 之 GreenDroid SiChrome 七 CoDA ,mu CoDA 1g,m,ym 万享个g万 m CoDA ,g ep BD y C-core m 亮m乱 乱g C-core CoDA m C-core g C-core C-core o C-core ,o,产 C-coreo C-core 为丽 之o C-core 万g

BD b ywmC-core wg tn1l2010 ASPLOS u二[5]m Conservation CoreskC-corelm C-core m, m下丽g2l2011 HPCA u(selective Depipelining, SDP) Cachelet [13]m C-core m C-core g u C-coreg3l2011 MICRO u QsCores [19]mpym C-core C-core pgy /mpm C-coreg m互乐 C-core m

1 CoDA wm享个g

19 七

w下kGCCfLLVMfCodeSurfer[49-51]fOpenIMPACT[52] Raw-GCC[53]lm 丽 LLVM1p下g C-core n 1l C-core 于乔o2l mo3lm m C-core o4l》 (software pipelining)g C-core 互丽g p从mg fpm乱于 乱m丽 C-coregm LLVM Clang 下 C/C++丽 LLVM 于(IR)g m乐于(C-Core IR Arsenal IR)m LLVM kf么” lkm亮ulg LLVM C-core Verilog f g MIT Raw BTL g产 GreenDroid/CoDA g

BD y C-core p g业 ABImC-core ,gp C-core pg 2-1 C-core pmf久f 久g C-core v C 之gu乱 kfast clockl C-core 久(load store), 亿亿ox于 g两upmp 于wm业; gm亚 C-core m 乱久 load m乱 g久(CFG)m久, vgxmm

1 LLVM 下mw mgpmLLVM 串串m m下g 20

2 CoDA ,

vpg

2-1C-core ,

C-core p 2-1 久mC-core 久;于为丽gm 业kABIlm享 C-core p久 m C-core pg C-core 乱 load Store m乱ymC-core 乱w下 久g ,产 C-corem享 C-core pgu 乱n:x C-core m C-core f C-core gp乱 C-core x m C-core g:x C-core 于之pmt乱n1lC-core o2l C-core o3lg之m C-core 乱pg C-core m之 patch kflg C-core 习mw享um享 C-core CachegfC-core x Cache 于 g Cache pm乱ppm p C-coreg 2-1g

21 七

2-1 ,产 C-core

乔 tree_addr 32 In : 二 tree_store_value 32 In : C-core tree_re 1 In : tree_we 1 In : tree_load_value 32 Out : C-core clk 1 In reset 1 In mem_valid 1 In Cache Cache mem_load_value 32 In Cache Cache mem_addr 32 Out Cache m0 8 m1 16 mem_store_mode 2 Out Cache m2 32 mem_store_value 32 Out Cache Cache o2 g1 mem_access_type 2 Out Cache m0 m attention 1 Out : : C-core 享 专m err_flag 32 Out : : C-core 专

BD 产ym享,gMIT Raw 2D-mesh u乔七u 些m,g 2-2 C-core 产 GreenDroid ,g 2-2(a) GreenDroid ,mGreenDroid 2D-mesh 乔,gp 2-2(b)mp(MIT Raw)f 8-15 C-coref Cachef C-core Cache (on-chip network, OCN)g 2-2(c)乱乔g, wmw享 C-core ugmC-core 业 ABIom C-core

22

2 CoDA ,

Cacheg Cache w于义m 业g

OCN C OCN C C C L1 L1 L1 L1 CPU CPU CPU CPU C I-Cache D-Cache

L1 L1 L1 L1 C

I-cache CPU CPU CPU CPU

1 1 mm C L1 L1 L1 L1 CPU CPU CPU CPU CPU Internal State Internal Tree Interface CPU C C C FPU L1 L1 L1 L1 CPU CPU CPU CPU 1 mm !!!!!!!!!!(a)!GreenDroid !!!!!!!!!!! (b)! !!!!!!!!!(C)!CPU o C1core O! 2-2 产 C-core GreenDroid ,

BD C-core C-core m乐 g业 C-core C-core m C-core g C-core 为丽 C-core m C-core ug 为丽 C-core m享ff C-coregm C-core g C-core 业 MIT Raw ABImRaw ABI x MIPS O32 ABI g ABI 乱npmpg Raw ABI $v0f$v1 m$a0-$a3 m y争 SPkl GPklg C-core m业 C-core 之x C-core g pm C-core pm C-core 之 GP m gx业w C-core 乱m C-core g 享mC-core ; g C-core m attention m C-core C-core g mC-core 业业m享 C-core ,g 且 C-core 于为丽mp

23 七

dem为丽g 2-3 mp Hello C-core 业 3 mt 3 C-core 产, g 1 m为丽 C-core_f1 um为丽 om之 2-3 g

Hello C-core { OCN

I-Cache D-Cache

1 2 3 4 …… CPU 5 Internal State Internal double c = function_3(args) + a + b; Tree Interface FPU 6 }

2-3 C-core 于为丽

BD C-core ,ffC-core 业g C-core mg 2-4 七[13]m g In-SW u 产 C-coreg C-core m ECOcoregtum于于mv g ECOcore 于 33%x :于亚g么 59%m 乱m之 么 70%g 七[13] 45nm MIPS 24KE m 1.5GHzm Cache Cache CACTI5.3g y ECOcore 万 Verilog m Synopsys DC(C-2009.06-SP2)fIC Compiler(C-2009.06-SP2)g万 TSMC 45nm GS g 享且 C-core ,y:m于m gwf 之予( 4 4 32 )g C-core ,wmpy: g 24

2 CoDA ,

2-4 s 4BD C-core 亮k为丽l

25 七 um S-core ,gm互享 CoDA ,yg[21]》产 ,五1之。五ug S-core ,五m mffgm[54]亿, m,五, 乱二m乱之产 SIMDfVLIW g mp,mwm乐 p,[55-58]gm: ,g S-core m ,g

os 且mgw于f乔 f一互m互予予w予[59-62]g一 mg之亮 m-乱gx,wmvn m之mm[59-62]g mg: :m主乱gkStreamCl :kKernelClm业一m 之:m 2-5 g: p产mm:一 一gm:于g MIT StreamIt[63]m Brook[64]m StreamC/KernelC[65]g

StreamI 1 StreamO 1 StreamI 2 Kernel 1 Kernel 2 StreamO 2 StreamI 3

2-5

1 七,久 FPGAm两up,五g 26

2 CoDA ,

4BD p且mg 2-6 S-core ,g S-core 乱n,五gm 乱nkStream Register File, SRFlo kStream Bufferl乔,五mp pg,五v乱n1l:o:乱 o2l乔o乔k 2-7lm g五之p mgx 2D-mesh m oxm m享乔个[66]g3l,o,五 :于乔,m,p: vpg,,m 2 :m p乱,g,g

2-6S-core ,

Imagine MASA m,g p API mm u SDR SRF 于m:g 七hAn Evaluation of the Many-core Longtium SP Computer Systemihi1g

1 vv http://pan.baidu.com/s/1pJAzk1dmS-core 乔m 五p,mg乐 m S-core ug

27 七

2-7S-core ,五乔

4BD p ,五乱:pm fffff 7 gm :,g,m g,享 :m丽g ,五,m ,五,g ,之丽m A÷B 丽 A×(1/B)m亮g kSymmetric Table Addition MethodmSTAMl[67-71]m 人乘kNewton-Raphson iterationlg之 STAM m享m个oSTAM 人乘 m个g STAMm 五g人乘 k2-1lm x0 STAM mx1 人乘 mB g之p人乘pg

xxx100=+×−×(1.0 xB 0 ) k2-1l m:,p亮 mt:pp人乘m,, 2-8gp

: x0o:p

1.0 - x0 * Bot:p人乘 x1op: pgm付 乱m付乱 g 五,w m 4 :, FFT g

28

2 CoDA ,

X0 A A A

* * A/B X1 B B B 1.0 MUX 1.0 MUX ± ± C C

MUX MUX 1.0 – X0 * B

a b c d

X0 A A

* * NoC X1 1.0 – X0 * B B B B X0 OutA MUX MUX ± 1.0 ± MUX C C X0

MUX MUX X0+X0*(1.0 – X0 * B)

NoC (a) OutB (b) (c) - (d)

2-8 ,,

4BD 仍万m Altera Stratix II FPGA m NIOS m S-core Avalon ug FPGA 个些m 4x4 ,五m32KB m 69%久个 100MHz 亿vg 2-2 S-core 乱g 2-9 g m FPGA um 万gn1024 FFT 10 m 1000 16 FIR m2048 2048 g C++ Intel um乱 k Intel 80 m S-core 32 mm 专lg万,五 16 :m

29 七

: PowerPC 750 R m:乱g u于之m R g 2-3 2-10 g 2-2 S-core

Top Level 1089 2741 36562 7.54% Stream Buffer 991 2498 34995 6.87% SRF 4571 9919 153888 27.29% Computing core 5906 13318 131121 36.63% Star-Mesh 2086 4429 45731 12.18%

Software API 1485 3450 33900 9.49% Total 16128 36355 436197 100%

2-9FPGA

2-3 R 16 :v

FFTkR2 FIR(R16 n=1024l n=1000) kn=2048l kn=2014l : 41040 29465 2072 8182 16 : 3470 1370 326 552 11.83 14.21 6.36 14.0

30

2 CoDA ,

14.21 14

11.83

6.36

FFT FIR

2-10 bp -DADB wm mgk2-2l mgpm 一m 二g 2-11 , Gartner 乎m 7 ug 2-12 Gartner wm(Android) pum乐wg g互 p产y GreenDroid1g battery_ capacity Power_( Budget = ) (2-2) battery #_hrs active __ use between _ recharges

2-11Gartner PC

1 www.greendroid.org

31 七

2-12Gartner w

ym,g 2-13 g:p C C++ kNative Librarieslm乱享mff 2D 3D gp乐 Dalvik (DVM)1gDalvik Java 下m Java Native Interface(JNI) Java 之业g de业mde Dalvik ug Dalvik g之m u乱g 乱,pm 乐pmf亿f gy C-core 产 GreenDroid 亚gy 2 n1l Dalvik my亚 m仍 C-core 72% g2lvmp yg;m C-core 80% 95%g C-core m m 2 m, C-core m C-core g u产 C-core u乱

1 Dalvik 乐下丽mp;m 2014 6 I/O umDalvik ART m Android , m“乐g 32

2 CoDA ,

gmwmu个串串m, 个 C-corempg GreenDroid m, 2-2(a)g

2-13

-DADB GreenDroid p万m 2-14gp:f Cache Cache u g乐 9 C-coreg C-core 2D k7 m libskialfJPEG (1 )(1 )g TSMC 45nm m C-core 0.58mm2 m 58%g

2-14GreenDroid p

33 七

2-15 GreenDroid 产 C-core g um C-core u乱g n1l C-core uw享f下 久m 56%o2l 35% C-core g C-core u 91pJ 8pJg

2-15GreenDroid 产 C-core

C-core m,w C-core u g产 C-coremg vp之p万g p 4DB m;kNielsenl2011 m 31%于m且 pgwu于予 mfm C-core g

p m(profiler)g QEMU mfDalvik Linux :g Trace 一 CPU upg fDalvik Linux :g 之w享m享p g 之二m二 乎 20 m GooglemgmailmGoogle NewsmGoogle Financem Amazon, Docs gvn1lm

34

2 CoDA ,

pm C-core o2lm C-core o3lm C-core o4l业fm C-core g 2-16 于gpu于 libwebcore f libskia 10%于flibc 9%fDalvik libdvm 8%flinux : 7%gm 乱于f:g, 乱m C-core 乱m g与且m产 C-core m CoDA ,七ug

libcrypto,!libz,!! Other! libchromium,! libwebcore)58% libcuNls! ) Kernel,!7%!

V8,!32%! libdvm,!8%!

HTML/! libc,!9%! Layout,!21%! Other,!4%!

libskia,!10%!

2-16 于

2-17 x于g mg mg mm予o m予g 90% 享m 84000 g之 2.1 C-core p享个m万 8 s享 个g万 22nm v享 7mm2m 100mm2 Die mm个 享gmf :m个g

35 七

2-17 x

2-18 乱x

mxm 90%m享 1200 o 80%享 625 g C-core g 2-18 乱x于m 乱享g享 g

B 之 GreenDroid ,m乱 g乱m yu于于 72%gm乐

36

2 CoDA ,

APP APP ym 乱gy 80% 95%g u七mm 享个g之m 个且g产 1000 C-core 个p乱乱m 乱m享g CoDA gpm pm乱产 CoDA ,乱 m CoDA ,,gm之万享 个mg CoDA m个g产 C-core 产 m乔fg g my C-core f ,f产f为丽 C-core g万 m,产 C-core 70%gm ff乱 S-core ,g m些mg 乱m乱 y乱gm m CoDA g 乐 SiChrome 享个mu个 ,m CoDA g p[5, 13, 19]mx C-core , mp C-core ff gx 863 七dAn Evaluation of the Many-core Longtium SP Computer Systemegt 七dGreenDroid: An architecture for dark silicon ageed SiChromen A Silicon Browser”g

37

3 CoDA

Equation Section (Next) 3 B lo

m,串串二m, 产y串串gy nf且m亚之m 享,产yg CoDA m CoDA ,产uy 产y二g! CoDA ,y 二g享 CoDA ,g CoDA ,m CoDA mRTL f CoDA wgp CoDA ,f 万m于mvp CoDA ,g! vnp CoDA ,f o CoDA ot CoDA o CoDA 于gg! B CoDA ,u ymp,gvn 1l CoDA wmCoDA ,o2lCoDA ,mp: o3l CoDA ,业g!

l B CoDA ,wg享 ym享yoy mmw于gy CoDA ,fgm:,享 g享vmCoDA ,享wu mw产mg 3/1kal CoDA ,mv,g! kTile/core! Scalinglum CoDA w ,gw:产m Cacheg 3/1kalmCoDA ,w乐kVoltage!Domainlgp

39 七

pp Cache g于 f 2D/mesh 乔mwg 2D/mesh u :m CoDA 产些产 些ym于g!

3-1(a) CoDA (b) y产

乱kIntra/Tile!ScalinglumCoDA p wymyg wmw享y业 ABI 乐yx Cache w业g乱y之 p乔 Cachempp二 Cachem 乱:,g产y串mCache 二u串f 串予m产yu些g产y mp产m享g 3-1kbl p乱乱乔gm乐p L1 Cachemp乔gxy乔g! ykIntra/Coprocessor!Scalinglump产 CoDA ,y,m,y wgmgm ,乐;享之下yg m之kinlinelymy 于为丽g! pupyo享

40

3 CoDA

ym享义y ugwy 七m之mpp g! CoDA mp:二p Cache x 产ypyxp Cache 于m L1 Cache 二g产y m L1 Cache 二m些g myw, wg CoDA m产ym之 Profile y享m享 yp Cache 乎m享p Cache gw予m亚 gxm 两g! CoDA ,wum g且mw. m g MIPS MIT Raw [72]g 之乔二y乱m之 g!

B !!!! CoDA ,tmf争 争gm CoDA x,: CoDA g! 之 CoDA ,k 3-1(a)乱l五 五gpkpower raillm kvoltage regulatorlg之 w享mw Cache 个g CoDA ,于m Cache u 享于m业 g! ,争u享于 业于gCoDA ,享uyg

41 七

u个mvm ;个g享y y;m之 享ywg乱享业 g业于x争[73]争 [74/76]于g! CoDA 乐争gpy mgym 业争og big-LITTLE ,[17]g之m ug! !!!!m CoDA ,pmp乱p mpyg!

3-2!

CoDA m么 CoDA x:/:,mpfCoDA , 于m且g! 3/2 中g于 CoDA :um付亮m CoDA :g乐p m乱 CoDA ,u争争 g! CoDA u之x之g mp付m CoDA ,uvg mp乱临m业 享y:ug之享 y乱m: m争guw :享争争vgCoDA ,wy 42

3 CoDA

于为丽m:争mv 争g为丽vp:m:争 mvp业g之uw g!

B CoDA y业 ABI x Cachem享g! CoDA ,my于为丽g 为丽mCoDA 下享y业 ABIm pgyg m为丽ow vmug乱 m业乱业pg CoDA 且pgp CoDA 下m 且习gCoDA yf乱g! CoDA ,my g二m享 义之gwm业业 uug CoDA , uvk Conservation cores [5, 13]lm CoDA 下享 y享g! y丢um vp业yg!

B epo y C/corem乱 gym 产om乐万 CoDA , !

epP CoDA ,mwy fwgy mm CoDA y

43 七

gym, 产ywg CoDA ,些m亮些亮 ygmyum w丽vm gm七 CoDA ,m 3-1kblg! C-core [5]puy g C-corem, C-core yg C-core yg[3/6,!13]x mC-core 亚 10 k亮亚 30 [3,!4,!6]m亚 10 l 23 ugC-core 亚wgm C-core 亚m些vmCoDA , g! p C-core p乱g C-coregu mfpwg m C-core 于为丽g xm C-core m Amdahl’s m 丽 C-core um串串 业 C-core ug CoDA ,m产 C-core yumg! C-core v从 C-cores:! m profile gmp k switch l产gC-core 于kspatial computation approachlmppgxm C-core 久g久于gm 久p C-corem C-core pgpmC-core C-core m乱享之 mg C-core wu m乱乱ug

44

3 CoDA

C-core mug! 3-1(b) mC-core m C-core pg C-core x于m profile 享m C-core CoDA yg C-core cachelet C/core [13]m g!

CoDA ympg CoDA m些y uyg享 g! pdem“ SPEC 2006[77],!SPEC 2000[78] EEMBC[79]m乱 g 3-1 C-core g“乱 22 C-core m C-core Synopsys 之 争m之 FPGA 仍g! ! 3-1

22nm C-core 产 kmm2l astar (SPEC 2006) path finding 0.044 bzip2 (SPEC 2000) data compression 0.329 cjpeg (Independent JPEG Group 2000) jpeg encoding 0.076 crafty (SPEC 2000) chess 0.580 djpeg (Independent JPEG Group 2002) jpeg decoding 0.118 gzip (SPEC 2000) compression/decompression 0.190 mcf (SPEC 2000) multi-commodity flow 0.056

Viterbi (Embedded Microprocessor Benchmark convolutional decoding 0.039 Consortium 2002)

!!!!! !!!! C-core m p 2 f4 f8 16 m业

45 七

C-core 50%g之m“ 8 16 m32 m64 128 g m享 352 C-core y 97%gCoDA ,产 352 C-core 享 22nm v 73mm2 g! B s u CoDA m 享二g CoDA ,m CoDA m CoDA w gp CoDA ,fm CoDA ,g

vt介n1l C-corefCoDA 乱g C-core f oCache xfoo争o2l trace mg 义mCache pmCache /o3l 产 C-core m 45nm 40nm v 22nm gp CoDA vmf g! Synopsys Design CompilerfIC Compiler PrimeTime 万 3/1 C-coreg C-core fm互 g, C-core g mfm gukTSMCl45nm 40nm m 22nm g! m LLVM[80]一pm w乱yu乐u。g CoDA ,mpp。pm yugp一 mnkpmCache Cache 义lm u义k C-core 亮 C-core 于义m于义 lgumpg 一之um

46

3 CoDA

g!

si n Wire_2 length Component _, Area (3-1) = ∑ i !!!!!!!!!!!!!!!!!! i=1 ! ∈ !"!#$!!"#$"%&%'!!"#$%!!ℎ!!!"#ℎ.! mp 40nm 45nm v万 22nm g 3-2 g 乔x予m乔gk3-1lm 乔予m ASIC 乔个m 予于人kManhattan-distancelg y L1 予my 享m享y乎g! mvg1l之 C-core u 万mp为丽 y享 30 g2l MIT Raw 万[72]m 于义享 300 m于u 之u义享于g! 3-2

Bill Dally 2009DAC [6]m k3-3lk3-4l22nmg C-core[6]m y k3-3lk3-4l22nmg p享y NoC pg Cachem二于 CACTI[81]m22nmg

LPDDR2m3.2GB/sg kHPlkLSTPl 万m ITRSg(LP) gnHP 1mLP1.25mLSTP2.5g 享 万m22nmg x于 30m万 于义于 300m万

! 2 Areanew= Area old*(λλ new / old ) k3-2l

47 七

k3-3l LEnergy__ per square _ mmnew= ( LEnergy __ per suqare _ mm old 2 *3DFacter _ *(λλold / new ) )

k l Dynamic__*(/) energynew= Dynamic energy oldλλ new old !!!!!!! 3-4 ! k3-2lfk3-3lk3-4l 45nm / 40nm vf

vg λold λnew gk3-2lm个些且k IO 些l uwgk3-3lpg m之业g mm些gk3/3l g 32nm 22nm 3D (FinFET)丽mpn3D g 3D 0.7m Intel 22nm [82] gt乱gm 享亮m gtwm乐 gmk3-4lgm 之亚m 亚g亚k3-4lug! m Trace p Cache Cache gfm CACTI[81]g CACTI 32nm m 22nmgfk3-2lf k3-3lk3-4lg! m些m享 gp亮m ppgpg 义pmg pm享二p Cache pm pmpmp 于g! 一 C-core gm gp C-core 且m 0.0015mm2 48

3 CoDA

0.28mm2 wg C-core 万m C-core mx SiChrome gpwm 万mffff g乱x七[13] C-core gm七[13]p于 仍gpm产 C-core C-core pg! g万 C-core 40nm 45nm gm C-core 亮久万m 22nmmf 0.25mW/ mm2, 1.02 mW/mm2, 29.28 mW/mm2g k3-4lgwm CoDA wm ITRS gf 2.0f1.0 0.8g! 3GHzg;七[6] C-core m C-core 亮 30 g万 22nm v享 43pJ g[13] C-core 27%g!

os

3-3

3-4

49 七

万tnfgt g 3-3 g CPI mg CPI m CPI xvnm享 C-core x乎my L1 m予mgm 乱于 CPIm CPI g!

3-5

50

3 CoDA

3-4 乱gmL2 w L2 mgm pp L2 g! uwgw pmmp pg 3-5! p乱 gvpmyg! CoDA ,u乱gm 乐乱mp乱gm DRAM mkthrough-silicon viaslmpackage-on-package gm wgmw七m [83,!84]g! B i 3/1 , CoDA ,乐g CoDA 乐wmmw Cache g mwvkmwml wg! w CoDA ,m CoDA 七gwm CoDA kflg!

i 3-3CoDA 于

kl 8m16m32m64m128

L2 Cache kKBl 512m2048m8192

L1 Cache kKBl 8m16m32m64 kmm2l 0.5m2m8m32m些

1m4

争 0m90%m95%m98% LSTPmLPmHP

! 之wmuw CoDA g

51 七 wmw m CoDA ,于g! 3-3 CoDA 于gw于 w CoDA m Cache 产y m,:产yg 些于mw于w Cache 个g 且mw万 7200 w CoDA g!

is 3-4 wm

y mm m亮 乔予 乔 Cache Cache CPI于 CPI于 m亮 乔予 乔

L1 Cachem

L2 Cache 享 乔NoC u乔予 CoDAmCachemmMUX 争 亮 享 CPI 亮于

! 3-4 ptg 亮gpmy 98%mm享业享y g于w L1 L2 gw fy乔予g 于g gy于为k lm乱 C-core 予kylg

52

3 CoDA

CoDA 亮o乔予 乔o CoDA pp L1 m L1 o CoDA ,m 于于亿g! tgnu g业x乔 mgd 争 e之争 亮gpu km争[73]lm g CoDA ,pp Cachem Cache gpg nkLSTPlmkLPlmkHPlg 3.3 xg! CoDA ,f产g CoDA m CoDA , g CoDA wm乱产w ypy,gCoDA f争争g! CoDA ,于m万 w于g CoDA ,m ,mmgt 乱nmgvp CoDA 于g! !!!!七dExploring Energy Scalability in Coprocessor-Dominated Architectures for Dark Siliconeg

53

4 CoDA

Equation Section (Next) 4 B

CoDA ,产串串ym产w予m CoDA gy m CoDA x,:乐 g CoDA ,: mCoDA g! CoDA ,于m CoDA , vg 7200 w CoDA m Cache f,f CoDA g! 乐介 CoDA ,w yfm书y CoDA mg! 仍m CoDA m产 ym 3.7 o 5.3 g CoDA ,x,: 乐m CoDA ,g! 乐七 CoDA CoDA 书g! !!!!vnp CoDA 二o CoDA w于vot 二op亚 CoDA o og B j CoDA ,m些uy 乱乱gvn! 1lkm争lm CoDA g乐mm yg CoDA ,m乱 之乱gm仍享产 ymCoDA , 3.7 g! ! 2lCoDA ,f乔m

55 七

CoDA ,g仍 f乔mx CoDA 于g m仍 ! gm仍y书 mpg! 3l CoDA ,亚g仍 mmCoDA 5.3 m 5 kenergy-delay product, EDPlgumCoDA 3.7 m 3.5 EDP g! B CoDA ,mgw CoDA ,m CoDA 于gt 介w CoDA gp C/core m乐 C/core : CoDA g! umtf mmvg乐 七x产于m CoDA g!

B 4-1 于m争 vmp 4 m争 0%m90%m95% 98%g 90%m95% 98%m m 0% 98.1%g 4-1kal wgm kEDPlmwy 32KB L1 512KB L2 g 4-1kblx 4-1kalm 4 gkalkblpppv万 gw一w争mw一产g 4-1kcl! 4-1kflwk8m32m64 128l1 些vkPareto-frontierlmp w争mw X% PGE, Y VD kX 争mY lgp

1 16 x 32 mw 16 g 56

4 CoDA

wklklg 与且muvg!

4-1(EDP) vs. vs.

4-1kcl-kfl付 争x于予串串m 且 CoDA ,g m争m gu予mm 串串g且m gxmm争m gu予m u串串g CoDA ,uy

57 七

mymy g! m且m且争 g L2 m 争g L2 4-1kalx 4-1kbl klugu争u mp争g 4-1kel 4-1kfl 于付于m争g! 4-1kcl 4-1kflmkPareto curvelv 些kv于x于乎lmv 争gu gumm享 guwmpg w争m 争tgvm 争gm mCoDA my 亚g L2 w些m 且m L2 之g! 4-1 wm CoDA , g乱m 享于义于g ,mg! 4-1

L2 L1 / C-core EDP Cache Cache pJ/inst vs SW (mm2) vs SW (KB) (KB) (mm2) 512 32 - 1 0 1.0 10.29 51.05 1.0 HP 1

8 512 32 0.5 5 22 0.200 43.20 10.40 0.941 LP 1-2

16 512 16 0.5 9 44 0.206 45.20 9.97 0.857 LP 2-3

32 512 32 0.5 20 88 0.218 50.70 11.50 0.930 LP 5 64 512 32 0.5 40 176 0.270 60.70 13.39 0.892 LP 10

128 512 32 2.0 16 352 0.286 72.70 14.80 0.926 LP 4

! 58

4 CoDA

B PRdc

4-2EDP

4-2 upm于v CoDA 乱gmk lm乔乱m享亚m 乱义yugm CoDA C-core 亚 94% m享p乱g乱g 98%争[73]m乎 50%乐 g乱乔u L2 二 p乱g乔之久享g! 4-2 mvgpn CoDA m享o享m 98%争m亮wgn m 4-2 CoDA ,mw予g CoDA m且 CoDA , m七gpm 128 CoDA ,xgm Android m128 80%于[3]m CoDA 享g!

B CoDA 予g up临七与gp umuw享 C-core m

59 七

CoDA 乱u且o乱m wgum C-core 乱ugmCoDA pu 二mpmCoDA u乐m 乐亚g CoDA C-core 且习m w交gm CoDA SoC m CoDA u乱乐 g! B oj umwm CoDA klg七 CoDA ,m二g mCoDA ,g vn1lo2l乔 o3lg C-coreg二y争knglibcl C-core pm C-coreg vm丢umo m C-coremg业业 丢ug业 乐g

B CoDA ,m C-core C-core 享 C-core g Profile C-core g产v 享 C-core g profile x享wm g仍mw gpm 于o亮m 10% 90%于g 4-3 4-4 yxyw g仍p CoDA o产 128 m v于g

60

4 CoDA

myg 4-1 p g CoDA 16 m 16 g

4-3

4-3 CoDA ,ym gym 于g 4-3 m 乱n1lkm lo2lkmpy丢 uw C-core ulgm 4-3 upu g

4-4

3 亚 9%g 之 4 m之g 16 mx 5%g x 4-3 wm 4-4 CoDA m 亮gxywm ymgvm

61 七

mg 16 m 2 g予x CoDA ,产wg

mM 乱vm CoDA , profile w mywgm, p亚gy g CoDA m 亚 23.4%g乎p乔 g 亚ym,享n vm C-core g,之 C-corem C-core w C-core g[19] C-coregx 乱mpy之gy wwm之 C-core pm pg[19]仍n C-core vm C-core 23%mx C-core 亚 27%g y久且p乱m CoDA ,g

4-5 C-core

mp CoDA ,g C-core u 2 g CoDA ,m 41%亚 15%kxlg 4-5 x 4-3 4-4 m C-core gwkvl亮k于

62

4 CoDA

lg且m 16 C-core 7.4%g亮且mx C-core wm 么 22.1%k 7 lm 16 11.1%g

j CoDA ,p二 个g C-core mCoDA , 些gpm 4-1 128 CoDA EDP 16 mv 16 g

4-6

。m 16 p CoDA , 16m乱乐 gMIT Raw[72] 16 mx uk Raw lg Raw 16 u SPEC 7%g mpmCoDA ,x mp亿g 万享m串串 享mg 4-6 m 仍mvmCoDA , 享mpf LPDDR2 gm wwg 16 MCF m m享 3 ug 2D-mesh g Sd B r v 4 wum

63 七

享gum Intel 3D kFinFETlmp享g ITRS 2013 m亚享pm vn乱ukPDSOIl[85-89]fu kFDSOIl[90-93]fBULK Multi-GatekMGl[94-96]gm MOSFET m亮kn MOSFETlmw;u二g um》[97, 98]f CMOS kDynamic Multi-threshold CMOSl[99]fkDouble-Volagel[100-102]f 久u久[103-108]g m之,书g ,g Cache m乱 享g 乱[109-114]g, pm二m乱 之m Cache 二g久fm p亚 CoDA Cache g享 m [115-118]o从二[119]mw二 g:乱m CoDA g Cache w享乱 Cachem享 Cache ,u乱,g Cache ,vn ,[120]m Cache o,[121-124]o,[125]g p Cache g 乔un1l乔u m m。 [126-129]o2l丽mp[130]g m/亿业 kDVFSl[131-133]f[134, 135]f[136, 137] Intel [46]g CoDA ,m x乐pgm 4 m 2011 SoC 25 m

64

4 CoDA

om仔 920 gp CoDA ,m乐享m g 4-2 vm1l Cache Cache m CoDA , Cache g2l 乔mm乔 g3lpyg Cache Cache , g乔mvg kPrefetchingl[70, 138-140] Cache g 书m Cache 乐书 Cache 二gkPushl m之w享:m m Cache m Cache 书[141, 142]g :m m乔k乔o lm Cache 二 34.6%m乔u gp Cache 74.6%mp Cache 39.6%g u IPC 6.6%g个x 0.2%g pm L0 Cache mm 亚 Cache gmIPC 且m 于m亚g七dTolerating memory latency: L2 cache actively push architectureedAAP and AAPM: improved prefetching structures of the L2 cacheeg 2D-mesh 乔 CoDA ,mL1 L2 mg vmp二享 pg pdem de Intel [46]g 》之m HP m亮 LP m PrimeTime g仍m

65 七

乐mumpg m,,产串串yg GPU pm AMD GPU 产 pug乎[143-145]》今, m GPU CUDA[146]m Brook [64]m )g ,,m EXOCHI[144]mp 1000 wygEXOCHI 。,享p下g 之乱,uoxwm 乱yugmTartan[147] 。,,ug,久m Mishra[54] 万。享m g 予mpm MIT Raw [72]mUT Austin TRIPS [148, 149]人 WaveScalar [150]g ,m CoDA ,,g, 于gGreenDroid[3, 4, 6]七 ,m GreenDroid 二g Hannig[151]p之。,: SoC g CoDA ,个m CoDA 产亚gm CoDA m享下x 于。m CoDA 习g CoDA , mppg x CoDA m[152]p:g ,m:ww亿,m 七:gxwmCoDA m七 ,gw七:m七乔,g m[54]m乱乱 于gvmCoDA , C-core pmw享乱gC-core 之[19]

66

4 CoDA

个gw C-core mwg [5, 13]乱争um 争g 4-1 m CoDA ,乱于乱 m争v CoDA ,g享 CoDA ,vpm 乱g《[153]亮 亮mwg uy产, 二gm CoDA ,乐 g之 CoDA 于m 128 CoDA m 3.7 3.5 EDP g且 CoDA ,m CoDA ,g乐 , CoDA ,m乱f 乔g 仍y些m CoDA pdye二g m乔vm CoDA x 3.8 g 七dExploring Energy Scalability in Coprocessor-Dominated Architectures for Dark Siliconeg 4.4 乱七dTolerating memory latency: L2 cache actively push architectureed AAP and AAPM: improved prefetching structures of the L2 cacheeg

67

5 CoDA x

Equation Section (Next) 5 B hs

CoDA ,fC-core ,fS-core ,um CoDA ,二g CoDA ,w仍 m仍 C-core C-core x g乱np乱 FPGA m 乱 CoDA ASIC gvn p Basejump m CoDA FPGA g FPGA f介f乱g m MaXentric g 丁mBasejump pfFPGA 乔m Linux u介g互 Basejump 仍仍gmBasejump 之 2D-mesh 久m FPGA p仍 ASICm ASIC FPGA p仍 ASIC g CoDA , GreenDroid 产 Basejump , FPGA 仍m仍g仍p 仍仍 C-core g t CoDA gnp p C-coreop S-corem产 ygfIO fg 6GC hsj m享产 之乔mm m享产g产w于m 业亦gu享m Basejump gm介 于m互ygm CoDA ,之m FPGA 个仍 ASIC 享g 2D-mesh 乔 :,乔久 2D-mesh um p 2D-mesh g Basejump 5-1m Basejump g Northbridge x于x于乔mPlug Board 乔

69 七

wg互k l乔 Northbridge umpm 仍g 5-2 m Basejump gutnk FPGA PCB lfkFPGA lukLinux lgt 于g

5-1Basejump

u久nkNorthbridgelp FPGA m FPGA u于xm且于 产于久gx FMC m之 LVDS o于 MURN kMulti-University Research Networkl m 4 mp kBondinglpg xmgup ASIC 仍 FPGA g互 2 PCB gBasejump 乐pm产 Basejump 享乔ugv n1lpmwm IOm PCB g2l之乔 2D-mesh um 2D-mesh 久um FPGA 仍 ASIC g3l产g

70

5 CoDA x

5-2

p FPGA m产之乔 m之 PCIE 乔um FPGA 个w乱 ASIC 久g业 DDR3 fEthernet ffFMC fPCIE LCD g FPGA

71 七

五久g五五k lm五k 2D mesh 乔lg FPGA 个仍 ASIC 久mu产五m 五u产yg五 Plug Boardm乔 gw享乔w 2D mesh um FMC x享久乔m乔 ug up PC mu Basejump 介 k:介于介lfIO 介 g Basejump 介u PCIE 介oIO 2D mesh mwo mp业业g 仍gu Linux g Basejump 乔,mu 2D-mesh fu m于乔乔mgu Basejump m Basejump pm乱 于mpm 2D-mesh g

6GC Basejump 乎f PCB y争 kflow controllmg 于 m于乔 mg 于mg 于mv FIFOm久 FIFO 于g 于v 3 n1l FIFO “o2l 享于o3l FIFO 于gp mtg ;vmBasejump 3 n1l 乎于g2l乎 PCB 于g3l亮乎久g 乎y争 FIFO 于

72

5 CoDA x

m Basejump x 2D mesh 乔m于 乔g乎 PCB my争 IO FIFO 于m Basejump MURN FMC g亮乎 久于m于享m 之之丽my m于xp 于gmpm 于之pg Basejump m 2D-mesh BDIOM 于亮乎g 2D mesh x MIT Raw g g 5-3 m乔kdatam32 lkvalidlm FIFO 于g FIFOm FIFO FIFO p于k之 thanks lg

valid Credit data Tracker thanks

thanks Credit valid Tracker data

5-32D mesh

MURN BDIOM g

/53.1 e MURN I/O pm g MURN I/O mp 之 MURN 享g MURN I/O m 久mvpg MURN I/O 乔 ASIC FPGAm 、些 Die I/O pads 些vmMURN I/O 久ksingle-ended LVDS lg MURN I/O m I/O 二 mp g PCB u予wpm w亿vg MURN I/O 4 m 1 f 4 gvmw

73 七

vg North Bridge m gpmp 8 g 乱乐mkbit alignmentlg FMC LVDS m享g 5-4 MURN I/O ,m 4 MURN I/O mp 1 gm 亿vgm enabled kmurnio assembler cbclpm 久gkmurnio assembler outl久之 80 m 久kmurnio assembler inlg IO 亿乱久亿wmp MURN I/O gpksend fifolmp krecv fifolg p MURN I/O pmp v久n8 f1 f1 Token 1 Command gmToken mCommand 仍乐g仍 MURN I/O pgp 11 m享 22 o4 享 88 久gu丽p k乱久lm Token 丽pk0->1 1->0l FIFO 2 于g MURN I/O 之vn 1lmu MURN 128 m 之g 2l MURN I/O p 128 m 16 w 8 g g 3l MURN I/O g 4lu MURN I/O gx 128 mgm亚亿为丽 2 128 gp于mp 么m丢g

74

5 CoDA x

丢mpkpendinglm之 Basejump gpm 从g

5-4MURN IO

Basejump m之乔w fk 2D-meshlf业 g久ut于gp乔 Pin 乔mp乔pg 久 5-5g u乔wm ou乔p业m m IP gu乔五乔g n1lmw 乔k lg乔久m 5-5 n0 n8g2l pm 丽gmvp g之m享 RTL g

75 七

n4 Your Ring n12 Design 3 n15 Ring Ring n0 Ring n5 n8 n11 MURN FMC Your Plug

Design 1 ASIC FPGA FPGA FPGA Board Ring Ring Ring

n6 n10 n14

Your Ring Design 2 n7 Northbridge n9

Chip FPGA on Motherboard daughter board

ring network switch (snooping) ring network switch (normal)

ring network node

5-5 久

5-6g 80 g 4 k source IDlmgv 4 kdestination IDl g 1 m 1 且 okm 乱lgv 7 m 1 7 m 0 7 g 64 g 32 2D mesh mg 5-6 tg 0 mv 7 4 m FIFO 于g 2D mesh 丽 mp 4 2D mesh mpm pm之 1 kcreditl g32 64 丽书 1 m u 2 32 2D mesh pg 于mp 2D mesh pm gm p 32 gug7 3 w 111 m 64 mt 2D mesh gt 111 m 32 kword 0lm word 1 3 2D mesh g

76

5 CoDA x

5-6

2D mesh m 2D mesh m享 2D mesh 产p 2D mesh g 2D mesh 于g 6 wm 5-6 v乱g fg RNDISABLE_CMD 乔mw mwgRNENABLE_CMD 乔m xgRNDOWN_CMD 乔gRNUP_CMD 乔gRNRESET_ENABLE_CMD 乔g RNRESET_DISABLE_CMD 乔g kpower offl亮gpn1l RNUP_CMD o2l RNRESET_ENABLE_CMD o3l RNRESET_DISABLE_CMD o4l RNENABLE_CMD g mg vumm 互gug

77 七

5-72D mesh x丽 BDIOM

5-7 2D mesh x丽 BDIOM gp BDIOM 4 2D mesh g 2D mesh m recore_bdiom_out_aggregated 32 64 丽orecore_ioport_out 2 n1l举kround-robinlp 2D mesh g2l mmo w举vpgrecore_bdiom_encoder g 之krecore_bdiom_decoderlmp 2D mesh krecore_ioport_inlm互g产 2D mesh p 2D mesh 久 recore_bdiom_in_deaggregated g

CoDA ,久m FPGA 仍 个享m》 FPGA 仍gm CoDA , 2D-mesh um 2D-meshm p 2D-mesh g 5-8 久 2D-mesh ,gup ox乔up 2D-mesh x BDIOMmBDIOM w 2D-mesh 丽m 2D-mesh pku BDIOM 于lg 之 BDIOM 久u乔 2D-mesh g m享m n5 x n11 fn6 n10 g 5-8 mu 2x2 2D-mesh mu 2x2 2D-mesh m之 BDIOM 乔久up

78

5 CoDA x

4x2 2D-meshm产 8 :g

n4 n12

Ring Ring n5 n0 n8 n11

CPU CPU CPU CPU Ring Ring n6 n10

CPU CPU CPU CPU n7 n9

daughter board Motherboard

ring network switch (snooping) ring network switch (normal)

Ring BDIOM

5-8 久 2D-mesh

乔 FPGA 久仍 28nm ASIC g互 PCB FPGA ASIC 仍 vp ASICg

tN 2.2G Basejump 业nkRS232 lf fDDR3 fDDR2 PCIE g m中kXilinxlIPmwgDDR3 DDR2 IP :m 2D mesh DDR 丽m oFMC wgPCI m IP :m乐 I/O 于kPIOlm u介g乐 PIO pu Basejump m PCI Plug g Basejump PCI Plug u之 PCIE 乔 m之gvtn1lf u享业g2lum fFMC g3lpmm ug乐p g

79 七

5-9PCI Plug

5-9 PCI Plug mPCI Plug u Basejump ug Basejump umPCI Plug x 2D mesh 乔m之p乔 IO 于m业 IP :g乐 PIO g uum乱产m享介g PCI Plug 2D mesh m PIO gm PIO p 2 互kFIFOlm互um互 ug PIO p互pm u之g u于m PIO p互p mu且m 5-10g互 一互乐于gm介 互“介乱m乔 m PIO mm wguf gpum PIO 互一 互享gm介互 m“介mp介 互m介互m介互 介mpg mm介举 mwg

80

5 CoDA x

PCIE PIO

Recv Status Send Status R Channel 0 Channel R Channel 0 Number Recv Status Send Status R Channel 1 R Channel 1

Recv Status Send Status R Channel 2 R Channel 2

5-10 PCI Plug ,

5-1PIO

Channel Regs add_i[5:4] add_i[3:0] Description receive_status 2’b00 i 一互于 receive_fifo 2’b01 i 互 transmit_fifo 2’b10 i 互 transmit_status 2’b11 i 一互 Special Regs Address Address (Linux) (PIO) channel_number 0x7fc 0x1ff m 0xffff_ffff m 500 host_reset 0x7f8 0x1fe m mp test_reg 0x7f4 0x1fd m status_reg 0x7f0 0x1fc m i :

互互m PIO 乐p 32 gkchannel numberl一 PCI Plug p m介“互gu khost resetlpg 0xffff_ffffmp 500 klgktest regl

81 七

m介pmg kstatus reglu享乱m DDR fFMC fg介之 g PIO ppm 5-1g PIO 16 gLinux u IO 于 g u介:介于介乱g:介 f IO 于:于x g于介:于。于mp 互、m。g 业、g

Uonb u Basejump mm 串 PCB mp且 于g

Host Reset

Host Reset Host MURN Host MURN NB Host Reset FMC_NB FMC_MD sys_bootgen Reset chip Reset 1 wait the reset 1 Wait the signal start_boot set to 1. Your wait the 2 calibration begin 2 Send ring network Design start_boot boot sequence. signal, then set the 3 calibration done, start_boot set the Northbridge start_boot generate the 3 set the northbridge ready signal to 1 sys_enabled start_boot signal sys_enabled to 1 ready to 1 Chip Northbridge FPGA PCI Plug

Daughter Board Motherboard FPGA Host Status Reset reg Packet

Host Machine

5-11Basejump 之

5-11 之g之uv从n 1lu PCI Plugm PCI Plug PIO gu举 PIO m g 2lsys_bootgen m FMC kstart_bootlg FMC

82

5 CoDA x

g 3l FMC_MD mm FMC 之g 4l FMC_NB m FMC_NB MURN_NBg 5lMURN_NB mkchiplm MURN 之g MURN 之mp FMC_NBg 6lFMC_NB FMC 之g FMC sys_bootgen g 7lmsys_bootgen m享 g sys_bootgen g 8lu举mgu g B hs p,w享仍 m乐享 RTL m FPGA 仍m ,mpg FPGA u个 些mp仍产uy CoDA ,m FPGA p仍产 20 y CoDA , GreenDroidg y之产wy仍g up Basejump ,仍m GreenDroid 产 Basejump 仍g m RTL ff从o o丁仍之mf g

hs GreenDroid 仍mpvx之 m RTL f 二g 5-2 5-3gvmm Verilog SystemVerilog g且mGreenDroid 40181 RTL mw C-core kC-core w享ml业 IP :klg RTL mm

83 七

,mm 业于gm乐享 143 TCL 343 Makefileg TCL RTL FPGA vv享 mMakefile 之享g 2172 Python 乱业m ChipScope 临 FPGA m Python 0/1 下g乐 Python m乐 Python C-core 产 GreenDroid g2818 C/C++uu 介m GreenDroid m乱 MIT Raw g IO Master 介 C++m PCIfm之 g 5-2FPGA 仍 RTL

Main processor - Control 8998 28538 321422 22.96% FPU 3104 11349 92806 9.13% Static Network 3175 9433 108366 7.59% Main processor - Datapath 2590 7731 94774 6.22% Data Cache 1947 5954 69429 4.79% I/O Ports 2039 5701 57262 4.59% Dynamic Network 1693 5163 64131 4.15% Test Network 703 1998 17463 1.61% Integer Divider 457 1101 10700 0.89%

Datapath Module Wrappers 3526 11936 134185 9.61% GreenDroid Top Level Glue 234 697 8885 0.56% FPGA Top Level Glue 1359 3652 56476 2.94%

Virtual Tile Array 528 1893 24168 1.52% FMC adapter 2450 8016 89069 6.45% BDIOM 1411 4653 61471 3.75% Ring Network 372 1132 12567 0.91% MURN 3113 7918 95208 6.37% PCI Plug 1000 2882 40802 2.32% Memory Plug 1482 4522 63110 3.64% Total 40181 124269 1422294 100%

m之w ,wmwfw Cache fw 2D-mesh 乔g五m FPGA u之 1x1f2x1f 2x2 4x2 五g五 2D mesh 之乔m 享klg

84

5 CoDA x

之gm PCI Plug moFMC g乱 SystemVerilog m Verilog g m Verilog p,m 5-12gw SystemVerilog kinterfacel乐w FPGA ASIC 乱g 5-3 仍享

TCL 143 472 4199 Makefile 343 927 11202 Python 2172 6178 62884 Total 2496 7172 73362 Linux :介 712 1703 17622 Linux 于介 311 1055 8911 IO Master 介 1795 5868 50505 Total 2818 8626 77038

之 SystemVerilogm FPGA Synopsys Synplifyg Xilinx g Makefile 从业p享m业w RTL FPGA vf从m乐 下gmww kwlmw五m ww C-corem仍g

5-12,

f些m 于mw于,于g 4~6 互g 乐mu m享于g

-DADB2- hs pmp Virtex 5 XUPV5 m仍p C-core GreenDroidg Virtex 5 个wm义

85 七

Virtex 6 ML605 m产p Virtex-6 XC6VLX240T FPGA m 240K 久 14976KB RAM g义 ML605 m p久p FPGA ug uffPCIE g FMC 乔mg

5-13GreenDroid

5-13 GreenDroid gup Sun mu CentOS 5.8 g之 PCIE u ML605 Basejump guu 5-2 ugFMC 之 FMC HPC 乔m FMC LVDS 400MHzg FPGA IO 个gu mxgu 5-2 um MURN IO 于乔pw 乔gmw MURN w亿g FMC 乔享u FPGA 仅m享v仅g 享m乱u g 互 PCB mp PCB u Xilinx FPGA m 86

5 CoDA x

pvmpv享gp PCB p FPGA klpg PCB u之 FMC mw享g PCB g

仍享gmp op g 5-4

alu 3 host 12 static network 1 integer 11 cache 10 interrupts 7 control transfer 6 io mux 21 Dcache miss 5 memory counter 19 dynamic network 14 pins 1 event counters 20 instructions 26 floating point 10 streaming dram 1 Total 167

5-4 m MIT Rawk MIPSl mp C gpm alu op两m Cache g 5-5 g仍m ffuufC-core g SPEC 2000fSPEC 2006fJPEG Group 2000 g 5-5

autocorrelation djpeg radix bitmnp fbital rgbcmy btree fft rgbhpg bzip2 Hello world rgbyiq cjpeg host viterbi conven mcf vpr

之vm仍之gm, pmp产mp之p uugm 产g三 乐专mgm 仍m

87 七

g之mpm pm g之p之m 丢m丢g之mp 之g

5-14FPGA 仍 cjpeg

仍 RTL fFPGA 仍 ASIC g ,书 RTL ksimulationl Basejump m RTL f IP 于乔 MURN IOg MURN IO w 于 PCB u书wg RTL m RTL mp SPEC 享g 享 FPGA 仍g FPGA u 业m享m FPGA u Chipscope 业 业g业 RTL FPGA u pm二g 5-14 FPGA u cjpeg m PASSED DONE 之m PASSED mDONE g FPGA 个些m p C-core v FPGA um C-core u gp MCF t C-coremvp bzip2

88

5 CoDA x

享 C-coregp 2D-mesh 久 2 FPGA um p产享m仍 28nm GreenDroid g 万 Virtex 6 FPGA 4 :m个些 Block RAMg Cachek32KB I-cache, 32KB D-cachel 4 : RAW m 87% Block RAMg ASIC mw之乱m 乱乐 RTL g Makefile m 业w make 业wm make snakeboot 业 RTL make greendroid 业 FPGA 仍m RTL 乐争 之 Makefile g 之m gu Checkout p 下mgm RTL m o FPGAmug g互二m mg B 仍,pgm p Global Foundry 28nm GreenDoridm 4 产 20 C-coreop之 MOSIS kTSMCl250nm m p 1 om fMURN I/O g二些m 250nm 2 CoDAkS-CoDAlmp MIT Raw p C-corek autocorrelationlmp S-coregMIT Raw 2D-mesh m S-core ug仍wg pv个、o S-CoDA 个享o g

个 个m个kfl gm享个,, g之f享个m GF 28nm 之m 7-track

89 七

12-track g 个u乔f久争g 5-6 S-CoDA 乔f久争g S-CoDAm TSMC 250nm 5 g CoDA ,m享m乔个k lm 28nm 11 乔g nmg 5 m kM1mM3 M5lkM2mM4lgw mup乔p享享w m享于kviasl乔g t于人gp pmgm M1m串乔 M1gkSRAM Register Filel M1 M4 k乱享lm u M5 gM5 M4 m mkpower gridlg乐 g 5-6S-CoDA 个、个

Die 6 mm x 10 mm 5 、 356 争个 kDiel 6 mm x 10 mm 久k I/O l 5.46mm x 9.46mm 1379 nand2 1778 久争knand2l 2.5M 个 5 kM5, 乱 M3, 乱 M1l 2+ k乱 M4, 乱 M2l 1+ 个 、 356 Die Pad 680 I/O 96 : 2.5V IO Pads 3.3V

;m久g kutilizationl仍 0.7mg乐;

90

5 CoDA x

mgpm m,:m予g mm 5-15 予gfm m争g 、mS-CoDA mx仍 250nm p、gp 356 BGA 、mkwire bondl、 Die 乔g了 I/O Padm p Pad 享 40 kuml于g之 ICC uv p 4800umk、 Pad pl I/Om uvp 120 I/O Padmuvp 240 I/O Padg p 8800um I/O Padmp 220 I/O Padmp 440 I/O Padg 680 I/O Padg、m享、 I/O Pad bonding p Pin ug g

4B 5-7S-CoDA

Cell Cell Cell Cell ADDF - CLKBUF () MXI2 ,)- OAI221 ( ADDH . CLKINV .(- NAND2 (.)( OAI222 /. AND2 /)) DFF ,/) NAND2B . OAI2BB1 -( AND3 - DFFNS , NAND3 (,, OAI2BB2 ,,)( AND4 )-- DFFR ,- NAND3B ,.) OAI31 -( AOI21 (-. DFFRHQ (//( NAND4 ) OAI32 , AOI211 . DFFS /() NAND4B ( OAI33 - AOI22 ). DFFSHQ ,) NOR2 .).. OR2 ,() AOI221 ) DFFTR - NOR2B ( OR3 AOI222 (-, EDFF .)) NOR3 .( OR4 //- AOI2BB1 (,/ EDFFTR // NOR3B / TLAT ). AOI2BB2 . INV (( NOR4 )/ TLATN ,( AOI31 (( JKFF NOR4B , TLATR , AOI32 - JKFFS ( OAI21 // XNOR2 )- AOI33 / MX2 ,. OAI211 ). XOR2 -/ BUF () MX4 ( OAI22 , Total )-(

m g 5-7 S-CoDA g之w介 gp 37 sg 5-8 亮久 gm

91 七

, MURN I/O 乱 FIFOkmw SRAMlm gBuffers 乱m gm 1.86%g 乱 S-core 产五gxwm klatchlm MURN gkmuxesl乱u k2D meshm Ring Networklg 5-8亮久

争 flip-flops kD, JKl 67245 18.10% buffers 63103 16.99% adder kADDFmADDHl 6915 1.86% latch 3134 0.84% inverters 2215 0.60% muxes 1707 0.46%

5-9 m久争g 争 NAND2 g S-CoDA 久 争 230 s争gt S-Core pum GreenDroid :g BDIOM 互m g乱乔g 5-9S-CoDA 争

kum2l 争 BY]V )/,),./ ((./)- AMY[O ().((,(- //- ,, 6[OOX3[YSN ASXQVO2Y[O .,)(/-.)) /// (. LNSYW )-)/,). .),-. . (3WOR /,/ .)- )- C8 -)- (- ( SXQO]Y[U ((().-.() (.- , ]RO[ (), -(-) )

S-CoDA 享 I/O Pad vt乱n I/O Padf: I/O Pad I/O I/O Padg I/O Pad vnk2 lfk1lfMURN I/Ok22x4m88l JTAG(5 )m 96 g : I/O Pad 享g;万mS-CoDA 10Wm 2.5V v享 4Ag;且m p 25mAm享享 160 mup享 320 PadgI/O Pad

92

5 CoDA x

pm且享 13 I/O mup 26 g Pad 享 Padkcornerlgum S-CoDA 享 254 m S-CoDA g

5-15: IP

Synopsys DCm RTL 丽争k.v

93 七

m.ddc lg争w VCS m Formality 仍g之p之m 仍w仍 RTL Verilog 争pm乐仍 Verilog 争 DDC 争pg

5-16 :

94

5 CoDA x

之仍mg ARM IPm Artisan 享m IP nSRAM kRegister Filelg Milkway 丽 ICCkIC Compilerl享g ICC m: 5-15g

5-17 klkl

u 2 mu乱 S-coremv乱p GreenDroid g 2 m S-core gu乱 32 32bit x 128 SRAM S-core m SRAM 于 12 SRAM m享k lg 8 128 x 66 BDIOM FIFOg SRAM 2D mesh gv乱 SRAM GreenDroid g 5-16 5-15 u M1m M1 g 5-17 M4 M5 m乱g I/O 享二n I/O 乐久gI/O p久 I/O 享m I/O mv I/Og久 久 I/O 享m久g S-CoDA

95 七

久mm乐 I/O g I/O pad m bond 于g I/O Pad m 5-18gp I/O(inline bonding)m 且久gm Pad 于 p于m Bond 于于gp了 (stagger bonding)m之予 bond m 享w享 Pad 于gS-CoDA I/O g I/O 乐m乐 I/O g mDie uv I/O Pads 、 Pin m 2 Pads 乔p Pin ug S-CoDA 了 I/O mv I/O Padm 于mg Pad 96 Padm 4 Padm: I/O Pad 350 m I/O Pad 150 g 5-19 亚gp亚 g 9.28mV 亚gm 100MHzmmg

General pad Long bond Short bond

Inline bonding Stagger bonding

5-18 I/O 了 I/O 96

5 CoDA x

5-19亚kIR Dropl 互享于产 二m互 Basejump g之 CoDA GreenDroid x Basejump 乔m, CoDA FPGA 仍g p仍 CoDA , GreenDroid 久m 2D-mesh 久 2 FPGA um g之仍mf 七,gm》 TSMC 250nm CoDA ASIC g m乐云pg nf严g mm 享m业g下 ff产g些 mop 互m互 于gu互g 之 FPGA 互 28nm m CoDA ,mgm CoDA ,m且,w

97 七

g Basejump FPGA 乱mBasejump FMCfPCB 、互gg 乐x仍m S-CoDA g

98

6

6

(Metal-Oxide-Semiconductor Field-Effect Transistor, MOSFET)两mgm g些vm二 mgpm,:, m乎u g(AP)f丽 g且,享g ,之gy pg乎产串串 ym且产ypg 乱ym互y C-core gy乱mw,产 串串ygm, CoDA ,m,y 之g CoDA ,g CoDA ,f于f二 CoDA m七 CoDA ,fff CoDA 二mvn k1l CoDA m且 CoDA g m乱m乱 乱yugm ,书yg仍 22nm v 7mm2 ,书y 90%g m CoDA ,g k2l CoDA 于享m CoDA ,m CoDA ,g,wmp wympy ,g万p CoDA ,fo ,mg k3l CoDA ,w Cache ff v二gvmx, CoDA 5.3 5 kenergy-delay product,

99 七

EDPlou CoDA m 3.7 3.5 EDP g且 CoDA g m CoDA m CoDA g k4l CoDA g mg m介 CoDA 产产wm书 ym亚g CoDA ,产 QsCore g仍 QsCore m 41% 2 ym亮 11.1%~22.1%g k5l FPGA vp CoDA m FPGA 个w二m 2D-mesh ug 乔mp 2D-mesh g ASIC FPGA FPGA 于乔g 之m Virtex 6 FPGA CoDA , g I CoDA ,m ,g CoDA ,n k1l C-core ,mp亚 C-core ,g享 C-core gm vf亮个f》 ()g k2lp亚 CoDA ,,gp 亚mgm CoDA Cache m享p Cache CoDA g pg乔mm 享亚乔mg C-core m C-core ,久 C-core m 亚 C-core g k3l CoDA ,万乎 1 s Python , mwg CoDA ,m m互

100

6

m于g k4lu个m CoDA ,产 C-corem CoDA g享,g k5l,m, ,g。um之, C-coremgm乐享 ,五,万g k6lyyg ,wmg

101

i [1]G E Moore. Cramming More Components Onto Integrated Circuits[J]. Proceedings Of The Ieee, 1998,86(1):82-85. [2]R H Dennard, F H Gaensslen, V L Rideout, et al. Design of ion-implanted MOSFET's with very small physical dimensions[J]. Solid-State Circuits, IEEE Journal of, 1974,9(5):256-268. [3]N Goulding-Hotta, J Sampson, Z Qiaoshi, et al. GreenDroid: An architecture for the Dark Silicon Age[C]. 2012:100-105. [4]N Goulding-Hotta, J Sampson, G Venkatesh, et al. The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future[J]. Micro, IEEE, 2011,31(2):86-95. [5]G Venkatesh, J Sampson, N Goulding, et al. Conservation cores: reducing the energy of mature computations[C]. ACM, 2010:205-218. [6]N Goulding-Hotta, J Sampson, G Venkatesh, et al. GreenDroid: A mobile application processor for a future of dark silicon[C]. 2010. [7]M B Taylor. A landscape of the new dark silicon design regime[C]. 2014:1. [8]M B Taylor. A Landscape of the New Dark Silicon Design Regime[J]. Micro, IEEE, 2013,33(5):8-19. [9]M B Taylor. A landscape of the new dark silicon design regime[C]. 2013:1. [10]M B Taylor. Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse[C]. 2012:1131-1136. [11]H Esmaeilzadeh, E Blem, R St. Amant, et al. Dark silicon and the end of multicore scaling[C]. 2011:365-376. [12]N Hardavellas, M Ferdman, B Falsafi, et al. Toward Dark Silicon in Servers[J]. Micro, IEEE, 2011,31(4):6-15. [13]J Sampson, G Venkatesh, N Goulding-Hotta, et al. Efficient complex operators for irregular codes[C]. 2011:491-502. [14] N H E Weste, D Harris. CMOS VLSI Design: a Circuits and Systems Perspective[M].Addison-Wesley, 2005. [15]Semiconductor Industries Association. International technology roadmap for semiconductors. http://www.itrs.net/Links/2012ITRS/Home2012.htm. [16]Apple Online Shop, Mac Pro processor selection.

103 七

http://store.apple.com/us_smb_78313/buy-mac/mac-pro. [17]A Frumusanu, J Ho. Huawei Honor 6 Review. http://www.anandtech.com/show/8425/huawei-honor-6-review/4. [18]M Horowitz. 1.1 Computing's energy problem (and what we can do about it)[C]. 2014:10-14. [19]G Venkatesh, J Sampson, N Goulding-Hotta, et al. QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores[C]. ACM, 2011:163-174. [20]F Jia. The Arsenal Tool Chain for the GreenDroid Mobile Application Processor[D]. San Diego: University of California at San Diego, 2013. [21]V Govindaraju, H Chen-Han, T Nowatzki, et al. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing[J]. Micro, IEEE, 2012,32(5):38-51. [22]V Govindaraju, H Chen-Han, K Sankaralingam. Dynamically Specialized Datapaths for energy efficient computing[C]. 2011:503-514. [23]L Wu, M A Kim. Acceleration Targets: A Study of Popular Benchmark Suits[C]. 2012. [24]W Liang, K Skadron. Implications of the Power Wall: Dim Cores and Reconfigurable Logic[J]. Micro, IEEE, 2013,33(5):40-48. [25]E G Cota, P Mantovani, M Petracca, et al. Accelerator Memory Reuse in the Dark Silicon Era[J]. Computer Architecture Letters, 2014,13(1):9-12. [26]J Cong, X Bingjun. Optimization of interconnects between accelerators and shared memories in dark silicon[C]. 2013:630-637. [27]A Raghavan, L Yixin, A Chandawalla, et al. Designing for Responsiveness with Computational Sprinting[J]. Micro, IEEE, 2013,33(3):8-15. [28]A Raghavan, L Emurian, S Lei, et al. Utilizing Dark Silicon to Save Energy with Computational Sprinting[J]. Micro, IEEE, 2013,33(5):20-28. [29]A Raghavan, L Emurian, L Shao, et al. Computational sprinting on a hardware/software testbed[C]. ACM, 2013:155-166. [30]A Raghavan, L Yixin, A Chandawalla, et al. Computational sprinting[C]. 2012:1-12. [31]N Pinckney, R G Dreslinski, K Sewell, et al. Limits of Parallelism and Boosting in Dim Silicon[J]. Micro, IEEE, 2013,33(5):30-37. [32]J M Allred, S Roy, K Chakraborty. Dark Silicon Aware Multicore Systems: Employing Design Automation With Architectural Insight[J]. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 2014,22(5):1192-1196.

104

[33]J Allred, S Roy, K Chakraborty. Designing for dark silicon: a methodological perspective on energy efficient systems[C]. ACM, 2012:255-260. [34]K Swaminathan, E Kultursay, V Saripalli, et al. Steep-Slope Devices: From Dark to Dim Silicon[J]. Micro, IEEE, 2013,33(5):50-59. [35]A Sampson, J Nelson, K Strauss, et al. Approximate Storage in Solid-State Memories[J]. ACM Trans. Comput. Syst., 2014,32(3):1-23. [36]A Sampson, J Nelson, K Strauss, et al. Approximate storage in solid-state memories[C]. ACM, 2013:25-36. [37]H Esmaeilzadeh, A Sampson, L Ceze, et al. Neural Acceleration for General-Purpose Approximate Programs[C]. 2012:449-460. [38]H Esmaeilzadeh, A Sampson, L Ceze, et al. Architecture support for disciplined approximate programming[C]. ACM, 2012:301-312. [39]H Jie, M Orshansky. Approximate computing: An emerging paradigm for energy- efficient design[C]. 2013:1-6. [40]Z Jia, X Yuan, S Guangyu. NoC-sprinting: Interconnect for fine-grained sprinting in the dark silicon era[C]. 2014:1-6. [41]H Bokhari, H Javaid, M Shafique, et al. darkNoC: Designing energy-efficient network- on-chip with multi-Vt cells for dark silicon[C]. 2014:1-6. [42]M Shafique, S Garg, J Henkel, et al. The EDA challenges in the dark silicon era[C]. 2014:1-6. [43]M Haghbayan, A Rahmani, P Liljeberg, et al. Online testing of many-core systems in the Dark Silicon era[C]. 2014:141-146. [44] T S Muthukaruppan, M Pricopi, V Venkataramani, et al. Hierarchical power management for asymmetric multi-core in dark silicon era[C]. 2013:1-9. [45]N Clark, A Hormati, S Mahlke. VEAL: Virtualized Execution Accelerator for Loops[C]. 2008:389-400. [46]Y Kikuchi, M Takahashi, T Maeda, et al. A 40 nm 222 mW H.264 Full-HD Decoding, 25 Power Domains, 14-Core Application Processor With x512b Stacked DRAM[J]. Solid-State Circuits, IEEE Journal of, 2011,46(1):32-41. [47]J L Gustafson. Reevaluating Amdahl's law[J]. Commun. ACM, 1988,31(5):532-533. [48]京,IJ [49]G Balakrishnan, T Reps, N Kidd, et al. Model checking x86 executables with codesurfer /x86 and WPDS++[C]. Springer-Verlag, 2005:158-163.

105 七

[50]P Anderson, M Zarins. The CodeSurfer software understanding platform[C]. 2005:147-148. [51]T Teitelbaum. CodeSurfer[J]. SIGSOFT Softw. Eng. Notes, 2000,25(1):99. [52]R E Kidd. The OpenIMPACT Whole Program Optimization Framework[D]. Urbana- Champaign: University of Illinois at Urbana-Champaign, 2007. [53]M B Taylor, J Kim, J Miller, et al. The Raw microprocessor: a computational fabric for software circuits and general-purpose programs[J]. Micro, IEEE, 2002,22(2):25-35. [54]M Vuletic, P Ienne, C Claus, et al. Multithreaded virtual-memory-enabled reconfigurable hardware accelerators[C]. 2006:197-204. [55]J H Ahn, W J Dally, B Khailany, et al. Evaluating the Imagine stream architecture[C]. 2004:14-25. [56]U J Kapasi, W J Dally, S Rixner, et al. The Imagine Stream Processor[C]. 2002:282-288. [57]J D Owens, S Rixner, U J Kapasi, et al. Media processing applications on the Imagine stream processor[C]. 2002:295-302. [58]B Khailany, W J Dally, U J Kapasi, et al. Imagine: media processing with streams[J]. Micro, IEEE, 2001,21(2):35-46. [59] xIJ(/ [60] ,I3J(. [61] ,I3J(, [62] 0A0:I3J( [63]W Thies, M Karczmarek, S Amarasinghe. StreamIt: A Language for Streaming Applications[M].Springer Berlin Heidelberg, 2002. [64]I Buck, T Foley, D Horn, et al. Brook for GPUs: stream computing on graphics hardware[C]. ACM, 2004:777-786. [65]P Mattson. A Programming System for the Imagine Media Processor[D]. Palo Alto: Stanford University, 2001. [66] ,:,I3J ((

[67] 串I3J(- [68]J E Stine, N Naresh. Compressed symmetric for accurate function approximation of reciprocals[C]. 2006:4-802. [69] —乱I3J

106

kl() [70]D Joseph. Prefetching Using Markov Predictors[C]. 1997:252-263. [71]M J Schulte, J E Stine. Symmetric bipartite tables for accurate function approximation[C]. 1997:175-183. [72]M B Taylor, J Psota, A Saraf, et al. Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams[C]. 2004:2-13. [73]R Jotwani, S Sundaram, S Kosonocky, et al. An x86-64 core implemented in 32nm SOI CMOS[C]. 2010:106-107. [74]M B Henry, R Lyerly, L Nazhandali, et al. MEMS-Based Power Gating for Highly Scalable Periodic and Event-Driven Processing[C]. 2011:286-291. [75]M B Henry, L Nazhandali. From transistors to MEMS: Throughput-aware power gating in CMOS circuits[C]. 2010:130-135. [76]H F Dadgour, K Banerjee. Design and Analysis of Hybrid NEMS-CMOS Circuits for Ultra Low-Power Applications[C]. 2007:306-311. [77]Standard Performance Evaluation Corporation. SPEC CPU 2006 benchmark specifications. http://www.spec.org. [78]Standard Performance Evaluation Corporation. SPEC CPU 2000 benchmark specifications. http://www.spec.org. [79]Eembc benchmark suite. http://www.eembc.org. [80]C Lattner, V Adve. LLVM: a compilation framework for lifelong program analysis & transformation[C]. 2004:75-86. [81]S Thoziyoor, N Muralimanohar, J H Ahn, et al. CACTI 5.1. Tech report. http://www.hpl.hp.com/techreports/2008/HPL-2008-20.html. [82]M Bohr, K Mistry. Intel's Revolutionary 22nm Transistor Technology. http://download.intel.com/newsroom/kits/22nm/pdfs/22nm-Details_Presentation.pdf. [83]B C Lee, E Ipek, O Mutlu, et al. Architecting phase change memory as a scalable dram alternative[C]. ACM, 2009:2-13. [84]IMOD Technology Overview. IMOD technology overview. http://www.qualcomm.com/common/documents/whitepapers/QMT_Technology_Overv iew_12-07.pdf. [85]M L Alles, H L Hughes, D R Ball, et al. Total-Ionizing-Dose Response of Narrow, Long Channel 45nm PDSOI Transistors[J]. Nuclear Science, IEEE Transactions on, 2014,61(6):2945-2950.

107 七

[86]R V Joshi, R Kanj, S Saroop. Novel 4 GHz Interleaved SRAM Cells With Asymmetrical Precharge in 45 nm PDSOI Technology[J]. Semiconductor Manufacturing, IEEE Transactions on, 2014,27(2):278-286. [87]P Chao, H Zhiyuan, Z Zhengxuan, et al. A New Method for Extracting the Radiation Induced Trapped Charge Density Along the STI Sidewall in the PDSOI NMOSFETs[J]. Nuclear Science, IEEE Transactions on, 2013,60(6):4697-4704. [88]S Sirohi, S Khandelwal. Modeling low frequency noise in PDSOI MOSFETs for analog and RF applications[C]. 2010:1940-1942. [89]D Guo, A Bryant, W Xinlin, et al. Gate-dielectric permitivity and metal-gate work-

function tradeoff in Lmet=25nm PDSOI device characteristics[J]. Electron Device Letters, IEEE, 2006,27(6):505-507. [90]F Abouzeid, A Bienfait, K C Akyel, et al. Scalable 0.35 V to 1.2 V SRAM Bitcell Design From 65 nm CMOS to 28 nm FDSOI[J]. Solid-State Circuits, IEEE Journal of, 2014,49(7):1499-1505. [91]X Chuan, K Banerjee. Physical Modeling of the Capacitance and Capacitive Coupling Noise of Through-Oxide Vias in FDSOI-Based Ultra-High Density 3-D ICs[J]. Electron Devices, IEEE Transactions on, 2013,60(1):123-131. [92]J P Noel, O Thomas, M Jaud, et al. Multi-UTBB FDSOI Device Architectures for Low-Power CMOS Circuit[J]. Electron Devices, IEEE Transactions on, 2011,58(8):2473-2482. [93]J Mazurier, O Weber, F Andrieu, et al. On the Variability in Planar FDSOI Technology: From MOSFETs to SRAM Cells[J]. Electron Devices, IEEE Transactions on, 2011,58(8):2326-2336. [94]S H Rasouli, K Endo, J F Chen, et al. Grain-Orientation Induced Quantum Confinement Variation in FinFETs and Multi-Gate Ultra-Thin Body CMOS Devices and Implications for Digital Design[J]. Electron Devices, IEEE Transactions on, 2011,58(8):2282-2292. [95]Y Choi, H Jin-Woo, L Hyunjin. Reliability Issues in Multi-Gate FinFETs[C]. 2006:1101-1104. [96]G Knoblinger, C Pacha, F Kuttner, et al. Multi-Gate MOSFET Design[C]. 2006:65-68. [97]M C Johnson, D Somasekhar, C Lih-Yih, et al. Leakage control with efficient use of transistor stacks in single threshold CMOS[J]. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 2002,10(1):1-5.

108

[98]M C Johnson, D Somasekhar, K Roy. Leakage control with efficient use of transistor stacks in single threshold CMOS[C]. 1999:442-445. [99]K Nii, H Makino, Y Tujihashi, et al. A low power SRAM using auto-backgate-controlled MT-CMOS[C]. 1998:293-298. [100]F Hamzaoglu, Y Yibin, A Keshavarzi, et al. Analysis of dual-V/sub T/ SRAM cells with full-swing single-ended bit line sensing for on-chip cache[J]. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 2002,10(2):91-95. [101]N Azizi, A Moshovos, F N Najm. Low-leakage asymmetric-cell SRAM[C]. 2002:48-51.

[102]F Hamzaoglu, Y Yibin, A Keshavarzi, et al. Dual-VT SRAM cells with full-swing single-ended bit line sensing for high-performance on-chip cache in 0.13 μm technology generation[C]. 2000:15-19. [103]T Bjerregaard, J Sparsoe. Scheduling discipline for latency and bandwidth guarantees in asynchronous network-on-chip[C]. 2005:34-43. [104]T Bjerregaard, S Mahadevan, R G Olsen, et al. An OCP Compliant Network Adapter for GALS-based SoC Design Using the MANGO Network-on-Chip[C]. 2005:171-174. [105]T Bjerregaard, J Sparso. A router architecture for connection-oriented service guarantees in the MANGO clockless network-on-chip[C]. 2005:1226-1231. [106]F Feliciian, S B Furber. An asynchronous on-chip network router with quality-of- service (QoS) support[C]. 2004:274-277. [107]T Felicijan, J Bainbridge, S Furber. An asynchronous low latency arbiter for Quality of Service (QoS) applications[C]. 2003:123-126. [108]J Bainbridge, S Furber. Chain: a delay-insensitive chip area interconnect[J]. Micro, IEEE, 2002,22(5):16-23. [109] J Lee, S Kim. Filter Data Cache: An Energy-Efficient Small L0 Data Cache Architecture Driven by Miss Cost Reduction: Computers, IEEE Transactions on[Z]. 2014: PP, 1. [110]A Bardizbanyan, M Sj, Lander, et al. Towards a performance- and energy-efficient data filter cache[C]. ACM, 2013:21-28. [111]Y Chia-Lin, L Chien-hao. HotSpot cache: joint temporal and spatial locality exploitation for I-cache energy reduction[C]. 2004:114-119. [112]W Tang, R Gupta, A Nicolau. Design of a predictive filter cache for energy savings in high performance processor architectures[C]. 2001:68-73. [113]M Huang, J Renau, Y Seung-Moon, et al. L1 data cache decomposition for energy

109 七

efficiency[C]. 2001:10-15. [114]J Kin, M Gupta, W H Mangione-Smith. The filter cache: an energy efficient memory structure[C]. 1997:184-193. [115]Z Chuanjun, F Vahid, Y Jun, et al. A way-halting cache for low-energy high-performance systems[C]. 2004:126-131. [116]Z Chuanjun, F Vahid, Y Jun, et al. A Way-Halting Cache for Low-Energy High- Performance Systems[J]. Computer Architecture Letters, 2003,2(1):5. [117]M D Powell, A Agarwal, T N Vijaykumar, et al. Reducing set-associative cache energy via way-prediction and selective direct-mapping[C]. 2001:54-65. [118]S Kaxiras, H Zhigang, M Martonosi. Cache decay: exploiting generational behavior to reduce cache leakage power[C]. 2001:240-251. [119]X Cuiping, Z Ge, H Shouqing. Fast Way-Prediction Instruction Cache for Energy Efficiency and High Performance[C]. 2009:235-238. [120]T Adegbija, A Gordon-Ross, A Munir. Dynamic phase-based tuning for embedded systems using phase distance mapping[C]. 2012:284-290. [121]Z Chuanjun. A Low Power Highly Associative Cache for Embedded Systems[C]. 2006:31-36. [122]Z Chuanjun, F Vahid, W Najjar. A highly configurable cache architecture for embedded systems[C]. 2003:136-146. [123]A Malik, B Moyer, D Cermak. A low power unified cache architecture providing power and performance flexibility[C]. 2000:241-243. [124] D H Albonesi. Selective cache ways: on-demand cache resource allocation[C]. 1999:248-259. [125]P Gi-Ho, L Kil-Whan, L Jang-Soo, et al. A low-power cache system for embedded processors[C]. 2000:316-319. [126]H Jingcao, R Marculescu. Energy- and performance-aware mapping for regular NoC architectures[J]. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 2005,24(4):551-562. [127]H Jingcao, R Marculescu. Application-specific buffer space allocation for networks-on- chip router design[C]. 2004:354-361. [128]S Banerjee, N Dutt. FIFO power optimization for on-chip networks[C]. ACM, 2004:187-191. [129]C Xuning, P Li-Shiuan. Leakage power modeling and optimization in interconnection

110

networks[C]. 2003:90-95. [130]L Kangmin, L Se-Joong, Y Hoi-Jun. SILENT: serialized low energy transmission coding for on-chip interconnection networks[C]. 2004:448-451. [131]C Yu-Liang, L Shaoshan, C Eui-Young, et al. An Energy and Performance Efficient DVFS Scheme for Irregular Parallel Divide-and-Conquer Algorithms on the Intel SCC[J]. Computer Architecture Letters, 2014,13(1):13-16. [132]A B Kahng, K Seokhyeong, R Kumar, et al. Enhancing the Efficiency of Energy- Constrained DVFS Designs[J]. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 2013,21(10):1769-1782. [133]S Eyerman, L Eeckhout. A Counter Architecture for Online DVFS Profitability Estimation[J]. Computers, IEEE Transactions on, 2010,59(11):1576-1583. [134]T Yoneda, K Masuda, H Fujiwara. Power-Constrained Test Scheduling for Multi-Clock Domain SoCs[C]. 2006:1-6. [135]J Schmid, J Knablein. Advanced synchronous scan test methodology for multi clock domain ASICs[C]. 1999:106-113. [136]C Sitik, B Taskin. Multi-voltage domain clock mesh design[C]. 2012:201-206. [137]L Souef, C Eychenne, E Alie. Architecture for Testing Multi-Voltage Domain SOC[C]. 2008:1-10. [138]R Cooksey, S Jourdan, D Grunwald. A stateless, content-directed data prefetching mechanism[C]. ACM, 2002:279-290. [139]D Joseph, D Grunwald. Prefetching using Markov predictors[J]. Computers, IEEE Transactions on, 1999,48(2):121-133. [140]N P Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers[C]. 1990:364-373. [141]S Xian-He, S Byna, C Yong. Improving Data Access Performance with Server Push Architecture[C]. 2007:1-6. [142]W Hassanein, Jos, Fortes, et al. Data forwarding through in-memory precomputation threads[C]. ACM, 2004:207-216. [143]L Chi-Keung, H Sunpyo, K Hyesoon. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping[C]. 2009:45-55. [144]P H Wang, J D Collins, G N Chinya, et al. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system[C]. ACM, 2007:156-166. [145]J Owens, D Luebke, N Govindaraju, et al. A survey of general-purpose computation on

111 七

graphics hardware[C]. 2005:21-51. [146]J Nickolls, I Buck, M Garland, et al. Scalable parallel programming with CUDA[C]. ACM, 2008:1-14. [147]M Mishra, T J Callahan, T Chelcea, et al. Tartan: evaluating spatial computation for whole program execution[C]. ACM, 2006:163-174. [148]K Sankaralingam, R Nagarajan, L Haiming, et al. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture[C]. 2003:422-433. [149]K Sankaralingam, R Nagarajan, L Haiming, et al. Exploiting ILP, TLP, and DLP with the polymorphous trips architecture[J]. Micro, IEEE, 2003,23(6):46-51. [150]S Swanson, K Michelson, A Schwerin, et al. WaveScalar[C]. 2003:291-302. [151]F Hannig, S Roloff, G Snelting, et al. Resource-aware programming and simulation of MPSoC architectures through extension of X10[C]. ACM, 2011:48-55. [152]J Allred, S Roy, K Chakraborty. Designing for dark silicon: a methodological perspective on energy efficient systems[C]. ACM, 2012:255-260. [153]S Mingoo, S Hanson, L Yu-Shiang, et al. The Phoenix Processor: A 30pW platform for sensor applications[C]. 2008:188-189.

112

七mf之fj g于mfm m串oumw两 g东ff gm一g gmf 七g m乱七,g g 东fm g仍mff七 世不g C2A3 SMROVBbVY[ g[YPBbVY[ mw串仍pg[YPBbVY[ m 下ff,f,享 m仍g仍 tm亦g C2A3 A]OOXAXYX g亦m [YPAXYX g[YPAXYX mp g 9MUAWZYX g C2A30[OXV%6[OOX3[YSN%3[U ASVSMYX xm9MU ,g m9MU p 2MY[Of4Cf 2Y30 二g 予mfm 二gfff g什ffff什 fgffmf mx七po亲ff m于mg 6[OOX3[YSN 3[6XORDOXU]ORf3[]RX6YVNSXQ7Y]]f 9YO0[SMMRSY DSU[W1R]] f [YPSVOf 6[OOX3[YSN ,丁七g 1A6

113 七

3[HRXQfSDOQ 3[6XSXE 丁g DA 3[9YOV2YL[Xf3[0N[SX2VPSOVNf3[[6[ZZfB[OY[f ANR[X 560 fSXa 介 SXa :g 产 3[SMROVEOSf3[AXN[W m g 5OS9S仍 ( m 下mg 1OTWZ 之m,g f仔,丁g m乔 ( gffff][Saf ffAROXQGXQ7XQff于 m之g产 mmmm g—fffff(ffff g mm书 g j

114

于七

gUjoia

p七n [1] Qiaoshi ZhengmNathan Goulding-Hotta, Scott Ricketts, Steven Swanson, Michael Taylor, Jack Sampson. Exploring Energy Scalability in Coprocessor-Dominated Architectures for Dark Silicon[J]. ACM Transactions on Embedded Computing Systems(TECS)1, July 2014, 13(4), 130:1-130:24 (SCI Index, WOS N.O: 000341390100013) (EI Index, IDS N.O: 20143418087903) [2] Qiaoshi Zheng, Deyuan Gao, Xiaoya Fan, Meng Zhang, Tao Yao and Limin Han. An Evaluation of the Many-core Longtium SP Computer System[C]. The 2013 IEEE International Conference on Signal Processing, Communications and Computing (IEEE ICSPCC 2013), Aug. 2013, Kunming, China, IEEE Computer Society: 1694:1-1694:4 (EI Index, IDS N.O: 20140417228975) [3] Qiaoshi Zheng, Deyuan Gao, Liwen Shi, Jie Chen. Tolerating memory latency: L2 cache actively push architecture[C]. The 2009 International Conference on Advanced Computer Theory and Engineering (ICACTE 2009), Cairo, Egypt, ASME: 509-516. (EI Index, IDS N.O: 20104813425974) [4] Qiaoshi Zheng, Deyuan Gao, Jie Chen, Chao Wang. AAP and AAPM: improved prefetching structures of the L2 cache[C]. The 2009 International Conference on Advanced Computer Theory and Engineering (ICACTE 2009), Cairo, Egypt, ASME: 569-576. (EI Index, IDS N.O: 20104813425982)

亮p七n [5] Nathan Goulding-Hotta, Jack Sampson, Qiaoshi Zheng, Vikram Bhatt, Joe Auricchio, Steven Swanson and Michael Taylor. Greendroid: An architecture for the dark silicon age[C]. 17th Asia and South Pacific Design Automation Conference (ASP-DAC), February 2012, Sydney, Australian, IEEE Computer Society: 100-105. (invited paper)(EI Index, IDS N.O: 20121714967941) [6] Vikram Bhatt, Nathan Goulding-Hotta, Qiaoshi Zheng, Jack Sampson, Steven Swanson and Michael B. Taylor. SiChrome: A Silicon Browser[C]. 1st Dark Silicon Workshop (DaSi), in conjunction with ISCA, 2012, Portland, USA.

1 CCF x B http://www.ccf.org.cn/sites/ccf/biaodan.jsp?contentId=2567814757412m Domain-Specific Multicore Computing y 6 七七g

115 七

[7] mm(mm. TLB [J]. , 2011, 38(5): 301-305. [8] Liwei Shi, Xiaoya Fan, Qiaoshi Zheng, Jie Chen. Exploiting the Character of Memory Accesses to Achieve Lower Power Consumption of the Data TLB[C]. The 2010 International Conference on Computer Engineering and Technology, April 2010, Cheng Du, China, IEEE Computer Society: V2-271-275. (EI Index, IDS N.O: 20104313316750) [9] Chao Wang, Zhanhuai Li, Xiaoyu Li, Qiaoshi Zheng. An error handling method for primary-backup replication protocol[C]. The 2009 International Conference on Advanced Computer Theory and Engineering (ICACTE 2009), Cairo, Egypt, ASME: 569-576. (EI Index, IDS N.O: 20104813425963) [10] Jie Chen, Deyuan Gao, Qiaoshi Zheng. A Research on an Optimized Adaptive Dynamic Power Management[C]. The 2009 IEEE International Conference on Computer Science and Information Technology (ICCSIT 2009), August, 2009, Beijing, China, IEEE Computer Society: 52-55. (EI Index, IDS N.O: 20094612454531) [11] mm. p[J].xm2010, 46(14): 69-74.

yn [1] mmmmmmmmm . . ym 101697146A k l. [2] mmmm亲mmmm. x业. m 202267954U kl.

n “Altera ”r中 中mt() 2010.08m 中(p)mp 2010.06m dekpl 2010.07m予 de() 2009.08m予 : [1] d产ek乱(DARPA)个v, (Center for Future Architectures Research, C-FAR)mn2384.006l

116

于七

[2] “Prototyping Platform to Enable Power-Centric Multicore Research” ( mn1059333) [3] dEnergy-Efficient Parallel Architectures for Computer Visione k mn0846152l [4] dArsenal: Extending Moore’s Law through the Design, Synthesis and Use of Massively Heterogeneous Systemsekmn0811794l [5] d,ekdte mn2009AA01Z110l [6] duekmn60736012l [7] du:,ekmn60773223l [8] dComputer Organization and DesignccThe Hardware/Software InterfacemFourth Editionm David A. Patterson, John L. Hennessyek!l下

117

mn于七 g乱争,七 g七g七乱乱 mf 七gm七p

g

七g

七n n

cccccccccccccccccccccccccccccccccccc

ccc

东仅mn七m vgm m七w产 之mw之g

产mg

七x个wmpg

七n