Execution Drafting: Energy Efficiency Through Computation Deduplication

Michael McKeown, Jonathan Balkind, David Wentzlaff Princeton University Princeton, NJ. USA mmckeown,jbalkind,wentzlaf @princeton.edu { }

Abstract—Computation is increasingly moving to the might include a load balancer, front end webserver, business center. Thus, the energy used by CPUs in the data center logic, memory caching layer, and a database. In more data- is gaining importance. The centralization of computation in intensive applications such as MapReduce [4] or Dryad [5], the data center has also led to much commonality between the applications running there. For example, there are many it is common for all of the Mappers to be executing the same instances of similar or identical versions of the Apache web map program and all of the Reducers to be using the same server running in a large data center. Many of these appli- reduction program. There is also significant batch/bulk com- cations, such as bulk image resizing or video transcoding, putation in the form of resizing of images or compressing of favor increasing throughput over single stream performance. video. These bulk workloads have significant commonality In this work, we propose Execution Drafting, an architectural technique for executing identical instructions from different and are not overly latency sensitive. Even in heterogeneous programs or threads on the same multithreaded core, such settings such as Infrastructure as a Service (IaaS) Clouds [6], that they flow down the pipe consecutively, or draft. Drafting [7] or server consolidation workloads, it is common for the reduces switching and removes the need to fetch and decode platform (OS) and applications to be the same or similar, drafted instructions, thereby saving energy. Drafting can also thereby exhibiting a large amount of commonality. reduce the energy of the execution and commit stages of a pipeline when drafted instructions have similar operands, Much work has gone into reducing energy in computing such as when loading constants. We demonstrate Execution and in particular in the data center. Techniques such as Drafting saving energy when executing the same application DVFS [8], [9], reducing energy use of idle machines [3], with different data, as well as different programs operating on making computing systems energy proportional [10], [11], different data, as is the case for different versions of the same exploiting Dark Silicon [12]–[14], and energy aware proces- program. We evaluate hardware techniques to identify when to draft and analyze the hardware overheads of Execution sor design [12], [15]–[20] all have had significant impact. Drafting implemented in an OpenSPARC T1 core. We show In this work, we investigate a new way to reduce energy that Execution Drafting can result in substantial performance by deduplicating computation. In particular, we propose a per energy gains (up to 20%) in a data center without technique called Execution Drafting (ExecD) which identi- decreasing throughput or dramatically increasing latency. fies and exploits commonality between applications execut- Keywords-Energy Efficient Computing; Multithreading; ing on a multithreaded processor. Execution Drafting works Data Center Computing; Cloud Computing; Microarchitec- by finding two or more applications which are executing the ture; Computation Deduplication; Energy Efficiency same or similar instruction sequences, but potentially have different data. A hardware synchronizer delays execution of I. INTRODUCTION one or more processes or threads to allow the execution The computing industry has seen a huge shift of com- points in the candidate programs to align. Once aligned, putation into large data centers and the Cloud [1], [2]. instructions flow down the processor pipeline such that the Reducing data center ’ energy use can not only first instruction trailblazes while subsequent instructions can save the data center operator money, but also reduce the draft behind the first instruction, taking advantage of signif- environmental impact of computing and enable additional icant energy savings, as shown in Figure 1a. These energy computation within the same power budget. Data center savings can be achieved because the instruction, opcode, and computing is typically different than desktop, mobile, or register identifiers for drafted instructions are known to be small-scale server computing in that applications are often the same as trailblazers, thereby saving energy for instruction optimized for throughput over absolute performance and the fetch, instruction decode, and pipeline control. In addition to application mix in the data center can be more homogeneous. shutting down large portions of the front-end of a pipelined Optimizing processor energy is important because the pro- processor, Execution Drafting aligns programs which can cessor consumes 20-30% of server power [3]. potentially save energy in the register file, execution units, There are many examples of application commonality and commit portions of the pipeline when instructions that in a data center. Large websites typically scale out many execute share the same operand values, such as when loading exact replicas of a server configuration across a data cen- constants. If the control flow of the drafting programs ter, utilizing only a modest number of applications which diverge, an Execution Drafting processor simply reverts to Program A Instruction Program B Instruction efficient time multiplexing of execution resources as is done in a multithreaded or time-shared processor. Fetch Stage Select Decode Stage Execute Stage Memory Stage Writeback Unlike many previous energy saving and parallelization Stage Stage techniques, Execution Drafting is able to exploit common- ality across programs even if they do not contain the same control flow, data, or even code. For instance, Execution Successfully Drafted Drafting is capable of finding and exploiting commonality Instructions Lead Instructions in order to save energy across applications within different (a) Execution Drafting Threads Synchronized Virtual Machines (VMs), across different versions of the Program A Instruction Program B Instruction same program, and even across completely different pro- grams when they execute similar code, as in a common Fetch Stage Thread Select Decode Stage Execute Stage Memory Stage Writeback Stage Branch Not Stage shared library. Branch Taken Taken The energy efficiency gains of Execution Drafting come at the expense of potential additional program latency. Exe- cution Drafting will delay a program in order to synchronize execution. While this introduces latency, throughput is not (b) Execution Drafting Diverged drastically affected as the non-waiting thread/program will Figure 1: High-level Execution Drafting Concept continue to execute. Although, throughput can be slightly positively or negatively affected due to memory system access reordering. We leverage an insight from previous of computational deduplication. We demonstrate this idea research [21] by investigating how the introduction of a through the example of Execution Drafting. We present small and boundable amount of latency can lead to signif- a detailed architecture of how to implement Execution icantly improved energy efficiency in the data center. The Drafting in the OpenSPARC T1 multithreaded processor. We addition of latency, without degradation of throughput, can evaluate the energy benefit of Execution Drafting, different be appropriate for batch applications or even in interactive synchronization schemes, and the overheads. We show that performance jobs where the additional latency remains within the Service Execution Drafting can achieve up to 20% energy gain Level Agreement (SLA), as in many web serving applica- on real world applications with a low latency increase and tions where the network latency dominates. area overhead. Our work is closely related to previous work in exploiting II. COMPUTATION DEDUPLICATION identical instructions in a processor [22]–[26] (MMT, MIS, DRR, DIR, Thread Fusion) with some key differences. Much work has been done in the area of data deduplica- Unlike previous work, Execution Drafting exploits identical tion for storage, back-up, replication, and network transfer instructions among the same applications (or threads within systems [27]–[42]. In the process of storing or transmitting the same application), but is broader and can also draft data, unique chunks are identified and stored or transmitted different programs. This enables Execution Drafting to accordingly. As subsequent chunks enter the system, if they synchronize code at different locations in the same pro- are found to duplicate a previous chunk, the data in the gram or completely different programs. Synchronization chunk is not replicated but a pointer to the previous identical mechanisms invented in previous work rely solely on pro- chunk is stored or transmitted instead. The key insight that gram counter (PC) matching, are not scalable, and are makes this technique beneficial is that recurring chunks limited by the sizes of their structures. Execution Drafting occur frequently. For example, with nightly back-ups, only includes synchronization mechanisms that overcome this. a small portion of the data may change in any given day. Unlike previous work, Execution Drafting does not track While data deduplication is used widely in storage and and match operand values to reuse instruction results. We are network systems, the analog in computation lacks attention. able to achieve the energy savings previous works achieve The idea behind computation deduplication is to identify with operand matching without expensive RAM lookups, identical or similar pieces of computation, and exploit their operand comparison, and load address checks. In contrast to similarity to reduce the work required to perform it. While Execution Drafting, which optimizes for energy efficiency, it is not a novel idea, we attempt to identify it as a general previous work trades extra energy to achieve performance concept common to existing energy saving or performance gains. Last, we focus our study on small, in-order, low- enhancing techniques. In doing so, we hope to stimulate power, throughput oriented cores, while previous work uses further research in this area, as cloud environments show large out-of-order (OoO) cores. This is because we target great potential for computation deduplication. data centers and our approach echoes the recent interest in In an Infrastructure as a Service (IaaS) setting, such as utilizing ARM cores and Microservers in the data center. Amazon’s Elastic Compute Cloud (EC2) [6], there are many The contributions of this work include the formulation VMs hosted within the same data center. However, many customers utilize similar (or identical) versions of Linux Only the data may differ between the execution of identical or other commonly used such as Apache [43] or instructions. Thus, everything other than data buses remains MySQL [44]. In MapReduce [4] jobs, each Map and Reduce static. In fact, it is possible that some data may be shared task, by definition, execute the same code. Many social net- between threads. Immediate constants may be identical, working websites store multiple resolutions of user images threads may share memory, virtual memory addresses may to prevent dynamic resizing when serving webpages [45]. be the same, and some data values (zero) are common. Thus, when users upload images, they must be resized and This concept of reducing switching, thereby reducing stored. The resultant large amount of identical code is ripe energy, by executing duplicate instructions consecutively is for computation deduplication. called drafting, as indicated in Figure 1a. This term comes Commonality also exists between threads within a pro- from the analog in racing, where one racer reduces drag gram. For example, many web servers launch a new thread by following closely behind another racer. We call the first for each client. Each thread performs the exact same com- instruction or thread in a set of drafted instructions the leader putation, serving the same webpage, but to different clients. instruction or thread, and the instructions that follow behind, This type of parallelism also occurs in database servers. the follower instructions or threads (see Figure 1a). Computation deduplication can enable performance en- Reducing switching is not the only energy saving compo- hancements, energy savings, or both. Generally, these bene- nent of ExecD. Knowing that identical threads are synchro- fits can be achieved by doing less work to execute identical nized means that only one stream of instructions needs to be copies of a computation. In this work, we focus on one fetched. Thus, the fetch stage can be disabled for all follower instance of computation deduplication, Execution Drafting. threads. This is depicted in Figure 1a, where instructions are only fetched for the leader thread but are executed for the III. EXECUTION DRAFTING ARCHITECTURE leader thread and each follower thread. We call this double- Execution Drafting (ExecD) is an example of computa- stepping or repeating the instruction. performance tion deduplication which targets maximizing energy Data dependent control flow may cause threads to diverge. (similar to reducing energy-delay-product (EDP), but ac- These divergences require the fetch stage to be re-enabled counts for throughput). More specifically, it aims to reduce for all diverged threads. This case is illustrated in Figure 1b, energy consumption while minimizing the performance im- showing the cycle after Figure 1a. In the figure, mountain pact. Note that this metric allows us to trade performance bikes represent branch instructions. The two successfully degradation for larger energy savings, resulting in a net drafted branch instructions take different branch paths. Ex- performance gain in energy . Performance can be either single- ecD must detect this case and notify the thread select and threaded latency or overall throughput. We optimize for both fetch stage. The thread select stage stops double-stepping interpretations but emphasize throughput since many data instructions and the fetch stage begins fetching for the center applications favor throughput over single-threaded follower threads again. Note that interrupts and other control performance [46]. ExecD does not considerably impact flow altering events also cause divergences. throughput, but does have the potential to increase single- Data dependent control flow thus limits the energy savings threaded latency. ExecD is able to achieve. Consequently, the key to ExecD’s energy savings is the amount of time the processor spends A. Processor Architecture drafting. Thus, robust synchronization mechanisms are inte- The high-level concept of ExecD is illustrated in Figure 1. gral to ExecD’s success. In the figure, instructions from different threads are rep- resented by different shades (dark, light) and different in- B. Thread Synchronization Techniques struction types are represented by different types of bicycles In this section we describe three thread synchronization (road, mountain, cruiser). ExecD attempts to synchronize methods (TSMs) that can be implemented in hardware: stall, similar programs or threads on a multithreaded core such random, and hybrid. The methods are simple heuristics, that duplicate instructions are issued consecutively down the however, results show they perform well. pipe, as shown in Figure 1a. 1) Stall Thread Synchronization Method: The stall thread Duplicate Instructions: are defined as one of two cate- synchronization method (STSM), relies on thread PCs for gories: Full duplicates or partial duplicates. Full duplicates synchronization. When STSM encounters a control flow have identical . Partial duplicates have the divergence, it compares threads’ PCs. It selects the thread same opcode, but the full machine code does not match. with the lowest PC and stalls the others in an attempt to When duplicate instructions are executed consecutively, accelerate the lagging thread to synchronize with the others. transistor switching is reduced. Decoding of a duplicate This method works well on common code structures such instruction is identical to the lead instruction; thus it only as if-else and loop structures. Figure 2 shows the execution needs to be performed once. All control logic, i.e. ALU of an if/else structure, where the two threads take different opcodes, multiplexer select signals, etc., will not switch. paths. The thread that takes the path earlier in program order guarantees forward progress. Time T0 T1 Time T0 T1 Time T0 T1 Time T0 T1 guaranteesguarantees forward forward progress. progress. TimeTime0 T0T00x0T1T1 - TimeTime9 T0T00x1CT1T1 - TimeTime T00 T0 0x0T1T1 Time- Time T09T0 T10x10T1 - guaranteesguarantees forward forward progress. progress. 0 0 1 0x0Time0x0Time- -T0T0- 0x0T19T19 10Time0x1CTime0x1CT0-T0- - 0x1CT1T1 0 0TimeTime0x010x0T0T0 - T1-T1 0x0Time9Time9 0x10T0100x10T0 T10x1CT1- - - STSM breaks down when drafting different code. The same 0x0 STSMSTSM breaks breaks down down when when drafting drafting different different code. code. The The same same 1 1 2 - 0-0x40 0x00x00x00x0- 10- -10 119-9 - 0x200x1C0x1C0x1C- -- 1 1 0 0 2- 0x0-0x00x40x00x0- - -109109 0x1C0x100x10110x1C 0x20--- - - 0x4 theSTSM criticalSTSM breaks breaks section, down down and when when the drafting other willdifferent continue code. The to spin. same 2 2 3 0x40x41 1- - ---0x40x0110x011 120x2010100x20 -- -- - 0x1C0x200x1C 2 2 1 10x430x4- - - 0x0-0x0 0x411101110 0x200x1C0x1C120x20 ----- 0x20 PC in two threads executing different code may not point to 2 2 0x40x4 12- - 1111- 0x200x20 - - 2 2 0x40x4 - - 121111 0x200x20- 0x20- - PCPC in in two two threads threads executing executing different different code code may may not not point point to to 3 3 4 - -0x80x40x4 - 12 - 0x20 3 3 4- - 0x80x40x4 - 12 - 0x20 0x8 STSMPCPC in in will two two threadsalways threads executingselect executing the different thread that code is may spinning, not point since to 4 4 0x80x83 3 - --- 0x40x4 1212 - - 0x200x20 4 4 3 30x80x8- - - 0x4-0x4 1212 - - 0x200x20 0xC 5 0xC - Legend 5 - 0x14 Legend the same instruction. Thus, STSM may not be able to draft 5 5 0xC0xC4 4 -0x80x8- - - LegendLegend 5 5 4 4 - 0x8-0x80x140x14- - LegendLegend 0x10 thethe same same instruction. instruction. Thus, Thus, STSM STSM may may not not be be able able to to draft draft 6 50x105 0xC0xC- - - LegendLegend 5 5 6 - - -0x140x140x18 LegendLegend itsthethe PC same same is lower, instruction. instruction. resulting Thus, Thus, in livelock STSM STSM may may since not the be lock able holding to draft 6 6 0x100x10 - - Drafting 6 6 - - 0x180x18 Drafting0x14 6 6 0x100x10 - - DraftingDrafting 6 6 - - 0x180x18 DraftingDrafting 7 7 - - 0x140x14 DivergedDraftingDrafting 7 0xC7 0xC- - DraftingDraftingDiverged0x18 or disable fetch despite having matched up the PCs as we 7 7- 7 0x14- - 0x140x14 DivergedDiverged 7 7 7 0xC0xC0xC - - DivergedDiverged threadoror disableor disableor is disable disable stalled. fetch fetch fetch despitefetch despite To avoid despite despite having this, having matched having we implement matched matched up the PCs a up upratioed as the the we PCs PCs as as we we 8 - 0x18 DivergedDiverged 8 - 0x1C DivergedDiverged 0x1C 8 8 - 8- 8 0x180x18- - 0x180x18 8 8 8 8 - -- -0x1C0x1C0x1C0x1C 0x20 threadverifyverifyverifyverify selection the the instructionsverify instructionsthe the instructions policy. instructions the andinstructions and If the the and minimumandphysical the the andphysical physicaladdresses PC the has physical addresses been addressesmatch chosen before addresses match match before before match before (a)(a)(a) STSM STSM(a)(a) STSM STSM STSM STSM (b)(b)(b)(b) RTSM(b) RTSM RTSM RTSM (c)(c)(c)(c) Basic(c)BasicBasicBasic(c) Block Block BlockBasic Block Block Diagram Diagram Diagram DiagramBlock Diagramso Diagramhavinghaving manyhavinghaving themthem timeshaving them themdraft. draft. without draft. them Thisdraft. This synchronizing, reliance This draft. This reliance reliance onThis PCs select reliance on is on a PCs shortcoming the PCs maximumis on is a PCsa shortcoming shortcoming of is a shortcoming of of of FigureFigureFigureFigureFigureFigure 2: 2: 2:If/Else 2:If/Else 2: If/Else If/Else 2: If/Else If/Else Structure Structure Structure Structure Structure Execution Execution Execution Execution ExecD Drafting Drafting Drafting Example Drafting Example Example Example Example ExamplePCSTSMSTSM onceSTSMSTSM that that andSTSM thatother other that go other back thread threadother that tothread othersynchronizationthread synchronization selecting synchronization thread synchronization the synchronization minimum mechanisms mechanisms PC.mechanisms attempt This mechanisms attempt attempt attempt TimeTime T0T0 T1T1 TimeTime T0T0 T1T1 TimeTime T0T0 T1T1 TimeTime T0T0 T1T1 TimeTime T0T0 T1T1 TimeTime T0T0 T1T1 to overcome. Time 0 0T0 0x00x0T1 - - Time0 0 0x00x0T0 - - T1 0 0 Time0x00x0 T0- - T1 guaranteesto overcome. forward progress. 0x0 toto overcome. overcome. 0 0 0 0x00x01 10x0- --- -0x00x0 0 0 100x010x0 -0x0- - -0x00x0- 0 01 1 0x000x0- - 0x00x0-0x0- - to overcome. 0x4 1 1 1 - 2-2 -0x00x40x00x40x0- - 1 1 212- - 0x40x4-0x00x0- -0x0 1 12 2 -10x40x4- 0x0--0x0- 0x0 3.2.2.3.2.2.STSM Random Random breaks downThread Thread when Synchronization drafting different Method code.The The ran- 2 2 0x40x43 3 - --- 0x40x4 2 2 30x430x4 - - - -0x40x4 2 23 3 0x40x4- - 0x4-0x4- The ran- 2 40x4 0x8 - - 2 0x4 - 2 0x4 - 0x8 3.2.2.3.2.2. Random Random Thread Thread Synchronization Synchronization Method Method The ran-The ran- 3 3 - 4- 0x40x80x4 - 3 3 4 4- - - -0x40x40xFFF00xFFF0 3 34 4 - --- 0xFFF00x40xFFF00x4 dom thread3.2.2. synchronization Random method Thread (RTSM) Synchronization is a simple mech- Method 3 5 - 0xC0x4 - 35 0x8- -0x4 5 3 - 0xFFF4- 0x4 0xC samedom PCthread in synchronization two threads executing method (RTSM)different is code a simple may mech- not 4 4 0x80x85 -0xC- - 4 4 5 - - 0x80xFFF00xFFF0- 4 45 - -- 0xFFF00xFFF40xFFF0 4 6 60x80x100x10- - - 646 - -- 0xFFF40xFFF40xFFF0 6 6 40x80x8 -- - 0xFFF0 0x10 domdom thread thread synchronization synchronization method method (RTSM) (RTSM) is is a asimple simple mech- mech- 5 5 0xC0xC - - 5 5 0x80x8 - - 5 5 - - 0xFFF40xFFF4 anism. Whendom RTSM thread encounters synchronization a divergence method in the (RTSM) control is a simple mech- 5 7 70xC0x140x14- - - 757 0xC0xC0x8 - - - 7 7 5- - 0xFFF80xFFF8- 0xFFF4 pointanism. to the When same RTSM instruction. encounters Thus, a divergence STSM may in not the be control able 6 6 0x100x10 - - 6 6 - - 0xFFF40xFFF4 6 6 0x80x8 - - 6 8 0x108 0x180x18- - - 868 - -- 0xFFF80xFFF80xFFF4 8 8 6- - 0x80x80x8 - 7 7 0x140x149 - 0x1C- - 7 7 0xC90xC - - - 0x8 7 7 9 - 0xC- 0xFFF80xFFF8- flowanism.anism. of threads, When When it chooses RTSM RTSM a encounters thread encounters to execute a a divergence divergence at random untilin in the the control control 7 9 0x140x1C - - 97 -0xC 0x8 - 9 0xC7 -- 0xFFF80xFFF0 toflow draft of or threads, disableanism. it fetch chooses When despite a RTSM thread having to encounters execute matched at random up a the divergence PCs until in the control 8 8 0x180x181010 0x20- 0x20- - - 8 8 1010- - -0xFFF8- 0xFFF80xC0xC 8 10810 - --- 0x80xC0xC0x8 8 0x18 - 0xFFF4 9 9 0x1C0x1C1111 0x24- 0x24- - - 9 9 11811- - 0x100x10-0x80x80xFFF8- - 9 11911 0xC0x1080xC0x10 -- -- 0x8 threadsflowflow are of of synchronizedthreads, threads, it it chooses chooses again. This a a thread thread effectively to to execute execute varies threadat at random random until until 12 0x28 - 0xFFF8 asthreads we verify are synchronized the instructions again. and This the effectively physical varies addresses thread 1010 90x200x20120x1C0x28- - - - 1010 12912- - - --0xC0xC0x100x100x8 10101212 -9--- 0xC0x100x100xC - flow of threads, it chooses a thread to execute at random until 111110(a)(a)0x240x24STSM0x20STSM- - - w/o w/o11(b)11(b)100x10STSM0x10STSM- - - w/0xC w/ 1111(c)(c)0x10 HTSM100x10 HTSM- - (d)0xC(d)BasicBasic Block Block Diagram Diagram matchexecutionexecutionthreadsthreads before points points are drafting. are synchronized by synchronizedby sliding sliding This reliance the the again.threads again. on PCs This by This eachis effectively a effectively shortcoming other until varies varies a thread thread 1212110x280x280x24- - - 1212 11- - 0x100x100x10 - 1212 11- - 0x100x10 - threads are synchronized again. This effectively varies thread 12 0x28Threshold- 12Threshold- 0x10 12 - 0x10 convergence point is found. The intuition behind this is to (a)(a)STSMSTSMThreshold w/o w/o(b)(b)STSMSTSMThreshold w/ w/ (c)(c) HTSM HTSM (d)(d)BasicBasic Block Block Diagram Diagramofconvergence STSMexecutionexecution that pointother points points isTSMs found. by by slidingattempt sliding The intuition tothe the overcome. threads threads behind by by thiseach each is other to other until until a a (a) STSM w/o (b) STSM w/ (c) HTSM (d) Basic Block Diagram (a) STSMFigureFigure w/o 3: 3:(b) Function FunctionSTSM Call Call w/ Execution Execution(c) HTSM Drafting Drafting(d) Example ExampleBasic Block Diagrammodelmodel2) Random random randomexecution Thread changes changes Synchronization points to to code, by for sliding example Method: the betweenThe threads random two dif- by each other until a ThresholdThresholdThreshold ThresholdThresholdThreshold convergenceconvergence point point is is found. found. The The intuition intuition behind behind this this is is to to Threshold Threshold threadferentferent synchronization versions versionsconvergence of of the the same same method point program. program. (RTSM) is RTSM RTSMfound. is a attempts simple The intuition to mecha- fix the behind this is to FigureFigureFigure 3: 3: Function Function 3: Function Call Call Execution CallExecution ExecD Drafting Drafting Example Example Example modelmodel random random changes changes to to code, code, for for example example between between two two dif- dif- itit exitsFigure exits the the 3:if-else if-else Function structure. structure. Call The The Execution threads threads will will Drafting synchronize synchronize Example at at nism.shortcomingshortcoming Whenmodel RTSM that that STSM STSM random encounters has has changes with with a draftingdivergence to code, different in thefor programs controlexample between two dif- ferentferent versions versions of of the the same same program. program. RTSM RTSM attempts attempts to to fix fix the the thethe end end of of the the if-else if-else structure. structure. Similar, Similar, logic logic explains explains why why floworor the the of same same threads,ferent program program it versions chooses with with different different a of thread the versions same to execute program. and works at random RTSM well in attempts to fix the itit exits exitsSTSM the the if-else also if-else performs structure. structure. well The on The loops. threads threads If two will will threads synchronize synchronize execute at at shortcomingshortcoming that that STSM STSM has has with with drafting drafting different different programs programs it exitsSTSM the also if-else performs structure. well on loops. The threads If two threads will synchronize execute untilpractice.practice. at threads However, However, are synchronized it it does does not again. necessarily This effectivelyperform optimally varies thethe(T0 enddifferent enddifferent in of of Figure the the numbers numbers if-else if-else 2a) will of of structure. structure. iterations iterations be selected of Similar,of Similar, a a to loop, loop, execute say logic say logic 5 5 and iterations iterations explains explains exit the and and why why onor commonor the theshortcoming same same code program programstructures that with with such STSM different asdifferent if/else, has versions evident versions with drafting from and and worksFig- works different well well in in programs the end of the if-else structure. Similar, logic explainsthread whyon common execution code points structures by sliding such as the if/else, threads evident by each from other Fig- STSMSTSMif-else1010 iterations,also iterations, also structure performs performs the the first, threads threads well jumping well should should on on toloops. loops. stay astay larger synchronized synchronized If If two PC two than threads threadsfor the for the theother execute execute first first untilureurepractice.2b a2bpractice. convergence,, and andor thus thus the However, However, suffers suffers same point when when program is it it found. does does drafting drafting not The withnotnecessarily the the intuitionnecessarily different same same program behind perform versions performthis with optimally and optimally works well in differentdifferentSTSMthread.55 iterations, iterations, numbers STSMalso numbers performs afterwill after of of now which iterations which iterations select well one one the threadonof thread of athread loops. a loop, loop, will will that sayexit exit Ifsay took two 5the the 5 iterations iterationsloop. the loop. threads other This This and andexecuteisdifferentdifferent toon modelon common common inputs. inputs.practice. random code code changes However, structures structures to code, it such does such for as example notas if/else, if/else, necessarily between evident evident performfrom from Fig- Fig- optimally thread will now have a PC larger than the other thread, and 1010differentpath iterations, iterations,thread (T1 willin numbers theFigure the now threads threads have 2a), of as aiterationsshould should PC it now larger stay hasstayof than synchronizedthe synchronized a theminimumloop, other say thread, PC, for5 for iterations until the the and first firsttwo3.2.3. and3.2.3. differentureure Hybrid Hybrid2b2bon, and, versions and common Thread Thread thus thus of suffers Synchronizationsuffers Synchronization the code same when whenstructures program. drafting drafting Method RTSM such the theThe assame attempts same if/else, hybrid program program evident with with from Fig- itSTSM exitsSTSM the will will if-else select select structure. the the other other threadThe thread threads until until it it will finishes finishes synchronize its its iterations. iterations. at 5510 iterations, iterations, iterations, after after the which which threads one one should thread thread stay will will synchronized exit exit the the loop. loop. for This This thetothread firstthread fixdifferent thedifferent synchronization synchronization shortcomingure inputs.2b inputs., and that methodthus method STSM suffers (HTSM) (HTSM) has with when attempts drafting drafting different to inherit the same program with theTheThe end threads threads of the will will if-else synchronize synchronize structure. at at Similar the the end end logicof of the the explains loop loop structure. structure. why threadthread5 iterations, will will now now after have have which a a PC PC largerone larger thread than than the the will other other exit thread, thread, the loop. and andprograms Thisthethe positive positive or propertiesthe properties same fromprogram from STSM STSM with and and different RTSM. RTSM. HTSM versions is iden-and STSMThere also are performs a couple well of caveats on loops. to STSM. If two threadsConsider execute the case tical3.2.3.3.2.3. to STSMdifferent Hybrid Hybrid with Thread two inputs.Thread key differences. Synchronization Synchronization Recall the Method Method functionTheThe hybrid hybrid STSMSTSMthreadThere will will will select select are now a couplethe the have other other of a caveats thread PC thread larger tountil until STSM. itthan it finishes finishes Consider the other its its theiterations. iterations. thread,case workstical and to well STSM in practice. with two However, key differences. it does not Recall necessarily the function per- differentinin Figure Figure numbers33wherewhere of iterations one one thread thread of takes takesa loop, a a function say function 5 iterations call call and and and the the formcallcallthread exampleoptimallythread example3.2.3. synchronization synchronization in in on Figure Figure Hybrid common33 whichThread code methodmotivated method structures Synchronization (HTSM) (HTSM) the such addition as attempts attempts if/else, of the Method to to inherit inheritThe hybrid TheThe threads threads will will synchronize synchronize at at the the end end of of the the loop loop structure. structure. STSM10otherother iterations, will does does selectnot. not. the Assume Assume threads the other the the should function function thread stay code code until synchronized has has it a finishesa larger larger for PC PC its the than than iterations.evidentthresholdthresholdthethe from positive positive tothread to STSM.Figure STSM. properties properties synchronization In2b, In the the and case case from thus from of of suffersbeing being STSM STSM method outside outside when and and RTSM. thedrafting RTSM. (HTSM) threshold, HTSM the HTSM attempts is is iden- iden- to inherit TheTherefirstThereanyany threads 5 other areiterations, other are a athreads’ couplethreads’ will couple after synchronize PC,of PC, of which caveats i.e.caveats i.e. it itone is is to ata toa thread shared STSM. shared the STSM. end will library library Consider Considerof exit the or or the is isloop loop.placed the placed the structure.case casesameHTSMHTSMtical programtical chooses chooses tothe to STSM STSM with positive a a random random different with with properties two thread two inputs. key key to differences. issue from differences. instead STSM ofRecall Recalland alternat- RTSM. the the function function HTSM is iden- ininThis Figure FigureThereatat higher thread higher33 arewhere memorywherewill memory a couple now one one addresses, haveaddresses, thread ofthread a caveats PC takes as largeras takes shown shown to a than a STSM.function function in in the Figure Figure other Consider call3d call3d thread,.. and Thus, Thus, and the the the caseinging3)call the Hybridthecall issueexample issue example Thread of of instructions instructions in in Synchronization Figure Figure3 from from3whichwhich each each Method: motivatedmotivated thread,The asthe shown the hybrid addition addition in of of the the STSM will stall the thread in the function call (T1 in Figure 3) tical to STSM with two key differences. Recall the function otherotherandSTSM does does STSM not.will not. will stall Assume Assume select the thread the the the function otherin function the functionthread code code until callhas has(T1 ita a larger finishes larger in Figure PC PC its than3 than) threadFigureFigurethresholdthreshold synchronization3c3c.. This This to to is is STSM. STSM. part part method of of In the the In the theeffort effort (HTSM) case case to to of hybridize hybridize of attempts being being outside STSM outside to inherit andthe the threshold, threshold, initerations. Figureuntil the The3 otherwhere threads thread one will reaches thread synchronize a PC takes larger at athan the function end it. Since of call the the andRTSM. the call example in Figure 3 which motivated the addition of the anyany otheruntil other the threads’ threads’ other thread PC, PC, i.e. reaches i.e. it it is is a a aPC shared shared larger library librarythan it. or Sinceor is is placed placed the theRTSM. positiveHTSMHTSM properties chooses chooses a from a random random STSM thread thread andto RTSM. to issue issue HTSM instead instead is of of alternat- alternat- otherloopfunctionfunction structure. does code codenot. is isAssume placed placed at at the a a large large function PC, PC, this this code may may nothas not occur occur a larger until until PC thanThe otherthreshold key difference to STSM. is that In HTSM the case addssome of being flexibil- outside the threshold, atat higher higher memory memory addresses, addresses, as as shown shown in in Figure Figure3d3d.. Thus, Thus,identicalTheinging other the to the STSM issue key issue difference of with of instructions instructions two is that key HTSM differences. from from adds each each some Recall thread, thread, flexibil- the as as shown shown in in anytheTherethe other other other are threads’ threads threadscaveats terminate, terminate,to PC, STSM. i.e. depictedConsider depicted it is a shared in in the Figure Figure case library in3a3a. Figure. Thus, Thus, or 3 inis in placedfunctionityity to to the the call PCHTSM PC examplebased based chooses synchronization synchronization in Figure a random 3 in in which STSM. STSM. thread motivated HTSM to issue looks the instead of alternat- STSMSTSMorder will will to stall stall handle the the this thread thread case, in in we the the set function function a threshold call call value (T1 (T1 in for in Figure STSM. Figure33) ) FigureFigure3c3c.. This This is is part part of of the the effort effort to to hybridize hybridize STSM STSM and and atwhere higherorder one to handle memory thread this takes addresses,case, a function we set a as threshold call shown and thevalue in other Figure for STSM. does3d. Thus,additionforfor the the caseof caseing the where where the threshold issue the the PCs PCs to of ofSTSM. of instructions the the threads threads In the matchcase from of (STSM’s each being thread, as shown in until theIf the other difference thread between reaches PCs a is PC outside larger this than threshold, it. Since STSM the RTSM. untilnot.If the the Assume difference other the thread functionbetween reaches PCs code is a hasoutside PC a larger larger this threshold, than PC than it. Since STSM any theoutsidegoal),goal),RTSM. but butthe the the threshold, instructions instructions HTSM and chooses instruction a random physical thread addresses to functionSTSMresorts code will to alternatingstall is placed the thread atthe a issue large in of the PC, instructions function this may from call not each (T1occur thread. in until Figure 3) Figure 3c. This is part of the effort to hybridize STSM and functionotherresorts threads’ code to alternating is PC, placed i.e. theit at issue is a large a ofshared instructions PC, library this may from or sitsnot each higheroccur thread. untilissuedodo not not insteadThe match. match.The other other of This This alternating key key indicates indicates difference difference the the issue threads is is that ofthat are instructionsHTSM HTSM executing adds adds differ-from some some flexibil- flexibil- untilThe the threads other will thread alternate reaches until the a thread PC largerthat took than the function it. Since the RTSM. thethein otherThe other memory, threads threads threads as will shown terminate, terminate,alternate in Figure until depicted depicted the 3d. thread Thus, in in that Figure Figure STSM took the3a will3a. function. Thus, stall Thus, in ineachententity code.thread, code.ity to to the HTSM HTSM the as PC shown PCdetects baseddetects based in Figure synchronization this this synchronization case 3c. and This tries hybridizes in to in STSM.model STSM. STSM small HTSM HTSM looks looks functioncall returns, code inis whichplaced case at a the large PCs PC, are within this may the threshold not occur until orderorderthecall to thread to handlereturns, handle inthisthe in this which function case, case, case we we call theset set (T1PCs a a threshold threshold in are Figure within value value 3) the until threshold for for STSM. the STSM.andchangeschanges RTSM.for the to to code, code, caseThe such such where other as as additionskey additions the differencePCs or of removals. the is threads that This HTSM matchis mod- adds (STSM’s some flexibil- again and STSM works as described previously, as shown in for the case where the PCs of the threads match (STSM’s Ifthe theotheragain differenceother thread and threads STSM reaches between works terminate, a PC PCs as larger described is outside depicted than previously,it. this Since in threshold, Figure the as function shown3a STSM. in Thus,eledeledThe in using using otherity an an key offsetto offset difference the that that PC is is based added added is thatto synchronization HTSM all PC adds comparisons some in flex- and STSM. HTSM looks If theFigure difference3b. Choosing between the PCs value is for outside this threshold this threshold, is ad hoc, STSM but goal),goal), but but the the instructions instructions and and instruction instruction physical physical addresses addresses ordercodeFigure is to placed3b handle. Choosing at this a large thecase, value PC, we this for set this may a threshold threshold not occur is ad valueuntil hoc, thefor but STSM.ibilitydifferences.differences. to the WhenWhen PC based HTSM HTSM synchronization encounters the in case STSM. just described, HTSM resortsresortsempirically to to alternating alternating 100 works the the issue issue well of and of instructions instructions is used in this from from work. each each thread. thread. dodo not notfor match. match. the caseThis This indicates where indicates the the the PCs threads threads of the are are executingthreads executing match differ- differ- (STSM’s otherempirically threads terminate, 100 works depicted well and in is Figure used in 3a. this Thus, work. in order itit attempts attempts to to find find the the correct correct offset offset to to synchronize synchronize to. It does TheTheIf threadsthe threads difference will will alternate alternate between until until PCs the the threadis thread outside that that thistook took threshold, the the function function STSMlooksentent for code. thecode.goal), PCs HTSM HTSMbut of the the detects threads detects instructions this matching this case case andand (STSM’s and instruction tries tries goal), to to model model physical small small addresses to handleTheThe other other this case,caveat caveat we to to setSTSM STSM a threshold involves involves value a a livelock livelock for STSM. case. case. Con- Con- If butthisthis the by by instructions alternating alternating theand the execution execution instruction of of physical code code from addresses each thread. not callcallresorts returns, returns,sider to two alternating in in drafting which which threads case casethe issuethe the that PCs PCs are of instructions tryingare are within within to grab thethe from a spin threshold threshold each lock thread.However,changesdo instead to not code, ofmatch. just such executing This as additions indicates each thread or the removals. once, threads HTSM This are executingis mod- differ- thesider difference two drafting between threads PCs isthat outside are trying this threshold,to grab a spin STSM lock matching.However,changes This instead toindicates code, of just suchthe executing threads as additions each are executingthread or removals.once, different HTSM This is mod- againThethat and threads protects STSM will a works criticalalternate as section. described until theOne previously, thread thread thatwill acquireastook shown the the function in executeseled using each thread an offset at exponentially that is added increasing to all PC rates. comparisons It ex- and againresortsthat and protects to STSM alternating a critical works the section.as issue described of One instructions thread previously, will from acquire as showneach the incode.executes HTSMeled eachent using detects threadcode. an this offset at HTSM exponentiallyand that tries detects is to added model increasing this changesto all case rates. PC to and comparisons code, It ex- tries to model and small Figurelock3b. and Choosing enter the the critical value section, for this and threshold the other will is ad continue hoc, but ecutes thread 0 for 1 instruction, thread 1 for 2 instructions, Figurecallthread.lock returns,3b and The. Choosing enter threads in the which critical will the alternatevalue case section, thefor until andthis PCs the the threshold are other thread within will thatis continue ad the took hoc, threshold butsuchecutesdifferences. asdifferences. thread additionschanges 0 for When or When 1 to removals. instruction, code, HTSM HTSM such This encounters thread encounters as is additions modeled1 for the 2 the instructions, case using case or removals.just anjust described, described, This is mod- to spin. STSM will always select the thread that is spinning, thread 0 for 4 instructions, thread 1 for 8 instructions, etc. empiricallyempiricallyagaintheto function spin. and 100STSM STSM100 call works works returns, will works always well well in which asandand select described is caseis used the used the thread in PCsin previously, this this that are work. work. within is spinning, as the shownoffsetthread initit that attempts attempts 0 for iseled added 4 instructions, tousing to find to find all an the the PC offset correct thread correct comparisons that 1 offset offset for is 8 added toand instructions, to synchronize synchronizedifferences. to all etc. PC to. to. comparisons It It does does and since its PC is lower, resulting in livelock since the lock hold- This effectively slides the threads past each other, looking for Thethresholdsince other its again PC caveat is and lower, to STSM STSM resulting works involves in as livelock described a livelock since previously, the lockcase. hold- as Con-WhenThisthis effectively HTSM by alternating encounters slides the thethis threads execution case, past it each attempts of other, code to looking findfrom the for each thread. FigureTheing other thread3b. Choosing caveat is stalled. to STSMthe In order value involves to for avoid this this, a thresholdlivelock we implement case. is ad Con- a hoc,the but correctthis bydifferences. offset alternating that models When the the execution HTSM change to encounters of the code code. from HTSM the each case thread.just described, sidershowning two thread in drafting Figure is stalled. 3b. threads Choosing In order that this are to avoid threshold trying this, to value we grab implement is a ad spin hoc, lock a correcttheHowever, correct offset offset to synchronizeinstead that models of just to. the It executing does change this to by each the alternating code. thread HTSM the once, HTSM siderempiricallyratioed two drafting thread 100 selection works threads well policy. that and are If the istrying usedminimum to in grab this PC a haswork. spin been lock continuesHowever,it to attempts do instead this until to of find it just finds the executing a correct point where each offset the thread to threads synchronize once, HTSM to. It does butratioed empirically thread 100 selection works policy. well and If the is minimum used in this PC work. has been executioncontinues of to code do this from until each it finds thread, a point executing where each the threadsthread thatthatThe protectschosen protects other so a manya critical caveat critical times section.to section. without STSM synchronizing, One One involves thread thread a will select livelockwill acquire acquire the maxi- case. the the Con-synchronize,executesexecutesthis at each each by which alternatingthread thread it updates at at exponentially exponentially the the offset execution with increasing the increasing difference of code rates. rates. from It It eachex- ex- thread. lockchosen andThe otherenter so many caveat the critical times to STSM without section, involves synchronizing, and livelock. the other selectConsider will the continue twomaxi- atsynchronize, exponentiallyecutes thread at which increasing 0 for it updates 1 instruction,rates. the It offset executes thread with thread the 1 difference for 0 2 for instructions, locksidermummum and two PC enter PC draftingonce once the and and critical go gothreads back back section, to to thatselecting selecting areand the trying the minimum minimum other to will grab PC. PC. continue This aThis spin lockinin the theecutes PCs PCsHowever, at at thread that that point. point. 0 instead for All All 1 synchronization synchronization instruction, of just executing thread from 1this each for point 2 thread instructions, once, HTSM to spin.drafting STSM threads will trying always to grab select a spinthe thread lock that that protects is spinning, a 1 instruction,thread 0thread for 4 instructions, 1 for 2 instructions, thread 1thread for 8 0 instructions, for 4 etc. tothatcritical spin. protects STSM section. a will One critical always thread section. will select acquire the One thread the thread lock that and will is enter spinning, acquireinstructions, thethreadexecutes thread 0 for 1 4 for each instructions, 8 instructions, thread at thread exponentially etc. This 1 for effectively 8 instructions, increasing rates. etc. It ex- sincesince its its PC PC is is lower, lower, resulting resulting in in livelock livelock since since the the lock lock hold- hold- ThisThis effectively effectively slides slides the the threads threads past past each each other, other, looking looking for for lock and enter the critical section, and the other will continue44 ecutes thread 0 for 1 instruction, thread 1 for 2 instructions, inging thread thread is is stalled. stalled. In In order order to to avoid avoid this, this, we we implement implement a a thethe correct correct offset offset that that models models the the change change to to the the code. code. HTSM HTSM to spin. STSM will always select the thread that is spinning, thread 0 for 4 instructions, thread 1 for 8 instructions, etc. ratioedratioed thread thread selection selection policy. policy. If If the the minimum minimum PC PC has has been been continuescontinues to to do do this this until until it it finds finds a a point point where where the the threads threads since its PC is lower, resulting in livelock since the lock hold- This effectively slides the threads past each other, looking for chosenchosen so so many many times times without without synchronizing, synchronizing, select select the the maxi- maxi- synchronize,synchronize, at at which which it it updates updates the the offset offset with with the the difference difference mummuming PC thread PC once once is and and stalled. go go back back In to toorder selecting selecting to avoid the the minimum minimum this, we PC. implement PC. This This in ain the thethe PCs PCs correct at at that that point.offset point. thatAll All synchronization models synchronization the change from from to this this the point code. point HTSM ratioed thread selection policy. If the minimum PC has been continues to do this until it finds a point where the threads chosen so many times without synchronizing, select the maxi- synchronize, at which it updates the offset with the difference 4 mum PC once and go back to selecting the minimum PC. This4 in the PCs at that point. All synchronization from this point

4 Existing OpenSPARC T1 Front-End ExecD Additions Trace XML Config File Pin Tool PCs Memory System Model RepeatMux Instruction Instructions Energy Model Select Mux Select

Thread App 0 Decoded Info I-Cache Instruction Decode TIDs/ASIDS Simulation I-TLB Buffer x 2 Memory Ops Stats

DraftSim Synchronization Trace Performance Instruction type Pin Tool PCs OpenSPARC Energy Misses Instructions T1 Thread

Decoded Info Traps and Interrupts App 1 Synchronization Select Logic Resource Conflicts TIDs/ASIDS Mechanism PC Logic Memory Ops x 2 Repeat Thread Thread ExecD Select Mux Enable Mux Control Flow Diverged ExecD Figure 5: DraftSim Architecture Thread PC x 2 Thread Instruction Buffer x 2 ExecD Sync Logic Instruction PA x2

Enable Bit tall S Sync Enable Mux) is added to the thread selection decision path Fetch Stage Thread Select Stage which selects between the default OpenSPARC T1 thread Figure 4: ExecD Changes to OpenSPARC T1 Front-End selection policy and the ExecD thread synchronization pol- icy. This allows ExecD to control which threads get executed when. The multiplexer is controlled by an ExecD enable bit. slides the threads past each other, looking for the correct The ExecD Thread Synchronization Methods (TSMs) re- offset that matches the change to the code. HTSM continues quire multiple signals from different portions of the pipeline, to do this until it finds a synchronization point, at which as shown in the bottom right of Figure 4. The reason these it updates the offset with the difference in the PCs. All signals are needed can be understood from the discussion in synchronization from that point on is based on synchronizing Section III-B. ExecD requires knowledge of when threads di- to that PC offset. verge in order to enable the fetching for follower threads and 4) Note on Thread Synchronization Methods: While stop the double stepping of instructions. This signal includes TSMs try to achieve high synchronization, energy savings branch divergences, interrupts, and hitting page boundaries, can still be had when threads are not synchronized. When as we rely on threads executing from the same code page threads are diverged, duplicates still occur. In these cases, to determine threads are synchronized. It also requires the we prefer to draft duplicates over synchronizing (i.e. PCs thread PCs for synchronization purposes. Thread instructions don’t match but is a duplicate), leading to energy savings. and the corresponding physical addresses are required to determine when threads are synchronized. Only when the IV. EXECUTION DRAFTING IMPLEMENTATION instructions’ physical addresses and the instructions them- Figure 4 depicts modifications to the front-end of the selves match, can ExecD declare being synchronized. This OpenSPARC T1 [47] core for ExecD. The OpenSPARC T1 requires that the physical addresses be piped forward (not TM is an open source version of Sun’s UltraSPARC T1 [48]– shown in Figure 4), as they are usually dropped after the [50]. The core is reduced from 4 to 2 threads for this tag comparison. Recall, ExecD does not require threads work. This core is used since it is multithreaded, making to be synchronized to achieve energy savings, as duplicate it amenable to ExecD. We modified the Verilog RTL of the instructions that arise when diverged are also drafted, only core to implement ExecD and we plan to tape out a chip the fetch stage cannot be disabled. utilizing the modified core. The additions to the OpenSPARC In order for ExecD to share the fetching of instructions for T1 processor are small, however, we evaluate the overheads two synchronized threads and disable fetch for the follower in our results. thread, the physical code page must be shared. Physical code The OpenSPARC T1 fetches two instructions per cycle pages are shared in most operating systems when multiple and places them into the instruction buffer for their corre- instances of the same program are executing, in addition, sponding thread in the fetch stage. Instructions are always most Monitors deduplicate code pages be- fetched from the thread that is issuing. In the thread select tween VMs enabling ExecD to draft threads between VMs. stage, one instruction from one thread is selected to issue. VMware ESX Server and KVM with Kernel Samepage This decision is based on a weighted round-robin thread Merging (KSM) enabled both perform this optimization. selection policy giving priority to the least recently executed The ExecD TSM logic outputs a few signals in order to thread. In addition, threads may become unavailable due to control the pipeline when drafting: stall, sync, and repeat. long-latency instructions, cache misses, traps, or resource The stall signal is wired to the fetch stage. This allows Ex- conflicts [51]. ecD to prevent fetching for follower threads when drafting. Modifying the OpenSPARC T1 to implement ExecD re- The sync signal goes to the PC logic in order to keep the quires replacing the round robin thread selection policy with PCs of follower threads up to date when drafting. When a policy that attempts to synchronize the threads, discussed stalling fetch for a follower thread, the PC of the leader in Section III-B. As shown in Figure 4, a multiplexer (ExecD thread gets copied into the PC of the stalling thread in fetch. Table I: Core Model Table II: Energy Model DraftSim Parameter Value Pipeline Stage Energy per Cycle CPI (Non-Memory Instructions) 1 (nJ/cycle) L1 Data Cache Size 8KB Fetch 0.60571875 L1 Data Cache Associativity 4 Thread Select 0.03583125 L1 Data Cache Line Size 32B Decode 0.020475 L1 Data Cache Access Latency 1 cycle Execute 0.4675125 Memory 0.3241875 L2 Unified Cache Size 384KB (a) adpcm (b) epic (c) g721 L2 Unified Cache Associativity 12 Writeback 0.252525 L2 Unified Cache Line Size 64B Whole Pipeline 1.70625 L2 Unified Access Latency 22 cycles DRAM Access Latency 52 cycles

The PC of follower threads needs to stay up to date so that we fetch the correct instructions when re-enabling fetch (d) ghostscript (e) gsm (f) jpeg after encountering a divergence. The repeat signal controls an instruction repeat multiplexer that is added to the datapath to support double stepping instructions. The instruction repeat multiplexer selects between the output of the thread select multiplexer and a flip-flop that latches the previously issued instruction. Thus, when double stepping, the leader instruction is chosen from the thread select multiplexer and (g) jpg2000 (h) mpeg2 (i) mpeg4 follower instructions are selected from the flip-flop, only the Figure 6: Reference Commonality for MediaBench thread identifier is changed.

V. R ESULTS fetch, thread select, and decode stages is not added for the We first evaluate commonality in applications. This ref- duplicate instruction. If two partial duplicates are drafted, erence result represents an estimate for the amount of the energy from the decode stage is not added for the commonality Thread Synchronization Methods (TSMs) can duplicate instruction. Otherwise, the energy for all stages performance find. Energy savings, performance overheads, energy is added. When the decode stage energy is not added, the gains, and area overheads of ExecD are evaluated, as well register file energy is still added, as it is always accessed. as the effect that different numbers of drafted threads has The static power of unused stages is not modeled, however, on energy savings. this should be small and should not dramatically impact results. The simulator’s energy model keeps track of whether A. Evaluation Methodology threads are synchronized and when instructions are drafted. To evaluate ExecD, we built a trace-based simulator, Applications from the SPECint CPU2006 [54], MediaBench DraftSim, shown in Figure 5. DraftSim accepts traces gen- I [55], and II [56] benchmarks are used throughout the erated by Pin [52], and models synchronization of the in- evaluation. In addition, different version of the Apache web struction traces using a specified TSM and reports synchro- server [43] and GCC [57] are used as benchmarks. nization results. It also accepts an XML configuration file In order to evaluate ExecD, we need multiple input sets for that specifies a memory system model and energy model and each benchmark. Unfortunately, not all applications within simulates the execution of instructions to obtain performance a benchmark suite have multiple input sets, so we restrict and energy results. The core model assumes a clocks per our evaluation to applications with multiple input sets. instruction (CPI) of one for all non-memory instructions. We model the latency and occupancy of the L1 data cache, B. Peak Commonality Available in Applications L2 cache, and memory. The parameters used in the core We start by evaluating how much commonality exists be- model for all results are shown in Table I and are modeled tween applications, between the same program with different after the OpenSPARC T1 core. inputs as well as different versions of the same program We use a simple energy model to calculating energy with different inputs. We implemented a reference TSM in consumption. The energy model is derived from the power DraftSim to obtain an estimate of the available commonality. breakdown of the OpenSPARC T1 core in [53] and the The reference TSM operates as follows: When a divergence TM power measurements of the UltraSPARC T1 chip run- is encountered, the TSM scans future instructions in the trace ning at 1.2 Ghz / 1.2V in 90nm technology in [49]. The until it finds the earliest convergence point. It executes the resulting energy model is presented in Table II. The energy instructions up until that point, after which the traces are model is used in the simulator as follows. If the threads synchronized again. While this greedy TSM is not practical are synchronized, an instruction is drafted behind another, to implement in real hardware, it provides a good estimate and the instructions are full duplicates, the energy from of the peak available commonality. We limit the number of 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 100.00% Ref 80.00% 60.00% STSM 40.00% RTSM 20.00% HTSM % Synchronized 0.00% MMT

(a) Mediabench

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

(a) 400.perlbench (b) 401.bzip2 (c) 403.gcc 100.00% Ref 80.00% Figure 7: Reference Commonality for SPECint (Numbers on STSM 60.00% RTSM axes represent different inputs) 40.00% HTSM 20.00%

% Synchronized MMT 0.00% 400.perlbench 401.bzip2 403.gcc amean gmean (b) SPECint

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

100.00% Ref 80.00% STSM 60.00% RTSM 40.00% HTSM 20.00%

% Synchronized 0.01% 0.02% MMT 0.00% Same Version Same Minor Version amean gmean (c) Apache

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 100.00% Ref 80.00% STSM 60.00% RTSM 40.00% HTSM 20.00%

% Synchronized MMT (a) Apache (b) GCC 0.00% Same Version Same Minor Version amean gmean Figure 8: Reference Commonality for Apache and GCC (d) GCC Figure 9: Average Achieved Synchronization. White lines indicate average excluding same program, same input (SPSI) future instructions searched to 50 million instructions in the reference to bound the memory requirements.

15.00% 1) Same Program, Different Inputs: Figures 6 and 7 show 10.00% the reference commonality results for MediaBench I, II, and 5.00% SPECint, respectively. These graphs show the percentage 0.00% of execution which is drafted; lighter is better. The actual Savings % Energy values are also annotated in each cell, rounded to the nearest (a) Mediabench integer. On each axis is a benchmark mode (for benchmarks 15.00% with modes) and input combination. Each box represents the 10.00% commonality found between drafting the two corresponding 5.00% 0.00% benchmark mode and input combinations. The numbers Savings % Energy 400.perlbench 401.bzip2 403.gcc amean gmean on the axes for SPECint represent different inputs. Some (b) SPECint 20.00% applications, such as adpcm, epic, ghostscript, and gsm, 15.00% exhibit high commonality when executed in the same mode 10.00% (i.e. encode/decode), even for different inputs. 5.00% 0.00% % Energy Savings % Energy However, other applications show sporadic or poor com- Same Version Same Minor Version amean gmean (c) Apache monality. This is mainly due to irregular or high amounts of 15.00% data dependent control flow. However, Section V-E shows 10.00% that, despite the lack of energy benefits from drafting, the 5.00% performance penalty from doing so is minuscule. 0.00% % Energy Svaings % Energy Same Version Same Minor Version amean gmean 2) Different Program Versions, Different Inputs: We (d) GCC study the commonality in different versions of the same Figure 10: HTSM Average Energy Savings. White lines program with different inputs. Figure 8a compares differ- indicate average excluding SPSI ent versions of Apache serving different pages. Figure 8b compares different versions of GCC compiling different applications. All experiments show available high commonality be- versions exhibits poor commonality. This is expected, as tween the same version with different inputs and versions differing major and minor versions generally indicate large with the same major and minor version but with different changes to the code, whereas different patch version num- inputs. However, comparing versions across major or minor bers usually indicate a bug fix or small change [58]. C. Hardware Synchronization Method Performance In this section, we determine how much commonality can be exploited by our Thread Synchronization Methods (a) Throughput Gain (TSMs) and the TSM presented in Minimal Multithreading (MMT) [22]. For a qualitative comparison to MMT, see Section VI. We ran experiments identical to the previous sec- (b) Single-Threaded Speedup tion, with each TSM. Figure 9 shows the average achieved synchronization across different input cross products for each TSM. The white lines indicate averages excluding the same program, same input (SPSI) combinations. Overall, the TSMs achieve comparable synchronization to the reference. STSM achieves close to the same, if not more, synchro- nization versus the reference for the same program with different inputs (SPDI). However, it falls short on different (c) Performance/Energy Gains versions of Apache or GCC. This is because STSM relies Figure 11: Performance for MediaBench using HTSM. Ex- on PCs for synchronization, but in different versions of cludes SPSI for visibility. Averages: w/o SPSI (w/ SPSI) an application, the same PCs may not point to the same instruction. Nevertheless, it can still find a significant amount of commonality. For instance, drafting Apache 2.4.6 hosting cnn.com against top500.org achieves 94.84% commonality. RTSM in part makes up for STSM’s shortcomings. It (a) Throughput Gain achieves close to the same synchronization as the reference for the different program, different input (DPDI) experi- ments, but generally falls short with SPDI. This is due to RTSM’s lack of reliance on PCs, which is beneficial for (b) Single-Threaded Speedup the DPDI case. However, it may not perform optimally on common code structures like if-else and loops, causing it to perform worse in general than STSM in the SPDI case. HTSM is successful in hybridizing STSM and RTSM, achieving synchronization comparable to the reference in almost all cases, whether drafting different inputs for the same program or different programs. Note that HTSM does not perform better than other synchronization methods in all cases, such as gsm, where STSM performs better. Thus, HTSM is not the obvious choice in all cases. The MMT simulations used a Fetch History Buffer (FHB) (c) Performance/Energy Gains with 32 entries (as in the original work) and assumed only Figure 12: Performance for SPECint using HTSM. Excludes taken branch targets were recorded in the FHB. In addition, SPSI for visibility. Averages: w/o SPSI (w/ SPSI) every instruction’s PC was checked against the other thread’s FHB for potential convergence points. MMT performs similarly to STSM on the SPSI exper- D. Energy Savings iments. It performs well on common code structures like We present energy savings from ExecD in this section. To STSM, but may miss opportunities to synchronize short be brief, only HTSM will be used for the remainder of the code snippets. In addition, MMT may not reconverge until results. However, energy savings are similar for STSM and shortly after optimal reconvergence points (the first common RTSM. The average energy savings from HTSM for each instruction after a divergence), while STSM achieves tighter benchmark are presented in Figure 10, where the white lines synchronization. MMT performs relatively poorly in SPDI indicate averages without the SPSI combinations. Results are experiments, as it relies on PC matching (like STSM) and presented as a percentage saved versus the energy needed is limited by its FHB size. A larger FHB may increase to execute the benchmarks using fine-grain multithreading MMT’s ability to find synchronization points, especially for (FGMT) - which cannot disable fetch, but may save energy SPDI cases, where common code may be distanced further by issuing coincidental identical instructions consecutively. apart. STSM is not limited by the size of any structures, and In general, the energy savings HTSM achieves is corre- therefore may perform better here. lated to the amount of synchronization it achieves, between (a) Throughput Gain

(b) Single-Threaded Speedup

(c) Performance/Energy Gains Figure 13: Performance for Apache using HTSM. Excludes SPSI for visibility. Averages: w/o SPSI (w/ SPSI)

which does not contribute a large portion to the core energy (see Table II). The energy results from DraftSim are conservative. Draft- (a) Throughput Gain Sim does not take into account energy savings from the execute, memory or writeback stages if data happens to be shared between drafted instructions. In addition, energy saved from the control logic, i.e. ALU opcode, multiplexer (b) Single-Threaded Speedup control signals, etc., is not considered. However, we do not model the energy penalty from our TSM logic, but do take into account energy from the thread select stage with the OpenSPARC T1 thread select logic. Our TSM logic should not consume more power than the OpenSPARC T1 thread select logic (based on the area results in Section V-G), thus it is conservative to use the thread select logic energy as a substitute for the TSM logic energy. As such, the actual energy savings may be higher than what is presented.

E. Performance Results This section examines throughput and single stream per- (c) Performance/Energy Gains formance, but we emphasize that throughput is more sig- Figure 14: Performance for GCC using HTSM. Excludes nificant for data center applications with QoS bounds. Only SPSI for visibility. Averages: w/o SPSI (w/ SPSI) HTSM results are presented; STSM and RTSM are similar. 1) Throughput: Throughput gains as a percentage of 5% and 20% energy savings for most benchmarks. This is FGMT throughput for each benchmark using HTSM are evident when comparing Figures 9 and 10. Note that there shown in Figures 11a, 12a, 13a, and 14a along with averages. can be exceptions: If many of the drafted instructions are Two averages are presented, excluding SPSI (first average), partial duplicates, it will not save as much energy as with and including SPSI (second average in parentheses). The full duplicates. While there are achievable energy savings data for SPSI is excluded from the plots for visibility, but is from drafting partial duplicates, they are comparably small trivial to infer since the execution is the exact same when as we model ExecD only saving energy for the decode stage, drafting SPSI. Load/Store Unit Multiply Unit LFSR Counter Shift Register Throughput gains are in the range of [-5%, 4%] across 9.71% 5.70% 0.04% 0.04% 0.03% L1 Data Cache Stream Processing Unit 14.74% 3.55% STSM FSM all benchmarks. We were surprised to find that ExecD can 0.02% Misc. Logic PC Comparators, PC 0.28% achieve throughput gains, even for the majority of drafting Subtracter, IR RTSM FSM Comparators, etc. 0.01% Trap Logic Unit 1.24% combinations within a certain benchmark suite, as shown 8.83% ExecD Overheads HTSM FSM 2.12% 0.03% Execution Unit in Figure 12a and 14a. Changes in throughput are mainly L1 Instruction Cache Common 6.79% 37.29% IR Replay and FSM Logic due to cache performance, as for both ExecD (regardless PA Pipe Registers 0.01% Floating-point 0.68% ExecD Enable of the TSM) and FGMT, an instruction is selected to be Front-end Unit Multiplexing 2.77% 0.03% executed every cycle if ready. ExecD TSMs will never Instruction Fetch Unit Fetch Stall Logic 0.01% stall both threads on a given cycle (unless stalling for a 8.20% Figure 15: ExecD Area Overhead in 2-threaded OpenSPARC different reason). However, in some cases, ExecD TSMs T1 Core. Excludes L2 Cache. align instructions such that cache performance is improved and, in other cases, they do not. Nevertheless, throughput is not largely affected by ExecD TSMs. Thus, we can be aggressive in our attempts to draft applications since, even if drafting does not achieve any energy savings, the performance will not be significantly degraded. An interesting future technique to explore is to prefetch data from drafted threads based on cache misses of duplicate instructions. This could lead to improved cache performance, Figure 16: Energy Savings as a Function of Thread Count giving ExecD a throughput advantage over FGMT. This Executing 100 Instructions per Thread idea is related to previous work in Helper Threads or pre- execution [59]–[63]. 2) Single-Threaded Performance: The single-threaded loss, meaning it is not detrimental to draft these types of performance results are presented as a percentage speedup programs. For example, drafting Apache instances within compared to FGMT. They are shown for each benchmark throughput the same major and minor versions achieves high energy using HTSM in Figures 11b, 12b, 13b, and 14b, with the gains, however, drafting instances with different minor ver- same averages and caveats as the throughput figures. Results throughput sions leads to nominal energy gain. for both drafted threads are shown. Our TSMs can negatively affect single-threaded performance as they stall threads to G. Execution Drafting Area Overheads keep them synchronized. This increases latency over FGMT. We synthesized the Verilog RTL for a 2-threaded Single-threaded performance speedup is in the range of OpenSPARC T1 core with ExecD modifications with Syn- [-5%, 4%] across all benchmarks. Again, we were surprised opsys Design Compiler using the IBM 32nm SOI process to see ExecD achieve speedups, even for the majority of at 1.2Ghz/0.9V. The SRAM area for caches was estimated drafting combinations within a certain benchmark suite, as using the IBM RAM . Note, the L2 cache area is shown in Figure 12b and 14b. This could be a result of cache excluded. In addition, the area of the register files, TLBs, performance improvements or the way HTSM schedules and store buffer are excluded as they are either highly ported instructions. HTSM and STSM could stall one thread for or use CAMs (making it difficult to estimate the area). The many cycles, while executing instructions from the other area overheads for ExecD are small, 2.12% of the core area thread. This means the thread that was not stalled has a as shown in Figure 15. Most of the area overheads are due lower latency. RTSM shows single-threaded latency closer to comparators for PCs and instructions, a subtractor for to that of FGMT, as it chooses instructions to execute at the PCs, and pipe registers for physical addresses. Note, random. Thus, it averages out to FGMT in the long term. this implementation includes all three TSMs, which allows The take away: single-threaded performance is not dra- for many of the hardware resources to be shared between matically impacted by ExecD. As with throughput, this them, e.g. HTSM and RTSM can share a linear feedback provides some freedom in scheduling threads to be drafted. shift register (LFSR). The area result correlates well with the power model we used, lending credence to its validity. F. Performance/Energy Results P erformance H. Execution Drafting with Higher Thread Counts Energy is the key metric that ExecD optimizes for. In this section we interpret performance as throughput, since Thus far, we have only presented results for drafting it is more pertinent to data center applications. Figures 11c, two threads. However, it is possible to draft higher thread throughput 12c, 13c, and 14c show energy gains for each benchmark counts. Figure 16 shows the energy savings from ExecD as Throughput using HTSM. Energy gains are up to 20% for drafting a function of drafted thread count. This was generated by threads that exhibit high commonality. For threads that do drafting 100 instructions from each thread and comparing not exhibit high commonality, there is nominal gain or against the energy from executing the instructions from each thread sequentially. For two threads, the energy savings is not perform active synchronization of the threads. MIS relies

⇠20%, which is our experimental upper bound. Yet, greater on small control flow divergences and finding duplicate energy savings could be had by drafting more threads. instructions within a limited window of instructions. Thus, Energy savings increases dramatically until 32 threads, after MIS can try to deduplicate code from different programs, which we see diminishing returns. This is because there is but will not be very successful in doing so. more opportunity to find commonality and greater energy MIS performs operand comparisons in order to reuse savings can be harvested with larger blocks of drafted (memoize) instruction results from different threads, like threads. The benefits diminish since we only save front-end MMT. However, MIS claims performance benefits, not en- energy. We leave developing scalable TSMs and evaluating ergy savings. In fact, they claim energy losses. This is at the energy savings to future work. odds with MMT’s findings, which claims both performance and energy benefits. Considering MIS and MMT take the VI. RELATED WORK same approach and have similar mechanisms, this is a Minimal Multithreading (MMT) [22] is the work most re- strange result, but may have to do with MMT using OoO lated to ExecD, but has many key differences. Unlike MMT, cores and MIS using small, throughput oriented cores. ExecD not only drafts the same applications, but is broader Thread Fusion [26], unlike drafting, ”fuses” threads, and can draft different programs as we presented. The MMT meaning the core fetches and decodes only a single copy of synchronization scheme relies only on PC matching, while an instruction (requiring full duplicates) but issues it to two ExecD is able to synchronize common instruction sequences functional units in the same cycle when possible. Dynamic at different PCs. This enables ExecD to synchronize code value reuse may enable using a single functional unit, dedu- at different locations in the same program or completely plicating computation and thus saving energy. This requires different programs, which MMT is unable to do. comparators and tracking in register files and tracking in Our design is simpler and more scalable than MMT. As scoreboards, but enables energy savings in the presented we show in Figure 16, drafting larger numbers of threads workloads. Thread Fusion uses synchronization points (SPs), leads to larger energy savings. When drafting 64 threads, added manually, by the compiler, or by OpenMP. The MMT’s synchronization method requires each thread CAM authors also suggest an algorithm for finding new SPs. match (comparing) against 63 fetch history buffers (FHBs), Thread Fusion relies on PCs, so it is unlikely to work well which is not scalable. Our approach does not require all-to- on different programs or versions. all comparisons of PC histories. MMT is limited by the size Molina et al. [24] and Sodani et al. [25] also exploit of the FHBs, which also limits scalability. In fact, MMT’s identical instructions in order to reuse (memoize) instruction synchronization approach is similar to our reference. results, but only within a single thread in a superscalar. They Our approach is different from MMT. Unlike MMT, we claim performance benefits, but do not study energy. do not match operand values to reuse (memoize) instruction Our thread synchronization methods are similar to thread- results. Instead, we always execute the instruction for each management techniques [67] like thread delaying (to save thread. If the operand values are the same, wires do not energy) and thread balancing (to improve performance). toggle in the wake of the leader thread, saving energy. In a controlled environment like the one those techniques Thus, we achieve the energy savings MMT achieves without utilize, they could be complementary to our own. Since expensive RAM lookups, operand comparisons, and load ad- ExecD targets different programs, it can save energy without dress checks. In contrast to ExecD, MMT trades energy for tracking all threads’ progress. performance while ExecD optimizes for energy efficiency. There are many other areas of work related to Ex- The MMT synchronization mechanism, which uses a FHB to ecD. Variations on multithreading architectures, such as the CAM thread PCs against PCs previously fetched from other vector-thread (VT) architecture [68], are similar to ExecD, threads, is not robust. MMT only records branch targets in combining independent execution with deduplication when the FHB, which means that reconvergence can only occur possible. However, automatically finding commonality in at branch targets. This prevents MMT from reconverging in data center workloads and factoring out redundant work short if/else code hammocks. ExecD handles these cases. is not a focus of architectures like VT. In addition, VT Last, we focus our study on small, low-power, throughput requires software changes, which ExecD does not. Tem- oriented cores, while MMT uses large OoO cores. We poral Single Instruction Multiple Thread (T-SIMT) [69] intentionally chose lean cores like those used in current data is another example in this class which employs syncwarp center architectures such as Applied Micro, Calxeda [64], instructions to bring divergent execution paths back together. Tilera [65], [66], and the new AMD ARM chip for ExecD We considered a similar solution, but found it to achieve since we target data centers. similar results to our hardware synchronization mechanisms, Multithreaded Instruction Sharing (MIS) [23] is also re- without software changes. lated to ExecD and takes a similar approach to MMT. MIS Warp reconvergence [70], [71] is an active area of GPGPU uses small, throughput oriented cores, like ExecD, but does research as warp divergence is bad for both energy and per- formance. These works aim to reconverge warps as early as [7] “Microsoft azure,” 2013, http://www.microsoft.com/azure. [Online]. Available: http://www.microsoft.com/azure possible, making them similar to ExecD. However, GPGPU [8] S. Herbert and D. Marculescu, “Analysis of dynamic voltage/frequency scaling requires a different programming model, and thus requires in chip-multiprocessors,” in Low Power Electronics and Design (ISLPED), 2007 software changes, making it less general than ExecD. ACM/IEEE International Symposium on, pp. 38–43. [9] G. Semeraro, G. Magklis, R. Balasubramonian, D. H. Albonesi, S. Dwarkadas, There are, of course, many previous works in reducing and M. L. Scott, “Energy-efficient processor design using multiple clock do- mains with dynamic voltage and frequency scaling,” in High-Performance Com- energy consumption of data center servers. This includes, puter Architecture, 2002. Proceedings. International Symposium on. IEEE, DVFS [8], [9], PowerNap [3], VM consolidation [72]–[75], pp. 29–40. energy proportional computing [10], [11], MapReduce en- [10] X. Fan, W.-D. Weber, and L. A. Barroso, “Power provisioning for a warehouse- sized ,” in Proceedings of the international symposium on Computer ergy efficiency [76]–[81], [81], configurable accelerators ex- architecture, ser. ISCA ’07, pp. 13–23. ploiting Dark Silicon [14] such as Conservation Cores [13], [11] K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis, K. Periyathambi, and M. Horowitz, “Towards energy-proportional datacenter memory with and general energy-aware processor design techniques [12], mobile dram,” in Proceedings of the International Symposium on Computer [15]–[20]. Most of these methods are complimentary to Architecture, ser. ISCA ’12. IEEE, pp. 37–48. [12] A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. Dreslinski, ExecD and do not attempt to deduplicate computation. T. F. Wenisch, and S. Mahlke, “Composite cores: Pushing heterogeneity into a core,” in Proceedings of the IEEE/ACM International Symposium on VII. CONCLUSION Microarchitecture, ser. MICRO ’12. IEEE, pp. 317–328. [13] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo- Execution Drafting is an architectural method for reducing Martinez, S. Swanson, and M. Taylor, “Conservation cores: reducing the performance energy of mature computations,” in Proceedings of Architectural support for energy consumption and optimizing energy in data programming languages and operating systems, ser. ASPLOS ’10, pp. 205– center servers. ExecD synchronizes duplicate instructions 218. [14] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, from different programs or threads on the same multi- “Dark silicon and the end of multicore scaling,” in Proceedings of the threaded core, such that they flow down the pipe consec- international symposium on Computer architecture, ser. ISCA ’11, pp. 365– 376. utively, or draft. ExecD is an instance of a broader class of [15] J. H. Tseng and K. Asanovic,´ “Ringscalar: A complexity-effective out-of- work in computation deduplication, in which common code order superscalar microarchitecture,” MIT-CSAIL-TR-2006-066, MIT CSAIL Technical Report, September 2006. is exploited to reduce the work required to execute it. [16] S. Das and W. J. Dally, “Static cache way allocation and prediction,” Concurrent We evaluated the commonality available in the same appli- VLSI Architectures Group, Stanford University, Tech. Rep. 132, August 2012. [17] R. C. Harting, V. Parikh, and W. J. Dally, “Energy and performance benefits of cations with different inputs and the same applications with active messages,” Concurrent VLSI Architectures Group, Stanford University, different version numbers, the effectiveness of thread syn- Tech. Rep. 131, February 2012. [18] V. Parikh, R. C. Harting, and W. J. Dally, “Configurable memory hierarchies chronization methods, and the energy-performance trade-off. for energy efficiency in a many core processor,” Concurrent VLSI Architectures performance Group, Stanford University, Tech. Rep. 130, February 2012. Results indicate ExecD can achieve up to 20% energy [19] E. Witchel, S. Larsen, C. S. Ananian, and K. Asanovic,´ “Direct addressed gain over fine-grain multithreading with small area over- caches for reduced power consumption,” in Proceedings of the ACM/IEEE heads and the potential for larger gains in drafting larger international symposium on Microarchitecture, ser. MICRO ’01, pp. 124–133. [20] K. Asanovic,´ “Energy-exposed instruction set architectures,” in In Progress Ses- thread counts. sion, International Symposium on High Performance Computer Architecture, January 2000. ACKNOWLEDGMENTS [21] D. Meisner and T. F. Wenisch, “Dreamweaver: Architectural support for deep sleep,” in Proceedings of the International Conference on Architectural Support This work was partially supported by the NSF under for Programming Languages and Operating Systems, ser. ASPLOS 2012, pp. Grant No. CCF-1217553 and DARPA under Grants No. 313–324. [22] G. Long, D. Franklin, S. Biswas, P. Ortiz, J. Oberg, D. Fan, and F. T. Chong, N66001-14-1-4040 and HR0011-13-2-0005. Any opinions, “Minimal multi-threading: Finding and removing redundant instructions in findings, and conclusions or recommendations expressed in multi-threaded processors,” in Proceedings of the IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’10. IEEE Computer Society, this material are those of the authors and do not necessarily pp. 337–348. reflect the views of our sponsors. [23] M. Dechene, E. Forbes, and E. Rotenberg, “Multithreaded instruction sharing,” Department of Electrical and Computer Engineering, North Carolina State University, Tech. Rep., December 2010. REFERENCES [24] C. Molina, A. Gonzalez,´ and J. Tubella, “Dynamic removal of [1] R. Buyya, C. S. Yeo, and S. Venugopa, “Market-oriented cloud computing: redundant computations,” in Proceedings of the International Conference Vision, hype, and reality for delivering it services as computing utilities,” on Supercomputing, ser. ICS ’99. ACM, pp. 474–481. in Proceedings of the IEEE International Conference on High Performance [25] A. Sodani and G. S. Sohi, “Dynamic instruction reuse,” in Proceedings of the Computing and Communications 2008. International Symposium on Computer Architecture, ser. ISCA ’97. ACM, [2] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, pp. 194–205. G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “Above the [26] R. Rakvic, J. Gonzalez,´ Q. Cai, P. Chaparro, G. Magklis, and A. Gonzalez,´ clouds: A berkeley view of cloud computing,” EECS Department, University “Energy efficiency via thread fusion and value reuse,” IET Computers & of California, Berkeley, Tech. Rep. UCB/EECS-2009-28, Feb 2009. Digital Techniques, vol. 4, no. 2, pp. 114–125, 2010. [3] D. Meisner, B. T. Gold, and T. F. Wenisch, “Powernap: eliminating server idle [27] J. C. Mogul, “A trace-based analysis of duplicate suppression in HTTP,” power,” in Proceedings of the international conference on Architectural support Compaq Computer Corporation Western Research Labratory, Tech. Rep. 99/2, for programming languages and operating systems, ser. ASPLOS 2009, pp. November 1999. 205–216. [28] N. T. Spring and D. Wetherall, “A protocol-independent technique for eliminat- [4] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large ing redundant network traffic,” in Proceedings of the Conference on Applica- clusters,” in Proceedings of the Symposium on Opearting Systems Design & tions, Technologies, Architectures, and Protocols for Computer Communication, Implementation, ser. OSDI’04. Berkeley, CA, USA: USENIX Association, ser. SIGCOMM ’00. ACM, pp. 87–95. pp. 10–10. [29] A. Muthitacharoen, B. Chen, and D. Mazieres,` “A low-bandwidth network [5] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data- file system,” in Proceedings of the ACM Symposium on Operating Systems parallel programs from sequential building blocks,” ACM SIGOPS Operating Principles, ser. SOSP ’01. ACM, pp. 174–187. Systems Review, vol. 41, no. 3, pp. 59–72, 2007. [30] S. Quinlan and S. Dorward, “Venti: A new approach to archival data storage,” [6] “Amazon elastic compute cloud (amazon ec2),” 2013. [Online]. Available: in Proceedings of the USENIX Conference on File and Storage Technologies, http://aws.amazon.com/ec2 ser. FAST ’02, Berkeley, CA, USA. [31] P. Kulkarni, F. Douglis, J. LaVoie, and J. M. Tracey, “Redundancy elimination [59] D. Kim, S. S.-w. Liao, P. H. Wang, J. d. Cuvillo, X. Tian, X. Zou, H. Wang, within large collections of files,” in Proceedings of the USENIX Annual D. Yeung, M. Girkar, and J. P. Shen, “Physical experimentation with prefetching Technical Conference, ser. ATEC ’04, Berkeley, CA, USA, pp. 5–5. helper threads on intel’s hyper-threaded processors,” in Proceedings of the [32] N. Jain, M. Dahlin, and R. Tewari, “Taper: Tiered approach for eliminating International Symposium on Code Generation and Optimization: Feedback- redundancy in replica synchronization,” in Proceedings of the USENIX Con- directed and Runtime Optimization, ser. CGO ’04. IEEE Computer Society, ference on File and Storage Technologies, ser. FAST’05, Berkeley, CA, USA, pp. 27–. pp. 21–21. [60] D. Kim and D. Yeung, “Design and evaluation of compiler algorithms for pre- [33] D. R. Bobbarjung, S. Jagannathan, and C. Dubnicki, “Improving duplicate execution,” in Proceedings of the International Conference on Architectural elimination in storage systems,” Trans. Storage, vol. 2, no. 4, pp. 424–448, Support for Programming Languages and Operating Systems, ser. ASPLOS Nov. 2006. 2002. ACM, pp. 159–170. [34] B. Zhu, K. Li, and H. Patterson, “Avoiding the disk bottleneck in the data [61] M. Kamruzzaman, S. Swanson, and D. M. Tullsen, “Inter-core prefetching for domain deduplication file system,” in Proceedings of the USENIX Conference multicore processors using migrating helper threads,” in Proceedings of the on File and Storage Technologies, ser. FAST’08, Berkeley, CA, pp. 18:1–18:14. International Conference on Architectural Support for Programming Languages [35] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Cam- and Operating Systems, ser. ASPLOS ’11. ACM, pp. 393–404. [62] S. S. Liao, P. H. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. P. Shen, ble, “Sparse indexing: Large scale, inline deduplication using sampling and “Post-pass binary adaptation for software-based speculative precomputation,” locality,” in Proccedings of the Conference on File and Storage Technologies, in Proceedings of the ACM SIGPLAN Conference on ser. FAST ’09. Berkeley, CA, USA: USENIX Association, pp. 111–123. Design and Implementation, ser. PLDI 02. ACM, pp. 117–128. [36] L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein, [63] K. Sundaramoorthy, Z. Purser, and E. Rotenburg, “Slipstream processors: “The design of a similarity based deduplication system,” in Proceedings of Improving both performance and fault tolerance,” in Proceedings of the SYSTOR 2009: The Israeli Experimental Systems Conference. ACM, pp. 6:1– International Conference on Architectural Support for Programming Languages 6:14. and Operating Systems, ser. ASPLOS 2000. ACM, pp. 257–268. [37] D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge, “Extreme binning: [64] R. Courtland, “The high stakes of low power,” Spectrum, IEEE, vol. 49, no. 5, Scalable, parallel deduplication for chunk-based file backup,” in Modeling, pp. 11–12, May 2012. Analysis Simulation of Computer and Telecommunication Systems. IEEE In- [65] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, ternational Symposium on, ser. MASCOTS ’09, pp. 1–9. M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, [38] B. Debnath, S. Sengupta, and J. Li, “Chunkstash: Speeding up inline storage W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and deduplication using flash memory,” in Proceedings of the USENIX Annual J. Zook, “Tile64 - processor: A 64-core soc with mesh interconnect,” in Solid- Technical Conference, ser. USENIXATC’10, Berkeley, CA, USA, pp. 16–16. State Circuits Conference. Digest of Technical Papers. IEEE International, Feb [39] F. Guo and P. Efstathopoulos, “Building a high-performance deduplication 2008, pp. 88–598. system,” in Proceedings of the USENIX Annual Technical Conference, ser. [66] M. Berezecki, E. Frachtenberg, M. Paleczny, and K. Steele, “Many-core USENIXATC’11, Berkeley, CA, USA, pp. 25–25. key-value store,” in Green Computing Conference and Workshops (IGCC), [40] J. Min, D. Yoon, and Y. Won, “Efficient deduplication techniques for modern International, July 2011, pp. 1–8. backup operation,” Computers, IEEE Transactions on, vol. 60, no. 6, pp. 824– [67] R. Rakvic, Q. Cai, J. Gonzalez,´ G. Magklis, P. Chaparro, and 840, 2011. A. Gonzalez,´ “Thread-management techniques to maximize efficiency in [41] W. Xia, H. Jiang, D. Feng, and Y. Hua, “Silo: A similarity-locality based multicore and simultaneous multithreaded microprocessors,” ACM Trans. near-exact deduplication scheme with low ram overhead and high throughput,” Archit. Code Optim., vol. 7, no. 2, pp. 9:1–9:25, Oct. 2010. in Proceedings of the USENIX Annual Technical Conference, ser. USENIX- [68] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and ATC’11, Berkeley, CA, USA, pp. 26–28. K. Asanovic,´ “The vector-thread architecture,” in Computer Architecture, 2004. [42] P. Shilane, M. Huang, G. Wallace, and W. Hsu, “Wan-optimized replication Proceedings. International Symposium on, pp. 52–63. [69] Y. Lee, R. Krashinsky, V. Grover, S. Keckler, and K. Asanovic,´ “Convergence of backup datasets using stream-informed delta compression,” Trans. Storage, and scalarization for data-parallel architectures,” in IEEE/ACM International vol. 8, no. 4, pp. 13:1–13:26, Dec. 2012. Symposium on Code Generation and Optimization (CGO). 2013, pp. 1–11. [43] “The apache http server project,” 2012. [Online]. Available: [70] G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchili, http://httpd.apache.org/ “Simd re-convergence at thread frontiers,” in Proceedings of the IEEE/ACM [44] “Oracle mysql,” 2013. [Online]. Available: http://www.mysql.com/ International Symposium on Microarchitecture, ser. MICRO-44 ’11. ACM, [45] D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, “Finding a pp. 477–488. needle in haystack: Facebook’s photo storage,” in Proceedings of the USENIX [71] S. Collange, “Stack-less SIMT reconvergence at low cost,” Tech. Rep. Conference on Operating Systems Design and Implementation, ser. OSDI’10, [Online]. Available: http://hal.archives-ouvertes.fr/hal-00622654 Berkeley, CA, USA, pp. 1–8. [72] K. Ye, D. Huang, X. Jiang, H. Chen, and S. Wu, “Virtual machine based energy- [46] L. A. Barroso, J. Dean, and U. Holzle,¨ “Web search for a planet: The google efficient data center architecture for cloud computing: A performance perspec- cluster architecture,” IEEE Micro, vol. 23, no. 2, pp. 22–28, Mar. 2003. tive,” in Green Computing and Communications (GreenCom), 2010 IEEE/ACM [47] “OpenSPARC T1.” [Online]. Available: Int’l Conference on Cyber, Physical and Social Computing (CPSCom), 2010, http://www.oracle.com/technetwork/systems/opensparc/opensparc-t1-page- pp. 171–178. 1444609.html [73] A. Beloglazov and R. Buyya, “Energy efficient resource management in [48] A. Leon, J. Shin, K. Tam, W. Bryg, F. Schumacher, P. Kongetira, D. Weisner, virtualized cloud data centers,” in Proceedings of the IEEE/ACM International and A. Strong, “A power-efficient high-throughput 32-thread sparc processor,” Conference on Cluster, Cloud and Grid Computing, ser. CCGRID ’10. IEEE in International Solid-State Circuits Conference. Digest of Technical Papers, Computer Society, pp. 826–831. 2006, pp. 295–304. [74] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu, “No [49] A. Leon, B. Langley, and J. Shin, “The ultrasparc t1 processor: Cmt reliability,” ”power” struggles: Coordinated multi-level power management for the in Custom Integrated Circuits Conference, 2006, pp. 555–562. data center,” in Proceedings of the International Conference on Architectural [50] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: a 32-way multithreaded Support for Programming Languages and Operating Systems, ser. ASPLOS sparc processor,” Micro, IEEE, vol. 25, no. 2, pp. 21–29, 2005. 2008. ACM, pp. 48–59. [75] A. Beloglazov, J. Abawajy, and R. Buyya, “Energy-aware resource allocation [51] OpenSPARC T1 Microarchitecture Specification, Santa Clara, CA, 2006. heuristics for efficient management of data centers for cloud computing,” [52] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, Future Gener. Comput. Syst., vol. 28, no. 5, pp. 755–768, May 2012. V. J. Reddi, and K. Hazelwood, “Pin: Building customized program analysis [76] Y. Chen, S. Alspaugh, D. Borthakur, and R. Katz, “Energy efficiency for tools with dynamic instrumentation,” in Proceedings of the ACM SIGPLAN large-scale mapreduce workloads with significant interactive analysis,” in Conference on Programming Language Design and Implementation, ser. PLDI Proceedings of the ACM european conference on Computer Systems, ser. ’05. ACM, pp. 190–200. EuroSys ’12, pp. 43–56. [53] J. Sartori, B. Ahrens, and R. Kumar, “Power balanced pipelines,” in High [77] Y. Chen, L. Keys, and R. H. Katz, “Towards energy efficient mapreduce,” EECS Performance Computer Architecture (HPCA), 2012 IEEE 18th International Department, University of California, Berkeley, Tech. Rep., Aug 2009. Symposium on, pp. 1–12. [78] W. Lang and J. M. Patel, “Energy management for mapreduce clusters,” Proc. [54] S. P. E. Corporation, 2013, http://www.spec.org/. VLDB Endow., vol. 3, no. 1-2, pp. 129–139, Sep. 2010. [55] C. Lee, M. Potkonjak, and W. Mangione-Smith, “Mediabench: a tool for [79] T. Wirtz and R. Ge, “Improving mapreduce energy efficiency for computation evaluating and synthesizing multimedia and communications systems,” in Mi- intensive workloads,” in Green Computing Conference and Workshops (IGCC), croarchitecture. Proceedings., IEEE/ACM International Symposium on, 1997, 2011 International, pp. 1–8. pp. 330–335. [80] J. Leverich and C. Kozyrakis, “On the energy (in)efficiency of hadoop clusters,” [56] “Mediabench ii,” 2006. [Online]. Available: SIGOPS Oper. Syst. Rev., vol. 44, no. 1, pp. 61–65, Mar. 2010. [81] Y. Chen, A. S. Ganapathi, A. Fox, R. H. Katz, and D. A. Patterson, “Statistical http://euler.slu.edu/⇠fritts/mediabench/ [57] “Gcc, the ,” October 2013. [Online]. Available: workloads for energy efficient mapreduce,” EECS Department, University of http://gcc.gnu.org/ California, Berkeley, Tech. Rep., Jan 2010. [58] “Semantic versioning 2.0.0.” [Online]. Available: http://semver.org/