QSCORES: Trading Dark Silicon for Scalable Energy Efficiency With
Total Page:16
File Type:pdf, Size:1020Kb
QSCORES: Trading Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores Ganesh Venkatesh+, Jack Sampson, Nathan Goulding-Hotta, Sravanthi Kota Venkata+, Michael Bedford Taylor and Steven Swanson http://greendroid.org Department of Computer Science and Engineering University of California, San Diego ABSTRACT General Terms Transistor density continues to increase exponentially, but power Design, Experimentation, Measurement dissipation per transistor is improving only slightly with each gen- eration of Moore’s law. Given the constant chip-level power bud- Keywords gets, this exponentially decreases the percentage of transistors that QSCORE, Conservation Core, Merging, Dark Silicon, Utilization can switch at full frequency with each technology generation. Hence, Wall, Heterogeneous Many-Core, Specialization while the transistor budget continues to increase exponentially, the power budget has become the dominant limiting factor in processor design. In this regime, utilizing transistors to design specialized 1. INTRODUCTION cores that optimize energy-per-computation becomes an effective Although transistor density continues to scale, nearly constant approach to improve system performance. per-transistor power and fixed chip-level power budget place tight To trade transistors for energy efficiency in a scalable manner, we constraints on how much of a chip can be active at full frequency at propose Quasi-specific Cores, or QSCORES, specialized processors one time. Hence, as transistor density increases with each process capable of executing multiple general-purpose computations while generation, so does the fraction of the chip that must be under- providing an order of magnitude more energy efficiency than a clocked or under-utilized because of power concerns [27]. Re- general-purpose processor. The QSCORE design flow is based on searchers have termed these passive expanses of silicon area dark the insight that similar code patterns exist within and across appli- silicon [14, 11, 15, 17, 21]. Dark silicon is forcing designers to find cations. Our approach exploits these similar code patterns to ensure new solutions to convert transistors into performance. that a small set of specialized cores support a large number of com- Recent work [3, 16, 27] has shown that specialization is an ef- monly used computations. fective solution to the problem of using the increasing transistor We evaluate QSCORE’s ability to target both a single applica- budget to extend performance scaling. Specialization-based ap- tion library (e.g., data structures) as well as a diverse workload proaches trade a cheaper resource, dark silicon (i.e., area), for a consisting of applications selected from different domains (e.g., more valuable resource, power efficiency, to scale system perfor- SPECINT, EEMBC, and Vision). Our results show that QSCORES mance. Specialized processors improve system performance by can provide 18.4 better energy efficiency than general-purpose optimizing per-computation power requirements and thereby allow processors while reducing⇥ the amount of specialized logic required more computations to execute within a given power budget. to support the workload by up to 66%. Conventional accelerators reduce power for regular applications containing ample data-level parallelism, but recent work [27, 21] has shown that specialized hardware can reduce power for irreg- Categories and Subject Descriptors ular codes as well. Moreover, these designs use specialized com- pute elements even for relatively short running computations. This C.1.3 [ ]: Heterogeneous (hybrid) systems; Processor Architectures increases the fraction of program execution on specialized, power- C.3 [ ] Special-Purpose and Application-based systems efficient hardware, but also raises questions about area efficiency and re-usability, since each specialized core targets a very specific + Now at Intel Corporation, Hillsboro, Oregon. computation. In addition, supporting a large number of tasks in hardware would require designing a large number of specialized This research was funded by the US National Science Foun- cores, which in turn would make it difficult to tightly integrate all dation under NSF CAREER Awards 06483880 and 0846152, and of them with the general-purpose processor. Therefore, for spe- under NSF CCF Award 0811794. cialization to be effective at targeting a general-purpose system’s workload, the specialized processors must ideally be able to sup- port multiple computations, such that a relatively small number of Permission to make digital or hard copies of all or part of this work for them can together support large fractions of system execution. personal or classroom use is granted without fee provided that copies are To trade silicon for specialization in a scalable manner, we pro- not made or distributed for profit or commercial advantage and that copies pose a new class of specialized processors, Quasi-specific Cores bear this notice and the full citation on the first page. To copy otherwise, to (QSCORES). Unlike traditional ASICs, which target one specific republish, to post on servers or to redistribute to lists, requires prior specific task, QSCORES can support multiple tasks or computations. Be- permission and/or a fee. MICRO’11, December 3-7, 2011, Porto Alegre, Brazil cause of this improved computational power, a relatively small num- Copyright 2011 ACM 978-1-4503-1053-6/11/12 ...$10.00. ber of QSCORES can potentially support a significant fraction of a system’s execution. Moreover, QSCORES are an order of magni- FlipTrackChangesFCL BestLookAheadScore tude more energy-efficient than general-purpose processors. The litWasTrue = GetTrueLit(iFlipCandidate); litWasTrue = GetTrueLit(iLookVar); litWasFalse = GetFalseLit(iFlipCandidate); litWasFalse = GetFalseLit(iLookVar); QSCORE design flow is based on the insight that “similar” code aVarValue[iFlipCandidate] = segments exist within and across applications, and these code seg- 1 - aVarValue[iFlipCandidate]; ments can be represented by a single computation pattern. The S ORE pClause = pLitClause[litWasTrue]; pClause = pLitClause[litWasTrue]; Q C design flow exploits this insight by varying the required for (j=0;j<aNumLitOcc[litWasTrue];j++) { for (j=0;j<aNumLitOcc[litWasTrue];j++) { number of QSCORES as well as their computational power (i.e., the number of distinct computations they can support) based on the rel- aNumTrueLit[*pClause]--; ative importance of the applications and the area budget available to if (aNumTrueLit[*pClause]==0) { if (aNumTrueLit[*pClause]==1) { optimize them. This enables QSCORES to significantly reduce the aFalseList[iNumFalse] = *pClause; total area and the number of specialized processors compared to a aFalseListPos[*pClause] = iNumFalse++; UpdateChange(iFlipCandidate); fully-specialized logic approach supporting the same functionality. aVarScore[iFlipCandidate]--; In this paper, we address many of the challenges involved in pLit = pClauseLits[*pClause]; pLit = pClauseLits[*pClause]; designing a QSCORE-enabled system. The first challenge lies in for (k=0;k<aClauseLen[*pClause];k++) { for (k=0;k<aClauseLen[*pClause];k++) { identifying similar code patterns across a wide range of general- iVar = GetVarFromLit(*pLit); iVar = GetVarFromLit(*pLit); purpose applications. The hotspots of a typical general-purpose UpdateChange(iVar); UpdateLookAhead(iVar,-1); application tend to be many hundreds of instructions long and con- aVarScore[iVar]--; tain complex control-flow and irregular memory access patterns, pLit++;} pLit++;} making the task of finding similarity both computationally inten- } } if (aNumTrueLit[*pClause]==1) { if (aNumTrueLit[*pClause]==2) { sive and algorithmically challenging. The second challenge lies pLit = pClauseLits[*pClause]; pLit = pClauseLits[*pClause]; for (k=0;k<aClauseLen[*pClause];k++) { for (k=0;k<aClauseLen[*pClause];k++) { in exploiting these similar code patterns to reduce hardware re- if (IsLitTrue(*pLit)) { if (IsLitTrue(*pLit)) { dundancy across specialized cores by designing generalized code- iVar = GetVarFromLit(*pLit); iVar = GetVarFromLit(*pLit); structures that can execute all these similar code patterns. The third UpdateChange(iVar); if (iVar != iLookVar) { challenge involves making the right area-energy tradeoffs to ensure aVarScore[iVar]++; UpdateLookAhead(iVar,+1); aCritSat[*pClause] = iVar; that the QSCORES fit within the area budget, while minimally im- pacting their energy efficiency. Addressing this challenge requires break; break; pLit++;}}} pLit++;}}} finding efficient heuristics that approximate an exhaustive search of pClause++; pClause++; the design space but avoid the exponential cost of that search. The } } final challenge involves modifying the application code/binary ap- propriately to enable the applications to offload computations onto the QSCORES at runtime. Figure 1: SAT Solver hotspot code comparison These two func- We evaluate our toolchain by designing QSCORES for the op- tions in the UBC SAT Solver tool [24] share significant code pat- erator functions — find, insert, delete — of commonly used terns (highlighted). data structures, including a linked-list, binary tree, AA tree, and hash table. Our results show that designing just four QSCORES application’s characteristics [19, 25, 28]. Our work differs from can support all these data structure operations and provide 13.5 these approaches in that it does not center on the customization energy savings over a general-purpose processor. On a more di-⇥ of pre-engineered components. Instead, our work