2015-Samos-06

Imposing Coarse-Grained Reconfiguration to General Purpose Processors M. Duric⇤, M. Stanic⇤, I. Ratkovic⇤, O. Palomar⇤†, O. Unsal⇤, A. Cristal⇤†‡, M. Valero⇤†, A. Smith§ Barcelona Supercomputing Center first . last @bsc.es Universitat Politecnica` de Catalunya IIIA-CSIC ⇤ { } { } † ‡ §Microsoft Research [email protected] Abstract—Mobile devices execute applications with diverse the appropriate unit for each workload, depending on the compute and performance demands. This paper proposes a workload characteristics and particular needs. general purpose processor that adapts the underlying hardware Accelerators usually increase performance by using fixed to a given workload. Existing mobile processors need to utilize more complex heterogeneous substrates to deliver the hardware structures that match computation of hot code demanded performance. They incorporate different cores and regions. Fixed-function hardware limits the applicability specialized accelerators. On the contrary, our processor utilizes of accelerators. Hence, multiple accelerators are typically only modest homogeneous cores and dynamically provides incorporated in the design. They incur additional area over- an execution substrate suitable to accelerate a particular heads, as well as increase the design and verification cost of workload. Instead of incorporating accelerators, the processor reconfigures one or more cores into accelerators on-the-fly. It the mobile processors. Minimizing their costs directly affects improves performance with minimal hardware additions. the final cost of the mobile processor. To moderate the costs The accelerators are made of general purpose ALUs recon- and mitigate the limited applicability, various designs pro- figured into a compute fabric and the general purpose pipeline pose reconfigurable compute accelerators [3], [4], [5], [6], that streams data through the fabric. To enable reconfiguration [7], [8], [9], [10]. They increase performance over different of ALUs into the fabric, the floorplan of a 4-core processor is changed to place the ALUs in close proximity on the workloads, but require including the remarkable amount of chip. A configurable switched network is added to couple and compute resources to enable per each workload specialized dynamically reconfigure the ALUs to perform computation acceleration. To avoid such addition of the resources in of frequently repeated regions, instead of executing general mobile processors, we propose reconfiguration of the general purpose instructions. Through this reconfiguration, the mobile purpose cores into accelerators. processor specializes its substrate for a given workload and maximizes performance of the existing resources. Our results This work contributes a novel reconfigurable and yet show that reconfiguration accelerates a set of selected compute low cost chip multiprocessor (CMP) which: 1) maximizes intensive workloads by 1.56x, 2,39x, 3,51x, when configuring computational capabilities of the existing general purpose the accelerator of 1-, 2-, or 4- cores respectively. cores; 2) minimizes the amount of extra hardware for Keywords-reconfigurable computing, dynamic processors accelerators.1 The processor is made of four composable lightweight cores [12], [13]. Composability avoids static I. INTRODUCTION placement of the resources and allows a workload to dynamically optimize its execution substrate by composing one Mobile devices are becoming ubiquitous and the mobile or more cores into a large processor. Our approach goes market segment is starting to dominate the computer indus- one step further than composability and reconfigures one or try. The competitive market requires constant improvement more composable cores into accelerators. The accelerators in the quality of these devices. Modern mobiles enable new consist of the core’s ALUs reconfigured into a compute compelling user experiences such as speech and gesture fabric and the core’s pipeline that streams data through the interfaces. Mobile applications manage an ever-increasing fabric. The fabric resembles the power-efficient computing amount of data such as: photos, videos, and a world of substrate introduced by [8]. However, instead of integrating content available in the cloud. On such inputs, the applica- such substrate, our fabric places all the ALUs available in the tions repeat their computation many times and offer potential processor to the center of the chip and connects them with a for acceleration. To harness this potential, recent mobile configurable circuit-switched network. The network permits processors have shifted toward heterogeneous computing. the fabric to be configured to perform various commonly Heterogeneous architectures [1], [2] integrate cores with repeated computations. The pipeline of the accelerator can different power and performance characteristics, GPUs and be configured as well, by composing the resources of one or specialized accelerators in a single chip. These architectures more cores around the fabric to tune the memory capabilities are more complex, but yield extra performance by choosing of the accelerator and further increase its performance. This work has been partially funded by the Spanish Government (TIN2012-34557), the European Research Council under the EUs 7th FP 1A sketch of this idea [11] with a very basic implementation and (FP/2007-2013) / ERC GA n. 321253. and Microsoft Research. simplified evaluation showed promising results. 978-1-4799-2079-2/13/$31.00 c 2013 IEEE 978-1-4673-7311-1/15/$31.00 ©2015 IEEE 1 II. BACKGROUND AND RELATED WORK architectures, which incorporate the set of configurable data processing units. Each unit performs one compute task We build our design on composable cores based on an of the parallel workload (e.g. matrix arithmetic, signal or Explicit Data Graph Execution (EDGE) architecture [14], image processing), while data passes from one unit to [15]. The EDGE architecture is an efficient research vehicle another in pipeline. The units are coupled by using a for low power mobile computing. Compiler analysis is configurable interconnect in a grid-like compute fabric. Such used to divide a program into blocks of instructions that fabric is integrated into a processor pipeline like a back- execute atomically (Atomic Instruction Block or AIB). An end processor. The pipeline arranges compute data, while AIB consists of a sequence of dataflow instructions. EDGE the fabric computes the data. The fabric is configured by compilers statically generate the dataflow and encode it in mapping compute instructions to it. It improves performance the EDGE ISA. The encoded dataflow defines producer- by “spatially” executing multiple compute instructions in consumer relationships between EGDE instructions and different stages of the fabric. The fabric also improves avoids power hungry out-of-order hardware. Instructions efficiency by repeating once configured computation many inside the block communicate directly. Each instruction times, which avoids per compute instruction fetch and de- leverages two reservation stations, which hold left and right code overheads [19]. Instead of integrating such fabric into operands respectively. Producer instructions encode targets a lightweight mobile processor, we propose reconfiguration that route their outputs to the appropriate reservation stations of the processor compute resources into a fabric. Our fabric of consumer instructions. Register operations are used only is designed to resemble the previously proposed computing for handling less-frequent inter-block communication, by substrates, their performance and efficiency, while avoiding keeping the temporary results between the AIBs. Composing as much as possible of their area overheads. EDGE cores increases performance and efficiency [12], [13]. Rather than fixing the size of cores at design time, one or III. RECONFIGURATION OF CORES more lightweight physical cores can be composed at runtime to form a larger, wider-issue logical core by using an on- The reconfiguration of composable cores extends their chip routed network between the cores. Reconfiguration of capabilities beyond general purpose processing. Bulk re- composable cores is an advanced dynamic feature that adds a sources can be allocated and used either as general purpose circuit-switched network between the ALUs of composable processors or accelerators. Each application running on cores to specialize one or more cores into an accelerator and this processor dynamically tunes the amount of allocated further increase their performance and efficiency. resources and their configuration to achieve the desired There is previous work that proposes to reconfigure exist- performance and efficiency. For example, one application ing resources of a processor into some kind of accelerator. may be executed by using one or more cores of the CMP In the context of EDGE architectures, vector execution on for general purpose processing and another one accelerated EDGE cores [16], [17] dynamically repurposes the general by using the cores reconfigured into an accelerator. purpose cores into a vector processor. The vector processor Reconfiguration makes an accelerator of one or more executes vector AIBs, which allocate the existing compute general purpose cores. The accelerator is composed of resources and repeat the computation by streaming the the general purpose ALUs reconfigured to perform like a values of large vectors. A vector memory unit is incorporated compute fabric and the general purpose pipeline that streams to decouple vector memory accesses

Load more