Composite-ISA Cores: Enabling Multi-ISA Heterogeneity Using a Single ISA Ashish Venkat Harsha Basavaraj Dean M

Composite-ISA Cores: Enabling Multi-ISA Heterogeneity Using a Single ISA Ashish Venkat Harsha Basavaraj Dean M. Tullsen University of Virginia University of California, San Diego [email protected] fhbasavar j [email protected] Abstract—Heterogeneous multicore architectures are com- difference in encoding schemes and application-binary interfaces prised of multiple cores of different sizes, organizations, and (ABI) of fully disjoint ISAs. capabilities. These architectures maximize both performance This work proposes composite-ISA architectures that employ and energy efficiency by allowing applications to adapt to phase changes by migrating execution to the most efficient compact cores implementing fully custom ISAs derived from a core. Heterogeneous-ISA architectures further take advantage single large superset ISA. If such an architecture can just match of the inherent ISA preferences of different application phases the performance and energy of the multi-ISA designs, this is a to provide additional performance and efficiency gains. significant win due to the elimination of multi-vendor licensing, This work proposes composite-ISA cores that implement testing and verification issues, fat binaries, and high migration composite feature sets made available from a single large superset ISA. This architecture has the potential to recreate, overheads. However, our results show that this design can, in and in many cases supersede, the gains of multi-ISA het- fact, significantly outperform fully heterogeneous-ISA designs, erogeneity, by leveraging a single composite-ISA, exploiting due to greatly increased flexibility in creating cores that mix and greater flexibility in ISA choice. Composite-ISA CMPs enhance match specific sets of features. existing performance gains due to hardware heterogeneity by To derive fully custom ISAs with diverse capabilities, we first an average of 19%, and have the potential to achieve an additional 31% energy savings and 35% reduction in Energy construct a well-defined baseline superset ISA that offers a wide Delay Product, with no loss in performance. range of customizable features: register depth (programmable Keywords-heterogeneous architectures; multicore processors; registers supported), register width, addressing mode complexity, ISA; RISC; CISC; energy efficiency; single-thread performance predication, and vector support. Ignoring some permutations of I.I NTRODUCTION these features that are not viable or unlikely to be useful, this Modern processors increasingly employ specialization to still gives the processor designer 26 custom ISAs that are improve the execution efficiency of domain-specific workloads [1]– potentially useful. Thus, by starting with a single superset ISA, [6]. Additionally, some take advantage of microarchitectural the designer has far more choice and control than selecting from heterogeneity by combining large high-performance cores and among existing commercial ISAs. small power-efficient cores on the same chip, to create efficient This work features a massive design space exploration that designs that cater to the diverse execution characteristics of sweeps through all viable combinations of the customizable general-purpose mixed workloads [7]–[14]. Heterogeneous-ISA ISA features, along with an extensive set of customizable architectures [15]–[17] further expand the dimensions of hardware microarchitectural parameters, to identify optimal designs. A specialization by synergistically combining microarchitectural major contribution of this work is the isolation of specific heterogeneity with ISA heterogeneity, exploiting the inherent ISA features that improve single thread performance and/or ISA affinity of an application. increase multi-programmed workload throughput/efficiency, and ISA heterogeneity provides the hardware architect and the an extensive study of their effect on important architectural compiler designer with finer control over features such as register design choices that enable efficient transistor investment on the pressure, predication support, and addressing mode availability, available silicon real estate. that could each significantly impact the overall throughput of In this work, we construct the superset ISA using extensions the code [16], [18]–[20]. Multiple independent explorations and mechanisms completely consistent and compatible with the have shown that ISA heterogeneity substantially improves existing Intel x86 ISA. However, we note that greater levels of execution efficiency in both chip multiprocessor (CMP) [15]–[22] customization can be achieved by creating a new (superset) ISA and datacenter environments [23], [24], and these effects are from scratch. We start with x86 because it not only already particularly pronounced in scenarios where microarchitectural employs a large set of the features we want, but it has a clear heterogeneity alone provides diminishing returns [16], [19]. history and process for adding extensions. In this work, we However, in prior work, the ISA-dependent features had to be present detailed compiler techniques to extend the x86 backend selected at a very coarse level. For example, the hardware to both support and exploit the extra features that we add and to designer might either choose x86 that comes with perhaps one customize existing features. We also describe migration strategies key feature of interest but a lot of other overheads, or Thumb (with negligible binary translation costs) to switch between which forgoes several non-essential features but also a few the composite ISAs that implement overlapping feature sets. others that are of significance. Such coarse selection greatly In addition, this work features a comprehensive analysis of restricts both the ability to assign threads to the best core, but the hardware implications of the custom feature set options, even more so it restricts the processor architect in identifying the including a full synthesized RTL design of multiple versions of best combination of hardware features to assign to a globally the x86 decoder. optimal set of heterogeneous cores. Due to the constraint of a single baseline superset ISA, Furthermore, despite their potential for greater performance the derived custom ISAs can never incorporate all traits of and energy efficiency, the deployment of heterogeneous ISAs distinct vendor-specific ISAs (such as the code compression of on a single chip is non-trivial. First, integration of multiple Thumb). However, the greater flexibility and composability vendor-specific commercial ISAs on a single chip is fraught of our design results in greater overall efficiency. In designs with significant licensing, legal, and verification costs and optimized for multithreaded mixed workloads, we gain 18% in barriers. Second, process migration in a heterogeneous-ISA CMP performance and reduce the energy-delay product (EDP) by 35% necessitates the creation of fat binaries, and involves expensive over single-ISA heterogeneous designs, without sacrificing most binary translation and state transformation costs due to the of the benefits of a single ISA. In designs optimized for single thread performance/efficiency, we achieve a speedup of 20% Fabscalar [69], [70], OpenPiton [71], [72], the heterogeneous- and an EDP reduction of 28% on average, over designs that ISA JuxtaPiton [73], and Alladin [74], [75] further allow the only employ hardware heterogeneity. Further, we match and in exploration of heterogeneous architectures of varying complexity. fact supersede (by 15%) the gains of a heterogeneous-ISA We do not include the exploration of GPUs and other accelerator CMP with vendor-specific ISAs, while effectively eliminating designs in our search space since we target diversity in execution vendor-specific ISA licensing issues, fat binary requirements, characteristics and our primary goal is to achieve the gains of binary translation, and state transformation costs. multi-vendor ISA-heterogeneity [16], [19], [20], [24] using a single ISA. II.R ELATED WORK Kumar, et al. [10]–[12] introduced single-ISA heterogeneous III. ISAFEATURE SET DERIVATION multicore architectures. These architectures employ cores In this section, we describe our superset ISA. It resembles of different sizes, organizations, and capabilities, allowing x86, but with an additional set of features that can be customized an application to dynamically identify and migrate to the along 5 different dimensions: register depth, register width, most efficient core, thereby maximizing both performance opcode and addressing mode complexity, predication, and and energy efficiency. Single-ISA heterogeneity has been data-parallel execution. We further study the code generation well studied and applied broadly in embedded [4], [8], [14], impact, processor performance, power, and area implications of general-purpose [10], [11], [25], GPU [3], [7], autonomous each dimension. driving [26], and datacenter [27]–[29] environments. Architects Register Depth. The number of programmable registers have further explored several microarchitectural [30]–[38] and exposed by the ISA to the compiler/assembly programmer scheduling techniques [13], [39]–[55] to better harness the gains constitutes an ISA’s register depth. The importance of register due to single-ISA heterogeneity. depth as an ISA feature is well established due to its close Heterogeneous-ISA architectures [15]–[17], [20]–[22], correlation to the actual register pressure (number of registers [56] explore yet another dimension of heterogeneity. These available for use) in any given code

Load more