Exploring Opportunities for Heterogeneous-ISA Core Architectures in High-Performance Mobile Socs

Technical Report Exploring Opportunities for Heterogeneous-ISA Core Architectures in High-Performance Mobile SoCs Wooseok Lee, Dam Sunwoo, Christopher D. Emmons, Andreas Gerstlauer, and Lizy John UT-CERC-17-01 March 10, 2017 Computer Engineering Research Center Department of Electrical & Computer Engineering The University of Texas at Austin 201 E. 24th St., Stop C8800 Austin, Texas 78712-1234 Telephone: 512-471-8000 Fax: 512-471-8967 http://www.cerc.utexas.edu Exploring Opportunities for Heterogeneous-ISA Core Architectures in High-Performance Mobile SoCs Wooseok Lee1, Dam Sunwoo2, Christopher D. Emmons2, Andreas Gerstlauer1, and Lizy K. John1 1The University of Texas at Austin, 2ARM Research ABSTRACT croarchitecture. In bigger cores, more transistors are spent High-performance processors use more transistors to deliver on components that improve performance, such as caches, better performance. This comes at a cost of higher power branch predictors, and out-of-order processing capabilities. consumption, often lowering energy efficiency. Conventional In small cores, reducing the amount of performance-relevant approaches mitigate this problem by adding heterogeneous resources can, however, be detrimental to some workloads. energy-efficient cores for matching tasks. However, reduc- An alternate approach is to implement energy-efficient ing high-performance components, such as caches or out-of- cores by restricting functionality instead of giving up per- order processing capabilities, broadly affects performance. formance. Specifically, by reducing resources that have less In this paper, we instead explore opportunities to increase impact on performance, power dissipation can be alleviated energy efficiency by providing cores with restricted function- without losing significant performance. The benefit of imple- ality, but without necessarily impacting performance. We menting certain features in the Instruction Set Architecture aim to achieve this by removing support for complex but less (ISA) is highly dependent on workloads. Not all instruc- frequently executed instructions. Since instruction mixes tions are frequently used by every workload. If a particular used by real-world workloads are often heavily biased, the workload favors specific instructions that are not directly potential to remove unused or less frequently used instruc- supported by the hardware, performance dramatically de- tions is relatively high. creases. By contrast, there is little performance degradation To explore such opportunities, we investigate which in- if those instructions are not frequently used. structions are worthwhile to remove, how frequently they are In this paper, we explore opportunities for systems com- used, and how much performance degradation is expected prised of heterogeneous reduced-ISA cores to improve energy after removing direct hardware support. We analyze a sub- efficiency. We first identify a subset of instructions that are set of instructions in the ARM ISA and their correspond- complex to implement, but are less frequently used. We find ing logic burden in the microarchitecture. Furthermore, we that ISAs are often broadly inclusive in that they incorpo- present instruction profiling results and show performance rate instructions to support a variety of workloads across degradation when candidate instructions are removed. Ex- processor generations. However, once instructions are de- perimental results show that removing most of the identified fined in an ISA, backwards compatibility requires all future complex instructions has negligible impact on performance processors to implement them whether they are frequently except for NEON instructions, which result in large perfor- used or not. These less frequently used components add mance degradations for floating-point-oriented workloads. complexity inside the microarchitecture while contributing We further propose a heterogeneous-ISA system to achieve little to overall performance. energy efficiency without performance degradation using a We identify candidate instructions that are complex to im- system architecture that combines both full- and reduced- plement and profile how frequently the selected instructions ISA cores. Results show that by providing the flexibility of are used in state-of-the-art ARM-based systems. Results heterogeneous-ISA cores, the proposed system can improve show that workloads are biased towards certain instructions energy efficiency by 12% on average and up to 15% for ap- depending on the application or program phase. Basic in- plications that do not require NEON support, all without structions for essential data processing such as add, branch, performance overhead. move, load, and store are frequently used across all benchmarks. However, subsets of instructions are selectively or less frequently used, suggesting opportunities to remove di- 1. INTRODUCTION rect hardware support for them. Since instruction selection Current mobile systems-on-chips (SoCs) incorporate nu- depends on the compiler, we investigate results from major merous heterogeneous computing resources in an effort to compilers, where results show minimal variances. improve energy efficiency. Specialized processing elements We further evaluate performance degradations if candi- provide energy-efficient computing by offloading various types date instructions are removed. If performance degradation of computations onto dedicated accelerators. For general- is large, it is not worthwhile to consider removing them purpose workloads, architects have traditionally spent avail- despite the potential power benefit. We evaluate system able transistors to increase processor performance. However, performance with a reduced instruction set running several this also leads to increased power consumption, often dete- benchmarks. Results show that some subsets of instructions riorating energy efficiency. The conventional approach to are critical to performance while others are not essential handle this problem is to take advantage of heterogeneity at since in most cases, their usage frequency is low enough to the system level by switching between high-performance and not impact performance significantly. energy-efficient cores [3]. Such heterogeneous systems match Lastly, we propose a heterogeneous-ISA architecture to application demands with core types to maximize energy ef- obtain energy benefits while maintaining performance across ficiency. In such systems, the heterogeneity lies in the mi- a wide range of workloads. In our proposed system architecture, reduced-ISA cores remove hardware support for lously selecting the instructions required to support a range complex instructions, which increases energy efficiency but of workloads, we can reduce the number of instructions re- requires trapping and emulating of non-supported instruc- quired in a processor. tions. When unsupported instructions are infrequent, a workload runs on the reduced-ISA core to reduce energy. By 2.2 Instructions as a Source of Logic contrast, when unsupported instructions are prevalent, the To maximize the benefit of reducing instructions, iden- workload is migrated to a traditional full-ISA core. This tifying instructions that greatly impact logic complexity is dynamic core switching allows the system to avoid the per- essential. Removing instructions that do not result in logic formance degradation and energy inefficiency of software- reductions unnecessarily increases the danger of code bloat. emulated instructions. As long as they are not performance- We find that the overall number of instructions in an ISA critical, a compiler can thereby optimize binaries to remove is not as crucial for logic reduction as we first expected. unsupported instructions and thus maximize residency on While macro-operations are directly related to the logic in the reduced-ISA core. Our results show that workloads with- the decode stage, the rest of the pipeline is more driven by out performance-critical instructions spend most of their ex- micro-operations. Basic micro-operations share many com- ecution time on reduced-ISA cores, achieving energy savings mon datapath resources, such ALUs, that can not be re- of up to 15%. Workloads with frequent use of unsupported moved. Our study rather focus on the specific semantics instructions execute exclusively on full-ISA cores with no required by particular instructions that contribute to large change in performance or energy consumption. On average, logic within the microarchitecture. Since the processor is un- 12% energy savings at little to no performance cost are ob- aware of the instruction until decoded, we regard the fetch served across a variety of benchmarks, where applications stage as an irrelevant block to consider for our analysis. migrate between reduced- and full-ISA cores depending on In the decode stage, the number of decodable instructions dynamically varying instruction usage. is a major source of logic. In order to shorten the cycle time, parallel logic is commonly used to determine the fetched 2. ISA ANALYSIS instructions. For superscalar machines, the logic reduction In this section, we discuss the relationship between per- effects are multiplied through duplicated decoder blocks. In formance and the versatility of an ISA. Furthermore, we case of the ARM ISA, multiple instruction sets, ARM and analyze the influence of instructions as a source of logic. Thumb, mandate separate decoding blocks. As a case study, we select the ARM V7 ISA, and one of One specific category of instructions that incurs complex- the performance-oriented

Load more