ARC HS4x and HS4xD CPUs: New Dual-Issue Architecture Boosts Embedded Processor Performance

By Mike Demler Senior Analyst

May 2017

www.linleygroup.com

ARC HS4x and HS4xD CPUs: New Dual-Issue Architecture Boosts Embedded Processor Performance

By Mike Demler, Senior Analyst, The Linley Group

This white paper describes the DesignWare® ARC® HS4x and HS4xD series of licensable CPU cores. These are the company’s newest CPUs for embedded applications requiring 32-bit RISC performance in a small silicon footprint with minimal power consumption. This paper was prepared by The Linley Group and sponsored by Synopsys, but the opinions and analysis are those of the author.

Synopsys’s DesignWare ARC CPUs comprise a family of highly configurable and customizable processor cores, which ship in nearly two billion chips per year. ARC’s popularity in embedded devices makes the company second only to ARM in the number of chips that integrate its licensable CPUs. More than 230 ARC licensees use the cores in products that span a broad range of embedded applications, such as automotive control systems, digital-audio devices, sensors, solid-state drives (SSDs), network-attached storage (NAS), and residential gateways.

Since acquiring ARC as part of its 2010 purchase of , Synopsys has continued to improve the CPU architecture and add new features that expand the product line. In 2011, the company introduced the 32-bit ARCv2 ISA and the ARC EM family for very low power and deeply embedded products. In 2013, it introduced the ARC HS family implemented with the ARCv2 ISA, which were the first ARC cores to support dual- and quad-core configurations. The ARC HS cores target higher performance embedded applications, but they are software compatible with the ARC EM family cores.

In 2014, the introduction of ARC HS38 brought extensive enhancements, including the option to integrate a memory-management unit (MMU), which supports running and other higher-level operating systems that use virtual memory. The HS38 supports dual- (HS38x2) and quad-core (HS38x4) configurations, along with offering a shared L2 cache for cache-coherent symmetric multiprocessing (SMP).

In May 2017, Synopsys announced the new HS4x family, which enhances the ARC HS3x architecture by adding dual-issue capability to the 10-stage pipeline. The HS4x CPUs run the same software as previous HS models, but they deliver a 25% boost in CoreMarks per megahertz compared with the ARC HS3x. Designers can alternatively use the new architecture to deliver the same performance at a lower frequency than the previous version, reducing power consumption.

Like the predecessor ARC HS3x-series, the new HS4x offers customers three base configurations, and all enable dual- and quad-core clusters. As Table 1 shows, the smallest new model is the HS44, which includes up to 16MB instruction and data closely coupled memories (CCMs) but no L2 cache or MMU. The HS46 adds as much as 64KB

©2017 The Linley Group - 1 - ARC HS4x and HS4xD CPUs

L1 I/D caches along with the CCMs, and it optionally has up to 8MB of L2 cache as well as an MMU. But including all those features is equivalent to licensing the top-of-the-line HS48, which supersedes the ARC HS38. Designers can configure all three cores with an optional IEEE 754-compliant FPU, a memory protection unit (MPU), and a real-time trace (RTT) feature.

ARC HS44 ARC HS46 ARC HS48 Instruction Set 32-bit ARCv2 CPU Freq (max) 1.6GHz (worst case), 2.2GHz (typical) in TSMC 28nm HPM Instruction Issue 2 per cycle Pipeline Depth 10 stages L1 Caches (I/D) None 2-64KB/2-64KB 2-64KB/2-64KB Closely Coupled 512B–16MB / 512B–16MB Memories (I/D) L2 Cache None Optional 256KB-8MB Memory- Management None Optional Standard Unit Options FPU, MPU, real-time trace, DMA Interfaces 32-, 64-, or 128-bit AXI/AHB-Lite Table 1. Key features of the Synopsys DesignWare ARC HS4x family. The company offers three base configurations, but designers can use the ARC Processor Extension (Apex) tools to further customize the CPUs.

Along with the new base HS4x configurations, Synopsys also developed two new ARC HS4xD processor cores that add DSP extensions previously available only with ARC EM. The ARCv2-DSP ISA adds more than 150 signal-processing instructions for audio, speech, and wireless-baseband applications. The ARC HS45D brings the DSP features to the HS44 base design, and it includes the same 16MB instruction and data CCMs. The ARC HS47D builds on the HS46, adding a DSP along with the CCMs and L1 caches.

The HS4xD designs include a unified 32x32-bit multiplier/multiplier-accumulator (MUL/MAC). Designers can optionally add an L2 cache and MMU, as well as the DMA, FPU, MPU, and RTT features to the base designs. The DSP-equipped cores also support dual- and quad-core configurations.

ARC HS4x Architectural Overview

The ARC HS4x family uses the same 10-stage pipeline as its predecessor, but the addition of a second instruction decoder provides a substantial performance boost by increasing utilization of the functional units. This 32-bit RISC architecture has traditional 32-bit instructions and a subset of 16-bit instructions for greater code density in small- memory embedded systems. As with all ARC cores, designers can use the ARC Processor Extension (Apex) tools to add custom instructions and hardware, including their own RTL, auxiliary registers, and condition and status codes, as well as memory-mapped blocks and closely coupled peripherals.

©2017 The Linley Group, Inc. - 2 - ARC HS4x and HS4xD CPUs

Using extension core registers, ARC HS processors can support up to 60 core registers. The register file has four read ports and three write ports. Configurability enables designers to add or omit features in order to optimize the core’s performance, size, and power consumption for the target application.

As Figure 1 shows, the ARC HS4x functional units include two ALUs and two late ALUs. The late ALUs defer execution to stage 9 in the 10-stage pipeline, thereby avoiding stalls that would occur when data loaded from memory isn’t available in time for the earlier ALUs. The late ALUs can also resolve branches that depend on load data, so most instructions suffer no load-to-use penalty.

Figure 1. Block diagram of Synopsys DesignWare ARC HS4x CPU. The design adds a second decoder that can simultaneously issue instructions to any two functional units. The dual- issue capability, along with a second ALU and late ALU, enable the ARC HS4x to increase performance by 25% compared with the previous HS3x model.

The other functional units remain the same as in the ARC HS3x, comprising a single divider, multiplier, MAC/SIMD, and optional FPU. The ARC HS4x can issue two ALU instructions in parallel or pair instructions from two different categories, including ALU, load/store, multiply (MPY), divide (DIV), floating point (FPU), and user defined. The HS4x load/store unit executes 64-bit operations, enabling a single instruction to load or store data to or from a register pair.

The ARC HS compiler automatically generates 64-bit load/stores, and the microarchitecture supports nonaligned access without incurring an additional one-cycle penalty. The load-to-use latency is one cycle for most ALU operations, and the load-to- store delay is also just one cycle. ARC HS is a nonblocking architecture and can handle up to four cache misses without blocking the pipeline. The optional data-memory port

©2017 The Linley Group, Inc. - 3 - ARC HS4x and HS4xD CPUs

enables faster access to memory or peripheral data. The AXI interface supports up to 13 outstanding memory transactions on the system bus, which designers can configure as a 32-, 64-, or 128-bit interface.

In multicore designs, each HS4x core integrates a snoop interface to keep the L1 caches coherent. The snoop unit applies the MOESI (Modified, Owned, Exclusive, Shared, Invalid) protocol for cache-to-cache transfers. The optional I/O coherency port keeps I/O traffic coherent with the L1 caches.

The shared L2 cache operates at the CPU clock frequency, and its Harvard architecture connects independently to the core’s data and instruction buses. Designers can use configuration options to enable cache redundancy and ECC support, as well as to select up to a 16-way last recently used (LRU) policy and a 64- or 128-byte line size. The cache also supports reduced-power operation during processor sleep states.

A More Efficient Pipeline

Although HS4x CPUs issue instructions in order, the execution pipeline skews some operations so they can process and retire out of order, preventing pipeline stalls. This method enhances the instruction parallelism and throughput without needing additional hardware such as reorder buffers, which would increase die area and power.

For example, by delaying a MPY operation to the third execution cycle (Exec3), that execution unit can receive operands from the ALUs while, at the same time, those ALUs are free to begin executing another instruction, as Figure 2 shows. Similarly, rather than allow the pipeline to stall while the ALUs wait for operands, the design will schedule those instructions for the late execution cycle (Exec3) as other instructions proceed to available execution units. ALU operations can incur different latencies depending on whether they are basic or advanced. Advanced ALU functions are long latency, requiring an additional cycle to complete compared with basic ALU operations. An

©2017 The Linley Group, Inc. - 4 - ARC HS4x and HS4xD CPUs example of an advanced ALU operation is a rounding function.

Figure 2. Block diagram of ARC HS4x execution pipeline. The CPU can issue two instructions per cycle. It employs deferred execution to maximize function-unit utilization and prevent pipeline stalls.

Synopsys expects the ARC HS4x to support the same 2.2GHz maximum operating frequency as the predecessor HS3x, but the new dual-issue capability will increase performance efficiency by roughly 25%. The new design delivers 5.0 CoreMarks per megahertz (or 2.53 Dhrystone mips per megahertz).

The dual-issue hardware adds roughly 50K gates to the design. For the smallest HS44, area increases by roughly 20–25%, so area efficiency stays the same as in the older design. The area increase is proportionally smaller in the HS46 and HS48 with L1 caches, however, so customers will gain both the 25% performance-efficiency boost and greater area efficiency.

Boosting DSP

Synopsys also offers two new dual-issue HS models that include support for ARC DSP extensions. The ARCv2DSP ISA comprises most of the same set of instructions that the smaller ARC EMxD products support, providing for software compatibility. The ISA has more than 150 DSP instructions, including vector/SIMD operations and complex-math functions. The ARC HS4xD omits the EMxD’s bitstream instructions, which those cores use for audio codecs, but adds support for 64-bit operands.

The smallest ARC HS4xD version is the HS45D, which essentially adds DSP features to the HS44, including up to 16MB CCMs. The ARC HS47D includes I/D caches, which are the same as in the HS46. Customers can separately license an MMU to enable an HS47D to run Linux, but the company isn’t offering that feature combination in a preconfigured

©2017 The Linley Group, Inc. - 5 - ARC HS4x and HS4xD CPUs product. Other licensable options are a single-precision FPU, a memory-protection unit (MPU), real-time trace, a DMA, and a shared L2 cache that supports dual- and quad-core configurations.

As Figure 3 shows, the HS4xD replaces the integer multiplier with a DSP-specific multiplier that performs both the integer and fixed-point MPY/MAC functions, including fast and slow operations. The slow functions are those that take an additional cycle to set flags. The DSP core also adds signal-processing operations to the advanced ALU. The new dual-issue DSP architecture employs 64-bit source operands, which the ARC EMxD cores don’t support. In the ARC HS4xD, that capability enables quad 16-bit and dual 32-bit SIMD operations. For example, the HS4xD can pack four 16-bit dot- product operations in a single instruction. The new design enables the HS4xD to perform single-cycle 32x32-bit multiply/multiply-accumulate (MAC) operations, two 32x16-bit MAC operations per cycle, or four 16x16 MAC operations per cycle.

Figure 3. Block diagram of ARC HS4xD execution pipeline. The HS4xD cores add DSP multipliers to the base set of function units and modify the ALUs for signal-processing operations.

The ARC HS4xD omits the X/Y memories that enhance DSP performance in the ARC EM9/11D, but both designs employ the same address-generation unit (AGU), which supports the bit-reverse and modulo-wrapping modes common in FFTs and digital filters. In the HS45D/47D, the AGUs also generate operands for explicit load/store operations, and they support access to the data cache. The AGU feature is a

©2017 The Linley Group, Inc. - 6 - ARC HS4x and HS4xD CPUs

configuration option that includes a unit with four address pointers and modifiers as well as two offsets.

Along with the base dual-issue capability, the HS4xD can issue two 32-bit DSP ALU instructions in parallel or execute a DSP multiply command in parallel with a load/store. The base Apex, DIV, FPU, and branch instructions can’t issue in the same cycle as DSP multiply. The 64-bit advanced and DSP ALU operations are restricted to single-issue dispatch.

Competitive Comparisons

The three DesignWare ARC HS4x models give designers the option to include or exclude L1 and L2 caches, an IEEE 754–compliant floating-point unit (FPU), a memory- protection unit (MPU), and a real-time trace block. The HS44 and HS46 omit the memory-management unit (MMU) and are therefore suitable for use in running an RTOS. The ARM Cortex-M7 is a smaller and lower-power competitor for such tasks, but it delivers less than half the peak performance of the ARC HS4x. The HS48 includes an MMU that supports Linux and other high-level operating systems. Fully configured, the HS48 competes for low-power embedded Linux designs with ARM’s 32-bit Cortex-A7 and Cortex-A32, as well as Imagination Technologies’ 64-bit MIPS I6500, as Table 2 shows.

©2017 The Linley Group, Inc. - 7 - ARC HS4x and HS4xD CPUs

Synopsys ARM ARM Imagination

ARC HS48 Cortex-A7 Cortex-A32 MIPS I6500 Instruction Set 32-bit ARCv2 32-bit ARMv7 32-bit ARMv8 64-bit MIPS R6 SMP/SMT SMP SMP SMP SMP and SMT CPU Speed (max) 2.2GHz 2.0GHz 2.2GHz† 2.0GHz Instr-Issue Rate 2 per cycle 1 per cycle 1 per cycle† 2 per cycle Reordering? No No No No Pipeline Depth 10 stages 8 stages 8 stages 9 stages L1 Caches I/D 0–64KB 4–64KB 64KB 0–64KB with ECC TCM I/D 0–16MB None None 0/1MB with ECC L2 Cache 256KB–8MB 128KB–1MB 128KB–1MB 512KB–8MB DSP, Neon, FPU, ECC on L2 cache, DSP, Neon, FPU, TrustZone, SMP, Optional FPU, MPU, real- 128-bit SIMD, TrustZone, SMP, LPAE, ECC on Extensions time trace DSP, FPU, H/W LPAE caches, HW virtualization virtualization 1x 32x32-bit or MACs/Cycle 1x 32x32-bit 1x 32x32-bit† 1x 64x64-bit 2x 16x16-bit 3.7CM/MHz (1 thread); CoreMarks/MHz 5.0CM/MHz 3.3CM/MHz 3.3CM/MHz† 5.6CM/MHz (2 threads) 7,400 / 11,600 11,000 Max Performance 6,600 CoreMarks 7,260 CoreMarks CoreMarks (2 CoreMarks threads) 32-, 64-, or 128- Amba 4 Ace, 256-bit Amba Interfaces 128-bit Amba 4 bit AXI, AHB-Lite AXI4, Amba 5 Chi AXI4 Die Area 0.25mm2 0.33mm2 0.45mm2† 1.00mm2 Perf per Area 44kCM/mm2 20kCM/mm2 16kCM/mm2 8kCM/mm2 Power (max) 0.05mW/MHz 0.12mW/MHz† 0.07mW/MHz† 0.15mW/MHz† Perf per Watt 100CM/mW 28CM/mW 47CM/mW 25CM/mW RTL Release 2017 2011 2016 2016 Table 2. Comparison of DesignWare ARC HS48 with three competing CPUs. The maximum clock frequencies assume speed-optimized synthesis for a 28nm high-k metal-gate (HKMG) process. Area estimates exclude the L1 cache. (Source: vendors, except †The Linley Group estimate)

ARM designed Cortex-A7 to work as the “little” core in its original 32-bit Big.Little configuration. Cortex-A32 implements the 32-bit features of the ARMv8 ISA, offering designers a more area- and power-efficient alternative to the “little” 64-bit Cortex-A35. Imagination’s MIPS I6500 is a low-end 64-bit CPU, but it runs MIPS32 software directly.

The new dual-issue ARC HS4x design boosts maximum performance to 11,000 CoreMarks, which is approximately 50% higher than its nearest competitor in this group. Although the dual-threaded MIPS I6500 model delivers slightly higher

©2017 The Linley Group, Inc. - 8 - ARC HS4x and HS4xD CPUs

performance than the ARC HS48, most embedded applications won’t use that feature, which comes at the expense of four times the area and three times the power of the Synopsys core.

The ARC and MIPS cores allow a maximum 8MB L2 cache, but the ARM cores are limited to a 1MB maximum. The larger caches will save additional power required to access off-chip memory. All these CPUs work in clusters of four or more cores. For embedded applications, ARC HS4x designers can include up to 16MB each of instruction and data closely coupled memory (CCM)—a feature the ARM cores lack. The MIPS core supports just a 1MB data scratchpad memory.

Besides its performance advantage, the ARC HS4x also offers superior area and power efficiency compared with the ARM and MIPS CPUs. Excluding the L1/L2 caches and TCMs, the ARC HS48 occupies just 0.25mm2, which is one-fourth the size of the I6500 and approximately 25% smaller than Cortex-A7. The result is 44kCM/mm2, more than twice the area efficiency of the ARM cores and 5.5x better than the I6500. The Synopsys design delivers similar advantages in performance per milliwatt. Its 100CM/mW power efficiency is twice that of Cortex-A32 as well as four times more than Cortex-A7 and the MIPS I6500.

The ARC HS45D and HS47D lack an MMU, although designers can optionally add one to the HS47D. These cores are thus best suited to advanced applications such as speech processing and wireless basebands. The ARC HS4xD competes for such designs with ARM’s Cortex-R8, which offers optional DSP extensions. It will also compete with Ceva’s configurable X2 DSP core, a design that’s purpose-built for PHY- control tasks in 5G and LTE-Advanced modems.

As Table 3 shows, the ARC HS4xD cores in a 28nm HPM process can run at a higher maximum clock frequency than the ARM and Ceva offerings, although the HS47D and X2 both sport 10-stage pipelines and Cortex-R8 has 11 stages. The Ceva X2 integrates two scalar CPUs (SPUs in the company’s parlance), but they deliver 10% less performance per megahertz than the ARC DSP. Cortex-R8 delivers 4.4 CoreMarks per megahertz, yielding peak performance that is 65% of ARC HS4xD.

©2017 The Linley Group, Inc. - 9 - ARC HS4x and HS4xD CPUs

Synopsys ARM Ceva X2

ARC HS47D Cortex-R8 Instruction Set 32-bit ARCv2-DSP ARMv7, Thumb-2 32-bit Ceva-X SMP/SMT SMP Dual-core AMP/SMP No CPU Speed (max) 2.2GHz 1.6GHz 1.0GHz† Instr-Issue Rate 2 per cycle 2 per cycle 2 per cycle Pipeline Depth 10 stages 11 stages 10 stages 0–128KB instr / L1 Caches I/D 0–64KB 0–64KB 0–64KB data 0–256KB instr TCM, TCM I/D 0–16MB 0–1MB 0–512KB data TCM L2 Cache 256KB–8MB 0–8MB None DSP, FPU, MPU, Data/instruction cache, Optional FPU, MPU, real-time dynamic branch AMP, embedded- Extensions trace predictor, trace module (ETM) 0–2x FPU 1x 32x32-bit, 2x 16x16-bit, 2x 32x32-bit, MACs/Cycle 4x 16x16-bit 4x 8x8-bit 4x 16x16-bit MACs/s (16-bit) 8.8GMACs 3.2GMACs 4.0GMACs CoreMarks/MHz 5.0CM/MHz 4.4CM/MHz 4.5CM/MHz Max Performance 11,000 CoreMarks 7,040 CoreMarks 4,500 CoreMarks 32-, 64-, or 128-bit AXI (main), Instructions: 128-bit optional 32-bit AXI 4x AXI (64-bit), master; AXI Interfaces (peripherals), 5x AXI (32-bit) data: 128-bit master + optional 32- or 64-bit 128-bit slave (slave) Die Area 0.22mm2† 0.16mm2* 0.18mm2† Perf per Area 50kCM/mm2 44kCM/mm2 25kCM/mm2 Power (max) 0.042mW/MHz 0.054mW/MHz‡ Not disclosed Perf per Watt 119CM/mW 81CM/mW Not disclosed Table 3. Comparison of DSP-capable CPUs. Maximum clock frequencies assume speed- optimized synthesis for a 28nm HPM process, *power-optimized nine-track library. (Source: vendors, except †The Linley Group estimate)

The ARC4xD cores maintain the area and power efficiency of the non-DSP versions. By omitting an MMU, the HS47D is slightly smaller than the HS48D, which has a higher area efficiency of 50kCM/mm2. Although Cortex-R8 is smaller than the HS47D, its area efficiency is 12% less. Ceva’s X2 is also roughly 20% smaller, but it delivers just half the area efficiency.

ARM’s Cortex-R8 SIMD/DSP instructions operate on 16- or 8-bit data values in 32-bit registers. By comparison, the ARC HS4xD supports 64-bit source operations for quad 16- bit or dual 32-bit SIMD, and it includes built-in signal-processing functions such as FFT

©2017 The Linley Group, Inc. - 10 - ARC HS4x and HS4xD CPUs butterflies, loop transformations, and FIR and IIR filters. Ceva’s design offers a more powerful DSP engine with dual 32x32-bit MACs and a five-way VLIW architecture, but designers looking for more-balanced RISC CPU and DSP operation will find the ARC cores to be a better choice. The Ceva design also has less TCM capability, increasing power consumption for memory transactions. And it lacks support for a shared L2 cache and SMP operation.

Summary

With its new HS4x design, Synopsys has enhanced its ARC lineup with a CPU family that delivers best-in-class area and performance efficiency for low-power embedded systems. By adding another instruction decoder and a second set of ALUs, the company efficiently increased utilization of the execution units to deliver 25% higher per-cycle performance.

The new ARC HS4x dual-issue capability adds just 50K gates, but the slightly larger die area (0.25mm2 for the HS48 versus 0.21mm2 for the HS38) raises both area and power efficiency. For the fully configured ARC HS48, performance per square millimeter increases by 14% and performance per milliwatt increases by 29%. The MetaWare compiler automatically optimizes instruction execution to take advantage of the dual- issue scheduler, so the HS4x is a drop-in replacement for its predecessor and is transparent to the programmer.

The HS4xD cores extend the scalability of the ARC family by enabling designers to employ the DSP ISA throughout the lineup. The DSP-equipped CPUs provide upward compatibility for most ARC EM5D/7D/9D/11D software, but the HS-series’ 10-stage pipeline enables higher clock frequencies for up to a 2x DSP-performance boost. Synopsys eases adoption by supporting the ARC DSP cores with a software library that includes common signal-processing functions, such as audio/voice codecs, filters, FFTs, and matrix operations.

Designers can configure the HS4x to run an RTOS such as ARC MQX, or they can add an MMU to run embedded Linux applications. Multicore options provide additional flexibility and scalability, allowing them to configure each HS core in a dual or quad cluster to optimize performance, power, and area. Designers can further customize their cores using Apex to implement user-defined functions.

The ARC HS4x and HS4xD preserve and extend the compactness, configurability, extensibility, and low power of the ARC processor architecture. The new features and performance boost will appeal to embedded-processor designers looking for high-end performance in an exceptionally efficient CPU core.

Mike Demler is a senior analyst at The Linley Group and a senior editor of Report. The Linley Group offers the most comprehensive analysis of the mobile semiconductor industry. We analyze not only the business strategy but also the internal technology. Our in- depth reports also cover topics including embedded processors, network processors, base-station processors, and Ethernet chips. For more information, see our web site at www.linleygroup.com.

©2017 The Linley Group, Inc. - 11 -