...... HEXAGON DSP: AN ARCHITECTURE OPTIMIZED FOR MOBILE MULTIMEDIA AND COMMUNICATIONS

...... THE QUALCOMM HEXAGON DSP IS USED FOR BOTH MODEM PROCESSING AND

MULTIMEDIA ACCELERATION.BY OFFLOADING MULTIMEDIA TASKS FROM THE CPU TO THE

DSP, SIGNIFICANT POWER SAVINGS CAN BE ACHIEVED.THIS ARTICLE PROVIDES AN

OVERVIEW OF THE HEXAGON ARCHITECTURE.THE PROCESSOR IS DESIGNED TO DELIVER

SUPERIOR ENERGY EFFICIENCY COMPARED TO MOBILE CPU ALTERNATIVES AND THEREBY

HELP ACHIEVE LONG BATTERY LIFE FOR IMPORTANT MOBILE APPLICATIONS...... To be competitive, a modern multimedia DSP, however, is licensed for mobile product must provide a rich user expe- programming by OEMs and third-party soft- rience and long battery life. Chips for these ware vendors. This article provides an over- Lucian Codrescu ecosystems integrate multiple subsystems, view of the multimedia DSP and builds on each customized for a particular application the presentation from HotChips 25.1 Willie Anderson domain. By specializing a subsystem to a task, Figure 2 shows the various Hexagon gen- performance and power can be enhanced erations. Version 2 (V2) was the first pro- Suresh Venkumanhanti beyond what is possible with a homogenous duction version and appeared in the initial CPU-based computing platform. Snapdragon mobile products in 2007. V3 Mao Zeng Figure 1 shows a block diagram of the featured an improved implementation with Snapdragon 800. This chip contains dedicated better power consumption. These early ver- Erich Plondke subsystems for camera, display, video, audio/ sions of Hexagon targeted voice and audio voice, sensors, graphics, cellular modem, Wi- processing. Example functions include wide- Chris Koob Fi, and more. Each subsystem contains dedi- band vocoders, echo cancellation, audio cated hardware, and many contain special- postprocessing filters, MP3/AAC play- Ajay Ingle purpose processing engines and software cus- back, speaker protection algorithms, and tomized to the task. so on. Charles Tabony The Snapdragon 800 has two instances of V4 and V5 expanded the application tar- the Hexagon digital-signal processor (DSP). gets to include image processing for camera Rick Maule The modem (mDSP) is dedicated and and video; computer vision tasks such as customized for modem processing, whereas hand, gesture, and face recognition; and Qualcomm the application DSP (aDSP) is used for mul- processing of sensor input (gyro, accelerome- timedia acceleration. The modem processor ter, fingerprint, and so on). Unless otherwise is a closed subsystem and is programmed noted, this article will focus on the latest only within Qualcomm Technologies. The Hexagon V5 core......

34 Published by the IEEE Computer Society 0272-1732/14/$31.00 c 2014 IEEE • aDSP: Real-time media & sensor processing Snapdragon 800

Audio Camera Krait Krait CPU CPU Sensors Display GPU

JPEG Krait Krait Hexagon Misc. CPU CPU Video aDSP connectivity Other 2-Mbyte L2

Multimedia fabric System fabric

Hexagon Modem Fabric & memory controller mDSP

• mDSP: Dedicated LPDDR3 LPDDR3 modem processing

Figure 1. Snapdragon 800 block diagram. The chip contains dedicated subsystems for camera, display, video, audio/voice, sensors, graphics, cellular modem, and Wi-Fi.

V5 mDSP V3 mDSP V4 mDSP 28 nm 45 nm 28 nm Dec. 2012 June 2009 Dec. 2010

V4 aDSP V3 aDSP Low-tier V1 aDSP V2 aDSP Low-tier 28 nm 65 nm 65 nm 45 nm Apr. 2011 Oct. 2006 Dec. 2007 Nov. 2009 V5 aDSP V3 aDSP V4 aDSP 28 nm 45 nm 28 nm Dec. 2012 Aug. 2009 Dec. 2010

Time

Figure 2. Hexagon digital-signal processor (DSP) evolution. The figure shows the evolution from Version 1 in October 2006 through Version 5 in December 2012...... MARCH/APRIL 2014 35 ...... HOT CHIPS

All user-level registers are replicated per thread. There are two sets of user registers: Instruction general registers and control registers. The cache general registers include thirty-two 32-bit registers that can be accessed either as single Instruction unit registers or as aligned 64-bit register pairs. The general registers contain all pointer, sca- lar, vector, and accumulator data. The con- L2 trol registers include special-purpose registers cache/ such as the program counter, status register, TCM and loop registers. Data unit Data Unit Execution Execution (load/ (load/ unit unit store/ store/ (64-bit (64-bit Data-processing instructions ALU) ALU) vector) vector) There are two identical 64-bit single- instruction, multiple-data (SIMD) execution Data cache units. Each unit supports all multiply, shift, arithmetic logic unit (ALU), and bit manipu- lation instructions. Supported data types include Register file/thread  8-, 16-, 32-, and 64-bit integer;  16- and 32-bit fractional with optional rounding and saturation;  16-bit complex; and Figure 3. Hexagon block diagram. The architecture features a four-wide very  single-precision IEEE-compatible float- long instruction word (VLIW) with dual load/store and dual single-instruction, ing point. multiple-data (SIMD) execution units and supports hardware multithreading. Each unit is capable of supporting: Hexagon is a multithreaded very long  four 16 Â 16 multiplies; instruction word (VLIW) DSP. The design  two 32 Â 16 multiplies; or philosophy is to maximize work per cycle for  one 32 Â 32 multiply, one complex performance, but target the microarchitec- multiply, or one floating-point fused ture to modest clock speeds and low power. multiply-add (FMA). Many of the instructions are complex and Instruction-set architecture overview application specific. Complex instructions The architecture’s foundation is a stati- targeted to a particular application domain cally scheduled four-way VLIW. The VLIW can provide high performance and energy approach puts the burden of instruction par- efficiency. For example, Figure 4 depicts a allelism on the compiler and thereby avoids complex multiply instruction used in a costly and power-hungry dynamic-scheduling 16-bit fixed-point fast-Fourier transform hardware.2 The VLIW approach is popular (FFT). Without such an instruction, it would among commercial DSPs. Figure 3 shows a take four multiplies, four shifts, four adds, block diagram of Hexagon. and two saturates to perform the operation. It should be clear that packing all the work in Registers and memory a single instruction executed in a single pipe- The Hexagon processor features a unified lined provides large efficiency byte-addressable memory. This memory has gains. a single 32-bit virtual address space that holds The Hexagon instruction set architecture both instructions and data. It operates in (ISA) contains numerous special-purpose little-endian mode. A full-featured memory instructions designed to accelerate key multi- management unit (MMU) translates virtual media kernels. Multimedia algorithms with to physical addresses. special instruction support include ...... 36 IEEE MICRO  variable-length encode/decode, such as context-adaptive binary-arithmetic- Rs I R I R Rs coding processing in H.264 video;  features from accelerated segment Rt I R I R Rt test (FAST) corner detection image processing;  FFTalgorithms; ∗∗ ∗∗  sliding-window filters;  linear-feedback shift;  table lookup from an arbitrary bit 32 32 32 32 field index; 0x8000 <<0–1 <<0–1 <<0–1 <<0–1 0x8000  elliptic curve cryptography; and –  cyclic redundancy check (CRC) calculation. Add Add

Load/store instructions Sat_32 Sat_32 Dual load/store units access signed or High 16bits High 16bits unsigned 8-, 16-, 32-, and 64-bit values in memory. There is a rich variety of addressing modes, including I R Rd  absolute 32-bit, Figure 4. Complex multiply instruction. Such an instruction executed in a  base plus scaled immediate and base single pipelined execution unit provides large efficiency gains for plus scaled register, applications that use complex arithmetic.  auto-incrementing by register and immediate,  circular addressing, and  bit reversed. that is generated from it by the compiler. The “.new” suffix implies the source predicate is To increase the number of instruction generated in the same packet. In this exam- combinations allowed in packets, the load/ ple, the dot-new construct enables the work store units also support 32-bit ALU to be done in one instruction packet instead instructions. of two. The C statement is as follows: Conditional execution and program flow if (R2 == 4) The Hexagon ISA includes conditional R3 = *R4; execution. Conditional execution is useful to else remove branches through if-conversion and R5 = 5; is helpful for a VLIW processor. Compare Assembly code with braces delineate instructions target one of four predicate regis- packet boundaries: ters. These predicate registers can then be { used to conditionally execute certain instruc- P0 = cmp.eq(R2,#4) tions. Not all instructions can be condi- if (P0.new) R3 = memw(R4) tional—only the most common load/store if (!P0.new) R5 = #5 and ALU instructions. } A unique feature of Hexagon conditional Similar to many DSP processors, Hexa- execution is that the processor can generate gon includes a zero-overhead hardware and use a predicate in the same VLIW in- counted looping mechanism with support struction packet. This reduces packet count for two levels of nesting. An instruction is and creates denser packets, both of which used to initialize the loop count and the start improve performance and reduce energy con- address. Bits encoded in the last packet of the sumption. Consider the following C state- loop delineate the end of the loop. This ment and the corresponding assembly code architecture allows execution of loops with ...... MARCH/APRIL 2014 37 ...... HOT CHIPS

Duplex instructions Packet of four 32 bits Low code size is advantageous for an 32-bit embedded processor. Hexagon instructions 32 bits instructions are fixed size and 32 bits in length. To 32 bits improve code size, the Duplex feature enables 32 bits some use of 16-bit instructions by creating a 32-bit subpacket containing two 16-bit Packet of two 32- 32 bits bit instructions instructions. These subpackets are called 32 bits and a duplex duplexes. Figure 5 shows a visualization of a 16 bits 16 bits duplex. Each box is an instruction. Because duplexes are always 32 bits, packet sizes continue to be multiples of 32 Figure 5. Visualization of a duplex. A duplex bits. This leads to a simpler and lower-power is a 32-bit subpacket containing two 16-bit implementation as compared to instruction instructions. sets with a mixed 16-/32-bit instruction set. Additionally, duplexes must always end a packet, and are always dispatched to the same no branch mispredicts or stalls, and no hard- two execution units, which further simplifies ware devoted to loop branch prediction. the implementation. The instructions allowed in duplexes, called subinstructions,arethemost Compound and memop instructions common subset of normal Hexagon instruc- Compound instructions combine two or tions, with reduced ranges of registers and more dependent operations in a single instruc- immediate operands. tion. These instructions improve code size and save power by reducing register file and for- warding power. Hexagon includes many such Multithreading and instructions, including shift-add, shift-or, add- The Hexagon processor is multithreaded. add, compare-branch, shift-xor, and many fla- The number of threads varies by implemen- vors of the classic DSP multiply-add. tation. Early implementations included six Another class of instruction performs sim- hardware threads, but more recent cores ple operations directly on memory, including include three hardware threads. There are add, subtract, logical–or, and logical–and. many trade-offs in choosing the number of Without these memory operations (mem- threads. Additional threads provide more ops), three instructions would be necessary to latency tolerance and enable power-saving perform the same task: one to load the value, opportunities in the microarchitecture by one to perform the arithmetic or logical oper- serializing work rather than speculating ation on the value, and one to store the result. work. On the other hand, additional threads Memops improve code size and reduce power increase cache pressure and increase the soft- because intermediate register access is not ware programming burden. Our experience needed. is that three or four threads are a sweet spot in the design space. VLIW instruction grouping Hexagon is designed to look like a multi- VLIW instruction packets are variable core architecture with communication sized and contain one to four instructions. If through shared memory. Figure 6 shows how a packet contains more than one instruction, the processor appears to the programmer. the instructions execute in parallel. The Software threads are mapped to hardware instruction combinations allowed in a packet threads by the operating system. are limited to the instruction types that can In the physical implementation, however, be executed in parallel in the four execution there is only one processor, which the three units. The processor uses parallel execution hardware threads share. Hexagon V1 through semantics. All registers are read, then all V4 implemented a simple round-robin inter- instructions are executed, then all registers leaved multithreading (IMT) approach.3 On are written. every clock tick, a different thread is given a ...... 38 IEEE MICRO Shared instruction cache

Thread 0 Thread 1 Thread 2

L2 DU DU XU XU DU DU XU XU DU DU XU XU cache/ TCM

Register file Register file Register file

Shared data cache

Figure 6. The programmer’s view of multithreading. To the programmer, it appears as three VLIW cores with shared caches. Software threads are mapped to hardware threads by the Hexagon operating system.

Thread 0 dispatch Thread 1 dispatch Thread 2 dispatch

T0: { Ld Ld Add Cmp } T1: { St Ld Mpy Add } T2: { Ld Add Jump }

T0: { Ld Ld Add Cmp } T1: { St Ld Mpy Add }

T0: { Ld Ld Add Cmp }

Figure 7. Interleaved multithreading. The figure shows a three-stage execution pipeline with three threads taking turns dispatching packets.

turn at each pipe stage. Figure 7 shows a Because there is no observable latency, the three-stage execution pipeline and with three compiler is not concerned with instruction threads taking turns dispatching packets. latency and scheduling for latency. This With the number of threads matched to yields higher VLIW packet density. When the execution pipe depth, all of a thread’s instructions are no longer needed to hide instructions from a VLIW packet are com- latency, they can be used instead to fill the plete before the next VLIW packet starts. packets...... MARCH/APRIL 2014 39 ...... HOT CHIPS

(DMT) is shown as the additional (striped) 5.0 bar on top of the baseline (solid) IMT bar. IPC_DMT 4.5 In the Snapdragon 800 implementation, IPC_IMT the DSP runs up to 800 MHz. The instruc- 4.0 tion cache is 16 Kbytes, the data cache is 32 3.5 Kbytes, and the level-2 (L2) cache is 256 3.0 Kbytes. Connection to main memory is pro- 2.5 vided over a 64-bit system bus that runs at 240 MHz. 2.0 1.5 1.0 System programming model Instructions per cycle 0.5 Communication between the DSP and CPU is done through a traditional shared- 0 memory-plus-interrupt mechanism. Both the SIFT MP3 DSP and CPU can access the full physical Listen JPEGD JPEGC address space and share the external memory. Coremark Dhrystone 2D-3D Cvt

spec95_go Access to memory is cache based, and there is Face detect VP8 decode H264 decode Wave denoise BM3D denoise

Feature extract no explicit data mover. The DSP includes an extensive prefetching capability to help hide Figure 8. Performance on multimedia applications. The figure shows the cache latency. The CPU and DSP are not instructions per cycle (IPCs) for various multimedia benchmarks, which are cache coherent with each other, so coherency sorted into multithreaded applications (left) and single-threaded applications must be maintained in software with explicit (right). cache maintenance operations. A software remote procedure call (RPC) interface lets a CPU application offload work to the DSP.When an RPC is made, any data The simple IMT model and simple in- associated with the call is flushed to main order pipeline yield a small, low-power pro- memory from the CPU caches and mapped cessor that is critical to meeting the aggressive into the DSP virtual address space. The DSP area and power targets. is then interrupted to process the RPC call, The obvious problem with IMT is that after which any results are flushed from the when threads are idle or stalled, their slice DSP caches back to main memory, and a of the processor goes unused. Starting with completion interrupt is sent to the CPU. Hexagon V5, a more dynamic approach to The overheads of software-managed thread scheduling has been implemented. coherency preclude offloading very small tasks Often, packets contain only simple instruc- to the DSP. Large kernels that run continu- tions and can be completed in fewer than ously or process large data (full image frames) three cycles. With Hexagon V5, the pro- are typically needed to amortize the overhead. cessor will opportunistically execute packets faster if threads are idle or stalled and simple packets are available. The design philosophy Power is not to shoot for the best single-thread Battery life is extremely important in performance, but rather to provide some mobile computing. Offloading applications performance boost when it is easy to do from the CPU to a specialized low-power so and without compromising energy processing engine such as Hexagon is critical efficiency. to achieving the power goals. In addition to Figure 8 shows the instructions per the ISA and microarchitecture for efficient cycle (IPCs) for various multimedia multimedia processing, Hexagon is imple- benchmarks. The benchmarks are sorted mented with aggressive low-power design into multithreaded applications on the techniques, including hierarchical clock gat- left, and single-threaded on the right. The ing with a custom clock tree, voltage scaling boost from the V5 dynamic multithreading with split-grid memories, pulse latches ...... 40 IEEE MICRO instead of flip-flops, and full-custom caches and register files designed for low power. A 3.5 more complete description of the low-power techniques used by Hexagon is available 3.0 elsewhere.4 based chip

An important power benchmark for - mobile phones is MP3 playback. Figure 9 2.5 compares current measured at the battery for various competitive smartphones. Battery 2.0 current includes all components of system power, such as the CPU, DSP, memory, and 1.5 I/O. The data is presented as (battery current in mA for the device divided by battery cur- rent in mA for the Hexagon-based chip). A 1.0 2Â delta on this chart represents twice the hours of music playback, which is a key mar- 0.5 keting and user-experience metric. Google recently announced support for DSP offload of audio playback in the Battery current in mA relative to Hexagon 0 Android 4.4 (KitKat). Quoting from the “Audio Tunneling to DSP” section on Goo- 5 gle’s Android developer webpage: CompetitorA CompetitorB CompetitorC CompetitorDCompetitorECompetitorFCompetitorG Hexagon-based For high-performance, lower-power audio play- back, Android 4.4 adds platform support for Figure 9. MP3 playback battery current for various smartphones. Lower audio tunneling to a digital signal processor (DSP) bars represent more hours of music playback, which is a critical customer in the device chipset. With tunneling, audio metric. decoding and output effects are off-loaded to the DSP, waking the application processor less often and using less battery. Audio tunneling can dramatically improve bat- tery life for use-cases such as listening to music over a headset with the screen off. For example, 1.20 with audio tunneling, offers a total off- network audio playback time of up to 60 hours, 1.00 an increase of over 50% over non-tunneled audio. 0.80 The Nexus 5 device uses Snapdragon 800, and the audio offload is done to the Hexagon 0.60 V5 aDSP. Figure 10 shows another example of off- 0.40 loading a computer-vision-object detection algorithm from the CPU to the Hexagon 0.20 DSP. Initially, the algorithm is run on the CPU. The algorithm is fully optimized with 0.00 Neon SIMD instructions. After offloading to CPU utilization (%) Detection time Total power (mW) the DSP, the same algorithm is called via an (msec) RPC. Because the algorithm is no longer run- ning on the CPU, the CPU load is reduced. ARM/Neon Hexagon app DSP In terms of speed, the algorithm is marginally faster on the DSP. This includes all RPC overheads. But, significantly, the total system Figure 10. An example of feature detection offload. The algorithm is power as measured at the battery has been offloaded from the CPU to DSP, which results in comparable performance reduced by 32 percent. but much reduced battery current...... MARCH/APRIL 2014 41 ...... HOT CHIPS

Canny Fast9 Intensity histogram Energy/pixel(nJ) Energy/pixel(nJ) Energy/pixel(nJ)

Time/pixel(nSec) Time/pixel(nSec) Time/pixel(nSec)

Harris NCC

- CPUx1 CPU @ 1.2 GHz - CPUx4 - DSP DSP @ 690 MHz Energy/pixel(nJ) Energy/pixel(nJ) Time/pixel(nSec) Time/pixel(nSec)

Figure 11. Energy versus latency for multimedia benchmarks. Each chart features a different algorithm used in a computer vision application.

Figure 11 provides additional examples of For readers who would like to explore performance and power, comparing the Hex- further, the Hexagon Software Developer’s agon V5 DSP to the CPU used in the Snap- Kit (https://developer.qualcomm.com/mobile- dragon 200 chip. Each chart features a development/maximize-hardware/multimedia- different algorithm used in a computer vision optimization-hexagon-sdk/multimedia- application. For the CPU, all code is fully optimization-h-2) provides everything optimized using Neon SIMD instructions. needed to program the DSP, including Both single-CPU and quad-CPU data is full documentation, software tools, a shown. The x-axis shows latency in units of cycle-approximate simulator, and example time/pixel, and the y-axis shows energy/pixel. code. MICRO The origin is (0,0) in all charts. Power is measured at the battery and includes all sys- ...... tem power. When compared to the quad References CPU in Snapdragon 200, the Hexagon V5 1. L. Codrescu et al., “Qualcomm Hexagon DSP provides similar or better performance DSP: An Architecture Optimized for Mobile and lower power in all cases. Multimedia and Communications,” Hot Chips 25, 2013. 2. J. Fisher, “VLIW Architectures: An Inevita- his article provides an overview of the ble Standard for the Future?” J. Supercom- Hexagon DSP architecture. The T puter, vol. 7, no. 2, 1990, pp. 29-36. demand for low-power signal processing in mobile applications continues unabated. 3. B. Smith, “Architecture and Applications of Camera and video applications require the HEP Multiprocessor Computer Sys- sophisticated signal processing at ultra-high tem,” SPIE Real Time Signal Processing IV, definition resolution. At the same time, 1981, pp. 241-248. “always-on” voice activation features are 4. M. Saint-Laurent et al., “A 28 nm DSP Pow- pushing power requirements to new lows. ered by an On-Chip LDO for High-Per- Future versions of Hexagon DSP will be formance and Energy-Efficient Mobile enhanced and specialized to tackle these Applications,” to be published in Proc. IEEE upcoming challenges. Int’l Solid-State Circuits Conf., 2014...... 42 IEEE MICRO 5. “Android KitKat: New Media Capabilities,” Rick Maule is a senior director at Qual- Android.com, Jan. 2014, http://developer. comm, where he is responsible for DSP android.com/about/versions/kitkat.html#44- product management. Maule has an MS in media. electrical engineering from the University of Arkansas. Lucian Codrescu is a senior director at Qualcomm, where he leads the Hexagon Direct questions and comments about this Architecture team. Codrescu has a PhD in article to Lucian Codrescu at lucian. computer engineering from the Georgia [email protected]. Institute of Technology.

Willie Anderson is a vice president at Qual- comm, where he runs the DSP engineering organization. Anderson has a BS in physics from the University of Texas at Austin.

Suresh Venkumanhanti is a principal engi- neer at Qualcomm, where he is responsible for the DSP core microarchitecture. Venku- manhanti has an MS in electrical engineer- ing from the University of Florida.

Mao Zeng is a senior staff engineer at Qual- comm and a principal designer of the DSP instruction set architecture. Zeng has a PhD in electrical engineering from the University of Victoria.

Erich Plondke is a senior staff engineer at Qualcomm, where he leads the system architecture team. Plondke has an MS in electrical engineering from the Georgia Institute of Technology.

Chris Koob is a principal engineer at Qual- comm, where he is responsible for the memory system microarchitecture. Koob has an MS in electrical engineering from Virginia Tech.

Ajay Ingle is a principal engineer at Qual- comm, where he is responsible for DSP architecture support of modem applications. Ingle has an MS in electrical engineering from Texas A&M University.

Charles Tabony is a senior engineer at Qualcomm, where he contributes to the instruction set design and performance veri- fication. Tabony has a BS in computer engi- neering from the University of Texas...... MARCH/APRIL 2014 43