Hexagon Dsp: an Architecture Optimized for Mobile Multimedia and Communications
Total Page:16
File Type:pdf, Size:1020Kb
................................................................................................................................................................................................................. HEXAGON DSP: AN ARCHITECTURE OPTIMIZED FOR MOBILE MULTIMEDIA AND COMMUNICATIONS ................................................................................................................................................................................................................. THE QUALCOMM HEXAGON DSP IS USED FOR BOTH MODEM PROCESSING AND MULTIMEDIA ACCELERATION.BY OFFLOADING MULTIMEDIA TASKS FROM THE CPU TO THE DSP, SIGNIFICANT POWER SAVINGS CAN BE ACHIEVED.THIS ARTICLE PROVIDES AN OVERVIEW OF THE HEXAGON ARCHITECTURE.THE PROCESSOR IS DESIGNED TO DELIVER SUPERIOR ENERGY EFFICIENCY COMPARED TO MOBILE CPU ALTERNATIVES AND THEREBY HELP ACHIEVE LONG BATTERY LIFE FOR IMPORTANT MOBILE APPLICATIONS. ......To be competitive, a modern multimedia DSP, however, is licensed for mobile product must provide a rich user expe- programming by OEMs and third-party soft- rience and long battery life. Chips for these ware vendors. This article provides an over- Lucian Codrescu ecosystems integrate multiple subsystems, view of the multimedia DSP and builds on each customized for a particular application the presentation from HotChips 25.1 Willie Anderson domain. By specializing a subsystem to a task, Figure 2 shows the various Hexagon gen- performance and power can be enhanced erations. Version 2 (V2) was the first pro- Suresh Venkumanhanti beyond what is possible with a homogenous duction version and appeared in the initial CPU-based computing platform. Snapdragon mobile products in 2007. V3 Mao Zeng Figure 1 shows a block diagram of the featured an improved implementation with Snapdragon 800. This chip contains dedicated better power consumption. These early ver- Erich Plondke subsystems for camera, display, video, audio/ sions of Hexagon targeted voice and audio voice, sensors, graphics, cellular modem, Wi- processing. Example functions include wide- Chris Koob Fi, and more. Each subsystem contains dedi- band vocoders, echo cancellation, audio cated hardware, and many contain special- postprocessing filters, MP3/AAC play- Ajay Ingle purpose processing engines and software cus- back, speaker protection algorithms, and tomized to the task. so on. Charles Tabony The Snapdragon 800 has two instances of V4 and V5 expanded the application tar- the Hexagon digital-signal processor (DSP). gets to include image processing for camera Rick Maule The modem (mDSP) is dedicated and and video; computer vision tasks such as customized for modem processing, whereas hand, gesture, and face recognition; and Qualcomm the application DSP (aDSP) is used for mul- processing of sensor input (gyro, accelerome- timedia acceleration. The modem processor ter, fingerprint, and so on). Unless otherwise is a closed subsystem and is programmed noted, this article will focus on the latest only within Qualcomm Technologies. The Hexagon V5 core. ....................................................... 34 Published by the IEEE Computer Society 0272-1732/14/$31.00 c 2014 IEEE • aDSP: Real-time media & sensor processing Snapdragon 800 Audio Camera Krait Krait Adreno CPU CPU Sensors Display GPU JPEG Krait Krait Hexagon Misc. CPU CPU Video aDSP connectivity Other 2-Mbyte L2 Multimedia fabric System fabric Hexagon Modem Fabric & memory controller mDSP • mDSP: Dedicated LPDDR3 LPDDR3 modem processing Figure 1. Snapdragon 800 block diagram. The chip contains dedicated subsystems for camera, display, video, audio/voice, sensors, graphics, cellular modem, and Wi-Fi. V5 mDSP V3 mDSP V4 mDSP 28 nm 45 nm 28 nm Dec. 2012 June 2009 Dec. 2010 V4 aDSP V3 aDSP Low-tier V1 aDSP V2 aDSP Low-tier 28 nm 65 nm 65 nm 45 nm Apr. 2011 Oct. 2006 Dec. 2007 Nov. 2009 V5 aDSP V3 aDSP V4 aDSP 28 nm 45 nm 28 nm Dec. 2012 Aug. 2009 Dec. 2010 Time Figure 2. Hexagon digital-signal processor (DSP) evolution. The figure shows the evolution from Version 1 in October 2006 through Version 5 in December 2012. ............................................................. MARCH/APRIL 2014 35 .............................................................................................................................................................................................. HOT CHIPS All user-level registers are replicated per thread. There are two sets of user registers: Instruction general registers and control registers. The cache general registers include thirty-two 32-bit registers that can be accessed either as single Instruction unit registers or as aligned 64-bit register pairs. The general registers contain all pointer, sca- lar, vector, and accumulator data. The con- L2 trol registers include special-purpose registers cache/ such as the program counter, status register, TCM and loop registers. Data unit Data Unit Execution Execution (load/ (load/ unit unit store/ store/ (64-bit (64-bit Data-processing instructions ALU) ALU) vector) vector) There are two identical 64-bit single- instruction, multiple-data (SIMD) execution Data cache units. Each unit supports all multiply, shift, arithmetic logic unit (ALU), and bit manipu- lation instructions. Supported data types include Register file/thread 8-, 16-, 32-, and 64-bit integer; 16- and 32-bit fractional with optional rounding and saturation; 16-bit complex; and Figure 3. Hexagon block diagram. The architecture features a four-wide very single-precision IEEE-compatible float- long instruction word (VLIW) with dual load/store and dual single-instruction, ing point. multiple-data (SIMD) execution units and supports hardware multithreading. Each unit is capable of supporting: Hexagon is a multithreaded very long four 16 Â 16 multiplies; instruction word (VLIW) DSP. The design two 32 Â 16 multiplies; or philosophy is to maximize work per cycle for one 32 Â 32 multiply, one complex performance, but target the microarchitec- multiply, or one floating-point fused ture to modest clock speeds and low power. multiply-add (FMA). Many of the instructions are complex and Instruction-set architecture overview application specific. Complex instructions The architecture’s foundation is a stati- targeted to a particular application domain cally scheduled four-way VLIW. The VLIW can provide high performance and energy approach puts the burden of instruction par- efficiency. For example, Figure 4 depicts a allelism on the compiler and thereby avoids complex multiply instruction used in a costly and power-hungry dynamic-scheduling 16-bit fixed-point fast-Fourier transform hardware.2 The VLIW approach is popular (FFT). Without such an instruction, it would among commercial DSPs. Figure 3 shows a take four multiplies, four shifts, four adds, block diagram of Hexagon. and two saturates to perform the operation. It should be clear that packing all the work in Registers and memory a single instruction executed in a single pipe- The Hexagon processor features a unified lined execution unit provides large efficiency byte-addressable memory. This memory has gains. a single 32-bit virtual address space that holds The Hexagon instruction set architecture both instructions and data. It operates in (ISA) contains numerous special-purpose little-endian mode. A full-featured memory instructions designed to accelerate key multi- management unit (MMU) translates virtual media kernels. Multimedia algorithms with to physical addresses. special instruction support include ............................................................ 36 IEEE MICRO variable-length encode/decode, such as context-adaptive binary-arithmetic- Rs I R I R Rs coding processing in H.264 video; features from accelerated segment Rt I R I R Rt test (FAST) corner detection image processing; FFTalgorithms; ∗∗ ∗∗ sliding-window filters; linear-feedback shift; table lookup from an arbitrary bit 32 32 32 32 field index; 0x8000 <<0–1 <<0–1 <<0–1 <<0–1 0x8000 elliptic curve cryptography; and – cyclic redundancy check (CRC) calculation. Add Add Load/store instructions Sat_32 Sat_32 Dual load/store units access signed or High 16bits High 16bits unsigned 8-, 16-, 32-, and 64-bit values in memory. There is a rich variety of addressing modes, including I R Rd absolute 32-bit, Figure 4. Complex multiply instruction. Such an instruction executed in a base plus scaled immediate and base single pipelined execution unit provides large efficiency gains for plus scaled register, applications that use complex arithmetic. auto-incrementing by register and immediate, circular addressing, and bit reversed. that is generated from it by the compiler. The “.new” suffix implies the source predicate is To increase the number of instruction generated in the same packet. In this exam- combinations allowed in packets, the load/ ple, the dot-new construct enables the work store units also support 32-bit ALU to be done in one instruction packet instead instructions. of two. The C statement is as follows: Conditional execution and program flow if (R2 == 4) The Hexagon ISA includes conditional R3 = *R4; execution. Conditional execution is useful to else remove branches through if-conversion and R5 = 5; is helpful for a VLIW processor. Compare Assembly code with braces delineate instructions target one of four predicate regis- packet boundaries: ters. These predicate registers can then be { used to conditionally execute certain instruc- P0 = cmp.eq(R2,#4) tions. Not all instructions can be condi- if (P0.new) R3 = memw(R4) tional—only