NEON Technology Introduction

NEON Technology Introduction Venu Gopal Reddy 1 ARM Architecture Evolution Key Technology Additions by Architecture Generation Thumb-EE Execution VFPv3 Environments: ARM11 Improved NEON™ memory use Adv SIMD Improved Thumb®-2 Media and DSP ARM9 TrustZone™ ARM10 SIMD Low Cost MCU VFPv2 Jazelle® Thumb-2 Only ARMv5 ARMv6 ARMv7A&R ARMv7M 2 What is NEON? . NEON technology is a wide SIMD data processing architecture . Extension of the ARM instruction set . 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide) . NEON instructions perform “Packed SIMD” processing . Registers are considered as vectors of elements of the same data type . Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, 32-bit float . Instructions perform the same operation in all lanes SourceSource RegistersRegisters Elements Dn Dm Operation Dd Destination Register Lane 3 NEON Coprocessor Registers . NEON has a 256-byte register file . Separate from the core registers (r0-r15) . Extension to the VFPv2 register file (VFPv3) . Two different views of the NEON registers . 32 x 64-bit registers (D0-D31) D0 Q0 . 16 x 128-bit registers (Q0-Q15) D1 D2 Q1 . Enables register trade-offs D3 . Vector length can be variable : : . Different registers available D30 Q15 D31 4 What are the Operations? . A comprehensive set of data processing instructions . Form a general purpose SIMD instruction set suitable for compilers . NEON operations fall in to the following categories . Addition / Subtraction (Saturating, Halving, Rounding) . MIN, MAX, NEG, MOV, ABS, ABD, … . Multiplication (MUL, MLA, MLS. …) . Shifts (Saturating, Rounding) . Comparison and Selection . Logical (AND, ORR, EOR, BIC, ORN, …) . Bitfield . Reciprocal Estimate/Step, Reciprocal Square Root Estimate/Step . Miscellaneous (DUP, EXT, CLZ, CLS, TBL, REV, ZIP, TRN, …) . Many more… 5 Long, Narrow and Wide Operations . NEON can utilise both register views in the same instruction . Enables instructions to promote or demote elements within operation Dn Qn Dm # Dd Qd . Long operations promote elements to double the precision . Multiply Long (16 x 16 -> 32), Add/Sub Long, Shift Long . Narrow operations demote data type to half the precision . Shift Right and Narrowing Add/Sub, Move . Wide operations promote the elements of the second operand . Add/Sub Wide (16 + 32 -> 32) . Allows number of lanes of processing to remain constant . Enables elements to be efficiently kept at appropriate precision 6 Pairwise Operations . NEON also supports pairwise instructions to add across registers . ADD, MIN, MAX . Normal . Long 7 Load/Store Instructions . Various memory access patterns are possible with single instructions x0 x x1 x2 y x memory x3 z x4 y x5 x3 x2 x1 x0 z x6 x7 x7 x6 x5 x4 NEON registers x0 x y0 x1 y x x x x y1 z x2 y y y y y2 x3 x2 x1 x0 x3 z z z z y3 y3 y2 y1 y0 8 NEON Processing Performance BDTImark2000, BDTIsimMark2000 BDTImark2000™ BDTI(sim)Mark2000™ are registered trademarks of BDTI. BDTIsimMark2000™ Contact [email protected] for more info. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Cortex-A8/NEON (600MHz) (projected) * PXA27x/WMMX (624MHz) (XScale) ARM1176 (335MHz) ARM9E (265MHz) SH3-DSP (200MHz) Notes: Cortex-A8*: Certified and published as 7.6 BDTIsimMarks/MHz (http://www.bdti.com/bdtimark/cortex_a8.htm). Projected Cortex-A8 result at OMAP35xx baseline frequency. The OMAP35x platform itself, is not currently certified. PXA27: 2140 BDTImarks measured at 624 MHz (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf ) SH3: 490 BDTImarks measured at 200 MHz (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf) ARM9, ARM11: Results quoted at (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf) BDTIsimMark2000 is calculated in the same manner as BDTImark2000, but with simulated results instead of hardware measurements 9 Why NEON? . General purpose SIMD processing useful for many applications . Supports widest range multimedia codecs used for internet applications . Many soft codec standards: MPEG-4, H.264, On2 VP6/7/8, Real, AVS, … . Ideal solution for „internet streaming‟ decode of various formats . Fewer cycles needed . NEON will give 1.6x-2.5x performance on complex video codecs . Individual simple DSP algorithms can show larger performance boost (4x-8x) . Processor can sleep sooner => overall dynamic power saving . Easy to program . Clean orthogonal vector architecture, applicable to a wide range of data intensive computation. Not just for codecs – also applicable to 2D & 3D graphics and other processing . Off the shelf tools, OS support, and ecosystem support 10 NEON - Enhancing User Experiences Watch any video in Game processing any format Edit & Enhance Process captured videos megapixel Video stabilization photos quickly Voice recognition Antialiased rendering & compositing Advanced Powerful multi- User Interfaces channel hi-fi audio processing 11 NEON in Audio . FFT: 256-point, 16-bit signed complex numbers . FFT is a key component of AAC, Voice/pattern recognition etc. Hand optimized assembler in both cases FFT time No NEON With NEON (v6 SIMD asm) (v7 NEON asm) Cortex-A8 500MHz 15.2 us 3.8 us Actual silicon (x 4.0 performance) . Extreme example: FFT in ffmpeg: 12x faster (Cortex-A8) . C code -> handwitten asm . Scalar -> vector processing . Single-precision floating point on Cortex-A8 12 How to use NEON . OpenMAX DL library . Recommended approach to accelerate AV codecs . Status: Released on http://www.arm.com/products/esd/openmax_home.html . Vectorizing Compilers . Exploits NEON SIMD automatically with existing source code . Status: Released (in RVDS 3.1 Professional and later) . Status: Codesourcery 2007q3 gcc and later . C Instrinsics . C function call interface to NEON operations . Supports all data types and operations supported by NEON . Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc) . Assembler . For those who really want to optimize at the lowest level . Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas) 13 OpenMAX DL v1.0 Library Summary . Video Domain . Audio Domain . MPEG-4 simple profile . MP3 . H.264 baseline . AAC . Still Image Domain . Signal Processing Domain . JPEG . FIR . Image Processing Domain . IIR . Colorspace conversion . FFT Dot Product . De-blocking / de-ringing . Rotation, scaling, compositing Spec from: www.khronos.org/openmax Opensource implementation for ARM11 & NEON available from: http://www.arm.com/products/multimedia/openmax/ NOTE: OpenMax DL provides low level data processing functions, not the complete codecs 14 OpenMAX early H.264 results . ARM Internal H.264 test codec using OpenMAX function calls . For OpenMAX development only - not fully optimized . Does not yet use NEON for deblocking (about 20% of cycles) . Currently, only about 50% of cycles spent in NEON optimized code (commercial codecs will use more NEON and will be better) . Includes YUV-RGB color conversion (could be done in graphics h/w) . Conditions: SystemBench (256K PL310 L2, 3:1 core:mem ratio, PL340 with mobile LPDDR) . Input sequence: Foreman VGA at 30fps. (1s sequence at bitrate of 512kbps) Mcyc directly translates to MHz to decode H.264 decode No NEON With NEON Cortex-A8 716Mcyc 398Mcyc (x 1.80 performance) Cortex-A9 633Mcyc 386Mcyc (x 1.64 performance) Commercial vendor performance results also available under NDA 15 ARM RVDS & gcc vectorising compiler |L1.16| int a[256], b[256], c[256]; VLD1.32 {d0,d1},[r0]! SUBS r3,r3,#1 foo () { armcc -S --cpu cortex-a8 VLD1.32 {d2,d3},[r1]! int i; -O3 -Otime --vectorize test.c VADD.I32 q0,q0,q1 VST1.32 {d0,d1},[r2]! for (i=0; i<256; i++){ BNE |L1.16| a[i] = b[i] + c[i]; } } .L2: add r1, r0, ip add r3, r0, lr gcc -S -O3 -mcpu=cortex-a8 add r2, r0, r4 add r0, r0, #8 -mfpu=neon -ftree-vectorize cmp r0, #1024 -ftree-vectorizer-verbose=6 fldd d7, [r3, #0] test.c fldd d6, [r2, #0] vadd.i32 d7, d7, d6 fstd d7, [r1, #0] bne .L2 . armcc generates better NEON code (gcc can use Q-regs with „-mvectorize-with-neon-quad‟ ) 16 ARM RVDS Vectorizing Compiler . RVDS 4.0 professional includes auto-vectorizing armcc . armcc --vectorize --cpu=Cortex™-A8 x.c . Up to 4x performance increase for benchmarks, with no source code changes (no source code changes are permitted for benchmarking) ARM vs NEON (Vectorize) on Cortex-A8 169% 170% 135% 120% Improved 100% 100% vectorization in 70% latest RVDS 4.0 20% Telecom Consumer ARM NEON . Simple source code changes can yield significant improvements above this . Use C „__restrict‟ keyword to work around C pointer aliasing issues . Make loops clearly multiple of 2n (e.g. use 4*n as loop end) to aid vectorization 17 Automatic Vectorizing . Automatic vectorization can generate code targeted for NEON from ordinary C source code . Less effort to produce efficient code . Portable - no compiler-specific source code features need to be used . To enable automatic vectorization, use these options together: --vectorize - enable vectorization --cpu 7-A or --cpu Cortex-A8 - provide a CPU option with NEON support -O2 or -O3 - select high optimization level -Otime - optimize for speed over space . Selecting optimization level -O3 will optimize more aggressively for speed, but at the expense of increased code size 18 Tuning C/C++ Code for Vectorizing . The goal is to try to make the code simple, straightforward, and parallel, so that the compiler can easily convert the code to NEON assembly . Loops can be modified for better vectorizing: . Short, simple loops work the best (even if it means multiple loops in your code) . Avoid breaks in loops . Try to make the number of iterations a power of 2 . Try to make sure the number of iterations is known to the compiler . Functions called inside a loop should be inlined . Pointer issues: . Using arrays with indexing vectorizes better than using pointers . Indirect addressing (multiple indexing or de-referencing) doesn‟t vectorize . Use the __restrict keyword to tell the compiler that pointers do not reference overlapping areas of memory 19 NEON Vectorizing Example (1) .

Load more