NEON Technology Introduction

NEON Technology Introduction Venu Gopal Reddy 1 ARM Architecture Evolution Key Technology Additions by Architecture Generation Thumb-EE Execution VFPv3 Environments: ARM11 Improved NEON™ memory use Adv SIMD Improved Thumb®-2 Media and DSP ARM9 TrustZone™ ARM10 SIMD Low Cost MCU VFPv2 Jazelle® Thumb-2 Only ARMv5 ARMv6 ARMv7A&R ARMv7M 2 What is NEON? . NEON technology is a wide SIMD data processing architecture . Extension of the ARM instruction set . 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide) . NEON instructions perform “Packed SIMD” processing . Registers are considered as vectors of elements of the same data type . Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, 32-bit float . Instructions perform the same operation in all lanes SourceSource RegistersRegisters Elements Dn Dm Operation Dd Destination Register Lane 3 NEON Coprocessor Registers . NEON has a 256-byte register file . Separate from the core registers (r0-r15) . Extension to the VFPv2 register file (VFPv3) . Two different views of the NEON registers . 32 x 64-bit registers (D0-D31) D0 Q0 . 16 x 128-bit registers (Q0-Q15) D1 D2 Q1 . Enables register trade-offs D3 . Vector length can be variable : : . Different registers available D30 Q15 D31 4 What are the Operations? . A comprehensive set of data processing instructions . Form a general purpose SIMD instruction set suitable for compilers . NEON operations fall in to the following categories . Addition / Subtraction (Saturating, Halving, Rounding) . MIN, MAX, NEG, MOV, ABS, ABD, … . Multiplication (MUL, MLA, MLS. …) . Shifts (Saturating, Rounding) . Comparison and Selection . Logical (AND, ORR, EOR, BIC, ORN, …) . Bitfield . Reciprocal Estimate/Step, Reciprocal Square Root Estimate/Step . Miscellaneous (DUP, EXT, CLZ, CLS, TBL, REV, ZIP, TRN, …) . Many more… 5 Long, Narrow and Wide Operations . NEON can utilise both register views in the same instruction . Enables instructions to promote or demote elements within operation Dn Qn Dm # Dd Qd . Long operations promote elements to double the precision . Multiply Long (16 x 16 -> 32), Add/Sub Long, Shift Long . Narrow operations demote data type to half the precision . Shift Right and Narrowing Add/Sub, Move . Wide operations promote the elements of the second operand . Add/Sub Wide (16 + 32 -> 32) . Allows number of lanes of processing to remain constant . Enables elements to be efficiently kept at appropriate precision 6 Pairwise Operations . NEON also supports pairwise instructions to add across registers . ADD, MIN, MAX . Normal . Long 7 Load/Store Instructions . Various memory access patterns are possible with single instructions x0 x x1 x2 y x memory x3 z x4 y x5 x3 x2 x1 x0 z x6 x7 x7 x6 x5 x4 NEON registers x0 x y0 x1 y x x x x y1 z x2 y y y y y2 x3 x2 x1 x0 x3 z z z z y3 y3 y2 y1 y0 8 NEON Processing Performance BDTImark2000, BDTIsimMark2000 BDTImark2000™ BDTI(sim)Mark2000™ are registered trademarks of BDTI. BDTIsimMark2000™ Contact [email protected] for more info. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Cortex-A8/NEON (600MHz) (projected) * PXA27x/WMMX (624MHz) (XScale) ARM1176 (335MHz) ARM9E (265MHz) SH3-DSP (200MHz) Notes: Cortex-A8*: Certified and published as 7.6 BDTIsimMarks/MHz (http://www.bdti.com/bdtimark/cortex_a8.htm). Projected Cortex-A8 result at OMAP35xx baseline frequency. The OMAP35x platform itself, is not currently certified. PXA27: 2140 BDTImarks measured at 624 MHz (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf ) SH3: 490 BDTImarks measured at 200 MHz (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf) ARM9, ARM11: Results quoted at (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf) BDTIsimMark2000 is calculated in the same manner as BDTImark2000, but with simulated results instead of hardware measurements 9 Why NEON? . General purpose SIMD processing useful for many applications . Supports widest range multimedia codecs used for internet applications . Many soft codec standards: MPEG-4, H.264, On2 VP6/7/8, Real, AVS, … . Ideal solution for „internet streaming‟ decode of various formats . Fewer cycles needed . NEON will give 1.6x-2.5x performance on complex video codecs . Individual simple DSP algorithms can show larger performance boost (4x-8x) . Processor can sleep sooner => overall dynamic power saving . Easy to program . Clean orthogonal vector architecture, applicable to a wide range of data intensive computation. Not just for codecs – also applicable to 2D & 3D graphics and other processing . Off the shelf tools, OS support, and ecosystem support 10 NEON - Enhancing User Experiences Watch any video in Game processing any format Edit & Enhance Process captured videos megapixel Video stabilization photos quickly Voice recognition Antialiased rendering & compositing Advanced Powerful multi- User Interfaces channel hi-fi audio processing 11 NEON in Audio . FFT: 256-point, 16-bit signed complex numbers . FFT is a key component of AAC, Voice/pattern recognition etc. Hand optimized assembler in both cases FFT time No NEON With NEON (v6 SIMD asm) (v7 NEON asm) Cortex-A8 500MHz 15.2 us 3.8 us Actual silicon (x 4.0 performance) . Extreme example: FFT in ffmpeg: 12x faster (Cortex-A8) . C code -> handwitten asm . Scalar -> vector processing . Single-precision floating point on Cortex-A8 12 How to use NEON . OpenMAX DL library . Recommended approach to accelerate AV codecs . Status: Released on http://www.arm.com/products/esd/openmax_home.html . Vectorizing Compilers . Exploits NEON SIMD automatically with existing source code . Status: Released (in RVDS 3.1 Professional and later) . Status: Codesourcery 2007q3 gcc and later . C Instrinsics . C function call interface to NEON operations . Supports all data types and operations supported by NEON . Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc) . Assembler . For those who really want to optimize at the lowest level . Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas) 13 OpenMAX DL v1.0 Library Summary . Video Domain . Audio Domain . MPEG-4 simple profile . MP3 . H.264 baseline . AAC . Still Image Domain . Signal Processing Domain . JPEG . FIR . Image Processing Domain . IIR . Colorspace conversion . FFT Dot Product . De-blocking / de-ringing . Rotation, scaling, compositing Spec from: www.khronos.org/openmax Opensource implementation for ARM11 & NEON available from: http://www.arm.com/products/multimedia/openmax/ NOTE: OpenMax DL provides low level data processing functions, not the complete codecs 14 OpenMAX early H.264 results . ARM Internal H.264 test codec using OpenMAX function calls . For OpenMAX development only - not fully optimized . Does not yet use NEON for deblocking (about 20% of cycles) . Currently, only about 50% of cycles spent in NEON optimized code (commercial codecs will use more NEON and will be better) . Includes YUV-RGB color conversion (could be done in graphics h/w) . Conditions: SystemBench (256K PL310 L2, 3:1 core:mem ratio, PL340 with mobile LPDDR) . Input sequence: Foreman VGA at 30fps. (1s sequence at bitrate of 512kbps) Mcyc directly translates to MHz to decode H.264 decode No NEON With NEON Cortex-A8 716Mcyc 398Mcyc (x 1.80 performance) Cortex-A9 633Mcyc 386Mcyc (x 1.64 performance) Commercial vendor performance results also available under NDA 15 ARM RVDS & gcc vectorising compiler |L1.16| int a[256], b[256], c[256]; VLD1.32 {d0,d1},[r0]! SUBS r3,r3,#1 foo () { armcc -S --cpu cortex-a8 VLD1.32 {d2,d3},[r1]! int i; -O3 -Otime --vectorize test.c VADD.I32 q0,q0,q1 VST1.32 {d0,d1},[r2]! for (i=0; i<256; i++){ BNE |L1.16| a[i] = b[i] + c[i]; } } .L2: add r1, r0, ip add r3, r0, lr gcc -S -O3 -mcpu=cortex-a8 add r2, r0, r4 add r0, r0, #8 -mfpu=neon -ftree-vectorize cmp r0, #1024 -ftree-vectorizer-verbose=6 fldd d7, [r3, #0] test.c fldd d6, [r2, #0] vadd.i32 d7, d7, d6 fstd d7, [r1, #0] bne .L2 . armcc generates better NEON code (gcc can use Q-regs with „-mvectorize-with-neon-quad‟ ) 16 ARM RVDS Vectorizing Compiler . RVDS 4.0 professional includes auto-vectorizing armcc . armcc --vectorize --cpu=Cortex™-A8 x.c . Up to 4x performance increase for benchmarks, with no source code changes (no source code changes are permitted for benchmarking) ARM vs NEON (Vectorize) on Cortex-A8 169% 170% 135% 120% Improved 100% 100% vectorization in 70% latest RVDS 4.0 20% Telecom Consumer ARM NEON . Simple source code changes can yield significant improvements above this . Use C „__restrict‟ keyword to work around C pointer aliasing issues . Make loops clearly multiple of 2n (e.g. use 4*n as loop end) to aid vectorization 17 Automatic Vectorizing . Automatic vectorization can generate code targeted for NEON from ordinary C source code . Less effort to produce efficient code . Portable - no compiler-specific source code features need to be used . To enable automatic vectorization, use these options together: --vectorize - enable vectorization --cpu 7-A or --cpu Cortex-A8 - provide a CPU option with NEON support -O2 or -O3 - select high optimization level -Otime - optimize for speed over space . Selecting optimization level -O3 will optimize more aggressively for speed, but at the expense of increased code size 18 Tuning C/C++ Code for Vectorizing . The goal is to try to make the code simple, straightforward, and parallel, so that the compiler can easily convert the code to NEON assembly . Loops can be modified for better vectorizing: . Short, simple loops work the best (even if it means multiple loops in your code) . Avoid breaks in loops . Try to make the number of iterations a power of 2 . Try to make sure the number of iterations is known to the compiler . Functions called inside a loop should be inlined . Pointer issues: . Using arrays with indexing vectorizes better than using pointers . Indirect addressing (multiple indexing or de-referencing) doesn‟t vectorize . Use the __restrict keyword to tell the compiler that pointers do not reference overlapping areas of memory 19 NEON Vectorizing Example (1) .

NEON Technology Introduction

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support