NEON Technology Introduction
Venu Gopal Reddy
1 ARM Architecture Evolution
Key Technology Additions by Architecture Generation Thumb-EE Execution VFPv3 Environments: ARM11 Improved NEON™ memory use Adv SIMD Improved Thumb®-2 Media and DSP ARM9 TrustZone™ ARM10 SIMD Low Cost MCU VFPv2
Jazelle® Thumb-2 Only
ARMv5 ARMv6 ARMv7A&R ARMv7M
2 What is NEON?
. NEON technology is a wide SIMD data processing architecture . Extension of the ARM instruction set . 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide) . NEON instructions perform “Packed SIMD” processing . Registers are considered as vectors of elements of the same data type . Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, 32-bit float . Instructions perform the same operation in all lanes SourceSource RegistersRegisters Elements Dn Dm Operation
Dd Destination Register
Lane
3 NEON Coprocessor Registers . NEON has a 256-byte register file . Separate from the core registers (r0-r15) . Extension to the VFPv2 register file (VFPv3)
. Two different views of the NEON registers . 32 x 64-bit registers (D0-D31) D0 Q0 . 16 x 128-bit registers (Q0-Q15) D1 D2 Q1 . Enables register trade-offs D3 . Vector length can be variable : : . Different registers available D30 Q15 D31
4 What are the Operations? . A comprehensive set of data processing instructions . Form a general purpose SIMD instruction set suitable for compilers
. NEON operations fall in to the following categories . Addition / Subtraction (Saturating, Halving, Rounding) . MIN, MAX, NEG, MOV, ABS, ABD, … . Multiplication (MUL, MLA, MLS. …) . Shifts (Saturating, Rounding) . Comparison and Selection . Logical (AND, ORR, EOR, BIC, ORN, …) . Bitfield . Reciprocal Estimate/Step, Reciprocal Square Root Estimate/Step . Miscellaneous (DUP, EXT, CLZ, CLS, TBL, REV, ZIP, TRN, …)
. Many more…
5 Long, Narrow and Wide Operations . NEON can utilise both register views in the same instruction . Enables instructions to promote or demote elements within operation
Dn Qn Dm #
Dd Qd
. Long operations promote elements to double the precision . Multiply Long (16 x 16 -> 32), Add/Sub Long, Shift Long . Narrow operations demote data type to half the precision . Shift Right and Narrowing Add/Sub, Move . Wide operations promote the elements of the second operand . Add/Sub Wide (16 + 32 -> 32)
. Allows number of lanes of processing to remain constant . Enables elements to be efficiently kept at appropriate precision
6 Pairwise Operations . NEON also supports pairwise instructions to add across registers . ADD, MIN, MAX
. Normal
. Long
7 Load/Store Instructions . Various memory access patterns are possible with single instructions
x0 x x1 x2 y x memory x3 z x4 y x5 x3 x2 x1 x0 z x6 x7 x7 x6 x5 x4 NEON registers
x0 x y0 x1 y x x x x y1 z x2 y y y y y2 x3 x2 x1 x0 x3 z z z z y3 y3 y2 y1 y0
8 NEON Processing Performance
BDTImark2000, BDTIsimMark2000 BDTImark2000™ BDTI(sim)Mark2000™ are registered trademarks of BDTI. BDTIsimMark2000™ Contact [email protected] for more info. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Cortex-A8/NEON (600MHz) (projected) *
PXA27x/WMMX (624MHz) (XScale)
ARM1176 (335MHz)
ARM9E (265MHz)
SH3-DSP (200MHz)
Notes: Cortex-A8*: Certified and published as 7.6 BDTIsimMarks/MHz (http://www.bdti.com/bdtimark/cortex_a8.htm). Projected Cortex-A8 result at OMAP35xx baseline frequency. The OMAP35x platform itself, is not currently certified. PXA27: 2140 BDTImarks measured at 624 MHz (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf ) SH3: 490 BDTImarks measured at 200 MHz (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf) ARM9, ARM11: Results quoted at (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf) BDTIsimMark2000 is calculated in the same manner as BDTImark2000, but with simulated results instead of hardware measurements
9 Why NEON? . General purpose SIMD processing useful for many applications . Supports widest range multimedia codecs used for internet applications . Many soft codec standards: MPEG-4, H.264, On2 VP6/7/8, Real, AVS, … . Ideal solution for „internet streaming‟ decode of various formats
. Fewer cycles needed . NEON will give 1.6x-2.5x performance on complex video codecs . Individual simple DSP algorithms can show larger performance boost (4x-8x) . Processor can sleep sooner => overall dynamic power saving
. Easy to program . Clean orthogonal vector architecture, applicable to a wide range of data intensive computation. . Not just for codecs – also applicable to 2D & 3D graphics and other processing . Off the shelf tools, OS support, and ecosystem support
10 NEON - Enhancing User Experiences
Watch any video in Game processing any format
Edit & Enhance Process captured videos megapixel Video stabilization photos quickly
Voice recognition Antialiased rendering & compositing
Advanced Powerful multi- User Interfaces channel hi-fi audio processing
11 NEON in Audio . FFT: 256-point, 16-bit signed complex numbers . FFT is a key component of AAC, Voice/pattern recognition etc. . Hand optimized assembler in both cases
FFT time No NEON With NEON (v6 SIMD asm) (v7 NEON asm) Cortex-A8 500MHz 15.2 us 3.8 us Actual silicon (x 4.0 performance)
. Extreme example: FFT in ffmpeg: 12x faster (Cortex-A8) . C code -> handwitten asm . Scalar -> vector processing . Single-precision floating point on Cortex-A8
12 How to use NEON . OpenMAX DL library . Recommended approach to accelerate AV codecs . Status: Released on http://www.arm.com/products/esd/openmax_home.html
. Vectorizing Compilers . Exploits NEON SIMD automatically with existing source code . Status: Released (in RVDS 3.1 Professional and later) . Status: Codesourcery 2007q3 gcc and later
. C Instrinsics . C function call interface to NEON operations . Supports all data types and operations supported by NEON . Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc)
. Assembler . For those who really want to optimize at the lowest level . Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas)
13 OpenMAX DL v1.0 Library Summary
. Video Domain . Audio Domain . MPEG-4 simple profile . MP3 . H.264 baseline . AAC . Still Image Domain . Signal Processing Domain . JPEG . FIR . Image Processing Domain . IIR . Colorspace conversion . FFT Dot Product . De-blocking / de-ringing . . Rotation, scaling, compositing
Spec from: www.khronos.org/openmax Opensource implementation for ARM11 & NEON available from: http://www.arm.com/products/multimedia/openmax/
NOTE: OpenMax DL provides low level data processing functions, not the complete codecs
14 OpenMAX early H.264 results
. ARM Internal H.264 test codec using OpenMAX function calls . For OpenMAX development only - not fully optimized . Does not yet use NEON for deblocking (about 20% of cycles) . Currently, only about 50% of cycles spent in NEON optimized code (commercial codecs will use more NEON and will be better) . Includes YUV-RGB color conversion (could be done in graphics h/w) . Conditions: SystemBench (256K PL310 L2, 3:1 core:mem ratio, PL340 with mobile LPDDR) . Input sequence: Foreman VGA at 30fps. (1s sequence at bitrate of 512kbps) Mcyc directly translates to MHz to decode
H.264 decode No NEON With NEON
Cortex-A8 716Mcyc 398Mcyc (x 1.80 performance)
Cortex-A9 633Mcyc 386Mcyc (x 1.64 performance)
Commercial vendor performance results also available under NDA
15 ARM RVDS & gcc vectorising compiler
|L1.16| int a[256], b[256], c[256]; VLD1.32 {d0,d1},[r0]! SUBS r3,r3,#1 foo () { armcc -S --cpu cortex-a8 VLD1.32 {d2,d3},[r1]! int i; -O3 -Otime --vectorize test.c VADD.I32 q0,q0,q1 VST1.32 {d0,d1},[r2]! for (i=0; i<256; i++){ BNE |L1.16| a[i] = b[i] + c[i]; } } .L2: add r1, r0, ip add r3, r0, lr gcc -S -O3 -mcpu=cortex-a8 add r2, r0, r4 add r0, r0, #8 -mfpu=neon -ftree-vectorize cmp r0, #1024 -ftree-vectorizer-verbose=6 fldd d7, [r3, #0] test.c fldd d6, [r2, #0] vadd.i32 d7, d7, d6 fstd d7, [r1, #0] bne .L2 . armcc generates better NEON code (gcc can use Q-regs with „-mvectorize-with-neon-quad‟ )
16 ARM RVDS Vectorizing Compiler . RVDS 4.0 professional includes auto-vectorizing armcc . armcc --vectorize --cpu=Cortex™-A8 x.c
. Up to 4x performance increase for benchmarks, with no source code changes (no source code changes are permitted for benchmarking)
ARM vs NEON (Vectorize) on Cortex-A8
169% 170%
135% 120% Improved 100% 100% vectorization in 70% latest RVDS 4.0
20% Telecom Consumer ARM NEON
. Simple source code changes can yield significant improvements above this . Use C „__restrict‟ keyword to work around C pointer aliasing issues . Make loops clearly multiple of 2n (e.g. use 4*n as loop end) to aid vectorization
17 Automatic Vectorizing
. Automatic vectorization can generate code targeted for NEON from ordinary C source code . Less effort to produce efficient code . Portable - no compiler-specific source code features need to be used
. To enable automatic vectorization, use these options together: --vectorize - enable vectorization --cpu 7-A or --cpu Cortex-A8 - provide a CPU option with NEON support -O2 or -O3 - select high optimization level -Otime - optimize for speed over space
. Selecting optimization level -O3 will optimize more aggressively for speed, but at the expense of increased code size
18 Tuning C/C++ Code for Vectorizing
. The goal is to try to make the code simple, straightforward, and parallel, so that the compiler can easily convert the code to NEON assembly
. Loops can be modified for better vectorizing: . Short, simple loops work the best (even if it means multiple loops in your code) . Avoid breaks in loops . Try to make the number of iterations a power of 2 . Try to make sure the number of iterations is known to the compiler . Functions called inside a loop should be inlined
. Pointer issues: . Using arrays with indexing vectorizes better than using pointers . Indirect addressing (multiple indexing or de-referencing) doesn‟t vectorize . Use the __restrict keyword to tell the compiler that pointers do not reference overlapping areas of memory
19 NEON Vectorizing Example (1) . How does the compiler perform vectorization? void add_int(int * __restrict pa, 2. Unroll the loop to the appropriate number of int * __restrict pb, iterations, and perform other transformations unsigned int n, int x) like pointerization { void add_int(int *pa, int *pb, unsigned int i; unsigned n, int x) __promise(n > 0 && n % 4 == 0); { for(i = 0; i < n; i++) unsigned int i; pa[i] = pb[i] + x; for (i = ((n & ~3) >> 2); i; i--) } { 1. Analyze each loop: *(pa + 0) = *(pb + 0) + x; . Are pointer accesses safe for *(pa + 1) = *(pb + 1) + x; vectorization? *(pa + 2) = *(pb + 2) + x; *(pa + 3) = *(pb + 3) + x; . What data types are being used? pa += 4; pb += 4; How do they map onto NEON } vector registers? } pb . Number of loop iterations x 3. Map each unrolled operation onto a + + + + + NEON vector lane, and generate corresponding NEON instructions pa 127 0
20 NEON Vectorizing Example (2)
With vectorizing compilation Without vectorizing compilation
add_int PROC add_int PROC BICS r12,r2,#3 MOV r12,#0 PUSH {r4} BEQ |L1.40| BICS r4,r2,#3 VDUP.32 q1,r3 BEQ |L1.44| LSRS r2,r2,#2 BIC r2,r2,#3 BEQ |L1.40| |L1.20| |L1.20| LDR r4,[r1,r12,LSL #2] VLD1.32 {d0,d1},[r1]! ADD r4,r4,r3 VADD.I32 q0,q0,q1 STR r4,[r0,r12,LSL #2] SUBS r2,r2,#1 ADD r12,r12,#1 VST1.32 {d0,d1},[r0]! CMP r2,r12 BNE |L1.20| BHI |L1.20| |L1.40| |L1.44| BX lr POP {r4} BX lr ENDP ENDP
armcc --cpu=Cortex-A8 -O2 -Otime --vectorize armcc --cpu=Cortex-A8 -O2 -Otime
21 Intrinsics
. Include intrinsics header file #include
. Use special NEON data types which correspond to D and Q registers, e.g. int8x8_t D-register containing 8x 8-bit elements int16x4_t D-register containing 4x 16-bit elements int32x4_t Q-register containing 4x 32-bit elements
. Use special intrinsics versions of NEON instructions vin1 = vld1q_s32(ptr); vout = vaddq_s32(vin1, vin2); vst1q_s32(vout, ptr);
. Strongly typed! . Use vreinterpret_s16_s32( ) to change the type
22 BeagleBoard . Funded by TI . Open Source/Creative Commons hardware . OMAP3 / Cortex-A8 NEON @ 720MHz, 256MB RAM (RevC) . Price: $149 via Digi-Key . 2xUSB 2.0 (MUSB + EHCI) . SDHC . DVI-D video . Audio I/O 3”
. Almost indistinguishable from Linux PC (except runs very cool)
. Runs Angstrom, Ubuntu 09.04 / 09.10, Android, WinCE, etc..
23 NEON in Opensource
. Bluez – official Linux Bluetooth protocol stack . NEON sbc audio encoder . Pixman (part of cairo 2D graphics library) . Compositing/alpha blending . X.Org, Mozilla Firefox, fennec, & Webkit browsers . e.g. fbCompositeSolidMask_nx8x0565neon 8x faster using NEON . ffmpeg – libavcodec . LGPL media player used in many Linux distros . NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora . NEON Audio: AAC, Vorbis, WMA . x264 – Google Summer Of Code 2009 . GPL H.264 encoder – e.g. for video conferencing . Android – NEON optimizations . Skia library, S32A_D565_Opaque 5x faster using NEON . Available in Google Skia tree from 03-Aug-2009 . Eigen2 – C++ vector math / linear algebra template library . Theorarm – libtheora NEON version (now BSD license) . Ubuntu – NEON versions of critical shared-libraries
24 Cairo-perf-test Benchmark Suite
Performance NEON vs. ARM v6 SIMD 3.00
2.50
2.00
1.50
1.00
0.50
0.00
Optimized assembly code in both cases („pixman‟ library) Overall NEON benefit +43% NEON optimization work ongoing
25 ffmpeg (libavcodec) Performance
. git.ffmpeg.org ffmpeg performance (relative to realtime)
snapshot 21-Sep-09 3
2.5 YouTube HQ video decode 2
480x270, 30fps v7vfp 1.5 Including AAC audio v7neon 1
0.5 . Real silicon measurements 0 . OMAP3 Beagleboard Cortex-A8 256KB L2 500MHz Cortex-A9 512KB L2 400MHz . ARM Cortex-A9 testchip
. NEON ~2x overall performance
26 Google WebM/VP8: SMP Dual-threaded
MHz Lower is better
Lower MHz results in lower Single core Atom only power solutions. just able to decode 40% of one CPU still available
No optimized MIPS code available in WebM release
Source: ARM benchmarking of opensource libvpx codec running on 400MHz Cortex-A9 testchip with 200MHz DDR
27 Scalability with SMP on Cortex-A9
x264 encode performance
25
20
15 ultrafast
10 veryfast
5
0 threads=1 threads-=1 threads=2 threads=1 threads=2
100MHz 100MHz 100MHz 100MHz 100MHz
ARM11 A9TC A9TC A9TC A9TC
v6 v6 v6 v7neon v7neon
NEON used
28 NEON Power Efficiency . NEON unit approx same mW/MHz as integer core overall power saving from reduced MHz (or core in standby WFI)
. Example: Youtube HQ video 480x270 H264 30fps . ffmpeg opensource player on 500MHz Cortex-A8 decoding to YUV420 . 64fps video only 234MHz average ([email protected]/MHz => 117mW) . 60fps including AAC-LC (audio decode using NEON IMDCT) 250MHz average ([email protected]/MHz => 125mW)
. Total power (including memory accesses) on Beagle ~1mW/MHz
29 Measured Power Benefits Cortex-A9 . Quad-core 90nm testchip 4
3.5
3
2.5
2 NEON rel perf Perf/power 1.5
1
0.5
0 OpenMAX colorspace ffmpeg ffmpeg jacob.ogg 320x240 ffmpeg test1.aac conversion youtube_matt22.mp4 1280x720+aac
30 NEON Codec Capabilities . Contact commercial vendors for optimized codec results . 500MHz Cortex-A8 (mobile implementation) approximate performance: MPEG-4 ASP decode 720p @ 24fps
H.264 decode D1 @ 30fps
H.264 encode CIF (352x288) @30fps
. Scales with CPU performance (e.g. 2GHz Cortex-A9) . Cortex-A9MP can use NEON per thread for N x scaling
31 NEON Third Party Ecosystem
H.264, VC1, MPEG-4 MPEG-4
VP6/7, MPEG-4, VC-1, H.264 GUI visual efffects (enc+dec), video stabilization
MPEG-4, MPEG-2, H.263, drawElements 2D GUI library H.264, WMV9, VC1, DD+ Finland
MPEG-4, H.263, H.264, Espico Ltd Audio: low-bitrate & WMV9, audio digital theater, consulting H.264, VC1 CoreAVC ultra fast TEAMSpirit voice & video codec
H.264, MPEG-4, H.263, WMV ARM NEON widely supported by software partners MobiClip
Codecs Full list on www.arm.com NEON ecosystem page Multichannel audio processing Adobe Flash products
32 NEON Summary . NEON to become standard on general purpose apps & media-centric devices. . Able to support future internet multimedia standards . NEON option across the entire Cortex-A roadmap . NEON technology ideal for use by downloaded apps
. Full enabling infrastructure to support NEON . Compilers, profilers, debuggers, libraries all available now . Key differentiator: easy to program, popular with software engineers
. Strong ARM NEON ecosystem . Broad support from many software vendors . OS support from Linux, Symbian, Microsoft, Android… . Increasing opensource adoption
33 Thank You
Please visit www.arm.com for ARM related technical details
For any queries contact < [email protected] >
34