NEON Technology Introduction

Venu Gopal Reddy

1 ARM Architecture Evolution

Key Technology Additions by Architecture Generation Thumb-EE Execution VFPv3 Environments: ARM11 Improved NEON™ memory use Adv SIMD Improved Thumb®-2 Media and DSP ARM9 TrustZone™ ARM10 SIMD Low Cost MCU VFPv2

Jazelle® Thumb-2 Only

ARMv5 ARMv6 ARMv7A&R ARMv7M

2 What is NEON?

. NEON technology is a wide SIMD data processing architecture . Extension of the ARM instruction set . 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide) . NEON instructions perform “Packed SIMD” processing . Registers are considered as vectors of elements of the same data type . Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, 32-bit float . Instructions perform the same operation in all lanes SourceSource RegistersRegisters Elements Dn Dm Operation

Dd Destination Register

Lane

3 NEON Coprocessor Registers . NEON has a 256-byte register file . Separate from the core registers (r0-r15) . Extension to the VFPv2 register file (VFPv3)

. Two different views of the NEON registers . 32 x 64-bit registers (D0-D31) D0 Q0 . 16 x 128-bit registers (Q0-Q15) D1 D2 Q1 . Enables register trade-offs D3 . Vector length can be variable : : . Different registers available D30 Q15 D31

4 What are the Operations? . A comprehensive set of data processing instructions . Form a general purpose SIMD instruction set suitable for compilers

. NEON operations fall in to the following categories . Addition / Subtraction (Saturating, Halving, Rounding) . MIN, MAX, NEG, MOV, ABS, ABD, … . Multiplication (MUL, MLA, MLS. …) . Shifts (Saturating, Rounding) . Comparison and Selection . Logical (AND, ORR, EOR, BIC, ORN, …) . Bitfield . Reciprocal Estimate/Step, Reciprocal Square Root Estimate/Step . Miscellaneous (DUP, EXT, CLZ, CLS, TBL, REV, ZIP, TRN, …)

. Many more…

5 Long, Narrow and Wide Operations . NEON can utilise both register views in the same instruction . Enables instructions to promote or demote elements within operation

Dn Qn Dm #

Dd Qd

. Long operations promote elements to double the precision . Multiply Long (16 x 16 -> 32), Add/Sub Long, Shift Long . Narrow operations demote data type to half the precision . Shift Right and Narrowing Add/Sub, Move . Wide operations promote the elements of the second operand . Add/Sub Wide (16 + 32 -> 32)

. Allows number of lanes of processing to remain constant . Enables elements to be efficiently kept at appropriate precision

6 Pairwise Operations . NEON also supports pairwise instructions to add across registers . ADD, MIN, MAX

. Normal

. Long

7 Load/Store Instructions . Various memory access patterns are possible with single instructions

x0 x x1 x2 y x memory x3 z x4 y x5 x3 x2 x1 x0 z x6 x7 x7 x6 x5 x4 NEON registers

x0 x y0 x1 y x x x x y1 z x2 y y y y y2 x3 x2 x1 x0 x3 z z z z y3 y3 y2 y1 y0

8 NEON Processing Performance

BDTImark2000, BDTIsimMark2000 BDTImark2000™ BDTI(sim)Mark2000™ are registered trademarks of BDTI. BDTIsimMark2000™ Contact [email protected] for more info. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Cortex-A8/NEON (600MHz) (projected) *

PXA27x/WMMX (624MHz) (XScale)

ARM1176 (335MHz)

ARM9E (265MHz)

SH3-DSP (200MHz)

Notes: Cortex-A8*: Certified and published as 7.6 BDTIsimMarks/MHz (http://www.bdti.com/bdtimark/cortex_a8.htm). Projected Cortex-A8 result at OMAP35xx baseline frequency. The OMAP35x platform itself, is not currently certified. PXA27: 2140 BDTImarks measured at 624 MHz (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf ) SH3: 490 BDTImarks measured at 200 MHz (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf) ARM9, ARM11: Results quoted at (http://www.bdti.com/bdtimark/chip_fixed_scores.pdf) BDTIsimMark2000 is calculated in the same manner as BDTImark2000, but with simulated results instead of hardware measurements

9 Why NEON? . General purpose SIMD processing useful for many applications . Supports widest range multimedia used for internet applications . Many soft standards: MPEG-4, H.264, On2 VP6/7/8, Real, AVS, … . Ideal solution for „internet streaming‟ decode of various formats

. Fewer cycles needed . NEON will give 1.6x-2.5x performance on complex video codecs . Individual simple DSP algorithms can show larger performance boost (4x-8x) . Processor can sleep sooner => overall dynamic power saving

. Easy to program . Clean orthogonal vector architecture, applicable to a wide range of data intensive computation. . Not just for codecs – also applicable to 2D & 3D graphics and other processing . Off the shelf tools, OS support, and ecosystem support

10 NEON - Enhancing User Experiences

Watch any video in Game processing any format

Edit & Enhance Process captured videos megapixel Video stabilization photos quickly

Voice recognition Antialiased rendering & compositing

Advanced Powerful multi- User Interfaces channel hi-fi audio processing

11 NEON in Audio . FFT: 256-point, 16-bit signed complex numbers . FFT is a key component of AAC, Voice/pattern recognition etc. . Hand optimized assembler in both cases

FFT time No NEON With NEON (v6 SIMD asm) (v7 NEON asm) Cortex-A8 500MHz 15.2 us 3.8 us Actual silicon (x 4.0 performance)

. Extreme example: FFT in : 12x faster (Cortex-A8) . code -> handwitten asm . Scalar -> vector processing . Single-precision floating point on Cortex-A8

12 How to use NEON . OpenMAX DL . Recommended approach to accelerate AV codecs . Status: Released on http://www.arm.com/products/esd/openmax_home.html

. Vectorizing Compilers . Exploits NEON SIMD automatically with existing . Status: Released (in RVDS 3.1 Professional and later) . Status: Codesourcery 2007q3 gcc and later

. C Instrinsics . C function call interface to NEON operations . Supports all data types and operations supported by NEON . Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc)

. Assembler . For those who really want to optimize at the lowest level . Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas)

13 OpenMAX DL v1.0 Library Summary

. Video Domain . Audio Domain . MPEG-4 simple profile . MP3 . H.264 baseline . AAC . Still Image Domain . Signal Processing Domain . JPEG . FIR . Image Processing Domain . IIR . Colorspace conversion . FFT Dot Product . De-blocking / de-ringing . . Rotation, scaling, compositing

Spec from: www.khronos.org/openmax Opensource implementation for ARM11 & NEON available from: http://www.arm.com/products/multimedia/openmax/

NOTE: OpenMax DL provides low level data processing functions, not the complete codecs

14 OpenMAX early H.264 results

. ARM Internal H.264 test codec using OpenMAX function calls . For OpenMAX development only - not fully optimized . Does not yet use NEON for deblocking (about 20% of cycles) . Currently, only about 50% of cycles spent in NEON optimized code (commercial codecs will use more NEON and will be better) . Includes YUV-RGB color conversion (could be done in graphics h/w) . Conditions: SystemBench (256K PL310 L2, 3:1 core:mem ratio, PL340 with mobile LPDDR) . Input sequence: Foreman VGA at 30fps. (1s sequence at bitrate of 512kbps)  Mcyc directly translates to MHz to decode

H.264 decode No NEON With NEON

Cortex-A8 716Mcyc 398Mcyc (x 1.80 performance)

Cortex-A9 633Mcyc 386Mcyc (x 1.64 performance)

Commercial vendor performance results also available under NDA

15 ARM RVDS & gcc vectorising compiler

|L1.16| int a[256], b[256], c[256]; VLD1.32 {d0,d1},[r0]! SUBS r3,r3,#1 foo () { armcc -S --cpu cortex-a8 VLD1.32 {d2,d3},[r1]! int i; -O3 -Otime --vectorize test.c VADD.I32 q0,q0,q1 VST1.32 {d0,d1},[r2]! for (i=0; i<256; i++){ BNE |L1.16| a[i] = b[i] + c[i]; } } .L2: add r1, r0, ip add r3, r0, lr gcc -S -O3 -mcpu=cortex-a8 add r2, r0, r4 add r0, r0, #8 -mfpu=neon -ftree-vectorize cmp r0, #1024 -ftree-vectorizer-verbose=6 fldd d7, [r3, #0] test.c fldd d6, [r2, #0] vadd.i32 d7, d7, d6 fstd d7, [r1, #0] bne .L2 . armcc generates better NEON code (gcc can use Q-regs with „-mvectorize-with-neon-quad‟ )

16 ARM RVDS Vectorizing Compiler . RVDS 4.0 professional includes auto-vectorizing armcc . armcc --vectorize --cpu=Cortex™-A8 x.c

. Up to 4x performance increase for benchmarks, with no source code changes (no source code changes are permitted for benchmarking)

ARM vs NEON (Vectorize) on Cortex-A8

169% 170%

135% 120% Improved 100% 100% vectorization in 70% latest RVDS 4.0

20% Telecom Consumer ARM NEON

. Simple source code changes can yield significant improvements above this . Use C „__restrict‟ keyword to work around C pointer aliasing issues . Make loops clearly multiple of 2n (e.g. use 4*n as loop end) to aid vectorization

17 Automatic Vectorizing

. Automatic vectorization can generate code targeted for NEON from ordinary C source code . Less effort to produce efficient code . Portable - no compiler-specific source code features need to be used

. To enable automatic vectorization, use these options together: --vectorize - enable vectorization --cpu 7-A or --cpu Cortex-A8 - provide a CPU option with NEON support -O2 or -O3 - select high optimization level -Otime - optimize for speed over space

. Selecting optimization level -O3 will optimize more aggressively for speed, but at the expense of increased code size

18 Tuning C/C++ Code for Vectorizing

. The goal is to try to make the code simple, straightforward, and parallel, so that the compiler can easily convert the code to NEON assembly

. Loops can be modified for better vectorizing: . Short, simple loops work the best (even if it means multiple loops in your code) . Avoid breaks in loops . Try to make the number of iterations a power of 2 . Try to make sure the number of iterations is known to the compiler . Functions called inside a loop should be inlined

. Pointer issues: . Using arrays with indexing vectorizes better than using pointers . Indirect addressing (multiple indexing or de-referencing) doesn‟t vectorize . Use the __restrict keyword to tell the compiler that pointers do not reference overlapping areas of memory

19 NEON Vectorizing Example (1) . How does the compiler perform vectorization? void add_int(int * __restrict pa, 2. Unroll the loop to the appropriate number of int * __restrict pb, iterations, and perform other transformations unsigned int n, int x) like pointerization { void add_int(int *pa, int *pb, unsigned int i; unsigned n, int x) __promise(n > 0 && n % 4 == 0); { for(i = 0; i < n; i++) unsigned int i; pa[i] = pb[i] + x; for (i = ((n & ~3) >> 2); i; i--) } { 1. Analyze each loop: *(pa + 0) = *(pb + 0) + x; . Are pointer accesses safe for *(pa + 1) = *(pb + 1) + x; vectorization? *(pa + 2) = *(pb + 2) + x; *(pa + 3) = *(pb + 3) + x; . What data types are being used? pa += 4; pb += 4; How do they map onto NEON } vector registers? } pb . Number of loop iterations x 3. Map each unrolled operation onto a + + + + + NEON vector lane, and generate corresponding NEON instructions pa 127 0

20 NEON Vectorizing Example (2)

With vectorizing compilation Without vectorizing compilation

add_int PROC add_int PROC BICS r12,r2,#3 MOV r12,#0 PUSH {r4} BEQ |L1.40| BICS r4,r2,#3 VDUP.32 q1,r3 BEQ |L1.44| LSRS r2,r2,#2 BIC r2,r2,#3 BEQ |L1.40| |L1.20| |L1.20| LDR r4,[r1,r12,LSL #2] VLD1.32 {d0,d1},[r1]! ADD r4,r4,r3 VADD.I32 q0,q0,q1 STR r4,[r0,r12,LSL #2] SUBS r2,r2,#1 ADD r12,r12,#1 VST1.32 {d0,d1},[r0]! CMP r2,r12 BNE |L1.20| BHI |L1.20| |L1.40| |L1.44| BX lr POP {r4} BX lr ENDP ENDP

armcc --cpu=Cortex-A8 -O2 -Otime --vectorize armcc --cpu=Cortex-A8 -O2 -Otime

21 Intrinsics

. Include intrinsics header file #include

. Use special NEON data types which correspond to D and Q registers, e.g. int8x8_t D-register containing 8x 8-bit elements int16x4_t D-register containing 4x 16-bit elements int32x4_t Q-register containing 4x 32-bit elements

. Use special intrinsics versions of NEON instructions vin1 = vld1q_s32(ptr); vout = vaddq_s32(vin1, vin2); vst1q_s32(vout, ptr);

. Strongly typed! . Use vreinterpret_s16_s32( ) to change the type

22 BeagleBoard . Funded by TI . Open Source/Creative Commons hardware . OMAP3 / Cortex-A8 NEON @ 720MHz, 256MB RAM (RevC) . Price: $149 via Digi-Key . 2xUSB 2.0 (MUSB + EHCI) . SDHC . DVI-D video . Audio I/O 3”

. Almost indistinguishable from PC (except runs very cool)

. Runs Angstrom, Ubuntu 09.04 / 09.10, Android, WinCE, etc..

23 NEON in Opensource

. Bluez – official Linux Bluetooth protocol stack . NEON sbc audio encoder . Pixman (part of cairo 2D graphics library) . Compositing/alpha blending . X.Org, Mozilla Firefox, fennec, & Webkit browsers . e.g. fbCompositeSolidMask_nx8x0565neon 8x faster using NEON . ffmpeg – . LGPL media player used in many Linux distros . NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, . NEON Audio: AAC, , WMA . Summer Of Code 2009 . GPL H.264 encoder – e.g. for video conferencing . Android – NEON optimizations . Skia library, S32A_D565_Opaque 5x faster using NEON . Available in Google Skia tree from 03-Aug-2009 . Eigen2 – C++ vector math / linear algebra template library . Theorarm – libtheora NEON version (now BSD license) . Ubuntu – NEON versions of critical shared-libraries

24 Cairo-perf-test Benchmark Suite

Performance NEON vs. ARM v6 SIMD 3.00

2.50

2.00

1.50

1.00

0.50

0.00

Optimized assembly code in both cases („pixman‟ library) Overall NEON benefit +43% NEON optimization work ongoing

25 ffmpeg (libavcodec) Performance

. .ffmpeg.org ffmpeg performance (relative to realtime)

snapshot 21-Sep-09 3

2.5 YouTube HQ video decode 2

480x270, 30fps v7vfp 1.5 Including AAC audio v7neon 1

0.5 . Real silicon measurements 0 . OMAP3 Beagleboard Cortex-A8 256KB L2 500MHz Cortex-A9 512KB L2 400MHz . ARM Cortex-A9 testchip

. NEON ~2x overall performance

26 Google WebM/VP8: SMP Dual-threaded

MHz Lower is better

Lower MHz results in lower Single core Atom only power solutions. just able to decode 40% of one CPU still available

No optimized MIPS code available in WebM release

Source: ARM benchmarking of opensource codec running on 400MHz Cortex-A9 testchip with 200MHz DDR

27 Scalability with SMP on Cortex-A9

x264 encode performance

25

20

15 ultrafast

10 veryfast

5

0 threads=1 threads-=1 threads=2 threads=1 threads=2

100MHz 100MHz 100MHz 100MHz 100MHz

ARM11 A9TC A9TC A9TC A9TC

v6 v6 v6 v7neon v7neon

NEON used

28 NEON Power Efficiency . NEON unit approx same mW/MHz as integer core overall power saving from reduced MHz (or core in standby WFI)

. Example: Youtube HQ video 480x270 H264 30fps . ffmpeg opensource player on 500MHz Cortex-A8 decoding to YUV420 . 64fps video only 234MHz average ([email protected]/MHz => 117mW) . 60fps including AAC-LC (audio decode using NEON IMDCT) 250MHz average ([email protected]/MHz => 125mW)

. Total power (including memory accesses) on Beagle ~1mW/MHz

29 Measured Power Benefits Cortex-A9 . Quad-core 90nm testchip 4

3.5

3

2.5

2 NEON rel perf Perf/power 1.5

1

0.5

0 OpenMAX colorspace ffmpeg ffmpeg jacob. 320x240 ffmpeg test1.aac conversion youtube_matt22.mp4 1280x720+aac

30 NEON Codec Capabilities . Contact commercial vendors for optimized codec results . 500MHz Cortex-A8 (mobile implementation) approximate performance: MPEG-4 ASP decode 720p @ 24fps

H.264 decode D1 @ 30fps

H.264 encode CIF (352x288) @30fps

. Scales with CPU performance (e.g. 2GHz Cortex-A9) . Cortex-A9MP can use NEON per thread for N x scaling

31 NEON Third Party Ecosystem

H.264, VC1, MPEG-4 MPEG-4

VP6/7, MPEG-4, VC-1, H.264 GUI visual efffects (enc+dec), video stabilization

MPEG-4, MPEG-2, H.263, drawElements 2D GUI library H.264, WMV9, VC1, DD+ Finland

MPEG-4, H.263, H.264, Espico Ltd Audio: low-bitrate & WMV9, audio digital theater, consulting H.264, VC1 CoreAVC ultra fast TEAMSpirit voice &

H.264, MPEG-4, H.263, WMV ARM NEON widely supported by software partners MobiClip

Codecs Full list on www.arm.com NEON ecosystem page Multichannel audio processing Adobe Flash products

32 NEON Summary . NEON to become standard on general purpose apps & media-centric devices. . Able to support future internet multimedia standards . NEON option across the entire Cortex-A roadmap . NEON technology ideal for use by downloaded apps

. Full enabling infrastructure to support NEON . Compilers, profilers, debuggers, libraries all available now . Key differentiator: easy to program, popular with software engineers

. Strong ARM NEON ecosystem . Broad support from many software vendors . OS support from Linux, , Microsoft, Android… . Increasing opensource adoption

33 Thank You

Please visit www.arm.com for ARM related technical details

For any queries contact < [email protected] >

34