Tegra Xavier Introduction to ARMv8 Kristoffer Robin Stokke, PhD Dolphin Interconnect Solutions And Debugging Goals of Lecture qTo give you q Something concrete to start on q Some examples from «real life» where you may encounter these topics qEvery year I try to include something new... q Which means more freebees for you! J qSimple introduction to ARMv8 NEON programming environment q Register environment, instruction syntax q «Families» of instructions q Important for debugging, writing code and general understanding qProgramming examples q Intrinsics q Inline assembly qPerformance analysis using gprof qIntroduction to GDB debugging Keep This Under Your Pillow qARM’s overview and information on NEON instructions q https://developer.arm.com/documentation/dui0204/j/neon-and-vfp-programming qGNU compiler intrinsics list: q https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/ARM-NEON-Intrinsics.html qSome non-formal calling conventions and snacks q https://medium.com/mathieugarcia/introduction-to-arm64-neon-assembly-930c4a48bb2a qThis may also be useful q https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/arm- neon-programming-quick-reference Modern Heterogeneous SoC Architectures

Manufacturer CPU CPU Cache RAM GPU DSP Hardware Accelerators X1 4 ARM Cortex 2 MB L2, 4 GB 256-core - • ISP A57 + 4 A53 48 kB I$ Maxwell 32 kB D$ (L1) Tegra Xavier Nvidia 8 Carmel 2 MB L3 (shared) 16 GB 512-core Volta - • CNN ARMv8 8 MB L2 (shared 2 Blocks cores) • ISP 128 kB I$ 64 kB D$ Myriad X Intel Movidius 2 SPARC Kilobytes 4 GB - 16 VLIW cores • ISP • CNN Blocks • ++ more SDA845 8 ARM- 8 GB Adreno GPU Hexagon VLIW • ISP based + SIMD • LTE

IMX6Q Freescale 4 ARM Cortex 1 MB L2 Impl. «3D graphics» (NXP) A-9 Dep.

02.02.2021 5 Tegra Xavier CPU Cache Hierarchy

• ARM 8.2 Cores (Carmel) Core Core • (Half-precision floating point!) • SIMD! 128kB $I 128kB $I 64 kB $D 64 kB $D • 64 kB L1 Cache • Per core 4 MB L2 Cache way 2 MB L3 Cache • 4 MB L2 Cache this • Per dual Faster Faster • 2 MB L3 Cache 16 GB RAM • Shared between all cores

• Least recently used eviction strategy (LRU) 02.02.2021 6 CPU Hierarchies and Performance • Let’s do an experiment! ☺J CPU Core

• Reading or writing 800 MB 20k Loops @ 40 kB

• Vary the size to read back-to-back L1 64 kB – E.g. read 24 kB repeatedly from same 40k Loops buffer, until 800 MB have been read @ 80 kB

L2 4 MB • Buffersize detemines location of data

• Under ideal conditions (no contention..) 100 Loops – Below 50kB, all reads are cached in L1 @ 8 MB

– Below 4 MB, all reads are cached in L2 L2 4 MB – Above 6 MB... Nothing gets cached

02.02.2021 7 Code Example and Profiling

• Compile with –pg • Run app: ./main • Analyze – Gprof ./main gmon.out Read Write • NB:Prefetch op – Prfm reg|label L1 20 ms 30 ms – Type • pld (for load) L2 70 ms 70 ms • pst (for store) • pli (for instruction) RAM 220 ms 180 ms • Target – L1 or L2 (or L3) – Policy • keep (normal) • stream (use once) • prfm pldl1keep[x0] (address in x0)

02.02.2021 8 ARMv8 Registers

31 x 64-bit general purpose registers

X0 X8 x16 x24

32 x 128-bit vector registers

V0 V8 V16 V24

SP WZR Stack pointer Zero registers PC WSP XZR The Vector Registers V0-V31: Packing

q Data in V0-V31 are packed, and you control how they are packed

Example: 16 bytes or 8 bytes

Lanes Example: 8 half-words or 4 half-words Intrinsics, Inline Assembly or Assembly?

Inline Intrinsics Assembly Assembly

• You will need to understand !"#$" 40"56$9,3#:"*&; assembly40"56$9,3#:"*&; to !%&' !()*+%),-*.+)#/#)#'#0"1 <<,=*,1".>>,*0,3#:"*& • Debug<<,=*,1".>>,*0,3#:"*& your program -*.+)#/#)#'#0"12 • Understand how to use 3%--!456,37837837, G#:"*&,H,3%--E/156?3#:"*&8,3#:"*&F //%1'//? +$,)& intrinsics@3%-- A3B!1C8,A3B!1C8,A3B!1CD correctly !#0- 2,2,A3#:"*&B,@ED,?3#:"*&F,2,F

Goes inside C functions Goes inside C functions Goes in .s file

Level of Difficulty

02.02.2021 11 Data Types

C World Assembly World uint8x8_t v0.8b Vector registers are specified by the 8-bit unsigned integer 8 elementsfollowing:

8B/16B/4H/8H/2S/4S/2D uint8x16_t v0.16b B: bytes H: half-word (16-bit) 8-bit unsigned integer 16 S:elements word (32-bit) D: doubleword (64-bit) float32x2_t v0.2s An S-vector can therefor occupy signed or unsigned integers, or floating point values. 32-bit floating point 2 elements Meaning is encoded in the assembly float32x4_t v0.4s instruction or intrinsic, when needed. 32-bit floating point 4 elements

02.02.2021 12 Example: Loading or Storing Something

void * inp = malloc( 64 ) char * inp = malloc( 64 )

__asm__( uint8x16_t vectors[4]; «ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[inp]]» : : [inp] «r» (inp)) for(i=0; i < 4; i++) vectors[i] = vld1q_u8(inp + i*16)

void * out = malloc( 64 ) char * inp = malloc( 64 )

__asm__( uint8x16_t vectors[4]; // Assume we have done something // intelligent with v0, v1, v2 and v3 for(i=0; i < 4; i++) vst1q_u8(inp + i*16, vectors[i]) «st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[out]]» : : [out] «r» (out) : «memory»)

02.02.2021 13 How Does Intrinsics Map to Assembly?

• If you want to write some piece of inline assembly – But compiler spits out errors and you don’t know the syntax • Try to write it by intrinsics – Then objdump –D | less – Type / + hit return and search • Alternatively – Gdb – Break .c: – Type run -> enter – Layout asm -> inspect

02.02.2021 14 Example: Loading or Storing Something

• There are vector types and intrinsics char * inp = malloc( 64 ) for clustered vectors • In this case, four 128-bit registers For(i=0; i < 64; i++) inp[i] = i; // Contents are consecutive in memory.. • Be careful! Uint8x16x4 vectors; vectors = vld4q(inp); • Compiler seems to rearrange the contents in some unintuitive way // Contents are not consecutive in vectors!!

02.02.2021 15 Example: Vector Packing

Data types Size Bytes 1B Half-words 2B Words 4B Double words 8B Half precision 2B Single precision 4B Double precision 8B Other Examples int8x16_t v0; int8_t init = 0;

int16x4_t v0, v1; v0 = vdupq_n_s8(init); Int16x4_t result; Initialise all lanes result = vadd_s16(v0, v1) result = vsub_s16(v0, v1) result = vmul_s16(v0, v1) Float32x4_t v0; Addition, subtraction and multiplication Float32_t val;

Val = vgetq_lane_f32(v0, 0)

Val += 42.0F; Uint32x4_t v0; V0 = vsetq_lane_f32(val, v0, 0) float32_x4_t result;

Result = vcvtq_f32_u32( v0 ) Get and set a specific lane Convert unsigned int to float

02.02.2021 17 Programming With Intrinsics

More in a bit! Programming Example: Intrinsics Inline Assembly q Mostly harder than using intrinsics qHowever, gives more control (and better performance?) q Not always straightforward to figure out what mnemonics to use qTips: disassemble intrinsics and look with objdump or gdb

Operand constraints > «m» : memory address > «r» : general purpose register > «f» : floating point register > «i» : immediate ++ more

Specify dirty registers and more Programming Example: Inline Assembly Lookup Tables (LUT)

• Powerful approximation

• Use LUTs to realise complex mathematics! • For example prime numbers..

• Some “index” points into a LUT offset that contains precomputed values • Output stored in a vector 7 5 3 3 2 5 Index «vector» LUT Output «vector» (four-element) 02.02.2021 22 Table Lookup in ARM Neon qVector table lookup: vtbl v0, {v1, v2, ..., vn}, vm Two flavours: qV0: destination vector q{v1, v2}: LUTvtbl (max 2x128-bit vectors!!) qvm: index vectorAny element out of range for LUT returns 0 vtbx v0 Any element out of range0 for LUT15 leaves16 the 31

destination unchanged v1 v2

0 8 4 6 0 8 4 6 0 18 4 18 24 25 14 19

vm • Let’s try to use LUTs to transpose matrices.

• Don’t go thinking 4x4 or 8x8. – Start easy, then let’s see if we can observe any patterns.

02.02.2021 24 Matrix Transpose (Super Simple) Stride = 1

Destination Vector a a a LUT (matrix) a stride Index Vector 0 2x2 matrix, stride = 2

Destination a c b d a b a c Vector

c d b d LUT (matrix) a b c d stride Index Vector 0 2 1 3

stride 3x3 matrix, stride = 3

a b c a d g Destination a d g b e h c f i d e f b e h Vector g h i c f i LUT stride (matrix) a b c d e f g h i

Index 0 3 6 1 4 7 2 5 8 Vector stride How to think?

• For the «first output row» = 0 – Element output n is taken from n*stride in «input matrix»

• For the “next row” = 1 – Element output n is taken from n*stride + 1

• So generally, for output element n in output row i – Element is taken from n*stride + i

02.02.2021 28 Matrix Transpose Example

02.02.2021 29 The Gamma Transform

• Human eye is sensitive to variation in luminance • The gamma transform.. – «stretches» small variations in luminance – Can make it easier to see detail • Gamma is also used to adjust for non-linearity in old monitors – Image data is actually transmitted to the display with gamma applied to it as a form of «back compatibility» – Which is extremely confusing – Google «gamma correction explained» and watch the madness

02.02.2021 30 The Gamma Transform � = 1

1 1 � = 2.2 Output luminance value

Output value

! � = � , � ∈ [0,1] Resulting larger change in luminance � = 2.2

Input luminance value 0 Input Value 1

Small change in luminance

02.02.2021 https://wolfcrow.com/what-is-display-gamma-and-gamma-correction/ 31 Higher gamma compresses the low-lights, but extends high-lights

1 � = � = 1.0 � = 2.2 2.2

Lower gamma compresses the high-lights, but extends the low-lights

02.02.2021 32 Non-Temporal Loads and Stores

• Reading remote RAM from PCIe • Very slow due to core is hanging while datapath is fetching CPU RAM response

• However, ARM provides special Tegra Xavier PCIe memory load & store instructions • Non-temporal loads and stores • Relaxed memory ordering PCIe

• Hardware implementation may be able to improve performance; for example by not waiting for reads CPU RAM to arrive in destination registers Tegra Xavier

02.02.2021 33 Non-Temporal Loads and Stores

«ldnp q0, q1, [address]»

“stnp q0, q1, [address]” CPU RAM

PCIe Temporal Non- Tegra Xavier Instructions Temporal Instructions PCIe NX reads < 10 MBps 200-300 remote RAM MBps (x2 link) CPU RAM

Tegra Xavier

02.02.2021 34 Example: Gamma Correction Using ARM Neon

02.02.2021 35 When Things Go Wrong

02.02.2021 36 GDB Example Tips qBuild functions to print out macroblocks from vector registers and memory qStart small – test out independent parts of the code that are easy to verify qWhen in trouble, step through the code, display the relevant registers, verify with output you know is working Detecting Adidas Features

02.02.2021 39 Good Luck! qYou’ll be fine. Matrix Transpose

tbl v0.4s, {v1.4s}, v2.4s

a c b v0.4s a b a c d

c d b d 0 2 1 3 v2.4s stride Think like this: a b c d v1.4s For each output row, stride select increasing column Code Profiling qCompile with –pg Time to Finish 100M computations qRun application: ./main for Matrix Multiply (MM) and Transpose Operations qRun gprof ./main gmon.out100 80 60 40 20 ? 0 Transpose, lazy Transpose, MM, NEON MM, NEON intrinsics NEONassembly assembly Series 1 Column1 Column2