Tegra Xavier Introduction to ARMv8 Kristoffer Robin Stokke, PhD FLIR UAS And Debugging Goals of Lecture qTo give you something concrete to start on qSimple introduction to ARMv8 NEON programming environment q Register environment, instruction syntax q «Families» of instructions q Important for debugging, writing code and general understanding qProgramming examples q Intrinsics q Inline assembly qPerformance analysis using gprof qIntroduction to GDB debugging Keep This Under Your Pillow qGNU compiler intrinsics list: o https://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/ARM-NEON-Intrinsics.html qARM Infocenter o infocenter.arm.com -> developer guides (..) -> software development -> Cortex A series Programmer’s Guide for arm8 qThis may also be useful q https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/ q https://elinux.org/Jetson_AGX_Xavier q https://community.arm.com/processors/b/blog/posts/coding-for-neon---part-1-load-and-stores qLast but not least – GDB q You will need it Modern Heterogeneous SoC Architectures

Manufacturer CPU CPU Cache RAM GPU DSP Hardware Accelerators X1 Nvidia 4 ARM Cortex 2 MB L2, 4 GB 256-core - • ISP A57 + 4 A53 48 kB I$ Maxwell 32 kB D$ (L1) Tegra Xavier Nvidia 8 Carmel 2 MB L3 (shared) 16 GB 512-core Volta - • CNN ARMv8 8 MB L2 (shared 2 Blocks cores) • ISP 128 kB I$ 64 kB D$ Myriad X Intel Movidius 2 SPARC Kilobytes 4 GB - 16 VLIW cores • ISP • CNN Blocks • ++ more SDA845 8 ARM- 8 GB Adreno GPU Hexagon VLIW • ISP based + SIMD • LTE

11.02.2020 5 Tegra Xavier CPU Cache Hierarchy

• ARM 8.2 Cores (Carmel) Core Core • Half-precision floating point! • SIMD! 128kB $I 128kB $I 64 kB $D 64 kB $D • 64 kB L1 Cache • Per core 4 MB L2 Cache way 2 MB L3 Cache • 4 MB L2 Cache this • Per dual Faster Faster • 2 MB L3 Cache 16 GB RAM • Shared between all cores

11.02.2020 6 CPU Hierarchies and Performance

• Let’s do an experiment! J CPU Core

20k Loops • Reading or writing 800 MB @ 40 kB L1 64 kB • Vary the size to read back-to-back 40k Loops – E.g. read 24 kB repeatedly from same @ 80 kB buffer, until 800 MB have been read L2 4 MB • Buffersize detemines location of data 100 Loops – Below 50kB, all reads are cached in L1 @ 8 MB

– Below 4 MB, all reads are cached in L2 L2 4 MB – Above 6 MB... Nothing gets cached

11.02.2020 7 Code Example and Profiling

• Compile with –pg • Run app: ./main • Analyze – Gprof ./main gmon.out Read Write • NB:Prefetch op – Prfm reg|label L1 20 ms 30 ms – Type • pld (for load) L2 70 ms 70 ms • pst (for store) • pli (for instruction) RAM 220 ms 180 ms • Target – L1 or L2 (or L3) – Policy • keep (normal) • stream (use once) • prfm pldl1keep[x0] (address in x0)

11.02.2020 8 ARMv8 Registers

31 x 64-bit general purpose registers

X0 X8 x16 x24

32 x 128-bit vector registers

V0 V8 V16 V24

SP WZR Stack pointer Zero registers PC WSP XZR The Vector Registers V0-V31: Packing

q Data in V0-V31 are packed, and you control how they are packed

Example: 16 bytes or 8 bytes

Lanes Example: 8 half-words or 4 half-words Example: Vector Packing

Data types Size Bytes 1B Half-words 2B Words 4B Double words 8B Half precision 2B Single precision 4B Double precision 8B Programming With Intrinsics

More in a bit! Programming Example: Intrinsics Inline Assembly q Mostly harder than using intrinsics qHowever, gives more control (and better performance?) q Not always straightforward to figure out what mnemonics to use qTips: disassemble intrinsics and look with objdump or gdb

Operand constraints > «m» : memory address > «r» : general purpose register > «f» : floating point register > «i» : immediate ++ more

Specify dirty registers and more Programming Example: Inline Assembly Table Lookup qNot straightforward to use for any purpose qVector table lookup: vtbl v0, {v1, v2, ..., vn}, vm qV0: destination vector v0 q{v1, v2, ..., vn}: source data 0 1516 31

qvm: data selector v1 v2

0 8 4 6 0 8 4 6 0 18 4 18 24 25 14 19

vm Matrix Transpose

tbl v0.4s, {v1.4s}, v2.4s

a c b v0.4s a b a c d

c d b d 0 2 1 3 v2.4s stride Think like this: a b c d v1.4s For each output row, stride select increasing column Code Profiling qCompile with –pg Time to Finish 100M computations qRun application: ./main for Matrix Multiply (MM) and Transpose Operations qRun gprof ./main gmon.out100 80 60 40 20 ? 0 Transpose, lazy Transpose, MM, NEON MM, NEON intrinsics NEONassembly assembly Series 1 Column1 Column2 GDB Example Tips qBuild functions to print out macroblocks from vector registers and memory qStart small – test out independent parts of the code that are easy to verify qWhen in trouble, step through the code, display the relevant registers, verify with output you know is working qMany things to investigate qSingle versus double precision? qDifferent, possibly more ways to implement e.g. transpose? qRe-using vector registers across different functional blocks? q..but stick to what the assignment says Detecting Adidas Features

11.02.2020 21 Good Luck! qYou’re going to need it J