Tegra Xavier Introduction to ARMv8 Kristoffer Robin Stokke, PhD FLIR UAS And Debugging Goals of Lecture qTo give you something concrete to start on qSimple introduction to ARMv8 NEON programming environment q Register environment, instruction syntax q «Families» of instructions q Important for debugging, writing code and general understanding qProgramming examples q Intrinsics q Inline assembly qPerformance analysis using gprof qIntroduction to GDB debugging Keep This Under Your Pillow qGNU compiler intrinsics list: o https://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/ARM-NEON-Intrinsics.html qARM Infocenter o infocenter.arm.com -> developer guides (..) -> software development -> Cortex A series Programmer’s Guide for arm8 qThis may also be useful q https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/ q https://elinux.org/Jetson_AGX_Xavier q https://community.arm.com/processors/b/blog/posts/coding-for-neon---part-1-load-and-stores qLast but not least – GDB q You will need it Modern Heterogeneous SoC Architectures
Manufacturer CPU CPU Cache RAM GPU DSP Hardware Accelerators Tegra X1 Nvidia 4 ARM Cortex 2 MB L2, 4 GB 256-core - • ISP A57 + 4 A53 48 kB I$ Maxwell 32 kB D$ (L1) Tegra Xavier Nvidia 8 Carmel 2 MB L3 (shared) 16 GB 512-core Volta - • CNN ARMv8 8 MB L2 (shared 2 Blocks cores) • ISP 128 kB I$ 64 kB D$ Myriad X Intel Movidius 2 SPARC Kilobytes 4 GB - 16 VLIW cores • ISP • CNN Blocks • ++ more SDA845 Qualcomm 8 Kryo ARM- 8 GB Adreno GPU Hexagon VLIW • ISP based + SIMD • LTE
11.02.2020 5 Tegra Xavier CPU Cache Hierarchy
• ARM 8.2 Cores (Carmel) Core Core • Half-precision floating point! • SIMD! 128kB $I 128kB $I 64 kB $D 64 kB $D • 64 kB L1 Cache • Per core 4 MB L2 Cache way 2 MB L3 Cache • 4 MB L2 Cache this • Per dual Faster Faster • 2 MB L3 Cache 16 GB RAM • Shared between all cores
11.02.2020 6 CPU Hierarchies and Performance
• Let’s do an experiment! J CPU Core
20k Loops • Reading or writing 800 MB @ 40 kB L1 64 kB • Vary the size to read back-to-back 40k Loops – E.g. read 24 kB repeatedly from same @ 80 kB buffer, until 800 MB have been read L2 4 MB • Buffersize detemines location of data 100 Loops – Below 50kB, all reads are cached in L1 @ 8 MB
– Below 4 MB, all reads are cached in L2 L2 4 MB – Above 6 MB... Nothing gets cached
11.02.2020 7 Code Example and Profiling
• Compile with –pg • Run app: ./main • Analyze – Gprof ./main gmon.out Read Write • NB:Prefetch op – Prfm
11.02.2020 8 ARMv8 Registers
31 x 64-bit general purpose registers
X0 X8 x16 x24
32 x 128-bit vector registers
V0 V8 V16 V24
SP WZR Stack pointer Zero registers PC WSP XZR The Vector Registers V0-V31: Packing
q Data in V0-V31 are packed, and you control how they are packed
Example: 16 bytes or 8 bytes
Lanes Example: 8 half-words or 4 half-words Example: Vector Packing
Data types Size Bytes 1B Half-words 2B Words 4B Double words 8B Half precision 2B Single precision 4B Double precision 8B Programming With Intrinsics
More in a bit! Programming Example: Intrinsics Inline Assembly q Mostly harder than using intrinsics qHowever, gives more control (and better performance?) q Not always straightforward to figure out what mnemonics to use qTips: disassemble intrinsics and look with objdump or gdb
Operand constraints > «m» : memory address > «r» : general purpose register > «f» : floating point register > «i» : immediate ++ more
Specify dirty registers and more Programming Example: Inline Assembly Table Lookup qNot straightforward to use for any purpose qVector table lookup: vtbl v0, {v1, v2, ..., vn}, vm qV0: destination vector v0 q{v1, v2, ..., vn}: source data 0 1516 31
qvm: data selector v1 v2
0 8 4 6 0 8 4 6 0 18 4 18 24 25 14 19
vm Matrix Transpose
tbl v0.4s, {v1.4s}, v2.4s
a c b v0.4s a b a c d
c d b d 0 2 1 3 v2.4s stride Think like this: a b c d v1.4s For each output row, stride select increasing column Code Profiling qCompile with –pg Time to Finish 100M computations qRun application: ./main for Matrix Multiply (MM) and Transpose Operations qRun gprof ./main gmon.out100 80 60 40 20 ? 0 Transpose, lazy Transpose, MM, NEON MM, NEON intrinsics NEONassembly assembly Series 1 Column1 Column2 GDB Example Tips qBuild functions to print out macroblocks from vector registers and memory qStart small – test out independent parts of the code that are easy to verify qWhen in trouble, step through the code, display the relevant registers, verify with output you know is working qMany things to investigate qSingle versus double precision? qDifferent, possibly more ways to implement e.g. transpose? qRe-using vector registers across different functional blocks? q..but stick to what the assignment says Detecting Adidas Features
11.02.2020 21 Good Luck! qYou’re going to need it J