Programming with Intrinsics

Tegra Xavier Introduction to ARMv8 Kristoffer Robin Stokke, PhD FLIR UAS And Debugging Goals of Lecture qTo give you something concrete to start on qSimple introduction to ARMv8 NEON programming environment q Register environment, instruction syntax q «Families» of instructions q Important for debugging, writing code and general understanding qProgramming examples q Intrinsics q Inline assembly qPerformance analysis using gprof qIntroduction to GDB debugging Keep This Under Your Pillow qGNU compiler intrinsics list: o https://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/ARM-NEON-Intrinsics.html qARM Infocenter o infocenter.arm.com -> developer guides (..) -> software development -> Cortex A series Programmer’s Guide for arm8 qThis may also be useful q https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/ q https://elinux.org/Jetson_AGX_Xavier q https://community.arm.com/processors/b/blog/posts/coding-for-neon---part-1-load-and-stores qLast but not least – GDB q You will need it Modern Heterogeneous SoC Architectures Manufacturer CPU CPU Cache RAM GPU DSP Hardware Accelerators Tegra X1 Nvidia 4 ARM Cortex 2 MB L2, 4 GB 256-core - • ISP A57 + 4 A53 48 kB I$ Maxwell 32 kB D$ (L1) Tegra Xavier Nvidia 8 Carmel 2 MB L3 (shared) 16 GB 512-core Volta - • CNN ARMv8 8 MB L2 (shared 2 Blocks cores) • ISP 128 kB I$ 64 kB D$ Myriad X Intel Movidius 2 SPARC Kilobytes 4 GB - 16 VLIW cores • ISP • CNN Blocks • ++ more SDA845 Qualcomm 8 Kryo ARM- 8 GB Adreno GPU Hexagon VLIW • ISP based + SIMD • LTE 11.02.2020 5 Tegra Xavier CPU Cache Hierarchy • ARM 8.2 Cores (Carmel) Core Core • Half-precision floating point! • SIMD! 128kB $I 128kB $I 64 kB $D 64 kB $D • 64 kB L1 Cache • Per core 4 MB L2 Cache way 2 MB L3 Cache • 4 MB L2 Cache this • Per dual Faster Faster • 2 MB L3 Cache 16 GB RAM • Shared between all cores 11.02.2020 6 CPU Hierarchies and Performance • Let’s do an experiment! J CPU Core 20k Loops • Reading or writing 800 MB @ 40 kB L1 64 kB • Vary the size to read back-to-back 40k Loops – E.g. read 24 kB repeatedly from same @ 80 kB buffer, until 800 MB have been read L2 4 MB • Buffersize detemines location of data 100 Loops – Below 50kB, all reads are cached in L1 @ 8 MB – Below 4 MB, all reads are cached in L2 L2 4 MB – Above 6 MB... Nothing gets cached 11.02.2020 7 Code Example and Profiling • Compile with –pg • Run app: ./main • Analyze – Gprof ./main gmon.out Read Write • NB:Prefetch op – Prfm <type><target><policy> reg|label L1 20 ms 30 ms – Type • pld (for load) L2 70 ms 70 ms • pst (for store) • pli (for instruction) RAM 220 ms 180 ms • Target – L1 or L2 (or L3) – Policy • keep (normal) • stream (use once) • prfm pldl1keep[x0] (address in x0) 11.02.2020 8 ARMv8 Registers 31 x 64-bit general purpose registers X0 X8 x16 x24 32 x 128-bit vector registers V0 V8 V16 V24 SP WZR Stack pointer Zero registers PC WSP XZR The Vector Registers V0-V31: Packing q Data in V0-V31 are packed, and you control how they are packed Example: 16 bytes or 8 bytes Lanes Example: 8 half-words or 4 half-words Example: Vector Packing Data types Size Bytes 1B Half-words 2B Words 4B Double words 8B Half precision 2B Single precision 4B Double precision 8B Programming With Intrinsics More in a bit! Programming Example: Intrinsics Inline Assembly q Mostly harder than using intrinsics qHowever, gives more control (and better performance?) q Not always straightforward to figure out what mnemonics to use qTips: disassemble intrinsics and look with objdump or gdb Operand constraints > «m» : memory address > «r» : general purpose register > «f» : floating point register > «i» : immediate ++ more Specify dirty registers and more Programming Example: Inline Assembly Table Lookup qNot straightforward to use for any purpose qVector table lookup: vtbl v0, {v1, v2, ..., vn}, vm qV0: destination vector v0 q{v1, v2, ..., vn}: source data 0 1516 31 qvm: data selector v1 v2 0 8 4 6 0 8 4 6 0 18 4 18 24 25 14 19 vm Matrix Transpose tbl v0.4s, {v1.4s}, v2.4s a c b v0.4s a b a c d c d b d 0 2 1 3 v2.4s stride Think like this: a b c d v1.4s For each output row, stride select increasing column Code Profiling qCompile with –pg Time to Finish 100M computations qRun application: ./main for Matrix Multiply (MM) and Transpose Operations qRun gprof ./main gmon.out100 80 60 40 20 ? 0 Transpose, lazy Transpose, MM, NEON MM, NEON intrinsics NEONassembly assembly Series 1 Column1 Column2 GDB Example Tips qBuild functions to print out macroblocks from vector registers and memory qStart small – test out independent parts of the code that are easy to verify qWhen in trouble, step through the code, display the relevant registers, verify with output you know is working qMany things to investigate qSingle versus double precision? qDifferent, possibly more ways to implement e.g. transpose? qRe-using vector registers across different functional blocks? q..but stick to what the assignment says Detecting Adidas Features 11.02.2020 21 Good Luck! qYou’re going to need it J.

Load more