Tegra Xavier Introduction to ARMv8 Kristoffer Robin Stokke, PhD Dolphin Interconnect Solutions And Debugging Goals of Lecture qTo give you q Something concrete to start on q Some examples from «real life» where you may encounter these topics qEvery year I try to include something new... q Which means more freebees for you! J qSimple introduction to ARMv8 NEON programming environment q Register environment, instruction syntax q «Families» of instructions q Important for debugging, writing code and general understanding qProgramming examples q Intrinsics q Inline assembly qPerformance analysis using gprof qIntroduction to GDB debugging Keep This Under Your Pillow qARM’s overview and information on NEON instructions q https://developer.arm.com/documentation/dui0204/j/neon-and-vfp-programming qGNU compiler intrinsics list: q https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/ARM-NEON-Intrinsics.html qSome non-formal calling conventions and snacks q https://medium.com/mathieugarcia/introduction-to-arm64-neon-assembly-930c4a48bb2a qThis may also be useful q https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/arm- neon-programming-quick-reference Modern Heterogeneous SoC Architectures
Manufacturer CPU CPU Cache RAM GPU DSP Hardware Accelerators Tegra X1 Nvidia 4 ARM Cortex 2 MB L2, 4 GB 256-core - • ISP A57 + 4 A53 48 kB I$ Maxwell 32 kB D$ (L1) Tegra Xavier Nvidia 8 Carmel 2 MB L3 (shared) 16 GB 512-core Volta - • CNN ARMv8 8 MB L2 (shared 2 Blocks cores) • ISP 128 kB I$ 64 kB D$ Myriad X Intel Movidius 2 SPARC Kilobytes 4 GB - 16 VLIW cores • ISP • CNN Blocks • ++ more SDA845 Qualcomm 8 Kryo ARM- 8 GB Adreno GPU Hexagon VLIW • ISP based + SIMD • LTE
IMX6Q Freescale 4 ARM Cortex 1 MB L2 Impl. «3D graphics» (NXP) A-9 Dep.
02.02.2021 5 Tegra Xavier CPU Cache Hierarchy
• ARM 8.2 Cores (Carmel) Core Core • (Half-precision floating point!) • SIMD! 128kB $I 128kB $I 64 kB $D 64 kB $D • 64 kB L1 Cache • Per core 4 MB L2 Cache way 2 MB L3 Cache • 4 MB L2 Cache this • Per dual Faster Faster • 2 MB L3 Cache 16 GB RAM • Shared between all cores
• Least recently used eviction strategy (LRU) 02.02.2021 6 CPU Hierarchies and Performance • Let’s do an experiment! ☺J CPU Core
• Reading or writing 800 MB 20k Loops @ 40 kB
• Vary the size to read back-to-back L1 64 kB – E.g. read 24 kB repeatedly from same 40k Loops buffer, until 800 MB have been read @ 80 kB
L2 4 MB • Buffersize detemines location of data
• Under ideal conditions (no contention..) 100 Loops – Below 50kB, all reads are cached in L1 @ 8 MB
– Below 4 MB, all reads are cached in L2 L2 4 MB – Above 6 MB... Nothing gets cached
02.02.2021 7 Code Example and Profiling
• Compile with –pg • Run app: ./main • Analyze – Gprof ./main gmon.out Read Write • NB:Prefetch op – Prfm
02.02.2021 8 ARMv8 Registers
31 x 64-bit general purpose registers
X0 X8 x16 x24
32 x 128-bit vector registers
V0 V8 V16 V24
SP WZR Stack pointer Zero registers PC WSP XZR The Vector Registers V0-V31: Packing
q Data in V0-V31 are packed, and you control how they are packed
Example: 16 bytes or 8 bytes
Lanes Example: 8 half-words or 4 half-words Intrinsics, Inline Assembly or Assembly?
Inline Intrinsics Assembly Assembly
• You will need to understand !"#$" 40"56$9,3#:"*&; assembly40"56$9,3#:"*&; to !%&' !()*+%),-*.+)#/#)#'#0"1 <<,=*,1".>>,*0,3#:"*& • Debug<<,=*,1".>>,*0,3#:"*& your program -*.+)#/#)#'#0"12 • Understand how to use 3%--!456,37837837, G#:"*&,H,3%--E/156?3#:"*&8,3#:"*&F //%1'//? +$,)& intrinsics@3%-- A3B!1C8,A3B!1C8,A3B!1CD correctly !#0- 2,2,A3#:"*&B,@ED,?3#:"*&F,2,F
Goes inside C functions Goes inside C functions Goes in .s file
Level of Difficulty
02.02.2021 11 Data Types
C World Assembly World uint8x8_t v0.8b Vector registers are specified by the 8-bit unsigned integer 8 elementsfollowing:
8B/16B/4H/8H/2S/4S/2D uint8x16_t v0.16b B: bytes H: half-word (16-bit) 8-bit unsigned integer 16 S:elements word (32-bit) D: doubleword (64-bit) float32x2_t v0.2s An S-vector can therefor occupy signed or unsigned integers, or floating point values. 32-bit floating point 2 elements Meaning is encoded in the assembly float32x4_t v0.4s instruction or intrinsic, when needed. 32-bit floating point 4 elements
02.02.2021 12 Example: Loading or Storing Something
void * inp = malloc( 64 ) char * inp = malloc( 64 )
__asm__( uint8x16_t vectors[4]; «ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[inp]]» : : [inp] «r» (inp)) for(i=0; i < 4; i++) vectors[i] = vld1q_u8(inp + i*16)
void * out = malloc( 64 ) char * inp = malloc( 64 )
__asm__( uint8x16_t vectors[4]; // Assume we have done something // intelligent with v0, v1, v2 and v3 for(i=0; i < 4; i++) vst1q_u8(inp + i*16, vectors[i]) «st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[out]]» : : [out] «r» (out) : «memory»)
02.02.2021 13 How Does Intrinsics Map to Assembly?
• If you want to write some piece of inline assembly – But compiler spits out errors and you don’t know the syntax • Try to write it by intrinsics – Then objdump –D
02.02.2021 14 Example: Loading or Storing Something
• There are vector types and intrinsics char * inp = malloc( 64 ) for clustered vectors • In this case, four 128-bit registers For(i=0; i < 64; i++) inp[i] = i; // Contents are consecutive in memory.. • Be careful! Uint8x16x4 vectors; vectors = vld4q(inp); • Compiler seems to rearrange the contents in some unintuitive way // Contents are not consecutive in vectors!!
02.02.2021 15 Example: Vector Packing
Data types Size Bytes 1B Half-words 2B Words 4B Double words 8B Half precision 2B Single precision 4B Double precision 8B Other Examples int8x16_t v0; int8_t init = 0;
int16x4_t v0, v1; v0 = vdupq_n_s8(init); Int16x4_t result; Initialise all lanes result = vadd_s16(v0, v1) result = vsub_s16(v0, v1) result = vmul_s16(v0, v1) Float32x4_t v0; Addition, subtraction and multiplication Float32_t val;
Val = vgetq_lane_f32(v0, 0)
Val += 42.0F; Uint32x4_t v0; V0 = vsetq_lane_f32(val, v0, 0) float32_x4_t result;
Result = vcvtq_f32_u32( v0 ) Get and set a specific lane Convert unsigned int to float
02.02.2021 17 Programming With Intrinsics
More in a bit! Programming Example: Intrinsics Inline Assembly q Mostly harder than using intrinsics qHowever, gives more control (and better performance?) q Not always straightforward to figure out what mnemonics to use qTips: disassemble intrinsics and look with objdump or gdb
Operand constraints > «m» : memory address > «r» : general purpose register > «f» : floating point register > «i» : immediate ++ more
Specify dirty registers and more Programming Example: Inline Assembly Lookup Tables (LUT)
• Powerful approximation
• Use LUTs to realise complex mathematics! • For example prime numbers..
• Some “index” points into a LUT offset that contains precomputed values • Output stored in a vector 7 5 3 3 2 5 Index «vector» LUT Output «vector» (four-element) 02.02.2021 22 Table Lookup in ARM Neon qVector table lookup: vtbl v0, {v1, v2, ..., vn}, vm Two flavours: qV0: destination vector q{v1, v2}: LUTvtbl (max 2x128-bit vectors!!) qvm: index vectorAny element out of range for LUT returns 0 vtbx v0 Any element out of range0 for LUT15 leaves16 the 31
destination unchanged v1 v2
0 8 4 6 0 8 4 6 0 18 4 18 24 25 14 19
vm • Let’s try to use LUTs to transpose matrices.
• Don’t go thinking 4x4 or 8x8. – Start easy, then let’s see if we can observe any patterns.
02.02.2021 24 Matrix Transpose (Super Simple) Stride = 1
Destination Vector a a a LUT (matrix) a stride Index Vector 0 2x2 matrix, stride = 2
Destination a c b d a b a c Vector
c d b d LUT (matrix) a b c d stride Index Vector 0 2 1 3
stride 3x3 matrix, stride = 3
a b c a d g Destination a d g b e h c f i d e f b e h Vector g h i c f i LUT stride (matrix) a b c d e f g h i
Index 0 3 6 1 4 7 2 5 8 Vector stride How to think?
• For the «first output row» = 0 – Element output n is taken from n*stride in «input matrix»
• For the “next row” = 1 – Element output n is taken from n*stride + 1
• So generally, for output element n in output row i – Element is taken from n*stride + i
02.02.2021 28 Matrix Transpose Example
02.02.2021 29 The Gamma Transform
• Human eye is sensitive to variation in luminance • The gamma transform.. – «stretches» small variations in luminance – Can make it easier to see detail • Gamma is also used to adjust for non-linearity in old monitors – Image data is actually transmitted to the display with gamma applied to it as a form of «back compatibility» – Which is extremely confusing – Google «gamma correction explained» and watch the madness
02.02.2021 30 The Gamma Transform � = 1
1 1 � = 2.2 Output luminance value
Output value
! � = � , � ∈ [0,1] Resulting larger change in luminance � = 2.2
Input luminance value 0 Input Value 1
Small change in luminance
02.02.2021 https://wolfcrow.com/what-is-display-gamma-and-gamma-correction/ 31 Higher gamma compresses the low-lights, but extends high-lights
1 � = � = 1.0 � = 2.2 2.2
Lower gamma compresses the high-lights, but extends the low-lights
02.02.2021 32 Non-Temporal Loads and Stores
• Reading remote RAM from PCIe • Very slow due to core is hanging while datapath is fetching CPU RAM response
• However, ARM provides special Tegra Xavier PCIe memory load & store instructions • Non-temporal loads and stores • Relaxed memory ordering PCIe
• Hardware implementation may be able to improve performance; for example by not waiting for reads CPU RAM to arrive in destination registers Tegra Xavier
02.02.2021 33 Non-Temporal Loads and Stores
«ldnp q0, q1, [address]»
“stnp q0, q1, [address]” CPU RAM
PCIe Temporal Non- Tegra Xavier Instructions Temporal Instructions PCIe NX reads < 10 MBps 200-300 remote RAM MBps (x2 link) CPU RAM
Tegra Xavier
02.02.2021 34 Example: Gamma Correction Using ARM Neon
02.02.2021 35 When Things Go Wrong
02.02.2021 36 GDB Example Tips qBuild functions to print out macroblocks from vector registers and memory qStart small – test out independent parts of the code that are easy to verify qWhen in trouble, step through the code, display the relevant registers, verify with output you know is working Detecting Adidas Features
02.02.2021 39 Good Luck! qYou’ll be fine. Matrix Transpose
tbl v0.4s, {v1.4s}, v2.4s
a c b v0.4s a b a c d
c d b d 0 2 1 3 v2.4s stride Think like this: a b c d v1.4s For each output row, stride select increasing column Code Profiling qCompile with –pg Time to Finish 100M computations qRun application: ./main for Matrix Multiply (MM) and Transpose Operations qRun gprof ./main gmon.out100 80 60 40 20 ? 0 Transpose, lazy Transpose, MM, NEON MM, NEON intrinsics NEONassembly assembly Series 1 Column1 Column2