Tegra Xavier Introduction to ARMv8 Kristoffer Robin Stokke, PhD Dolphin Interconnect Solutions And Debugging Goals of Lecture qTo give you q Something concrete to start on q Some examples from «real life» where you may encounter these topics qEvery year I try to include something new... q Which means more freebees for you! J qSimple introduction to ARMv8 NEON programming environment q Register environment, instruction syntax q «Families» of instructions q Important for debugging, writing code and general understanding qProgramming examples q Intrinsics q Inline assembly qPerformance analysis using gprof qIntroduction to GDB debugging Keep This Under Your Pillow qARM’s overview and information on NEON instructions q https://developer.arm.com/documentation/dui0204/j/neon-and-vfp-programming qGNU compiler intrinsics list: q https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/ARM-NEON-Intrinsics.html qSome non-formal calling conventions and snacks q https://medium.com/mathieugarcia/introduction-to-arm64-neon-assembly-930c4a48bb2a qThis may also be useful q https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/arm- neon-programming-quick-reference Modern Heterogeneous SoC Architectures Manufacturer CPU CPU Cache RAM GPU DSP Hardware Accelerators Tegra X1 Nvidia 4 ARM Cortex 2 MB L2, 4 GB 256-core - • ISP A57 + 4 A53 48 kB I$ Maxwell 32 kB D$ (L1) Tegra Xavier Nvidia 8 CarMel 2 MB L3 (shared) 16 GB 512-core Volta - • CNN ARMv8 8 MB L2 (shared 2 Blocks cores) • ISP 128 kB I$ 64 kB D$ Myriad X Intel Movidius 2 SPARC Kilobytes 4 GB - 16 VLIW cores • ISP • CNN Blocks • ++ More SDA845 QualcoMM 8 Kryo ARM- 8 GB Adreno GPU Hexagon VLIW • ISP based + SIMD • LTE IMX6Q Freescale 4 ARM Cortex 1 MB L2 Impl. «3D graphics» (NXP) A-9 Dep. 02.02.2021 5 Tegra Xavier CPU Cache Hierarchy • ARM 8.2 Cores (Carmel) Core Core • (Half-precision floating point!) • SIMD! 128kB $I 128kB $I 64 kB $D 64 kB $D • 64 kB L1 Cache • Per core 4 MB L2 Cache way 2 MB L3 Cache • 4 MB L2 Cache this • Per dual Faster Faster • 2 MB L3 Cache 16 GB RAM • Shared between all cores • Least recently used eviction strategy (LRU) 02.02.2021 6 CPU Hierarchies and Performance • Let’s do an experiment! ☺J CPU Core • Reading or writing 800 MB 20k Loops @ 40 kB • Vary the size to read back-to-back L1 64 kB – E.g. read 24 kB repeatedly from same 40k Loops buffer, until 800 MB have been read @ 80 kB L2 4 MB • Buffersize detemines location of data • Under ideal conditions (no contention..) 100 Loops – Below 50kB, all reads are cached in L1 @ 8 MB – Below 4 MB, all reads are cached in L2 L2 4 MB – Above 6 MB... Nothing gets cached 02.02.2021 7 Code Example and Profiling • CoMpile with –pg • Run app: ./main • Analyze – Gprof ./main gmon.out Read Write • NB:Prefetch op – PrfM <type><target><policy> reg|label L1 20 ms 30 ms – Type • pld (for load) L2 70 ms 70 ms • pst (for store) • pli (for instruction) RAM 220 ms 180 ms • Target – L1 or L2 (or L3) – Policy • keep (normal) • streaM (use once) • prfM pldl1keep[x0] (address in x0) 02.02.2021 8 ARMv8 Registers 31 x 64-bit general purpose registers X0 X8 x16 x24 32 x 128-bit vector registers V0 V8 V16 V24 SP WZR Stack pointer Zero registers PC WSP XZR The Vector Registers V0-V31: Packing q Data in V0-V31 are packed, and you control how they are packed Example: 16 bytes or 8 bytes Lanes Example: 8 half-words or 4 half-words Intrinsics, Inline Assembly or Assembly? Inline Intrinsics Assembly Assembly • You will need to understand !"#$" 40"56$9,3#:"*&; assembly40"56$9,3#:"*&; to !%&' !()*+%),-*.+)#/#)#'#0"1 <<,=*,1".>>,*0,3#:"*& • Debug<<,=*,1".>>,*0,3#:"*& your program -*.+)#/#)#'#0"12 • Understand how to use 3%--!456,37837837, G#:"*&,H,3%--E/156?3#:"*&8,3#:"*&F //%1'//? +$,)& intrinsics@3%-- A3B!1C8,A3B!1C8,A3B!1CD correctly !#0- 2,2,A3#:"*&B,@ED,?3#:"*&F,2,F Goes inside C functions Goes inside C functions Goes in .s file Level of Difficulty 02.02.2021 11 Data Types C World Assembly World uint8x8_t v0.8b Vector registers are specified by the 8-bit unsigned integer 8 elementsfollowing: 8B/16B/4H/8H/2S/4S/2D uint8x16_t v0.16b B: bytes H: half-word (16-bit) 8-bit unsigned integer 16 S:elements word (32-bit) D: doubleword (64-bit) float32x2_t v0.2s An S-vector can therefor occupy signed or unsigned integers, or floating point values. 32-bit floating point 2 elements Meaning is encoded in the assembly float32x4_t v0.4s instruction or intrinsic, when needed. 32-bit floating point 4 elements 02.02.2021 12 Example: Loading or Storing Something void * inp = malloc( 64 ) char * inp = malloc( 64 ) __asm__( uint8x16_t vectors[4]; «ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[inp]]» : : [inp] «r» (inp)) for(i=0; i < 4; i++) vectors[i] = vld1q_u8(inp + i*16) void * out = malloc( 64 ) char * inp = malloc( 64 ) __asm__( uint8x16_t vectors[4]; // Assume we have done something // intelligent with v0, v1, v2 and v3 for(i=0; i < 4; i++) vst1q_u8(inp + i*16, vectors[i]) «st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[out]]» : : [out] «r» (out) : «memory») 02.02.2021 13 How Does Intrinsics Map to Assembly? • If you want to write some piece of inline assembly – But compiler spits out errors and you don’t know the syntax • Try to write it by intrinsics – Then objdump –D <your executable> | less – Type /<insert_your_function_name> + hit return and search • Alternatively – Gdb <your executable> – Break <your_source_file>.c:<your_line_number> – Type run -> enter – Layout asm -> inspect 02.02.2021 14 Example: Loading or Storing Something • There are vector types and intrinsics char * inp = malloc( 64 ) for clustered vectors • In this case, four 128-bit registers For(i=0; i < 64; i++) inp[i] = i; // Contents are consecutive in memory.. • Be careful! Uint8x16x4 vectors; vectors = vld4q(inp); • Compiler seems to rearrange the contents in some unintuitive way // Contents are not consecutive in vectors!! 02.02.2021 15 Example: Vector Packing Data types Size Bytes 1B Half-words 2B Words 4B Double words 8B Half precision 2B Single precision 4B Double precision 8B Other Examples int8x16_t v0; int8_t init = 0; int16x4_t v0, v1; v0 = vdupq_n_s8(init); Int16x4_t result; Initialise all lanes result = vadd_s16(v0, v1) result = vsub_s16(v0, v1) result = vmul_s16(v0, v1) Float32x4_t v0; Addition, subtraction and multiplication Float32_t val; Val = vgetq_lane_f32(v0, 0) Val += 42.0F; Uint32x4_t v0; V0 = vsetq_lane_f32(val, v0, 0) float32_x4_t result; Result = vcvtq_f32_u32( v0 ) Get and set a specific lane Convert unsigned int to float 02.02.2021 17 Programming With Intrinsics More in a bit! Programming Example: Intrinsics Inline Assembly q Mostly harder than using intrinsics qHowever, gives more control (and better performance?) q Not always straightforward to figure out what mnemonics to use qTips: disassemble intrinsics and look with objdump or gdb Operand constraints > «m» : memory address > «r» : general purpose register > «f» : floating point register > «i» : immediate ++ more Specify dirty registers and more Programming Example: Inline Assembly Lookup Tables (LUT) • Powerful approximation • Use LUTs to realise complex mathematics! • For example prime numbers.. • Some “index” points into a LUT offset that contains precomputed values • Output stored in a vector 7 5 3 3 2 5 Index «vector» LUT Output «vector» (four-element) 02.02.2021 22 Table Lookup in ARM Neon qVector table lookup: vtbl v0, {v1, v2, ..., vn}, vm Two flavours: qV0: destination vector q{v1, v2}: LUTvtbl (max 2x128-bit vectors!!) qvm: index vectorAny element out of range for LUT returns 0 vtbx v0 Any element out of range0 for LUT15 leaves16 the 31 destination unchanged v1 v2 0 8 4 6 0 8 4 6 0 18 4 18 24 25 14 19 vm • Let’s try to use LUTs to transpose matrices. • Don’t go thinking 4x4 or 8x8. – Start easy, then let’s see if we can observe any patterns. 02.02.2021 24 Matrix Transpose (Super Simple) Stride = 1 Destination Vector a a a LUT (matrix) a stride Index Vector 0 2x2 matrix, stride = 2 Destination a c b d a b a c Vector c d b d LUT (matrix) a b c d stride Index Vector 0 2 1 3 stride 3x3 matrix, stride = 3 a b c a d g Destination a d g b e h c f i d e f b e h Vector g h i c f i LUT stride (matrix) a b c d e f g h i Index 0 3 6 1 4 7 2 5 8 Vector stride How to think? • For the «first output row» = 0 – Element output n is taken from n*stride in «input matrix» • For the “next row” = 1 – Element output n is taken from n*stride + 1 • So generally, for output element n in output row i – Element is taken from n*stride + i 02.02.2021 28 Matrix Transpose Example 02.02.2021 29 The Gamma Transform • Human eye is sensitive to variation in luminance • The gamma transform.. – «stretches» small variations in luminance – Can make it easier to see detail • Gamma is also used to adjust for non-linearity in old monitors – Image data is actually transmitted to the display with gamma applied to it as a form of «back compatibility» – Which is extremely confusing – Google «gamma correction explained» and watch the madness 02.02.2021 30 The Gamma Transform � = 1 1 1 � = 2.2 Output luminance value Output value ! � = � , � ∈ [0,1] Resulting larger change in luminance � = 2.2 Input luminance value 0 Input Value 1 Small change in luminance 02.02.2021 https://wolfcrow.com/what-is-display-gamma-and-gamma-correction/ 31 Higher gamma compresses the low-lights, but extends high-lights 1 � = � = 1.0 � = 2.2 2.2 Lower gamma compresses the high-lights, but extends the low-lights 02.02.2021 32 Non-Temporal Loads and Stores • Reading remote RAM from PCIe • Very slow due to core is hanging while datapath is fetching CPU RAM response
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages41 Page
-
File Size-