Gamma Correction Using ARM Neon

Tegra Xavier Introduction to ARMv8 Kristoffer Robin Stokke, PhD Dolphin Interconnect Solutions And Debugging Goals of Lecture qTo give you q Something concrete to start on q Some examples from «real life» where you may encounter these topics qEvery year I try to include something new... q Which means more freebees for you! J qSimple introduction to ARMv8 NEON programming environment q Register environment, instruction syntax q «Families» of instructions q Important for debugging, writing code and general understanding qProgramming examples q Intrinsics q Inline assembly qPerformance analysis using gprof qIntroduction to GDB debugging Keep This Under Your Pillow qARM’s overview and information on NEON instructions q https://developer.arm.com/documentation/dui0204/j/neon-and-vfp-programming qGNU compiler intrinsics list: q https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/ARM-NEON-Intrinsics.html qSome non-formal calling conventions and snacks q https://medium.com/mathieugarcia/introduction-to-arm64-neon-assembly-930c4a48bb2a qThis may also be useful q https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/arm- neon-programming-quick-reference Modern Heterogeneous SoC Architectures Manufacturer CPU CPU Cache RAM GPU DSP Hardware Accelerators Tegra X1 Nvidia 4 ARM Cortex 2 MB L2, 4 GB 256-core - • ISP A57 + 4 A53 48 kB I$ Maxwell 32 kB D$ (L1) Tegra Xavier Nvidia 8 CarMel 2 MB L3 (shared) 16 GB 512-core Volta - • CNN ARMv8 8 MB L2 (shared 2 Blocks cores) • ISP 128 kB I$ 64 kB D$ Myriad X Intel Movidius 2 SPARC Kilobytes 4 GB - 16 VLIW cores • ISP • CNN Blocks • ++ More SDA845 QualcoMM 8 Kryo ARM- 8 GB Adreno GPU Hexagon VLIW • ISP based + SIMD • LTE IMX6Q Freescale 4 ARM Cortex 1 MB L2 Impl. «3D graphics» (NXP) A-9 Dep. 02.02.2021 5 Tegra Xavier CPU Cache Hierarchy • ARM 8.2 Cores (Carmel) Core Core • (Half-precision floating point!) • SIMD! 128kB $I 128kB $I 64 kB $D 64 kB $D • 64 kB L1 Cache • Per core 4 MB L2 Cache way 2 MB L3 Cache • 4 MB L2 Cache this • Per dual Faster Faster • 2 MB L3 Cache 16 GB RAM • Shared between all cores • Least recently used eviction strategy (LRU) 02.02.2021 6 CPU Hierarchies and Performance • Let’s do an experiment! ☺J CPU Core • Reading or writing 800 MB 20k Loops @ 40 kB • Vary the size to read back-to-back L1 64 kB – E.g. read 24 kB repeatedly from same 40k Loops buffer, until 800 MB have been read @ 80 kB L2 4 MB • Buffersize detemines location of data • Under ideal conditions (no contention..) 100 Loops – Below 50kB, all reads are cached in L1 @ 8 MB – Below 4 MB, all reads are cached in L2 L2 4 MB – Above 6 MB... Nothing gets cached 02.02.2021 7 Code Example and Profiling • CoMpile with –pg • Run app: ./main • Analyze – Gprof ./main gmon.out Read Write • NB:Prefetch op – PrfM <type><target><policy> reg|label L1 20 ms 30 ms – Type • pld (for load) L2 70 ms 70 ms • pst (for store) • pli (for instruction) RAM 220 ms 180 ms • Target – L1 or L2 (or L3) – Policy • keep (normal) • streaM (use once) • prfM pldl1keep[x0] (address in x0) 02.02.2021 8 ARMv8 Registers 31 x 64-bit general purpose registers X0 X8 x16 x24 32 x 128-bit vector registers V0 V8 V16 V24 SP WZR Stack pointer Zero registers PC WSP XZR The Vector Registers V0-V31: Packing q Data in V0-V31 are packed, and you control how they are packed Example: 16 bytes or 8 bytes Lanes Example: 8 half-words or 4 half-words Intrinsics, Inline Assembly or Assembly? Inline Intrinsics Assembly Assembly • You will need to understand !"#$" 40"56$9,3#:"*&; assembly40"56$9,3#:"*&; to !%&' !()*+%),-*.+)#/#)#'#0"1 <<,=*,1".>>,*0,3#:"*& • Debug<<,=*,1".>>,*0,3#:"*& your program -*.+)#/#)#'#0"12 • Understand how to use 3%--!456,37837837, G#:"*&,H,3%--E/156?3#:"*&8,3#:"*&F //%1'//? +$,)& intrinsics@3%-- A3B!1C8,A3B!1C8,A3B!1CD correctly !#0- 2,2,A3#:"*&B,@ED,?3#:"*&F,2,F Goes inside C functions Goes inside C functions Goes in .s file Level of Difficulty 02.02.2021 11 Data Types C World Assembly World uint8x8_t v0.8b Vector registers are specified by the 8-bit unsigned integer 8 elementsfollowing: 8B/16B/4H/8H/2S/4S/2D uint8x16_t v0.16b B: bytes H: half-word (16-bit) 8-bit unsigned integer 16 S:elements word (32-bit) D: doubleword (64-bit) float32x2_t v0.2s An S-vector can therefor occupy signed or unsigned integers, or floating point values. 32-bit floating point 2 elements Meaning is encoded in the assembly float32x4_t v0.4s instruction or intrinsic, when needed. 32-bit floating point 4 elements 02.02.2021 12 Example: Loading or Storing Something void * inp = malloc( 64 ) char * inp = malloc( 64 ) __asm__( uint8x16_t vectors[4]; «ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[inp]]» : : [inp] «r» (inp)) for(i=0; i < 4; i++) vectors[i] = vld1q_u8(inp + i*16) void * out = malloc( 64 ) char * inp = malloc( 64 ) __asm__( uint8x16_t vectors[4]; // Assume we have done something // intelligent with v0, v1, v2 and v3 for(i=0; i < 4; i++) vst1q_u8(inp + i*16, vectors[i]) «st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[out]]» : : [out] «r» (out) : «memory») 02.02.2021 13 How Does Intrinsics Map to Assembly? • If you want to write some piece of inline assembly – But compiler spits out errors and you don’t know the syntax • Try to write it by intrinsics – Then objdump –D <your executable> | less – Type /<insert_your_function_name> + hit return and search • Alternatively – Gdb <your executable> – Break <your_source_file>.c:<your_line_number> – Type run -> enter – Layout asm -> inspect 02.02.2021 14 Example: Loading or Storing Something • There are vector types and intrinsics char * inp = malloc( 64 ) for clustered vectors • In this case, four 128-bit registers For(i=0; i < 64; i++) inp[i] = i; // Contents are consecutive in memory.. • Be careful! Uint8x16x4 vectors; vectors = vld4q(inp); • Compiler seems to rearrange the contents in some unintuitive way // Contents are not consecutive in vectors!! 02.02.2021 15 Example: Vector Packing Data types Size Bytes 1B Half-words 2B Words 4B Double words 8B Half precision 2B Single precision 4B Double precision 8B Other Examples int8x16_t v0; int8_t init = 0; int16x4_t v0, v1; v0 = vdupq_n_s8(init); Int16x4_t result; Initialise all lanes result = vadd_s16(v0, v1) result = vsub_s16(v0, v1) result = vmul_s16(v0, v1) Float32x4_t v0; Addition, subtraction and multiplication Float32_t val; Val = vgetq_lane_f32(v0, 0) Val += 42.0F; Uint32x4_t v0; V0 = vsetq_lane_f32(val, v0, 0) float32_x4_t result; Result = vcvtq_f32_u32( v0 ) Get and set a specific lane Convert unsigned int to float 02.02.2021 17 Programming With Intrinsics More in a bit! Programming Example: Intrinsics Inline Assembly q Mostly harder than using intrinsics qHowever, gives more control (and better performance?) q Not always straightforward to figure out what mnemonics to use qTips: disassemble intrinsics and look with objdump or gdb Operand constraints > «m» : memory address > «r» : general purpose register > «f» : floating point register > «i» : immediate ++ more Specify dirty registers and more Programming Example: Inline Assembly Lookup Tables (LUT) • Powerful approximation • Use LUTs to realise complex mathematics! • For example prime numbers.. • Some “index” points into a LUT offset that contains precomputed values • Output stored in a vector 7 5 3 3 2 5 Index «vector» LUT Output «vector» (four-element) 02.02.2021 22 Table Lookup in ARM Neon qVector table lookup: vtbl v0, {v1, v2, ..., vn}, vm Two flavours: qV0: destination vector q{v1, v2}: LUTvtbl (max 2x128-bit vectors!!) qvm: index vectorAny element out of range for LUT returns 0 vtbx v0 Any element out of range0 for LUT15 leaves16 the 31 destination unchanged v1 v2 0 8 4 6 0 8 4 6 0 18 4 18 24 25 14 19 vm • Let’s try to use LUTs to transpose matrices. • Don’t go thinking 4x4 or 8x8. – Start easy, then let’s see if we can observe any patterns. 02.02.2021 24 Matrix Transpose (Super Simple) Stride = 1 Destination Vector a a a LUT (matrix) a stride Index Vector 0 2x2 matrix, stride = 2 Destination a c b d a b a c Vector c d b d LUT (matrix) a b c d stride Index Vector 0 2 1 3 stride 3x3 matrix, stride = 3 a b c a d g Destination a d g b e h c f i d e f b e h Vector g h i c f i LUT stride (matrix) a b c d e f g h i Index 0 3 6 1 4 7 2 5 8 Vector stride How to think? • For the «first output row» = 0 – Element output n is taken from n*stride in «input matrix» • For the “next row” = 1 – Element output n is taken from n*stride + 1 • So generally, for output element n in output row i – Element is taken from n*stride + i 02.02.2021 28 Matrix Transpose Example 02.02.2021 29 The Gamma Transform • Human eye is sensitive to variation in luminance • The gamma transform.. – «stretches» small variations in luminance – Can make it easier to see detail • Gamma is also used to adjust for non-linearity in old monitors – Image data is actually transmitted to the display with gamma applied to it as a form of «back compatibility» – Which is extremely confusing – Google «gamma correction explained» and watch the madness 02.02.2021 30 The Gamma Transform � = 1 1 1 � = 2.2 Output luminance value Output value ! � = � , � ∈ [0,1] Resulting larger change in luminance � = 2.2 Input luminance value 0 Input Value 1 Small change in luminance 02.02.2021 https://wolfcrow.com/what-is-display-gamma-and-gamma-correction/ 31 Higher gamma compresses the low-lights, but extends high-lights 1 � = � = 1.0 � = 2.2 2.2 Lower gamma compresses the high-lights, but extends the low-lights 02.02.2021 32 Non-Temporal Loads and Stores • Reading remote RAM from PCIe • Very slow due to core is hanging while datapath is fetching CPU RAM response

Gamma Correction Using ARM Neon

GPU Developments 2018

Sensors and Data Encryption, Two Aspects of Electronics That Used to Be Two Worlds Apart and That Are Now Often Tightly Integrated, One Relying on the Other

Hardware-Assisted Rootkits: Abusing Performance Counters on the ARM and X86 Architectures

Arxiv:1910.06663V1 [Cs.PF] 15 Oct 2019

Gables: a Roofline Model for Mobile Socs

VIV:Vivo-X60-5G-Phone Datasheet Overview

Lecture 1: Introduction Advanced Digital VLSI Design I Bar-Ilan University, Course 83-614 Semester B, 2021 11 March 2021 Outline

Today's Smartphone Architecture

1 in the United States District Court for the Northern

OPERATING NEURAL NETWORKS on MOBILE DEVICES by Peter Bai

Linux Betriebssystem Linux Testen Und Parallel Zu Windows Installieren

Porting Linux on an ARM Board