Introduction to Lab 4

Introduction Uppmax SSE Summary What is lab 4? Vectors? Who needs vectors anyway? Introduction to Lab 4 Andreas Sandberg <[email protected]> The purpose of this assignment is to give insights into: How vector instructions can be used for floating point code Division of Computer Systems How integer operations can be performed using vector Dept. of Information Technology instructions Uppsala University How memory alignment affects performance and correctness 2011-11-18 2 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary The Kalkyl Cluster The Kalkyl Cluster Specifications Logging in transferring files Cluster specifications Connecting with SSH 348 Nodes interconnected with Infiniband Always connect to kalkyl.uppmax.uu.se 2784 CPU Cores ssh -Y [email protected] 9404 GB RAM • -Y – Enables X-forwarding 113 TB disk Transferring files Node specifications Transfer files using the scp command Runs Scientific Linux (RedHat Enterprise Linux customized Use the same server as for normal SSH logins for scientific applications) scp ./foo [email protected]:bar/ 2x Quad-Core Intel Xeon 5520 (Nehalem based) • Transfers the file ./foo to the directory bar in your home At least 24 GB RAM directory on Uppmax 3 AvDark’11| Introduction to Lab 4 4 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary The Kalkyl Cluster Loading additional software Submitting an interactive job You won’t measure correct results in your experiments unless Uppmax provides optional software in modules that can be you allocate an entire node for your experiments. easily loaded an unloaded. Use salloc -p node -n 1 -t 15:00 --qos=short -A g2011132 CMD module load gcc—Load the latest version of the GCC • Runs CMD, or a shell if CMD is omitted compiler • -p node -n 1—Request 1 node module unload gcc—Unload the currently loaded GCC • -t 15:00—Expected runtime for the job • --qos=short—Use the queue for short jobs module • -A g2011132—Use the course project for accounting module list—List loaded modules Jobs running longer than the requested runtime time will module whatis—List available modules be terminated 5 AvDark’11| Introduction to Lab 4 6 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary x86 Terminology SSE Registers 127 64 630 Integers 2x 64-bit (int|FP) Byte 8 bits 32 3164639695127 0 Word 16 bits 4x 32-bit (int|FP) DWord 32 bits QWord 64 bits 127 32 316463112111 96 95 48 47 16 1509695 8x 16-bit int Floating Point 127 0 Single 32 bits 16x 8-bit int Double 64 bits 16 new 128-bit registers (8 registers in 32-bit mode) Extended 80 bits (only available in the x87) Registers can hold either FP or integer values Number of elements depends on element type 7 AvDark’11| Introduction to Lab 4 8 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary SSE New instructions Compared to classical x86 Loads and Stores Classical x86/x87 SSE Several new MOV instructions Stack based FP math Register based FP math • Most of them can act as both loads and stores Uses extended 80-bit FP Uses standard 32-bit or Behavior with respect to memory system: precision internally 64-bit FP precision Aligned Requires aligned memory operands Some instructions have All registers are general Unaligned Allows unaligned memory operands fixed operands purpose Non-temporal Accesses optimized for streaming data Memory operations can Memory operations must Different versions depending on data type generally be unaligned generally be aligned • Can be used to optimize data placement inside the CPU 9 AvDark’11| Introduction to Lab 4 10 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary New instructions New Instructions Horizontal Add HADDPS (Horizontal ADD Packed Single fp) All common arithmetic operations are available • Operate on individual elements Input vectors • At least one version per data type (8 versions of add!) a0 a1 a2 a3 b0 b1 b2 b3 Binary logic operators are available • Operate on entire 128-bit registers ++++ • Different versions for integer and FP Vector specific instructions • Dot-products • Horizontal add • ... Output vector c0 c1 c2 c3 Hordes esoteric instructions Can be used to efficiently summarize 4 vectors 11 AvDark’11| Introduction to Lab 4 12 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary New Instructions Detecting SSE Comparisons How it should be done PCMPGTW (Parallel CoMpare Greater Than Word) Input vectors 1 Can bit 21 in EFLAGS be toggled? ) CPUID is present 2 Execute CPUIDEAX=0. Check manufacturer and maximum a0 a1 ... a7 > b0 b1 ... b7 CPUID function #. 3 Execute CPUIDEAX=1. Check the following bits: ci := ai > bi ? FFFF16 : 000016 EDX:25 SSE EDX:25 SSE2 Output vector c0 c1 ... c7 ECX:0 SSE3 ECX:9 SSSE3 ECX:19 SSE4.1 Compares element-wise ECX:20 SSE4.2 An element is binary all 1 if the predicate is true, 0 4 Check for optional instructions (use CPUID) otherwise Can be used to generate bit masks 13 AvDark’11| Introduction to Lab 4 14 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary Detecting SSE C-support What we do Basics Several different interfaces, no standard. Common approaches: • Assembler libraries • Inline assembler (no standard inline asm syntax) This slide is intentionally left blank • GCC Intrinsics • ICC Intrinsics (supported by GCC) Intrinsic names for ICC are documented in Intel’s CPU manuals GCC’s native instructions are “documented” in the GCC manual 15 AvDark’11| Introduction to Lab 4 16 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary C-support C-support Headers Instruction and Type Naming xmmintrin.h SSE emmintrin.h SSE2 pmmintrin.h SSE3 _mm_<op>_<suffix> tmmintrin.h SSSE3 smmintrin.h SSE4.1 nmmintrin.h SSE4.2 Vector type Element type gmmintrin.h AVX <suffix> epi8 __m128i int8_t epi16 __m128i int16_t Some header files include earlier versions headers from epi32 __m128i int32_t earlier SSE versions. epi64 __m128i int64_t GCC requires that SSE extensions are enabled through ps __m128 float command line switches. pd __m128d double Warning: This normally allows GCC to automatically generate code for the those extensions. 17 AvDark’11| Introduction to Lab 4 18 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary Example A Small Vectorization Tutorial Loading and Storing 1 Start with a simple serial version of your algorithm Load/store example using unaligned accesses 2 Remove conditional control flow 3 Unroll loops #include <pmmintrin.h> 4 Vectorize! s t a t i c void my_memcpy( char ∗dst , const char ∗src, size_t len) { / ∗ Assume that length is an even multiple of the s t a t i c i n t ∗ vector size ∗ / count ( const u i n t 3 2 _ t ∗data, size_t len) assert((len & 0xF) == 0); { for ( i n t i = 0; i < len; i +=16) { i n t c = 0; __m128i v = _mm_loadu_si128((__m128i ∗ ) ( src + i ) ) ; for ( i n t i = 0; i < len; i++) _mm_storeu_si128 (( __m128i ∗ )(dst + i), v); i f (data[i] == 0) } c ++; } return c ; } 19 AvDark’11| Introduction to Lab 4 20 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary A Small Vectorization Tutorial A Small Vectorization Tutorial 1 Start with a simple serial version of your algorithm 1 Start with a simple serial version of your algorithm 2 Remove conditional control flow 2 Remove conditional control flow 3 Unroll loops 3 Unroll loops 4 Vectorize! 4 Vectorize! i n t c = 0; i n t c = 0; for ( i n t i = 0; i < len; i++) assert(!(len & 0x3)); c += (data[i] == 0) ? 1 : 0; for ( i n t i = 0; i < len; i +=4) return c ; c += ((data[i + 0] == 0) ? 1 : 0) + ((data[i + 1] == 0) ? 1 : 0) + ((data[i + 2] == 0) ? 1 : 0) + ((data[i + 3] == 0) ? 1 : 0); return c ; 20 AvDark’11| Introduction to Lab 4 20 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary A Small Vectorization Tutorial Common Error Sources 1 Start with a simple serial version of your algorithm 2 Remove conditional control flow 3 Unroll loops Unaligned memory accesses 4 Vectorize! • Causes a Segmentation fault • May be due to an unintentional memory operand • Can be hard to spot in memory debuggers __m128i c = _mm_setzero_si128(); Unsupported SSE instructions const __m128i one = _mm_set1_epi32(1); • Causes an Illegal instruction error const __m128i zero = _mm_setzero_si128(); • GCC may automatically emit SSE instructions if SSE has for ( i n t i =0; i <len; i +=4) { __m128i v = _mm_loadu_si128((__m128i ∗ )(data + i)); been enabled on the command line const __m128i cond = _mm_cmpeq_epi32(v, zero ); c = _mm_add_epi32(c, _mm_and_si128(cond, one)); } return _mm_extract_epi32(c, 0) + _mm_extract_epi32(c, 1) + _mm_extract_epi32(c, 2) + _mm_extract_epi32(c, 3); 20 AvDark’11| Introduction to Lab 4 21 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary Where to go from here Important dates Intel • C++ Compiler Manual Groups: • Optimization Reference Manual Prep. Room 1515, now–17:00 • Intel 64 and IA-32 Architectures Software Developer’s A 2011-11-21, Room 1412, 08:15–12:00 Manual (vol. 1 & 2) B 2011-11-21, Room 1412, 13:15–17:00 AMD C 2011-11-22, Room 1412, 08:15–12:00 • Software Optimization Guide for AMD Family 10h Deadline: 2011-11-28 15:14 • AMD64 Architecture Programmer’s Manual (vol. 1 & 3) The GCC manual 22 AvDark’11| Introduction to Lab 4 23 AvDark’11| Introduction to Lab 4 Introduction Uppmax SSE Summary Introduction Uppmax SSE Summary Summary Summary And remember.

Introduction to Lab 4

AMD Athlon™ Processor X86 Code Optimization Guide

Lecture Notes in Assembly Language

Targeting Embedded Powerpc

Codewarrior® Targeting Embedded Powerpc

System Calls and Inline Assembler

GCC and Assembly Language GCC and Assembly Language

In Using the GNU Compiler Collection (GCC)

PA Build RM.Pdf

ARM Cortex-A Series Programmer's Guide

Learn Linux Kernel Programming, Hands-On: a Uniquely Effective Top-Down Approach

The Shellcoder 039 S Handbook Discovering And

CCES 2.9.0 C/C++ Compiler Manual for SHARC Processors