<<

Yet Another Survey on SIMD Instructions

Armando Faz Hern´andez [email protected]

Computer Science Department, UNICAMP

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Outline

Introduction

SIMD for multimedia

ARM architecture

Implementation aspects

Auto vectorization

Concluding remarks

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Introduction Outline

Introduction

SIMD for multimedia

ARM architecture

Implementation aspects

Auto vectorization

Concluding remarks

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Introduction RISC Computers

Around 1980 RISC computers establish a milestone in the computer’s architecture design. • Provide a execution. • Extract parallelism among instructions. • Introduced the Instruction Level Parallelism ILP. The performance gained by RISC processors was limited by applications.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Introduction Flynn Taxonomy

Flynn categorized computers according to how data and/or instructions was processed. Programs can be seen as a stream of instructions that are applied over a stream of data.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Single Instruction Multiple Data

Multiple Instruction Single Data Multiple Instruction Multiple Data

Introduction Flynn Taxonomy

Single Instruction Single Data

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Single Instruction Single Data

Multiple Instruction Single Data Multiple Instruction Multiple Data

Introduction Flynn Taxonomy

Single Instruction Multiple Data

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Single Instruction Single Data Single Instruction Multiple Data

Multiple Instruction Multiple Data

Introduction Flynn Taxonomy

Multiple Instruction Single Data

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Single Instruction Single Data Single Instruction Multiple Data

Multiple Instruction Single Data

Introduction Flynn Taxonomy

Multiple Instruction Multiple Data

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Introduction First SIMD approach Since 1970 vector architectures appeared as the first implementation of SIMD processing. Illiac IV from Illinois University, CDC Star and ASC from Texas Instruments.

• Process more than 64 elements per operation.

• Many replicated functional units.

• Cray series of computers was a successful implementation.

Illiac IV was able to compute 128 32-bit multiplications in 625 ns.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Introduction Survey content

1 Description of SIMD instructions sets for multimedia applications for desktop processors ( and AMD) and for low power consumption architectures such as ARM. 2 We show how to exploit parallelism in a SIMD fashion. 3 Finally, we go further exploring some tools that enables auto vectorization in code.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Outline

Introduction

SIMD for multimedia

ARM architecture

Implementation aspects

Auto vectorization

Concluding remarks

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia What is multimedia?

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia What is multimedia?

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions A cheaper solution is partitioning carry chains of an 64-bit ALU.

SIMD for multimedia Multimedia

Image and sound processing frequently involves to perform same instruction to a set of short size data types, in particular 8-bit for pixels and 16-bits for audio samples. Adding a dedicated unit results expensive.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Multimedia

Image and sound processing frequently involves to perform same instruction to a set of short size data types, in particular 8-bit for pixels and 16-bits for audio samples. Adding a dedicated unit results expensive.

A cheaper solution is partitioning carry chains of an 64-bit ALU.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Current computer 2010

SIMD for multimedia Multimedia in 1995

IBM computer 1995

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Multimedia in 1995

IBM computer Current computer 1995 2010

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions MMX can process operations over 8-bit, 16-bit and 32-bit vectors.

SIMD for multimedia MultiMedia eXtensions

MMX was released in 1997 and introduces 57 integer instructions and a register set MMX0-MMX7.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia MultiMedia eXtensions

MMX was released in 1997 and introduces 57 integer instructions and a register set MMX0-MMX7.

MMX can process operations over 8-bit, 16-bit and 32-bit vectors.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia 3DNow! MMX can only work in integer mode xor in FPU mode. Switching between modes incurs in performance loses. AMD released in 1998 the 3DNow! technology using the same register set (MMX), but floating point operations.

3DNow! was not enough popular for future developments and in 2010 AMD decide to deprecate the project.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Streaming SIMD Extensions

Intel identified problems of MMX and decided to provide a new set of registers, a.k.a XMM, and a new instruction set called Streaming SIMD Extensions (SSE). • XMM registers are 128-bit length. • SSE contains 70 new instructions for floating point arithmetic. • SIMD instructions are able to compute up to four 32-bit floating point operations. In 2000, AMD launched the x64 extension, which doubles the number of XMM registers (XMM0-XMM15).

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia SSE 2 The second iteration of SSE, called SSE2, adds 140 new instructions for double precision floating point processing. This release was focused on 3D games and Computer-Aided Design applications. Although SSE2 operate over four elements, the performance was roughly the same as MMX, which operates in just two elements. This loss of performance is due to accessing to misaligned memory addresses.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia SSE3

In order to solve performance issues due to misaligned data, SSE3 incorporates new instructions to from unaligned memory addresses minimizing timing penalties.

Supplemental Streaming SIMD Extensions 3 (SSSE3) was released in 2006, adding new instructions such as multiply-and-add and vector alignment.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia SSE4

SSE4 contemplates 54 new instructions (47 in SSE4.1 and 7 in SSE4.2) dedicated to string processing. Also includes elaborated instructions to perform population count and computation of CRC-32 error detecting code.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Advanced Vector Extensions

Intel decided to move computations to wider registers and introduces the Advanced Vector Extensions (AVX). This technology involves: • 16 256-bit registers, called YMM0-YMM15. • The ability to write three operand code in assembler listings. • Proposes the VEX encoding scheme that increases the space of operation codes. • It also support the legacy SSEx instructions by adding the VEX prefix.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Advanced Vector Extensions 2

The second version of AVX, released in 2012, includes the expansion of many integer operations to 256-bit registers.

AVX2 support gather/scatter operations to load/store registers from/to non-contiguous memory locations.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia SSE5, XOP, FMA4

A big confusion was caused about future directions of new instruction sets, both Intel and AMD has been changed their proposes about SSE5.

Bulldozer, the AMD’s latest micro-architecture, implements XOP and FMA4 instruction set, and also has compatibility with AVX.

Piledriver and Haswell are the next micro-architectures from AMD and Intel, respectively. They will provide more multiply-and-add instructions for both floating point and integer operations.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions ARM architecture Outline

Introduction

SIMD for multimedia

ARM architecture

Implementation aspects

Auto vectorization

Concluding remarks

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions ARM architecture ARM

ARM is a low power consumption widely distributed in several devices such as routers, tablets, cell phones and recently integrated into GPU. • ARM is a 32-bit architecture with a pipelined processor. • Most of the instructions are executed in just one clock cycle. • All instructions are conditionally executed. • Native ARM instructions have fixed size encodings.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions ARM architecture Thumb and Thumb-2

Fixed size encodings results in fast instruction decoding but large binary programs.

Thumb is a variable encoding scheme proposed in 1994. Thumb is able to encode a subset of ARM instructions in 16-bit operand codes.

Shorter encodings left out conditional instruction execution.

Thumb-2 emerged and solved this issue allowing to have variable size encodings and supporting conditional execution using the IT instruction.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions ARM architecture SIMD on ARM

The first implementation of SIMD processing on an ARM processor is the Vector Floating Point unit (VFP), which is a that extends the ARM instruction set.

VFP allows the execution of vector instructions, however FPU unit processes each element in the vector sequentially.

This is not a real SIMD processing!

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions ARM architecture NEON

ARM decided to integrate the Advanced SIMD extension, a.k.a as NEON.

NEON unit contains 16 128-bit registers and process packet SIMD operations over 8, 16 and 32-bit elements.

Their registers can be accessed by vector load/store instructions.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions ARM architecture NEON

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions ARM architecture NEON

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Implementation aspects Outline

Introduction

SIMD for multimedia

ARM architecture

Implementation aspects

Auto vectorization

Concluding remarks

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Implementation aspects Intrinsics In order to facilitate the use of SIMD instructions, Intrinsics provide a high level layer of function definitions of C/C++ language, that enable vectorized execution without knowledge of assembler mnemonics. Operation Vector intrinsic mm add pd mm sub pd Arithmetic mm mul pd mm sqrt pd mm and si128 Logical mm or si128 mm xor si128 mm shuffle pd Permutation mm unpakchi pd mm unpakclo pd

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Implementation aspects Application 1: Sound Processing

FIR is a function used to filter of speech signals in modern voice coders and many other processing areas.

An M length filter h[0,..., M − 1], applied to an input sequence x[0,..., N − 1] generates an output sequence y[0,..., N − 1], as described in the following equation:

M−1 X y(n) = h(i)x(n − i) i=0

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Implementation aspects Application 1: Sound Processing

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Implementation aspects Application 1: Sound Processing

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Implementation aspects Application 2: Conditional Vector Execution

Sequential code:

for (i=0; i

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Implementation aspects Application 2: Conditional Vector Execution

Vectorized code:

for (i=0; i< N; i+=4) { A = _mm_loadu_ps(&a[i]); B = _mm_loadu_ps(&b[i]); C = _mm_mul_ps(A,B); mask = _mm_cmplt_ps(A,B); C = _mm_blend_ps(C,A,mask); _mm_storeu_ps(&c[i],C); }

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Auto vectorization Outline

Introduction

SIMD for multimedia

ARM architecture

Implementation aspects

Auto vectorization

Concluding remarks

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Auto vectorization Code Generation

Given the diversity of computers and architectural differences between them, SIMD development should be targeted for many of these architectures.

In compilation time, compilers provide some flags to specify which technology is available.

$ gcc -o program_seq -c program.c $ gcc -o program_sse3 -c program.c -msse3 $ gcc -o program_sse4.2 -c program.c -msse4.2 $ gcc -o program_avx -c program.c -mavx

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Auto vectorization Code Generation

In running time, developers should produce architectural code-aware programs, i.. programs have the ability to decide which code is the better to be executed according to machine.

Using CPUID instruction a program can evaluate the machine’s resources and then specialized code can be executed to get maximal performance.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Auto vectorization Advanced Compiler Tools

How do I know if a loop can be vectorized?

Intel Compiler is able to give a report of loop vectorization.

$ icl /Qvec-report MultArray.c MultArray.c(92): (col. 5) remark: LOOP WAS VECTORIZED.

There are several causes for which a loop can not be vectorized, such as vector dependence, overlapped loops or non stride access to a vector.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Auto vectorization Advanced Compiler Tools

Keyword restrict gives a hint to compiler to ensure that memory referenced by pointers a, b and c is not overlapped.

void foo (float* restrict a, float* restrict b, float* restrict c, int n) { for (i = 0 ; i < n; i++) { a[i] = b[i] * c[i]; } }

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Auto vectorization Advanced Compiler Tools

Another way to hint compiler about a possible vectorized loop is by using #pragma sentences.

void vec_always(int *a, int *b, int m) { #pragma vector always for(int i = 0; i <= m; i++) a[32*i] = b[99*i]; }

In this case the use of #pragma vector always override efficiency heuristics during the decision to vectorize or not.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Auto vectorization Advanced Compiler Tools

Using #pragma ivdep hints compiler about to ignore assumed vector dependencies, i.e. it is used only when loop dependencies are safe to ignore.

void ignore_vec_dep(int *a, int k, int c, int m) { #pragma ivdep for (int i = 0; i < m; i++) a[i] = a[i + k] * c; }

Without the use of #pragma ivdep loop will not be vectorized due to the value of variable k is unknown in compilation time.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Concluding remarks Outline

Introduction

SIMD for multimedia

ARM architecture

Implementation aspects

Auto vectorization

Concluding remarks

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Concluding remarks Concluding Remarks

1 Since SIMD processing emerged, in the middle of 1990’s, has been improved multimedia applications. 2 Latest processors for desktop and mobile computing have dedicated functional units to process SIMD instructions. 3 Every architecture release processor comes with new and faster instructions. 4 The use of intrinsic enables a friendly approach to the use of SIMD operations. 5 Research in compiler engine results in heuristics to detect possible parallel fragments in code.

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Concluding remarks

Yet Another Survey on SIMD Instructions

Armando Faz Hern´andez [email protected]

Computer Science Department, UNICAMP

Armando Faz Hern´andez Yet Another Survey on SIMD Instructions