Yet Another Survey on SIMD Instructions

Yet Another Survey on SIMD Instructions Armando Faz Hernández [email protected] Computer Science Department, UNICAMP Armando Faz Hernández Yet Another Survey on SIMD Instructions Outline Introduction SIMD for multimedia ARM architecture Implementation aspects Auto vectorization Concluding remarks Armando Faz Hernández Yet Another Survey on SIMD Instructions Introduction Outline Introduction SIMD for multimedia ARM architecture Implementation aspects Auto vectorization Concluding remarks Armando Faz Hernández Yet Another Survey on SIMD Instructions Introduction RISC Computers Around 1980 RISC computers establish a milestone in the computer's architecture design. • Provide a pipeline execution. • Extract parallelism among instructions. • Introduced the Instruction Level Parallelism ILP. The performance gained by RISC processors was limited by applications. Armando Faz Hernández Yet Another Survey on SIMD Instructions Introduction Flynn Taxonomy Flynn categorized computers according to how data and/or instructions was processed. Programs can be seen as a stream of instructions that are applied over a stream of data. Armando Faz Hernández Yet Another Survey on SIMD Instructions Single Instruction Multiple Data Multiple Instruction Single Data Multiple Instruction Multiple Data Introduction Flynn Taxonomy Single Instruction Single Data Armando Faz Hernández Yet Another Survey on SIMD Instructions Single Instruction Single Data Multiple Instruction Single Data Multiple Instruction Multiple Data Introduction Flynn Taxonomy Single Instruction Multiple Data Armando Faz Hernández Yet Another Survey on SIMD Instructions Single Instruction Single Data Single Instruction Multiple Data Multiple Instruction Multiple Data Introduction Flynn Taxonomy Multiple Instruction Single Data Armando Faz Hernández Yet Another Survey on SIMD Instructions Single Instruction Single Data Single Instruction Multiple Data Multiple Instruction Single Data Introduction Flynn Taxonomy Multiple Instruction Multiple Data Armando Faz Hernández Yet Another Survey on SIMD Instructions Introduction First SIMD approach Since 1970 vector architectures appeared as the first implementation of SIMD processing. Illiac IV from Illinois University, CDC Star and ASC from Texas Instruments. • Process more than 64 elements per operation. • Many replicated functional units. • Cray series of computers was a successful implementation. Illiac IV was able to compute 128 32-bit multiplications in 625 ns. Armando Faz Hernández Yet Another Survey on SIMD Instructions Introduction Survey content 1 Description of SIMD instructions sets for multimedia applications for desktop processors (Intel and AMD) and for low power consumption architectures such as ARM. 2 We show how to exploit parallelism in a SIMD fashion. 3 Finally, we go further exploring some tools that enables auto vectorization in code. Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia Outline Introduction SIMD for multimedia ARM architecture Implementation aspects Auto vectorization Concluding remarks Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia What is multimedia? Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia What is multimedia? Armando Faz Hernández Yet Another Survey on SIMD Instructions A cheaper solution is partitioning carry chains of an 64-bit ALU. SIMD for multimedia Multimedia Image and sound processing frequently involves to perform same instruction to a set of short size data types, in particular 8-bit for pixels and 16-bits for audio samples. Adding a dedicated unit results expensive. Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia Multimedia Image and sound processing frequently involves to perform same instruction to a set of short size data types, in particular 8-bit for pixels and 16-bits for audio samples. Adding a dedicated unit results expensive. A cheaper solution is partitioning carry chains of an 64-bit ALU. Armando Faz Hernández Yet Another Survey on SIMD Instructions Current computer 2010 SIMD for multimedia Multimedia in 1995 IBM computer 1995 Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia Multimedia in 1995 IBM computer Current computer 1995 2010 Armando Faz Hernández Yet Another Survey on SIMD Instructions MMX can process operations over 8-bit, 16-bit and 32-bit vectors. SIMD for multimedia MultiMedia eXtensions MMX was released in 1997 and introduces 57 integer instructions and a register set MMX0-MMX7. Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia MultiMedia eXtensions MMX was released in 1997 and introduces 57 integer instructions and a register set MMX0-MMX7. MMX can process operations over 8-bit, 16-bit and 32-bit vectors. Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia 3DNow! MMX can only work in integer mode xor in FPU mode. Switching between modes incurs in performance loses. AMD released in 1998 the 3DNow! technology using the same register set (MMX), but computing floating point operations. 3DNow! was not enough popular for future developments and in 2010 AMD decide to deprecate the project. Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia Streaming SIMD Extensions Intel identified problems of MMX and decided to provide a new set of registers, a.k.a XMM, and a new instruction set called Streaming SIMD Extensions (SSE). • XMM registers are 128-bit length. • SSE contains 70 new instructions for floating point arithmetic. • SIMD instructions are able to compute up to four 32-bit floating point operations. In 2000, AMD launched the x64 extension, which doubles the number of XMM registers (XMM0-XMM15). Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia SSE 2 The second iteration of SSE, called SSE2, adds 140 new instructions for double precision floating point processing. This release was focused on 3D games and Computer-Aided Design applications. Although SSE2 operate over four elements, the performance was roughly the same as MMX, which operates in just two elements. This loss of performance is due to accessing to misaligned memory addresses. Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia SSE3 In order to solve performance issues due to misaligned data, SSE3 incorporates new instructions to load from unaligned memory addresses minimizing timing penalties. Supplemental Streaming SIMD Extensions 3 (SSSE3) was released in 2006, adding new instructions such as multiply-and-add and vector alignment. Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia SSE4 SSE4 contemplates 54 new instructions (47 in SSE4.1 and 7 in SSE4.2) dedicated to string processing. Also includes elaborated instructions to perform population count and computation of CRC-32 error detecting code. Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia Advanced Vector Extensions Intel decided to move computations to wider registers and introduces the Advanced Vector Extensions (AVX). This technology involves: • 16 256-bit registers, called YMM0-YMM15. • The ability to write three operand code in assembler listings. • Proposes the VEX encoding scheme that increases the space of operation codes. • It also support the legacy SSEx instructions by adding the VEX prefix. Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia Advanced Vector Extensions 2 The second version of AVX, released in 2012, includes the expansion of many integer operations to 256-bit registers. AVX2 support gather/scatter operations to load/store registers from/to non-contiguous memory locations. Armando Faz Hernández Yet Another Survey on SIMD Instructions SIMD for multimedia SSE5, XOP, FMA4 A big confusion was caused about future directions of new instruction sets, both Intel and AMD has been changed their proposes about SSE5. Bulldozer, the AMD's latest micro-architecture, implements XOP and FMA4 instruction set, and also has compatibility with AVX. Piledriver and Haswell are the next micro-architectures from AMD and Intel, respectively. They will provide more multiply-and-add instructions for both floating point and integer operations. Armando Faz Hernández Yet Another Survey on SIMD Instructions ARM architecture Outline Introduction SIMD for multimedia ARM architecture Implementation aspects Auto vectorization Concluding remarks Armando Faz Hernández Yet Another Survey on SIMD Instructions ARM architecture ARM ARM is a low power consumption processor widely distributed in several devices such as routers, tablets, cell phones and recently integrated into GPU. • ARM is a 32-bit architecture with a pipelined processor. • Most of the instructions are executed in just one clock cycle. • All instructions are conditionally executed. • Native ARM instructions have fixed size encodings. Armando Faz Hernández Yet Another Survey on SIMD Instructions ARM architecture Thumb and Thumb-2 Fixed size encodings results in fast instruction decoding but large binary programs. Thumb is a variable encoding scheme proposed in 1994. Thumb is able to encode a subset of ARM instructions in 16-bit operand codes. Shorter encodings left out conditional instruction execution. Thumb-2 emerged and solved this issue allowing to have variable size encodings and supporting conditional execution using the IT instruction. Armando Faz Hernández

Load more