A Study on SIMD Architecture

A study on SIMD architecture Gurkan¨ Solmaz, Rouhollah Rahmatizadeh and Mohammad Ahmadian Department of Electrical Engineering and Computer Science University of Central Florida Email: fgsolmaz,rrahmati,[email protected] Abstract— Single instruction, multiple data (SIMD) architec- implementation part of this research, we implement it on single tures became popular with the demanding increase on data machine with minimum load as test platform for measuring the streaming applications such as real-time games and video pro- performance. cessing. Since modern processors in desktop computers support SIMD instructions with various implementations, we may use As a group of graduate students taking the CDA 5106 these machines for optimizing applications in which we process course, we try to get involved in this challenge by doing multiple data with single instructions. research on the SIMD CPU architecture. In this project, we In this project, we study the use SIMD architectures and worked in almost all phases together as a team, having regular learn their effects on the performance of specific applications. meetings before and after the milestones. To sum up, Gurkan¨ We choose matrix multiplication and Advanced Encryption Stan- dard (AES) encryption algorithms and modify them to exploit worked on the implementation and documentation, prepared the use of SIMD instructions. The performance improvements the final report, made research for the history and related using the SIMD instructions are analyzed and validated by the studies. Rouhollah worked on implementation of the optimized experimental study. algorithms and documentation phases. He also conducted the experiments. Mohammad proposed the main idea, worked on I. INTRODUCTION the proposal, documentation for the benchmarks, and analysis Performance in a computer system is defined by the amount of the results. of useful work accomplished by the computer system com- The rest of the paper is organized as follows. Section II pared to the time and the resources used. There are several briefly summarizes the history of SIMD architectures and aspects for improving performance of computer systems. Re- the related work. We describe SIMD and some architectures searchers from several areas are striving to achieve higher per- in Section III. We provide a detailed description for our formance ranging from algorithm, compiler, OS, and hardware benchmarks in Section IV. The results of the experiments are designers. presented in Section V. We finally conclude in Section VI. Nowadays, most CPU designs contain at least some vector processing instructions, typically referred to as SIMD in which II. RELATED WORK typically operate on a few vectors elements per clock cycle in a pipeline. These vector processors run multiple mathematical Let us briefly discuss the history of the SIMD architectures operations on multiple data elements simultaneously. Thus, and the related work in the literature. The first use of SIMD they have effects on the performance equation. From the Iron instructions was in early 1970s [1]. As an example, they were law, we know that performance of a program is calculated by used in CDC Star 100 and TI ASC machines which could the formula P erformance := IC · CPI · CT , where IC do the same operation on a bunch of data. This architecture is the number of instructions (instruction count), CPI is the specially became common when Cray Inc. used it in their number of cycles per instruction and CT is the cycle time of supercomputers. However, vector processing used in these the processor. The use of SIMD architecture changes IC and machines nowadays are considered different from the SIMD CPI values of a program. machines. Thinking Machines CM-1 and CM-2 which are con- With the new enhancements in the processor architec- sidered as massively parallel processing-style supercomputers tures, the current modern processors started supporting 256- [2], started a new era in using SIMD for processing the data bit vector implementations. Moreover, the interest on SIMD in parallel. However, current researches focus on using SIMD architectures by the computer architecture research community instructions in desktop computers. A lot of tasks desktop is increasing. In the near future, the new powerful machines computers do these days like video processing and real-time are expected to make new high performance multiple data gaming need to do the same operation on a bunch of data. So, applications available with the enhanced SIMD architectures. companies tried to use this architecture in desktops. As one In this study, we target the SIMD architecture and its effects of the earliest attempts, Sun Microsystems introduced SIMD on performance of some designed cases. Then, we analyze integer instructions in VIS (visual instruction set) extensions the performance of our approach by evaluating the speed-up in UltraSPARC I microprocessor in 1995. values of the programs using SIMD architecture. One of the MIPS introduced MDMX (MIPS Digital Media eXten- problems we challenge is the configuration of compilers to sion). Intel made SIMD widely-used by introducing MMX output vector instruction in binary executable file. The other extensions to the x86 architecture in 1996. Then Motorolla challenge is designing case study for calculating of speed up. introduced AltiVec system in its PowerPC’s which also was For this reason, matrix multiplication and the AES encryption used in IBM’s POWER systems. So, this caused the Intels algorithms are chosen as the best candidates in which we need respond which was introduction of SSE. These days SSE and to do several linear binary operations on multiple data. For its extensions are used more than the others. Fig. 3. Data processing with SISD vs. SIMD. Taken from [6]. Fig. 1. Processor array in a supercomputer. Taken from [5]. from the memory to the PEs is equal to the number of PEs. The supercomputers also have an specific interconnection networks. The interconnection networks provide flexibility for data from and to PEs with high performance. They also have an I/O system which have differences from one machine to another. Figure 1 illustrates processor array architectures in supercomputers including memory modules, interconnection net- work, PEs, control unit and the I/O system. In some supercomputers, processing elements are controlled by a host computer, which is illustrated in Figure 2. Supercomputers which are categorized as multiple instruction, multiple data (MIMD) in Flynn‘s taxonomy is became popular and this caused the reduce in the interest in SIMD machines for some period of time. Fig. 2. The relationship between processor array and the host computer. Enhancements on the desktop computers lead to powerful Taken from [5]. machines which are strong enough to handle applications such as video processing. Therefore, the SIMD architectures again became popular in 1990s. SIMD architectures exploit There are various studies in the literature which are con- a property of data stream called as ”data parallelism”. SIMD ducted by research groups or companies which focus on computing is also known as vector processing, considering hardware. Holzer-Graf et al. [3] studied the efficient vector the row of data coming to the processor as vectors of data. implementations of AES-based designs. In this paper three It is almost impossible to have applications which are purely different vector implementations are analyzed and the per- parallel and the pure use of SIMD computing is not possible. formance of each of them is compared. The use of chip Hence, in applications of SIMD computing, the programs are multiprocessing and the cell broadband engine is described written for single instruction, single data (SISD) machines and by Gschwind [4]. they include SIMD instructions. The proportion of sequential part and the SIMD part in the program determines the max- III. SIMD ARCHITECTURE imum speed-up according to Amdahl‘s law. Figure 3 shows To understand the improvements being made on SIMD the data exploitation by a SIMD machine which processes 3 architectures, let us first start with the older SIMD architec- vectors at the same time, compared to an SISD machine which tures. Earlier versions of SIMD architectures are proposed is able to process one row of data. Length of vectors in a for supercomputers [1] which have a number of processing SIMD processor determines the number of elements of a given units (elements) and a control unit. In these machines, the data type. For instance, a 128-bit vector implementation in a processing units (PUs) are the pieces which make the compu- processor allows us to do four-way single-precision floating- tation while the control unit controls these array of processing point operations. elements (PEs). The single control unit is generally responsible We may categorize the SIMD operations by their types. for reading the instructions, decoding the instructions and The one obvious operation type is the intra-element arithmetic sending control signals to the PEs. Data are supplied to PEs and non-arithmetic operations. Addition, multiplication, and by a memory. In this architecture, the number of data paths substraction are example arithmetic operations, while AND, Fig. 4. Intra-element arithmetic and non-arithmetic operations. Taken from Fig. 6. AltiVec architecture with 4 distinct registers. Taken from [6]. [6]. Fig. 5. Inter-element operations between the elements of a vector. Taken from [6]. Fig. 7. Vector permutation with AltiVec architecture. Taken from [6]. XOR, and OR are examples of non-arithmetic operations. Figure 4 illustrates the intra-element operations with two source vectors VA and VB and a destination vector VT . They had a floating point unit along with the SIMD units in Each of these vectors contain 4 registers with 32-bit. This their processors. They also had a switch mechanism in their means we can have operations on 2 vectors and each of the processors, which allow the processor to change its mode vectors include 4 integers or floating points.

Load more