A study on SIMD architecture

Gurkan¨ Solmaz, Rouhollah Rahmatizadeh and Mohammad Ahmadian Department of Electrical Engineering and Science University of Central Florida Email: {gsolmaz,rrahmati,mohammad}@knights.ucf.edu

Abstract— Single instruction, multiple data (SIMD) architec- implementation part of this research, we implement it on single tures became popular with the demanding increase on data machine with minimum load as test platform for measuring the streaming applications such as real-time games and video pro- performance. cessing. Since modern processors in desktop support SIMD instructions with various implementations, we may use As a group of graduate students taking the CDA 5106 these machines for optimizing applications in which we course, we try to get involved in this challenge by doing multiple data with single instructions. research on the SIMD CPU architecture. In this project, we In this project, we study the use SIMD architectures and worked in almost all phases together as a team, having regular learn their effects on the performance of specific applications. meetings before and after the milestones. To sum up, Gurkan¨ We choose matrix multiplication and Advanced Encryption Stan- dard (AES) encryption algorithms and modify them to exploit worked on the implementation and documentation, prepared the use of SIMD instructions. The performance improvements the final report, made research for the history and related using the SIMD instructions are analyzed and validated by the studies. Rouhollah worked on implementation of the optimized experimental study. algorithms and documentation phases. He also conducted the experiments. Mohammad proposed the main idea, worked on I.INTRODUCTION the proposal, documentation for the benchmarks, and analysis Performance in a computer system is defined by the amount of the results. of useful work accomplished by the computer system com- The rest of the paper is organized as follows. Section II pared to the time and the resources used. There are several briefly summarizes the history of SIMD architectures and aspects for improving performance of computer systems. Re- the related work. We describe SIMD and some architectures searchers from several areas are striving to achieve higher per- in Section III. We provide a detailed description for our formance ranging from algorithm, compiler, OS, and hardware benchmarks in Section IV. The results of the experiments are designers. presented in Section V. We finally conclude in Section VI. Nowadays, most CPU designs contain at least some vector processing instructions, typically referred to as SIMD in which II.RELATED WORK typically operate on a few vectors elements per clock cycle in a . These vector processors run multiple mathematical Let us briefly discuss the history of the SIMD architectures operations on multiple data elements simultaneously. Thus, and the related work in the literature. The first use of SIMD they have effects on the performance equation. From the Iron instructions was in early 1970s [1]. As an example, they were law, we know that performance of a program is calculated by used in CDC Star 100 and TI ASC machines which could the formula P erformance := IC · CPI · CT , where IC do the same operation on a bunch of data. This architecture is the number of instructions (instruction count), CPI is the specially became common when Inc. used it in their number of and CT is the cycle time of . However, vector processing used in these the . The use of SIMD architecture changes IC and machines nowadays are considered different from the SIMD CPI values of a program. machines. Thinking Machines CM-1 and CM-2 which are con- With the new enhancements in the processor architec- sidered as processing-style supercomputers tures, the current modern processors started supporting 256- [2], started a new era in using SIMD for processing the data bit vector implementations. Moreover, the interest on SIMD in parallel. However, current researches focus on using SIMD architectures by the research community instructions in desktop computers. A lot of tasks desktop is increasing. In the near future, the new powerful machines computers do these days like video processing and real-time are expected to make new high performance multiple data gaming need to do the same operation on a bunch of data. So, applications available with the enhanced SIMD architectures. companies tried to use this architecture in desktops. As one In this study, we target the SIMD architecture and its effects of the earliest attempts, introduced SIMD on performance of some designed cases. Then, we analyze integer instructions in VIS () extensions the performance of our approach by evaluating the speed-up in UltraSPARC I in 1995. values of the programs using SIMD architecture. One of the MIPS introduced MDMX (MIPS Digital Media eXten- problems we challenge is the configuration of compilers to sion). made SIMD widely-used by introducing MMX output vector instruction in binary executable file. The other extensions to the architecture in 1996. Then Motorolla challenge is designing case study for calculating of speed up. introduced AltiVec system in its PowerPC’s which also was For this reason, matrix multiplication and the AES encryption used in IBM’s POWER systems. So, this caused the algorithms are chosen as the best candidates in which we need respond which was introduction of SSE. These days SSE and to do several linear binary operations on multiple data. For its extensions are used more than the others. Fig. 3. Data processing with SISD vs. SIMD. Taken from [6].

Fig. 1. Processor array in a . Taken from [5]. from the memory to the PEs is equal to the number of PEs. The supercomputers also have an specific interconnection networks. The interconnection networks provide flexibility for data from and to PEs with high performance. They also have an I/O system which have differences from one machine to another. Figure 1 illustrates processor array architectures in super- computers including memory modules, interconnection net- work, PEs, and the I/O system. In some su- percomputers, processing elements are controlled by a host computer, which is illustrated in Figure 2. Supercomputers which are categorized as multiple instruction, multiple data (MIMD) in Flynn‘s taxonomy is became popular and this caused the reduce in the interest in SIMD machines for some period of time. Fig. 2. The relationship between processor array and the host computer. Enhancements on the desktop computers lead to powerful Taken from [5]. machines which are strong enough to handle applications such as video processing. Therefore, the SIMD architectures again became popular in 1990s. SIMD architectures exploit There are various studies in the literature which are con- a property of data stream called as ””. SIMD ducted by research groups or companies which focus on computing is also known as vector processing, considering hardware. Holzer-Graf et al. [3] studied the efficient vector the row of data coming to the processor as vectors of data. implementations of AES-based designs. In this paper three It is almost impossible to have applications which are purely different vector implementations are analyzed and the per- parallel and the pure use of SIMD computing is not possible. formance of each of them is compared. The use of chip Hence, in applications of SIMD computing, the programs are and the broadband engine is described written for single instruction, single data (SISD) machines and by Gschwind [4]. they include SIMD instructions. The proportion of sequential part and the SIMD part in the program determines the max- III.SIMD ARCHITECTURE imum speed-up according to Amdahl‘s law. Figure 3 shows To understand the improvements being made on SIMD the data exploitation by a SIMD machine which processes 3 architectures, let us first start with the older SIMD architec- vectors at the same time, compared to an SISD machine which tures. Earlier versions of SIMD architectures are proposed is able to process one row of data. Length of vectors in a for supercomputers [1] which have a number of processing SIMD processor determines the number of elements of a given units (elements) and a control unit. In these machines, the data type. For instance, a 128-bit vector implementation in a processing units (PUs) are the pieces which make the compu- processor allows us to do four-way single-precision floating- tation while the control unit controls these array of processing point operations. elements (PEs). The single control unit is generally responsible We may categorize the SIMD operations by their types. for reading the instructions, decoding the instructions and The one obvious operation type is the intra-element arithmetic sending control signals to the PEs. Data are supplied to PEs and non-arithmetic operations. Addition, multiplication, and by a memory. In this architecture, the number of data paths substraction are example arithmetic operations, while AND, Fig. 4. Intra-element arithmetic and non-arithmetic operations. Taken from Fig. 6. AltiVec architecture with 4 distinct registers. Taken from [6]. [6].

Fig. 5. Inter-element operations between the elements of a vector. Taken from [6].

Fig. 7. Vector permutation with AltiVec architecture. Taken from [6]. XOR, and OR are examples of non-arithmetic operations. Figure 4 illustrates the intra-element operations with two source vectors VA and VB and a destination vector VT . They had a floating point unit along with the SIMD units in Each of these vectors contain 4 registers with 32-bit. This their processors. They also had a mechanism in their means we can have operations on 2 vectors and each of the processors, which allow the processor to change its mode vectors include 4 integers or floating points. Similarly, we from SISD to SIMD. Figure 8 shows the architecture of a can process two vectors with 2 double values inside each 128-bit Intel MMX/SSE architecture. The 128-bit processor of them. The other operation type is inter-element arithmetic is considered as a combination of 2 64-bit processing units. and non-arithmetic operations. This operations are between the This enables to do two different operations at the same time elements of a single vector. Vector permutes, logical shifts are with these processors. Nowadays, Intel’s 256-bit processor for example inter-element operations 5. SIMD architecture is available. To understand the modern SIMD architectures better, let Here, we finish our discussion for SIMD architectures by us now briefly discuss some of them. We start with AltiVec giving significant architecture examples and come to the next architecture which is used by many companies including level in the study. The main idea of this study is to use these Apple and . The illustration of the AltiVec archi- architectures with convenient benchmarks to evaluate their tecture is shown in Figure 6. As one can see from this performance improvements for various programs. Therefore, figure, there are 4 distinct set of registers in AltiVec. Vectors the benchmarks which are used in the study are described in VA and VB are the source vectors while vector VT is the detail in the next section. destination vector. Source registers hold the operands while the destination registers hold the value of the result. The new IV. MATRIX MULTIPLICATION AND AES vector VC is called as filter or modifier. This vector is useful There is no doubt that future processors will differ signif- for many operations including vector permutations. In vector icantly from the current designs and will reshape the way permutation operation, the vector VC holds the values to show of thinking about programming. SIMD is one of the most the elements in each of the vectors VA and VB their new important advancements of modern CPUs. This research deals locations in the destination vector VT . An illustration of this with SIMD and how you could use it to increase performance procedure can be seen in Figure 7. of programs. Intel introduced MMX/SSE architecture which is capable of The main goal in our research is to use SIMD instruction SIMD operations for integers and floating points with the new set for improving programs performance. The first problem we MMX series of processors. They used Katmai new instructions encountered is that there is no SIMD implemented benchmarks (KNI) for the new processors. As in Figure 9, they added 8 for assessing the performance. So, we created our own tools 128-bit new registers and called the new ones SSE registers. to assess the improvement in performance. In the next part Fig. 8. Intel MMX/SSE architecture. Taken from [6].

Fig. 10. Ordinary implementation of matrix multiplication.

Fig. 11. Matrix multiplication with just 3 SIMD instructions.

Fig. 9. 128-bit Intel MMX/SSE architecture with SSE registers. Taken from [6]. multiplication with so many scalar instructions. By applying SIMD instructions, the number of executed instructions can decrease. Also, we will achieve noticeable we will briefly describe the types of vector operations were performance improvement. Figure 11 shows the listing of code available for the implementation [7]. with SIMD instruction set. • V ← V Example: Complement all elements • S ← V Examples: Min, Max, Sum B. AES encryption algorithm • V ← V x V Examples: Vector addition, multiplication, AES is other candidate for implementation by SIMD in- division structions. It has vast applications everywhere in computer • V ← V x S Examples: Multiply or add a scalar to a systems from mobile device to distributed data centers. AES vector is a block cipher algorithm that has been selected by the • S ← V x V Example: Calculate an element of a matrix U.S. National Institute of Standards and Technology (NIST). We choose matrix multiplication and AES for the bench- It was selected by contest from a list of five finalists that marks. There are two reasons behind that. First, they are were themselves selected from an original list of more than 15 being used in many applications in desktop computers and submissions. AES will begin to supplant the Data Encryption also mobile devices these days. Second, they can be optimized Standard (DES) - and the later Triple DES - over the next by SIMD instructions because they do some instructions on a few years in many cryptography applications. The algorithm bunch of data. Now, we briefly describe the benchmarks. We was designed by two Belgian cryptologists, Vincent Rijmen chose matrix multiplication and AES for the benchmarks. and Joan Daemen [8], whose surnames are reflected in the cipher’s name. A. Matrix multiplication The cipher has a variable block length and key length. The Matrix multiplication is one of the best candidates to be authors currently try to specify how to use keys with a length optimized by SIMD instruction, because it deals with matrices of 128, 192 or 256 bits, to encrypt blocks with a length of which consist of arrays. Matrix multiplication is one of the 128, 192 or 256 bits (all nine combinations of key length and most common numerical operations, especially in the area block length are possible). Both block length and key length of dense linear algebra. It forms the core of many important can be extended very easily to multiples of 32 bits. Figure 12 algorithms, including solvers of linear systems of equations, illustrates the flow chart of the algorithm that we use. AES can least square problems, and singular and eigenvalue compu- be implemented very efficiently on a wide range of processors. tations. Figure 10 shows a naive implementation of matrix We have implemented it using the SIMD instructions. Fig. 14. Time to do 512x512 multiplication in milliseconds. Fig. 12. Flow chart for the AES algorithm.

Fig. 13. Performance gains using streaming SIMD extensions. Taken from [9].

V. EXPERIMENTAL STUDY A. Implementation Basically, there are two ways to implement an algorithm using SIMD instructions. First, it is possible to implement the Fig. 15. AES encryption (cycles/byte). algorithm using the normal instructions and use a compiler which automatically converts the instructions to SIMD instruc- tion wherever it is possible. Second, to optimize the algorithm mentations of matrix multiplication is doing the task on fixed by writing the SIMD instructions manually. In the first method, dimension matrices. This makes the algorithm faster, because Visual Studio is not able to do this task at this time, they do not use the loops which have their own overhead but, Intel has proposed a compiler which is capable of doing and cannot be written using SIMD instructions. However, this. However, the most optimized version can be achieved by our implementation of matrix multiplication can do this task manually implementing an algorithm and considering the most on variable matrix dimensions. Considering this point, our proper instructions for the implementation. So, we chose the optimization is still comparable to the Intel implementation. second method which gives the best result. Figure 14 compares the result of our implementation of a We implemented applications using ++ language and used naive implementation of matrix multiplication with the original Microsoft Visual Studio to compile them. We disabled all algorithm. It is about 2 times faster than the original algorithm. the compiler optimization features to make sure that the Figure 15 shows the result of our optimization of AES optimization is only because of using SIMD instructions, not encryption which is about 23 times faster. We will give some compiler optimizations like loop unrolling. For having an explanation about these results. First of all, by using SIMD acceptable estimation of number of cycles the algorithms need instructions, wherever we replace a normal instruction with to be executed, we closed all unnecessary processes and ran an SIMD one we decrease the number of instructions by the implemented function 10000000 times. Then, we divided a factor of four. However, not all the instructions are not the number of cycles achieved by the number of executions replaceable by SIMD instructions. In addition, in average, of the function. So, we found the average number of cycles number of cycles needed for the SIMD instructions to be needed for the algorithm to be executed. executed is more than the normal instructions for doing the same task. But, considering the Irons law, decrease in number B. Results of instructions will affect the final performance more than First benchmark was matrix multiplication. Table 13 shows increase in the average number of cycles for instructions to be the results of its implementation by Intel. Most of the imple- executed. So, successful optimization of an algorithm depends on two factors. First, how many instructions are replaceable by REFERENCES SIMD instructions. Second, what combination of instructions [1] Wikipedia, “SIMD.” 2013, [Accessed 15-April-2013]. [Online]. are used in the SIMD implementation. The second is important Available: http://en.wikipedia.org/wiki/SIMD because different SIMD instructions need different number of [2] . A. Patterson and J. L. Hennessey, Computer Organization and Design: the Hardware/Software Interface., 1998. cycles to be executed. [3] T. K. M. P. M. S. P. S. D. S. W. W. Holzer-Graf, Severin, “Efficient vector implementations of AES-Based designs: A case study and new VI.CONCLUSION implemenations for Grstl.” Topics in Cryptology CTRSA, pp. 145–161, In this paper, we studied the SIMD architectures. We 2013. [4] M. Gschwind, “Chip multiprocessing and the cell broadband engine.” summarized the history of these architectures and explained Computing Frontiers, 2006. some of the new architectures. We proposed use of SIMD [5] L. W. Harold W. Lawson, Bertil Svensson, Parallel processing in indus- for two applications, matrix multiplication and AES. The trial real-time applications. Prentice Hall, 1992. [6] J. Stokes, “SIMD architectures.” Ars Technica, March 2000. operations which are done for matrix multiplication and the [7] I. Corp., “Intel advanced vector extensions programming reference.” June AES algorithm is described in detail. The experimental results 2011. showed that the use of SIMD significantly improves the [8] J. Daemen and V. Rijmen, The design of Rijndael: AES-the advanced encryption standard. Springer, 2002. performance of each of the programs. [9] I. Corp., “Streaming SIMD extensions - matrix multiplication.” AP-930, June 1999.