Multi-Vector and SIMD

Ujjwal kumar Dept. of IT Gaya College,Gaya

Ujjwal Kumar Dept. of IT Gaya College, Gaya  A vector processor or array processor is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to the scalar processors, whose instructions operate on single data items.  Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Early Work ▪ Development started in the early 1960s at Westinghouse ▪ Goal of the Solomon project was to substantially increase arithmetic ▪ performance by using many simple co-processors under the control of a single master CPU ▪ Allowed single algorithm to be applied to large data set  Supercomputers ▪ Dominated supercomputer design through the 1970s into the 1990s  Cray platforms were the most notable vector supercomputers ▪ Cray -1: Introduced in 1976 ▪ Cray-2, Cray X-MP, Cray Y-MP  Demise ▪ In the late 1990s, the price-to-performance ratio drastically increased for

conventional microprocessorUjjwal Kumar designs Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya  CPU that implements an instruction set that operates on 1-D arrays, called vectors  Vectors contain multiple data elements  Number of data elements per vector is typically referred to as the vector length.  Both instructions and data are pipelined to reduce decoding time

Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya  A vector is a set of scalar data items, all of the same type, stored in memory.  A vector processor is an ensemble of hardware resources, including vector registers, functional pipelines, processing elements, and register counters, for performing vector operations.  Vector processing occurs when arithmetic or logical operations are applied to vectors.  Vector processing speedup 10..20 compared with scalar processing.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  A process that allows the CPU to execute a single instruction simultaneously on multiple pieces of data, rather than by repetitive looping.  Superscalar designs can take advantage of parallelism in scalar operations, it is possible to take advantage of similar parallelism in vector codes , Thus, it makes sense to provide multiple vector processors in a system.  Here the main issue is memory access.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Memory-to-Memory Architecture (Traditional)  Register-to-Register Architecture (Modern)

Ujjwal Kumar Dept. of IT Gaya College, Gaya  For all vector operation, operands are fetched directly from main memory, then routed to the functional unit  Results are written back to main memory  Major reason for demise was due to large startup time

Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya  All vector operations occur between vector registers.  If necessary, operands are fetched from main memory into a set of vector registers (load- store unit).  Includes all vector machines since the late 1980s: ▪ Convex, Cray, Fujitsu, Hitachi, NEC  SIMD processors are based on this architecture Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya  Add two vectors to produce a third.  Subtract two vectors to produce a third  Multiply two vectors to produce a third  Divide two vectors to produce a third  Load a vector from memory  Store a vector to memory

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Vector Registers ▪ Typically 8-32 vector registers with 64 - 128 64-bit elements ▪ Each contains a vector of double-precision numbers ▪ Register size determines the maximum vector length ▪ Each includes at least 2 read and 1 write ports  Vector Functional Units (FUs) ▪ Fully pipelined, new operation every cycle ▪ Performs arithmetic and logic operations ▪ Typically 4-8 different units  Vector Load-Store Units (LSUs) ▪ Moves vectors between memory and registers  Scalar Registers ▪ Single elements for interconnecting FUs, LSUs, and registers

Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya  Increase Memory Bandwidth ▪ Memory banks are used to reduce load/store latency ▪ Allow multiple simultaneous outstanding memory requests  Strip Mining ▪ Generates code to allow vector operands whose size is less than or greater than size of vector registers  Vector Chaining ▪ Equivalent to data forwarding in vector processors ▪ Results of one pipeline are fed into operand registers of another pipeline  Scatter and Gather ▪ Retrieves data elements scattered throughout memory and packs them into sequential vectors in vector registers ▪ Promotes data locality and reduces data pollution  Multiple Parallel Lanes, or Pipes ▪ Allows vector operation to be performed in parallel on multiple elements of the vector.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Require Lower Instruction Bandwith ▪ Reduced by fewer fetches and decodes  Easier Addressing of Main Memory ▪ Load/Store units access memory with known patterns  Elimination of Memory Wastage ▪ Unlike cache access, every data element that is requested by the Processor is actually used – no cache misses ▪ Latency only occurs once per vector during pipelined loading  Simplification of Control Hazards ▪ Loop-related control hazards from the loop are eliminated  Scalable Platform ▪ Increase performance by using more hardware resources  Reduced Code Size ▪ Short, single instruction can describe N operations

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Modern graphics processing units (GPUs) include an array of shader pipelines which may be driven by compute kernels, which can be considered vector processors  Other CPU designs include some multiple instructions for vector processing on multiple (vectorised) data sets, typically known as MIMD (Multiple Instruction, Multiple Data) and realized withVLIW (Very Long Instruction Word)

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Performs same instruction on multiple data points concurrently.  Takes advantage of data level parallelism within an algorithm.  Commonly used in image and signal processing applications ▪ Large number of samples or pixels calculated with the same instruction

Ujjwal Kumar Dept. of IT Gaya College, Gaya  These architectures include instruction set extensions which allow both sequential and parallel instructions to be executed.  Some architectures include separate SIMD coprocessors for handling these instructions  In SIMD computers, ‘N’ number of processors are connected to a control unit and all the processors have their individual memory units. All the processors are connected by an interconnection network.

Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya  Developed by Fortune and Wyllie (1978)  Objective:  Modeling idealized parallel computers with zero synchronization or memory access overhead  An n-processor PRAM has a globally addressable Memory

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Parallel random access machine; a theoretical model of parallel computation in which an arbitrary but finite number of processors can access any value in an arbitrarily large shared memory in a single time step.  Processors may execute different instruction streams, but work synchronously.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  A set of similar type of processors.  All the processors share a common memory unit. Processors can communicate among themselves through the shared memory only.  A memory access unit (MAU) connects the processors with the single shared memory.

Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya  The PRAM model can apply to SIMD class machines if all processors execute identical instructions on the same cycle, or to MIMD class machines if the processors are executing different instructions.  Load imbalance is the only form of overhead in the PRAM model.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  EREW - Exclusive read, exclusive write : any memory location may only be accessed once in any one step.

 CREW - Concurrent read, exclusive write : any memory location may be read any number of times during a single step, but only written to once, with the write taking place after the reads.

 ERCW - Exclusive read, Concurrent Write : This allows exclusive read or concurrent writes to the same memory location.

 CRCW - Concurrent read, concurrent write: any memory location may be written to or read from any number of times during a single step. Ujjwal Kumar Dept. of IT Gaya College, Gaya  There are many methods to implement the PRAM model, but the most prominent ones are −  Shared memory model  Message passing model  Data parallel model

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Shared memory emphasizes on control parallelism than on data parallelism.  In this model, multiple processes execute on different processors independently, but they share a common memory space.  Due to any processor activity, if there is any change in any memory location, it is visible to the rest of the processors.  As multiple processors access the same memory location, it may happen that at any particular point of time, more than one processor is accessing the same memory location.  To avoid this, some control mechanism is implemented to ensure mutual exclusion.

Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya  Shared memory programming has been implemented in the following − ▪ Thread libraries ▪ Distributed Shared Memory (DSM) Systems ▪ Program Annotation Packages

Ujjwal Kumar Dept. of IT Gaya College, Gaya  The thread library allows multiple threads of control that run concurrently in the same memory location.  Thread library provides an interface that supports multithreading through a library of subroutine.  It contains subroutines for ▪ Creating and destroying threads ▪ Scheduling execution of thread ▪ passing data and message between threads ▪ saving and restoring thread contexts  Examples of thread libraries include − SolarisTM threads for Solaris, POSIX threads as implemented in Linux, Win32 threads available in Windows NT and Windows 2000, and JavaTM threads as part of the standard JavaTM Development Kit (JDK).

Ujjwal Kumar Dept. of IT Gaya College, Gaya  DSM systems create an abstraction of shared memory on loosely coupled architecture in order to implement shared memory programming without hardware support.  They implement standard libraries and use the advanced user-level memory management features present in modern operating systems.  Examples include Tread Marks System, Munin, IVY, Shasta, Brazos, and Cashmere.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  This is implemented on the architectures having uniform memory access characteristics.  The most notable example of program annotation packages is OpenMP.  OpenMP implements functional parallelism.  It mainly focuses on parallelization of loops.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Merits of Shared Memory Programming ▪ Global address space gives a user-friendly programming approach to memory. ▪ Due to the closeness of memory to CPU, data sharing among processes is fast and uniform. ▪ There is no need to specify distinctly the communication of data among processes. ▪ Process-communication overhead is negligible. ▪ It is very easy to learn.  Demerits of Shared Memory Programming ▪ It is not portable. ▪ Managing data locality is very difficult. Ujjwal Kumar Dept. of IT Gaya College, Gaya  Message passing is the most commonly used parallel programming approach in distributed memory systems. Here, the programmer has to determine the parallelism. In this model, all the processors have their own local memory unit and they exchange data through a communication network.

Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya  Processors use message-passing libraries for communication among themselves.  Along with the data being sent, the message contains the following components − ▪ The address of the processor from which the message is being sent; ▪ Starting address of the memory location of the data in the sending processor; ▪ Data type of the sending data; ▪ Data size of the sending data; ▪ The address of the processor to which the message is being sent; ▪ Starting address of the memory location for the data in the receiving processor.  Processors can communicate with each other by any of the following methods − ▪ Point-to-Point Communication ▪ Collective Communication ▪ Message Passing Interface Ujjwal Kumar Dept. of IT Gaya College, Gaya  Point-to-point communication is the simplest form of message passing. Here, a message can be sent from the sending processor to a receiving processor by any of the following transfer modes − ▪ Synchronous mode − The next message is sent only after the receiving a confirmation that its previous message has been delivered, to maintain the sequence of the message. ▪ Asynchronous mode − To send the next message, receipt of the confirmation of the delivery of the previous message is not required.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Collective communication involves more than two processors for message passing. Following modes allow collective communications −  Barrier − Barrier mode is possible if all the processors included in the communications run a particular bock (known as barrier block) for message passing.  Broadcast − Broadcasting is of two types − ▪ One-to-all − Here, one processor with a single operation sends same message to all other processors. ▪ All-to-all − Here, all processors send message to all other processors.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  It may be of three types − ▪ Personalized − Unique messages are sent to all other destination processors. ▪ Non-personalized − All the destination processors receive the same message. ▪ Reduction − In reduction broadcasting, one processor of the group collects all the messages from all other processors in the group and combine them to a single message which all other processors in the group can access.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Merits of Message Passing ▪ Provides low-level control of parallelism; ▪ It is portable; ▪ Less error prone; ▪ Less overhead in parallel synchronization and data distribution.  Demerits of Message Passing ▪ As compared to parallel shared-memory code, message- passing code generally needs more software overhead.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  There are many message-passing libraries. Here, we will discuss two of the most-used message-passing libraries − ▪ Message Passing Interface (MPI) ▪ Parallel Virtual Machine (PVM)

Ujjwal Kumar Dept. of IT Gaya College, Gaya  It is a universal standard to provide communication among all the concurrent processes in a distributed memory system.  Most of the commonly used parallel computing platforms provide at least one implementation of message passing interface.  It has been implemented as the collection of predefined functions called library and can be called from languages such as C, C++, Fortran, etc.  MPIs are both fast and portable as compared to the other message passing libraries.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  Merits of Message Passing Interface ▪ Runs only on shared memory architectures or distributed memory architectures; ▪ Each processors has its own local variables; ▪ As compared to large shared memory computers, distributed memory computers are less expensive.  Demerits of Message Passing Interface ▪ More programming changes are required for parallel algorithm; ▪ Sometimes difficult to debug; and ▪ Does not perform well in the communication network between the nodes.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  PVM is a portable message passing system, designed to connect separate heterogeneous host machines to form a single virtual machine.  It is a single manageable parallel computing resource.  Large computational problems like superconductivity studies, molecular dynamics simulations, and matrix algorithms can be solved more cost effectively by using the memory and the aggregate power of many computers.  It manages all message routing, data conversion, task scheduling in the network of incompatible computer architectures. Ujjwal Kumar Dept. of IT Gaya College, Gaya  Very easy to install and configure;  Multiple users can use PVM at the same time;  One user can execute multiple applications;  It’s a small package;  Supports C, C++, Fortran;  For a given run of a PVM program, users can select the group of machines;  It is a message-passing model,  Process-based computation;  Supports heterogeneous architecture.

Ujjwal Kumar Dept. of IT Gaya College, Gaya  The major focus of data parallel programming model is on performing operations on a data set simultaneously.  The data set is organized into some structure like an array, hypercube, etc.  Processors perform operations collectively on the same data structure.  Each task is performed on a different partition of the same data structure.  It is restrictive, as not all the algorithms can be specified in terms of data parallelism. This is the reason why data parallelism is not universal.  Data parallel languages help to specify the data decomposition and mapping to the processors.  It also includes data distribution statements that allow the programmer to have control on data – for example, which

data will go on whichUjjwal Kumar processor – Dept.to of reduce IT the amount Gaya College, of communicationGaya within the processors.  Parallel computers use VLSI chips to fabricate processor arrays, memory arrays and large- scale switching networks.  Nowadays, VLSI technologies are 2- dimensional. The size of a VLSI chip is proportional to the amount of storage (memory) space available in that chip.

Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya  We can calculate the space complexity of an algorithm by the chip area (A) of the VLSI chip implementation of that algorithm.  If T is the time (latency) needed to execute the algorithm, then A.T gives an upper bound on the total number of bits processed through the chip (or I/O).  For certain computing, there exists a lower bound, f(s), such that A.T2 >= O (f(s))  Where A=chip area and T=time

Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya Ujjwal Kumar Dept. of IT Gaya College, Gaya