Stream Processors and Gpus: Architectures for High Performance Computing

SURVEY ON STREAM PROCESSORS AND GRAPHICS PROCESSING UNITS 1 Stream Processors and GPUs: Architectures for High Performance Computing Christos Kyrkou, Student Member, IEEE Abstract—Architectures for parallel computing are becoming parallel and scalable architectures that can efficiently utilize all the more essential with the increasing demands of multimedia, processing resources to achieve high performance. Such scientific, and engineering applications. These applications parallel architectures must consider the following [2]: They require architectures which can scale to meet their real-time need an efficient management of communication in order to constraints, with all current technology limitations in mind such hide long latencies with useful processing work. Additionally, as the memory gap, and power wall. While current many-core to keep their processing resources utilized they need a memory CPU architectures show an increase in performance, they still cannot achieve the necessary real-time performance required by hierarchy that provides high bandwidth and throughput. today`s applications as they fail to efficiently utilize a large Finally, the increasing number of computational resources will number of ALUs. New programming models, computer require more power and it will be challenging to manage it architecture innovations, coupled with advancements in process efficiently. Two such parallel architectures are Stream technology have set the foundations for the development of the Processors and Graphic Processing Units (GPUs). next generation of supercomputers for high-performance Relying on parallel architectures alone is not sufficient to computing (HPC). At the center of these emerging architectures gain high performance. Such architectures require a are Stream Processors and Graphics Processing Units (GPUs). programming model that can expose the inherit application Over the years GPUs exhibited increased programmability that parallelism and data flow, so that the hardware can be has made it possible to harvest their computational power for non graphics applications, while stream processors because of efficiently utilized. Developing such a programming model their programming model and novel design have managed to requires a dramatic shift from the existing sequential model utilize a large number of ALUs to provide increased used in today’s CPUs, to a data driven model that suits the performance. The objective of this survey paper is to provide an parallel nature of the underlying hardware. The stream overview of the architecture, organization, and fundamental programming model was developed with these considerations concepts of stream processors and Graphics Processing Units. in mind [3]. In this model data are grouped together into streams, and computations can be performed concurrently on Index Terms— High Performance Computing (HPC), Stream each stream element. This exposes both the parallelism and Programming model, Stream Processors. Graphics Processing locality of the application yielding higher performance. The Units (GPUs), General Purpose computation on a Graphics introduction of the stream programming model had as a result Processing Unit (GPGPU), the development of specialized stream processors [4], optimized for the execution of stream programs, thus, I. INTRODUCTION combining both high performance and programmability. Driven by the billion-dollar market of game development dvancements in modern technology and computer with ever increasing performance demands, GPUs have A architecture allows today`s processors to incorporate evolved into a massively parallel compute engine. Because of enormous computational resources into their latest chips, such the high performance demands the architecture of a GPU is as multiple cores on a single chip. The challenge is to translate drastically different from that of a CPU, transistors are used the increase in computational capability with an increase in for computational units instead of caches and branch performance. Hence, new architectures that are optimized for prediction and their architecture is optimized for high parallel processing rather that single thread execution have throughput instead of low latency [2]. Moreover, GPU emerged [1]. Parallel computing architectures emerged as a performance doubles every six months [5] in contrast to the result of increasing computation demands of various CPU performance which doubles every 18 months. applications ranging from multimedia, to scientific and Consequently GPUs offer order(s) of magnitude greater engineering fields. All these applications are compute performance and are widely considered as the computational intensive and require up to 10s of GOPS (Giga Operations Per engine for the future. Since early 2002-2003 there has been a Second) or even more. Only a small fraction of time spent for massive interest in utilizing GPUs for general purpose memory operations and the majority of operations using computing applications under the term General Purpose computational resources. Also most applications come with computing on GPU (GPGPU) [1]. This shift was primarily some type of real time constraint. These characteristics require motivated by the evolution of the GPU from just a hardwired implementation for 3D graphics rendering, into a flexible and programmable computing engine. C. Kyrkou is with the Department of Electrical and Computer Engineering, University of Cyprus, Nicosia, Cyprus (e-mail: [email protected] ). SURVEY ON STREAM PROCESSORS AND GRAPHICS PROCESSING UNITS 2 The purpose of this survey paper is to provide an in depth produced. The stream model ensures that kernel programs will overview of these emerging parallel architectures. The outline never access the main memory directly. The stream of this paper is as follows: The fundamentals of the stream programming model defines communication and concurrency programming model and a general architecture of a stream between streams and kernels at three distinct different levels processor are discussed in Section II. Section III provides [4]. In this way take the locality and parallelism of the details on four stream processor architectures. An introduction application are exposed. These restrictions in communication to GPUs and details about their evolution and architectural help in the most efficient use of bandwidth. trends are given in Section IV, while Section V provides a discussion on GPGPUs, and the next generation of GPUs that Communication: are enhanced for general purpose computing. Finally Section . Local: Used for temporary results produced by scalar VI concludes the paper. operations within a kernel. Stream: For data movement between kernels. All data are expressed in the form of streams. II. STREAM PROCESSORS FUNDAMENTALS . Global: Necessary for global data movement either between to and from the I/O devices, or for data that A. Stream Programming Model remain constant throughout the application lifespan. The stream programming model arranges applications into a Concurrency: set of computation kernels that operate on data streams [3]. Instruction Level Parallelism (ILP): Parallelism Expressing an application in terms of the stream programming exploited between the scalar operations within the model exposes the inherent locality and parallelism of that kernel. application, which can be efficiently handled by appropriate . Data Parallelism: Applying the same computation hardware to speed-up parallel applications. By using the pattern on different stream elements in parallel. stream programming model to expose parallelism, producer- . Task Parallelism: As long as no dependencies are consumer localities are revealed between kernels, as well as present, multiple computation and communication true data localities in applications. These localities can be tasks can be executed in parallel. exploited by keeping data movement locally between kernels that communicate which are more efficient rather than using global communication paths. Fig. 1 shows how kernels are B. Stream Operations chained together with streams. Typical operations that can be performed on streams are [5]: . Streams: A collection of data records of the same . Map-Apply: This operation is used to process all type, ranging from single numbers to complex elements of a stream by a function. elements. Gather and Scatter: Addressing mode often used . Kernels: Operations that are applied on the input when addressing vectors. Gather is a read operation stream elements. Kernels can perform simple to with an indirect memory reference, while scatter is a complex computations and can have one or more write operation with an indirect memory reference. input and output streams. Both types of memory referencing are shown in Table 1. TABLE 1: GATHER AND SCATTER MEMORY ADDRESSING Scatter Gather for (i=0; i<N; ++i) for (i=0; i<N; ++i) ++A[B[i]]; A[i] = B[i] + C[D[i]]; . Reduce: The operation of computing a smaller stream from a larger input stream. Filtering: The operation of selecting a subset of stream element from the whole stream and discarding the rest. The location and number of filtered elements are variable and are not known beforehand. Fig. 1. Stereo Depth Extraction in the stream model[6]. Sort: Transforms a stream into an ordered set of data. Search: This operation finds a specific element The advantage of expressing the application in the form of a within the stream, or the set of nearest neighbors to a stream program is that it exposes two types of localities. The specified element.

Load more