Comparison Between Research Data Processing Capabilities of Amd and Nvidia Architecture-Based Graphic Processors
Total Page:16
File Type:pdf, Size:1020Kb
COMPARISON BETWEEN RESEARCH DATA PROCESSING CAPABILITIES OF AMD AND NVIDIA ARCHITECTURE-BASED GRAPHIC PROCESSORS V. A. Dudnik,∗V. I. Kudryavtsev, S. A. Us, M. V. Shestakov National Science Center "Kharkov Institute of Physics and Technology", 61108, Kharkov, Ukraine (Received January 26, 2015) A comparative analysis has been made to describe the potentialities of hardware and software tools of two most widely used modern architectures of graphic processors (AMD and NVIDIA). Special features and differences of GPU architectures are exemplified by fragments of GPGPU programs. Time consumption for the program development has been estimated. Some pieces of advice are given as to the optimum choice of the GPU type for speeding up the processing of scientific research results. Recommendations are formulated for the use of software tools that reduce the time of GPGPU application programming for the given types of graphic processors. PACS: 89.80.+h, 89.70.+c, 01.10.Hx 1. INTRODUCTION processing, and therefore, the GPU architecture, un- like the CPU architecture, was multithreaded right The development of various architectures of graphic from the start. Besides, the fundamental princi- processors attained its maximal variety in the nineties ples of the GPU organization were initially general of the last century. At that time, a great many com- for video accelerators of all manufacturers, because panies at the computer graphics market (S3 Graph- they had a single target task, namely, shader pro- ics, Matrox, 3D Labs, Cirrus Logic, Oak Technolog, grams. Therefore, the general GPU structure of dif- Realtek, XGI Technology Inc., Number Nine Visual ferent manufacturers differed only slightly. The dif- Technologies, etc.) made overtures concerning the ferences concerned the details of microarchitecture architectures of graphical processors. However, by realization. The internal organization of the GPU the present time, as a result of stiff competition, out is similar for both the AMD architecture and the of a variety of the proposed architectures two archi- NVIDIA architecture, i.e., it consists of a few tens tectures of operating companies NVIDIA and AMD (30 for NVIDIA GT200, 20 for Evergreen, 16 for have taken the key positions at the market. Yet, when Fermi) of central processing elements referred to in considering modern GPU structures of the AMD and the NVIDIA nomenclature as Streaming Multipro- NVIDIA processors, one can notice that there is more cessors, and in the ATI terminology as SIMD Engine similarity than difference between them (and this be- (miniprocessors). They can simultaneously perform ing despite a severe architecture competition). The a set of computational processes, i.e., threads. Each reasons for this outcome of GPU-processor develop- miniprocessor has a local storage of size 16 KB for ment are similar to the results of the development GT200, 32 KB for Evergreen and 64 KB for Fermi of other high technology products (automobiles, air- (in actual fact, this is a programmed L1 cache). The craft, etc.), because in the process of engineering de- local storage is common for all the threads executed sign improvement we always have the interchange, in the miniprocessor. Apart from the local storage, borrowing and use of most happy ideas with the result the miniprocessor has also another storage area, be- that the competing companies arrive at very similar ing approximately four times larger in storage capac- engineering solutions. ity in all the architectures under consideration. It is shared in equal parts by all the executive threads, 2. AMD AND NV IDIA PROCESSOR these are registers for storing the temporary variable STRUCTURE and intermediate computation results. Each minipro- The similarity between the architectures of AMD and cessor has a great number of computation modules NVIDIA processors can be explained by the fact that (8 for GT200, 16 for Radeon and 32 for Fermi), but from the outset the GPU microarchitecture was or- all of them can carry out the same instruction with ganized quite differently than that of ordinary CPU. one and the same software address. However, the At the very beginning of the GPU development, the operands can be different in this case, one's own for graphics tasks implied an independent parallel data different threads. For example, the instruction "to ∗Corresponding author. E-mail address: [email protected] 148 ISSN 1562-6016. PROBLEMS OF ATOMIC SCIENCE AND TECHNOLOGY, 2015, N3(97). Series: Nuclear Physics Investigations (64), p.148-153. add the contents of two registers" is executed simul- comparable with the CPU. Essential modifications taneously by all computing devices, but the regis- have been made in the memory structure of NVIDIA ters are used different. If however, because of pro- graphic processors. The appearance of L2-cache in gram branching, the threads came apart in their way the FERMI architecture has radically (by a factor of running the code, then the so-called serialization of a few tens) accelerated operations with memory, takes place. That is, not all computation modules thereby enabling one to organize the internal mem- are involved, because the threads supply different in- ory manager, which is sufficiently rapid to allocate structions for execution, while the computation mod- the memory by the machine language C memory ule block can carry out, as stated above, only the control functions (malloc/free) for each thread in the instruction with one address. The computational ef- course of operation. Paralleling of global videomem- ficiency falls down in this case with reference to the ory access has been improved and the general address maximum capability. Another peculiarity of the GPU space has been realized for all the memory of CPU in comparison with the CPU is the absence of stack. and GPU, thereby making it possible to unite in a On account of a great many simultaneously execut- single address space all the memory visible for the ing threads, the GPU does not provide the stack, computational thread. Besides, the changes in the which might store the function parameters and lo- memory structure have provided a way of using re- cal variables (simultaneously running 10000 threads cursion functions (Fig.1). call for an enormous storage). Therefore, there is no recursion in the GPU, and instead of the call, the functions are substituted during compilation into the GPGPU program code. The AMD have obtained a full-scale support for general-purpose computations, starting with the Evergreen family, where also Di- rect X II specifications were first realized. Of impor- tance is the from the brginning use of the AMD tech- nology named VLIW (Very Long Instruction Word). The AMD graphics processors have the same num- ber of registers as, for example, GT200 have, the dif- ference being that these are 128-bit vector registers (e.g., using one single-cycle instruction, the number a1xb1+a2xb2+a3xb3+a4xb4 or the two-dimensional vector (a1xb1+a2xb2, a3xb3+a4xb4) can be com- puted). Until recently, NVIDIA supported only Fig.1. NVIDIA GPU structure scalar simple instructions operating on scalar regis- ters and realizing a simple classical RISC, but since The simultaneous optimized execution of several ker- 2009 the NVIDIA programmers have presented a fur- nels, realized in the NVIDIA architecture, permits or- ther development of the CUDA platform, namely, the ganization of simultaneous running of several CUDA architecture FERMI, and further, Kepler. The sup- functions of the same application provided that one port of one-precision and double-precision floating CUDA function cannot fully load the computational point computations realized in a new architecture, capability of the GPU (as the GPU analogue of the was one of the key demands of the high-performance multitask mode for multi-core CPUs). Two inter- computing market. faces for copy operations, realized in the NVIDIA GPU, have provided practically twice as fast data 3. DIFFERENCES BETWEEN AMD AND exchange due to simultaneous copying of data from NV IDIA MICROSTRUCTURES the memory of CPU multiprocessors to the GPU Here we mention the principal potentialities of ad- and vice versa (from the GPU memory to the CPU vanced technical solutions of NVIDIA company, memory). Unlike the NVIDIA GPU, new models of which specified the main differences between GPU AMD video adapters differ from the previous versions architectures proposed at present by AMD and practically only in quantitative characteristics. They NVIDIA. These differences are due to the initial are still based on the dated architecture Graphics tendency of using the NVIDIA graphic processors Core Next (GCN) Tahiti. This architecture forms not only for processing the computer graphics. The the basis for all present-day engineering solutions of NVIDIA also marketed its architecture as a means for the company, and even the latest graphic chip Hawaii high-performance computing. That implied both a differs from Tahiti only in a greater number of ex- high speed of computation operations and a high reli- ecution devices, and in some computational power ability combined with high programmability. There- modifications (e.g., as a support of a greater number fore, the last realizations of GPU by NVIDIA have of simultaneously executed instruction streams), as incorporated the possibilities of finding and correct- well as in support of some options Direct X 11.2, and ing errors in the operating storage and cache-memory in the improved technology AMD Power Tune. The subsystems. That provided the fault tolerance and present-day AMD graphic processors still have only performance reliability of computational algorithms one global read-write memory, and many different 149 sources of texture memory and constant memory, reasonable, considering that the VLIW instructions both being read-only memories. The peculiarity of are rather large in size. The AMD structure also pro- the constant memory is the availability of caching vides sections for the memory access instructions. In at the access for all GPU streams to one data area the NVIDIA GPU the instructions are arranged ac- (operates as quickly as the registers do).