TECHNICAL PAPER

Parallelism, Compute Intensity, and Data Vectorization: The CRAY APP

Bradley R. Carlile

Cray Research Superservers, Inc. 3601 SW Murray Blvd., Beaverton, Oregon 97005 [email protected] (503) 641-3151 (phone); (503) 641-4497 (fax)

Abstract The CRAY APP is a general purpose, 84 processor, mul- tiple instruction multiple data (MIMD) shared-memory High performance on parallel algorithms requires high system [2] [19]. Utilization of commodity processors al- delivered memory bandwidth, fast computations, and min- imal parallel overheads. These three requirements have far lows it to be a very cost effective machine. It is a multi-user reaching ramifications on complete system design and per- compute server programmed using autoparallelizing FOR- formance. To satisfy the high computation rates of parallel TRAN or in a Unix environment [1]. The peak perfor- programs, memory inefficiencies can be avoided by using mance is 6.7 Gflops for 32-bit computations and 3.4 Gflops knowledge of the applications data access patterns and the for 64-bit computations. The CRAY APP was designed as interaction of computations and data movement. Compute intensity (ratio of compute operations to memory accesses a production machine with an emphasis on ease-of-use. required) is central to the understanding parallel perfor- mance. Several other characteristics of parallel programs The CRAY APP uses commercial processors that can and techniques to exploit them will be discussed. One of issue multiple pipelined instructions to deliver fast compu- these techniques is data vectorization. Data vectorization tations in parallel programs. Loops are optimized on multi- focuses vectorization techniques on the data movement in a ple instruction issue processors using Pipelining code section. This and other techniques have been realized in the hardware and software design of the CRAY APP techniques [12] [23]. Software pipelining allows multiple shared-memory system. instruction issue processors to be viewed as efficient pro- grammable vector processors.

The key to understanding high performance system de- 1.0 Introduction sign is understanding the characteristics of the important user applications. Memory usage is one of the most critical High performance on parallel programs depends on the and often overlooked characteristics of programs. This is following requirements. becoming more critical as the gap between processor speed and memory speed grows [9]. The memory bandwidth of a 1) High memory bandwidth system is also a major contributor to the price point of a 2) Fast computations system. At any particular memory bandwidth, efficient use of memory bandwidth can provide higher performance 3) Minimal parallel overheads than higher memory bandwidths that are used inefficiently. This paper will focus on several aspects of memory usage These three requirements have far reaching ramifica- and some parallel issues. tions for performance, ease of programming, programming model, optimization techniques, and suitable types of ap- 2.0 Memory Bandwidth plications. An understanding of these requirements and a hardware/software codesign process has led to the develop- Memory bandwidth is directly related to performance. ment of the CRAY APP shared memory system. Shared- The relationship between compute operations and data re- memory systems do not have split address spaces like dis- quired is called Compute Intensity [10] [11]. Others have tributed memory machines that require careful data distri- subsequently also defined the reciprocal of compute inten- bution for performance. In addition, automatic sity as R [6]. Compute intensity is defined as follows: parallelizing for shared memory machines are a maturing technology.

Technical Paper - Draft Submission 1 Submitted to: SuperComputing ‘93, Portland, November 1993. = ⁄ (4) Number of Operations Time Memory Accesses Memory Bandwidth Compute Intensity = (1) Number of Data Words Accessed Either equation (2) or (4) can be used to determine the percentage of memory bandwidth achieved on a given ap- For numerical computations, the operation count is usu- plication. The percentage of memory bandwidth delivered ally counted in terms of floating-point operations. Of is a particularly helpful metric when optimizing the perfor- course, it is equally valid to use an integer operation count mance of an application. The CRAY APP often delivers 60- for integer dominated computations. Most applications 90% of total memory bandwidth during the execution of have high compute intensity. High compute intensity is of- parallel programs. ten found in nested loops that reuse data. It is also found in calculations that perform complicated operations on data. Relative to the problem size, most algorithms have ei- ther constant compute intensity, log growth compute inten- The Compute Intensity of an algorithm can be used to sity, or linear growth compute intensity. Current cache- determine the performance bound of an application on a based processors have enough on-chip storage to often re- given memory system. This estimate is based on delivered alize moderate compute intensities of 4 to 30 for a wide va- memory bandwidth. riety of applications. Our experience is that half of the applications have loops with constant compute intensity = × Performance Intensity Memory Bandwidth (2) with moderate values. Table 1., contains an example of or each of these classes of compute intensities.

Operations Operations Words = × (3) Data Compute Second Word Second Operation Words Intensity Algorithm Count Used (Ops/Word) This formula gives the maximum performance that the memory system can sustain for a given application. Even Sine 23N 2N 11.5 though this formula is completely independent of the float- 5 ing-point processing capabilities of a given machine it can Complex 1D FFT 5NNlog 2 4N log 2N often be a better measure. A different focus or a 4 2 3 2 1 different algorithm implementation can often greatly in- Real Solver N 2N N crease the realized compute intensity of an application. In- 3 3 creases in compute intensity will be reflected in higher Table 1. Compute Intensities of Basic Algorithms execution performance on any memory bandwidth. The compute intensity in an application will often be dif- Most applications have a great deal of compute intensi- ferent for each basic code block (loop, nested loops, condi- ty. Even for small data sizes, many important algorithms tional, etc.) within an application. The compute intensity of exceed the design point of current small or large scale ar- each basic block is dependent on system architecture and chitectures. Most architectures have a much higher perfor- the compiler’s optimization strategy. If the program con- mance potential based on memory bandwidth. For example, sists of a linear sequence of basic blocks with different one dimensional FFTs of length 2k have a compute intensi- compute intensities, then a realized compute intensity, IR, ty of 13.75 (see Table 1.). Using this compute intensity and for the entire sequence is the weighted average of the com- equation (2) one could support 220 Gflops on the memory pute intensities of each block, Ib, multiplied by percentage bandwidth of a CRAY Y-MP/C90 (16 Gigawords/s to vec- of work in each basic block, Pb. tor units). However, for this algorithm the performance is limited to less than the peak computational rate of 16 n = × Gflops. A compiler could produce a code that has a com- IR ∑ Ib Pb (5) pute intensity of 1.0 and still achieve maximum perfor- b = 1 mance. The compute intensity and the percentage of work in each basic block is often dependent on the problem size of Another way to estimate performance is to base it on the an application. Frequently, the compute intensity will grow number of memory accesses required. Sometimes it is eas- with an increase in problem size. ier to estimate the required data accesses than the compute intensity. This estimate is most accurate when the memory bandwidth of an application is saturated.

Technical Paper - Draft Submission 2 Submitted to: SuperComputing ‘93, Portland, November 1993. It is helpful to define another ratio called leverage to poor memory bandwidth utilization and thereby degrade quantify the data movement in a particular implementation. compute performance when implemented on cache-based Leverage is defined as follows: systems. The problems can be grouped into the three basic categories of cache miss handling (MISS), bandwidth Compute Time shortcomings (BW), and latency issues (LAT). These are Leaverage = (6) Data Movement Time shown with the associated causes in the table below. Leverage is directly related to compute intensity on a given machine. Compute time is related to the operation count and data movement time is related to the number of

data points involved in the computation. An algorithm with set- a high compute intensity will often have a high leverage. line size write policy associativity However, it is possible to have an algorithm with a low Cache Problem (type) miss penalty compute intensity and a high leverage. This results either Non-Stride-1 slow (BW) yes yes no no when a calculation takes a long time to perform the floating Over-fetch (BW) yes no yes yes point operations or when many non-floating-point opera- Write BW Waste (BW) yes no no yes tions are performed. Interference (MISS) yes no yes yes Leverage can be used to explain how several processors Miss Stalls (MISS) yes yes no no can work in parallel to saturate the available memory band- Latency variance (LAT) no yes no yes width. For example, if a particular loop has a leverage of 11 it will spend only 9% of it’s execution time moving data. If Table 2. Cache Problems and Causes the computation is in a parallel region of code, eleven pro- cessors could be computing while one processor is moving Losing memory bandwidth is a chief concern in these data. In this way, twelve processors can saturate the mem- systems since the delivered “cache-friendly” stride-1 data ory bandwidth and maximize the performance achieved on fetching memory bandwidth of current commercial micro- the memory system. Coupled with a software strategy, this processors is limited to only 30-70% of memory band- kind of arrangement allows multiple processors to effec- width. These inefficiencies are due to the intrinsic cache tively share a common bus. This concept is basic to the missing mechanism. The processor has a special CRAY APP architecture. load and store mechanism that allows it to achieve 99% of memory bandwidth (this will be discussed later). The next sections explore techniques to maximize per- formance by exploiting the concept of compute intensity on Delivered Max % Max parallel computers composed of commercial microproces- BW BW % special sors. Processor (MB/s) (MB/s) Max load Intel i860 XR 91 160 57% 99% 2.1 Caches and Compute Intensity Intel i860 XP 228 400 57% 99%

In order to take advantage of compute intensity, some HP 9000/735 264 528 50% no form of local storage close to the computational unit is re- IBM RS6000-950 114 400 29% no quired. On current microprocessors, caches are typically DEC Alpha ? ? ? no used as this local storage. Caches have been shown to be ef- IBM Power PC ? ? ? no fective on many “cache-friendly” applications and are widely used. The dynamics of the computer marketplace Intel Pentium ? ? ? no have driven many to optimize cache designs for small Motorola 68060 ? ? ? no dusty-deck code codes like SPECint92 and SPECfp92 Delivered Memory to Cache Bandwidth benchmarks [22]. These low compute intensity benchmarks are small enough to almost fit in current caches. Therefore, Realized memory bandwidth is further degraded when they do not accurately reflect the performance of large- strided data is referenced. In this case, a full cache line is scale programs that require interaction with memory. brought into the cache even if only one word of the cache line is used. Additionally this effectively decreases the The memory access patterns of large scale technical cache size and increases the cache line replacement penal- computing have different characteristics that may result in ties. Many caches employ write-through policies that can

Technical Paper - Draft Submission 3 Submitted to: SuperComputing ‘93, Portland, November 1993. dramatically increase memory bandwidth requirements. using data vectorization. However, the normal cacheing Write-back caches tend to decrease this problem at the ex- version will still suffer performance loss due to the reasons pense of a more complicated implementation. discussed above. Figure 1., shows the performance of ma- trix multiply with the kernel blocked at 24 by 24 elements Cache missing can be another source of performance with and without data vectorization. Blocking causes the loss. Cache missing typically causes a CPU to stall while it sawtooth in the performance curve. The reduced bandwidth waits to access the required data. Caches lose their advan- and the interference problems are shown in the normal tages if the rate of missing increases dramatically. Data ac- cache version. Higher performance is seen on the data vec- cessed in regular patterns might seem to be perfect to avoid torized version when using an optimal blocking factor. interference, however, if these regular accesses interfere it is regular and severe [16]. In addition, items stored in cache 40 have a much lower latency than items stored in memory. It 38 can not be known if a data element is in cache or memory 36 until run time. The difference in latency and its uncertainty 34 makes it difficult for a compiler to optimally schedule in- 32 structions at compile time. 30 28

Mflops APP Optimized Data Vectorization 26 APP Data Vectorization The three strategies to augment caches to minimize these APP Normal Caching problems that have been proposed are prefetch operations, 24 multiple levels of cache, and data vectorization[2] [1] [19]. 22 The code optimization technique of vectorization consists 20 18 of the two distinct components of Data Vectorization and 16 Code Vectorization. Data Vectorization is the user or com- 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 piler insertion of code templates or instructions that deal Matrix size with data movement between memory and vector registers Figure 1. Single Processor Matrix Multiply, Blocking = 24 (local storage). Code vectorization is the user or compiler insertion of code templates or hardware vector instructions Figure 2. shows the performance on an ADI integration to replace computations in scalar loops. loop [4] using different caching strategies and data vector- ization on different data sizes [19].

90

75 Cache Off Cache On DVect-no handoff DVect-handoff Cache 60

Enhancements Stride-1 Slow BW Waste Write Interference Miss Stalls Latency variance Over-fetch

Data Vectorize no no no ? no no 45 Pre-fetch ? no yes ? yes yes

% Memory Efficiency 30 Multiple Caches yes no yes ? yes yes

15 Cache Enhancements Effect on Cache Problems

0 Of these three solutions only data vectorization provides 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 full performance, maximum memory bandwidth, latency Size of Matrix tolerance, and is easy to compile. In addition, data vector- ization can be implemented without removing the advan- Figure 2. Bandwidth Utilization & Cache Strategies tages of caches. Data vectorization techniques provide In the above legend, “Cache on” charts the performance higher performance than relying on caches [19] or adding of multiple processors each using normal caching. “Cache localization techniques. Localization [15] re-arranges code off” is used to show the performance when all data refer- and data to improve cache performance. ences are memory references “DVect: No Handoff” shows Blocking the data improves the performance on an algo- the performance of data vectorization without hardware to rithm when using the normal cacheing mechanism or when coordinate the processors on a processor bus. “Dvect:

Technical Paper - Draft Submission 4 Submitted to: SuperComputing ‘93, Portland, November 1993. Handoff” show the performance when data vectorization of memory access. Constant-stride accesses are typified by and bus handoff are used together. This gives the best mem- traversing the rows in a 2D matrix. Indexed accesses use an ory utilization and hence the highest performance. index vector to point to the desired vector elements of an- other vector. Indexed loads are typically not as fast as con- 2.2 Data Vectorization stant stride accesses due to their non-uniformity.

Data vectorization on cache-based systems has many On the CRAY APP, data vectorization subroutines are benefits. First, it can guarantee 100% cache hits during the implemented using the fully pipelined cache-bypass in- compute intensive portions of an algorithm. Second, the struction and the “no allocate on store miss” policy of the compiler’s code scheduling is simplified since the latency cache [13]. These features allow fully pipelined memory for all vectorized data is the same. Third, it separates the la- accesses at the rate of one word (64-bit) every other CPU tencies of the memory system from the latencies of the clock. These optimally coded data vectorization subrou- cache in a deterministic manner. Fourth, an application’s tines have ann12⁄ of 12 which allows stride-one or arbi- performance can be increased by reducing its data move- trary-stride accesses to deliver up to 99% of peak memory ment. bus bandwidth [2]. Then12⁄ is the number of operations re- quired to achieve half of the maximum bandwidth [10]. On the CRAY APP, data vectorization can be applied in- dependently from code vectorization. There are many inte- 2.3 Realized Compute Intensity ger and floating-point applications that can be data vectorized that can not be code vectorized. An analysis of The realized compute intensity is often dependent on a the Livermore Kernels[2] [18] shows that 71% of compiler’s optimization strategy. Code optimized for the the loops can be completely code vectorized and 92% can memory-bandwidth rich CRAY Y-MP focuses on optimal be completely data vectorized. Similar results have been code vectorization which results in efficient running codes found in other benchmarks and production codes [19]. Data that have low compute intensity [8] in the range of 0.5 to Vectorization can also be applied to non-floating point op- 3.0. This is the proper optimization technique for the erations. CRAY Y-MP. However, many algorithms when compiled with a data vectorization focus can realize a much greater Data vectorization can be implemented on cache-based compute intensity on each loop and the entire code. microprocessors. This adds many of the benefits of special- ized vector processors to commercial microprocessors. The Compute Intensity cache can be thought of as a configureable vector register set or vcache [21]. On the CRAY APP, vcache is imple- CRAY CRAY Maximum Application Kernel Y-MP APP (est) mented by placing a cache-sized array in the data cache. Data vectorization functions are implemented in the form 3D Kirchhoff Migration 2.9 17.3 16146.1 of subroutines that copy data into vcache. Any number of Linpack 1000 1.9 9.5 666.6 temporary data vectors may be stored in vcache. Vector Perfect Club TRFD (kernel) 2.0 16.6 394.5 load routines copy blocks of data from system memory into Electromagnetic Fill 1.3 6 - 12 167.2 vcache locations. Computations are performed on data ele- ments in vcache with the results placed back to vcache lo- Real FFT 1k x 1k 1.3 5.4 22.5 cations. Vector store routines can then copy the resultant Linpack 100 0.7 0.4 1.0 data from vcache to system memory. Table 3. Compute Intensity Implementations The translation of scalar loops into code that contains In the above table, the maximum compute intensity of an these data vectorization subroutines is part of the CRAY algorithm can be estimated roughly by dividing the total APP’s compiler. This is accomplished without introducing number of floating-point operations by the data storage re- complexity to the programming model [19]. The compiler quired. This estimate serves as an upper bound compute in- uses the technique of stripmining to divide arbitrary length tensity for the entire run. data arrays into smaller data vectors stored in vcache. In ad- dition, the data vectorization routines support a variety of An example of the different optimization strategies and memory access patterns: unit-stride, constant stride, and in- how they affect compute intensity is the 3D Kirchhoff Seis- dexed accesses. Unit-stride accesses between consecutive mic Migration (prestack). On a one processor CRAY Y-MP vector elements are the most important and common form (aggressive code vectorization, 64-bit computation), this

Technical Paper - Draft Submission 5 Submitted to: SuperComputing ‘93, Portland, November 1993. migration has a performance of 187 Mflops and a compute maximum latency of 100 ns. The CRAY APP uses a simple intensity of 2.9. However, the same code with a data vector- banked memory system. Optimizations for banked memory ization focus on an 84 processor CRAY APP (32-bit com- systems are well known and are part of many compilers. putation) has a compute intensity of 17.3 and a performance Data distribution is trivial as each processor simply refer- of 1084 Mflops. The kernel loop of this code has several ences its required data without regard to placement. conditional statements that prevent simple code vectoriza- tion. Since the compute intensity is greater than 12, all of 3.2 Granularity and Applicability the processors on each bus of the CRAY APP can be uti- lized in this computation for a delivered 79x speedup. Many codes have significant amounts of parallelism that can be exploited. However, these codes may not be The Linpack 100 benchmark [6] has a low compute in- programmed in the correct fashion to expose enough of the tensity. For historical reasons, the rules of this benchmark parallelism. In addition, often the parallel overheads added force compilers to use the low compute intensity SAXPY to run a program in parallel may outweigh the benefits of operation as the kernel for this algorithm. The compute in- going parallel. In this context, parallel overheads are any tensity of the SAXPY operation is 0.67. If an LAPACK operations that were added to a program to guarantee cor- solver [17] were used for this benchmark, the maximum rect execution. This includes process creation, coordinat- compute intensity would increase to 33.0 and the delivered ing functions, and extra data movement (or performance would also increase. communication). Overheads often limit the performance in a parallel system. 3.0 Parallelism Granularity can be defined as the size of work that is Automatic vectorization and automatic parallelization given to each processor in a parallel system. Low over- techniques have been effectively combined on shared- heads allow a computer to achieve speedups on fine- memory systems for many years [14] [11]. Automatic par- grained programs. Programs typically have several parallel allelism is important for wide-spread production use. Cur- regions of code, each of which has a distinct level of gran- rently, similar efforts for automatic parallelism are ularity. underway for machines with non-uniform access (distribut- ed and shared memory) [4]. However, this is complicated Low overheads enable more sections of code to be par- by data placement issues. allelized. Parallelizing more of what otherwise would be serial code can have a profound effect on complete job per- 3.1 Data Placement formance. Table 4., shows an example of a code with four code sections (A, B, C, and D) that take a given percentage Data placement is required by systems with disparate of the execution time of a serial run. Case 1 shows only memory latencies (some of which may be very high). It in- code section “A” running 45 times faster using parallelism. volves actually placing data in each processor’s local mem- In this example, the other code sections could not be run in ory. In addition, communication routines must be added to parallel due to the parallel start-up overheads being too move data between processors. This code re-write can be large. On the entire code, this would give a speedup of simplified somewhat by taking advantage of the structure 8.2x over the serial code. Case 3 shows the effect of being of the communication [2]. The split address space and the able to speed up code sections “B” and “C” with parallel- non-uniform memory accesses drastically complicate the ism to some level. On the entire code this shows a speed up process of effectively programming a distributed-memory of 20.0x over the serial code. Fine-grain parallelism is also machine. Greatly disparate memory latencies on shared easier to recognize and take advantage of in real codes. memory processors also require the programmer to pay as much attention to data placement as on distributed memory AB C D machines. For example, the Kendall Square KSR1 where it takes 24 clocks to load from a local memory, 130 clocks to Case 90% 5% 3% 2% Time Speedup access memory within a local group of processors, and 570 1 45x 1x 1x 1x 12.0 8.2x clocks to access memory outside a local group. 2 45x 10x 1x 1x 7.5 13.3x 3 45x 10x 6x 1x 5.0 20.0x The CRAY APP’s shared-memory system has a uniform memory latency of 150 nanoseconds (6 clocks) to any Table 4. Low Overheads Speedup More Code Sections memory location. For comparison, the CRAY Y-MP has a

Technical Paper - Draft Submission 6 Submitted to: SuperComputing ‘93, Portland, November 1993. The number of applications that can be executed in par- 4.0 The CRAY APP allel is much larger when one can take advantage of fine- grain parallelism. Conversely, there are more parallel com- 4.1 CRAY APP Hardware puters that can execute large grain jobs than fine grain The CRAY APP is a multiple-bus parallel processor de- jobs. A computer designed for fine-grain parallelism can signed to accelerate a variety of applications in a multi-user also perform well on large-grain programs. The opposite is environment [2]. . not true. It is difficult to design parallel computers with low overheads. CRAY APP (Patent Pending) 1-84 CPUs P P P P P P P The CRAY APP uses several hardware mechanisms [2] P P P P P P P P P P P P P P to assist the highly tuned user-level parallel primitives of P P P P P P P parallel region entry (pcall), memory protection (critical P P P P P P P section), and processor synchronization (barrier). The 160 P P P P P P P MB/s P P P P P P P HIPPI A overheads associated with these functions are very low to P P P P P P P allow for effective fine-grain parallelism. The automatic P P P P P P P HIPPI B parallelizers guard against simultaneous updates to shared P P P P P P P P P P P P P P VME data using critical sections. The compiler guarantees that P P P P P P P shared data is consistent between synchronization points. The benefits of focusing on efficient parallelism is evi- 8 x 8 Crossbar 1.28 Gbytes/sec denced by the delivered performance of 882 Mflops on the Linpack 1000 benchmark [6] [2]. The CRAY APP has the best performance and is also the most cost-effective of the 32-1,024 MBytes of Memory microprocessor-based parallel computers.

3.3 Data Coherency and Data Vectorization Figure 3. The CRAY APP

Data vectorization guarantees that each processor’s The architecture of the CRAY APP architecture (Patent cache only contains the data it needs so processors cannot Pending), as illustrated in Figure 3., consists of 4 to 84 pro- update different portions of shared cache lines. Cache cessing elements and 1 to 7 processor buses and one I/O coherency hardware does not add any additional function- bus. Each bus supports up to 12 processing elements. The ality for data vectorized code. Data vectorization also processing elements are the Intel i860XR 64-bit micropro- allows multiple processors to effectively share a common cessors [13]. Each element supplies up to 80 Mflops on sin- bus. While one processor is performing a vector load or gle-precision operations and 40 Mflops when performing store, the other processors on the bus are computing on an equal number of double-precision multiplies and adds. data in their cache. Low overheads make this technique The buses are connected to low latency shared memory by very effective on the CRAY APP. a crossbar.

3.4 Parallel Loops The maximum bandwidth between processors and mem- ory is 1.12 Gbytes/second. At a given time, only one pro- The work distribution for parallel loops can be calcu- cessing element on each bus can access a memory port. The lated quickly at run-time using low-overhead function calls crossbar allows eight simultaneous accesses to eight differ- to give the number of processors and a logical processor ent memory ports every memory cycle. I/O transfers can be identity. Notice there is no dependence on a power-of-two overlapped with computation. HiPPI disk transfers have a number of processors (unlike many distributed memory measured peak of 69 Megabytes/second (96% of disk max- machines). Load balancing of loops may be improved by imum). changing partitioning of the loop iterations from finely interleaved to blocked. The user can also use directives to The CRAY APP is air-cooled with a power requirement coalesce two parallel loops to increase the parallel work of 6.6KVA. It has a maximum dissipation of 4000 watts. and provide better load balancing. The dimensions of the package are 27" x 62" x 43".

Technical Paper - Draft Submission 7 Submitted to: SuperComputing ‘93, Portland, November 1993. 4.2 CRAY APP Software most distributed-memory systems. In addition, uni-direc- The CRAY APP’s program development software in- tional data transfers have been measured at over 92 Mbytes/ cludes a complete set of compilation, debugging, and pro- sec. The cluster is programmed as a distributed-memory filing tools [21]. It is programmable in either standard system at the machine level, since each CRAY APP has its FORTRAN or C. The FORTRAN and C compilers perform own memory space. Parallel coordination of tasks is ac- optimizations include data vectorization, code vectoriza- complished by communications between CRAY APPs. The tion, and software pipelining. The compiler also performs CRAY APP cluster can also have a globally accessible optimizations to improve the compute intensity of loops, memory, which can contain up to 16 Gbytes of data. The to- and autoparallelizes code which utilizes multiple proces- tal memory in a cluster can be 28 Gbytes. sors on each bus. Auto parallelization is also referred to as 5.0 CRAY APP Performance autotasking [5]. This section will cover the performance of some algo- Users can explicitly parallel program using comment di- rithms on the CRAY APP, see also [2] [19] [20]. The rectives (FORTRAN) or pragmas (C) to identify parallel CRAY APP performs well on many of the important 32-bit loops and synchronization points within code. Supported algorithms required by signal and image processing appli- directives include CRAY Autotasking directives [5], Paral- cations. Table 5. shows the performance of a single CRAY lel Computing Forum directives [1] and KAI Mid-Proces- APP on a sample of 32-bit algorithm. sor directives. Code is compiled in a manner that allows a single executable image to run on a system with any num- ber of processors. Algorithm Data Set Performance The CRAY APP microkernel is responsible for manag- 1D Convolution 500 filter, 1M data 6555 Mflops ing the execution of CRAY APP programs. The microker- 1D Complex FFT 4096, length 256 2301 Mflops nel directly manages functions such as creating and 1D Complex FFT 1M 847 Mflops controlling program processes and process threads, han- 1D Real Radix 3 FFT 10000, length 486 3020 Mflops dling of exceptions, managing HiPPI data transfers, and switching context between users. The CRAY APP micro- 2D Complex FFT 1K x 1K 1064 Mflops kernel also supplies virtually all of the system services 3D Real/Complex FFT 243 x 243 x 243 1063 Mflops available on the SPARC processor, including SunOS sys- Vector Square Root 1M 646 Mflops tem calls. When an extended system service such as a read or write is executed on the CRAY APP, it is sent to the Table 5. Example 32-bit Algorithm Performance front-end for execution. The front-end acts as the I/O pro- cessor for the CRAY APP. Many applications require computations to be per- formed on 64-bit data. Table 6. shows the performance of a 4.3 CRAY APP Cluster single CRAY APP on some typical 64-bit kernels.

Cluster-Level parallelism is concerned with the largest granularity of sub-problems or the parallelism between Performance multiple cooperating programs. A cluster describes a sys- Algorithm Data Set 1 CRAY APP tem composed of several machines that work on a related Complex Matrix Multiply 1000 x 1000 2912 Mflops task. The CRAY APP Cluster system is a scalable cost- Real Matrix Multiply 1000 x 1000 1886 Mflops efficient compute solution for numerous applications, such Complex Solver 50K x 50K 2821 Mflops as 3D seismic migration, beamforming, and electromag- netic simulations. A CRAY APP cluster can currently be Linpack 1000 1000 x 1000 882 Mflops configured with up to 12 CRAY APPs. This provides a Linpack 100 100 x 100 33 Mflops peak 81 Gflops for 32-bit performance and a peak of 40 Vector Logarithm 1M 1264 Mflops Gflops for 64-bit performance. The CRAY APP Cluster system uses a HiPPI network and a globally accessible Table 6. Example 64-bit Algorithm Performance central memory system.

Messages can be transferred between CRAY APPs with a measured latency of 9 microseconds which is lower than

Technical Paper - Draft Submission 8 Submitted to: SuperComputing ‘93, Portland, November 1993. The algorithms shown above are involved in many im- 7.0 References portant technical applications. Some of these applications are shown below. [1] CRAY APP Programmers Guide, Cray Research Superserv- ers, Inc. April 1992. # of CRAY [2] B. R. Carlile and D. Miles, “Structured Asynchronous Com- Application APPs Performance munications Routines for the FPS T Series”, in: Proceedings of the Third Conference on Hypercube and Concurrent Com- Electromagnetic Simulation 2 5.4 Gflops puters and Applications 1 (Jan. 1988) 550-559. Beamforming & Conversion 3 2.3 Gflops Options Evaluation 1 2.0 Gflops [3] B. R. Carlile, “Algorithms and Design: The CRAY APP Shared-Memory System”, in: COMPCON ‘93, (Feb. 1993) Portfolio Optimization 1 1.3 Gflops 312-320. Image Processing 1 1.2 Gflops [4] A. Choudhary, G. Fox, et al, Compiling Fortran 77D and 3D Kirchhoff Depth Migration 1 1.1 Gflops 90D for MIMD Distributed-Memory Machines, Proceedings Table 7. CRAY APP Application Performance from FRONTIERS ‘92, IEEE Computer Society Press, 1992.

6.0 Conclusion [5] CRAY Y-MP, CRAY X-MP EA, and CRAY X-MP Multitask- ing Programmers Manual, SR-0222 F-01, Cray Research, The innovative features of the CRAY APP are due to a Inc., Eagan Minnesota, 1991. hardware-software codesign process. This process used the [6] J. Dongarra, Performance of Various Computers Using Stan- characteristics of important algorithms to define and opti- dard Linear Equations Software, Oak Ridge National Labo- mize the features that are most frequently used. Parallel ratory, April 10, 1992. performance is directly related to high delivered band- width, minimal parallel overheads, and efficient computa- [7] Eagle Project Plan, Cray Research Superservers Inc. (for- tions. Many of these features are also applicable to non- merly FPS Computing), Beaverton OR, 1989. technical applications. The compute intensity of an applica- tion or a benchmark is very helpful in determining the [8] C. M. Grassl, “Parallel Performance of Applications on Su- achievable performance on an architecture. percomputers, Parallel Computing 17 (1991), 1257-1273.

On the CRAY APP, vectorization techniques are effec- [9] J. L. Hennessy and D. A. Patterson, Computer Architecture A Quantitative Approach (Morgan Kaufmann Publishers, tively used on a commodity microprocessor. Data vector- San Mateo CA, 1990). ization improves both serial and parallel performance by organizing data movement and avoiding the pitfalls of stan- ,, [10] R. W. Hockneyr∞ n12⁄ s12⁄ measurements on the 2 dard caching mechanisms. Parallel performance is also en- CPU CRAY X-MP, Parallel Computing 2 (1985) 1-14. hanced by low-overhead parallel support routines which allow more of an application to be parallelized. This in- [11] R. W. Hockney and C. R. Jesshope, Parallel Computers, creases the number of codes that can benefit from parallel Adam Hilger, Philadelphia, Second Edition, 106-108, 1981. processing. [12] How to Program the FPS AP-120B, Floating Point Systems, The CRAY APP has capitalized on the important char- Inc., FPS-7303 manual, 1977. acteristics of applications and has realized Gflops level per- [13] i860 Microprocessor Family Programmer’s Reference Man- formance on a wide variety of applications. The ual, Intel Corporation, 1991. programming effort on the CRAY APP is simplified by the fast commercial processors, low-latency shared memory, [14] K. Kennedy, “Automatic Translation of Fortran Programs to data vectorization, and efficient parallel support functions. Vector Form”, Technical Report 476-029-4, Rice University, The CRAY APP was designed to run production-class Houston TX, October 1980. codes at higher efficiency and lower cost than other current- ly available parallel processors. [15] M. S. Lam and M. E. Wolf, “Automatic Blocking by a Com- piler”, Stanford University, Fifth SIAM Conference on Par- allel Processing, March 25-27, 1991.

[16] M. S. Lam, E. Rotherberg, and M. E. Wolf, “The Cache Per- formance and Optimizations of Blocked Algorithms”, Fourth

Technical Paper - Draft Submission 9 Submitted to: SuperComputing ‘93, Portland, November 1993. Intern. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), Palo Alto, CA, April 9-11, 1991.

[17] LAPACK User’s Guide, E. Anderson...[et al.], Society for Industrial and Applied Mathematics, Philadelphia, 1992.

[18] F. H. McMahon, The Livermore FORTRAN Kernels: A Com- puter Test of the Numerical Performance Range. Research Report: Lawrence Livermore Laboratories, December 1986.

[19] D. Miles, “Beyond Vector Processing: Parallel Programming on the CRAY APP”, in: COMPCON 93, (Feb. 1993), 321- 328.

[20] D. Miles, “Compute Intensity and the FFT”, submitted to Su- percomputing ‘93 Conference, November 1993.

[21] PGTools - i860 Development Software, The Portland Group, Inc. March 1992.

[22] SPEC Newsletter, Vol 4. Issue 3, NCGA, Suite 200, 2722 Merrilee Drive, Fairfax, VA 22031, September 1992.

[23] B. R. Rau and C. D. Glaser, Some scheduling techniques and an easily scheduleable horizontal architecture for high per- formance scientific computing, in: Proceedings of the Four- teenth Annual Workshop on Microprogramming (1981) 183- 198.

Technical Paper - Draft Submission 10 Submitted to: SuperComputing ‘93, Portland, November 1993.