<<

Chapter 1 Exercises 37

Chapter 1 Exercises

1. Look up the definition of “parallel” in your favorite dictionary. How does it compare to the definition of “parallelism” given in this chapter? 2. Give reasons why 10 bricklayers would not in real life build a wall 10 times faster than one bricklayer. 3. Ten volunteers are formed into a bucket brigade between a small pond and a cabin on fire. Why is this better than each volunteer working individually carrying water to the fire. Analyze the bucket brigade as a pipeline. How are the buckets returned? 4. Using an assembly line, one wants the conveyor belt to move as fast as possible in order to produce the most widgets per unit time. What determines the maximum speed of the conveyor belt? 5. Assume a conveyor belt assembly line of five tasks. Each task takes T units of time. What is the speedup for manufacturing a). 10 units? b). 100 units? ). 1000 units? 6. Plot the graph of the results of problem 5. What is the shape of the curve? 7. Given the assembly line of problem 5, what are the speedups if one of the tasks takes 2T units of time to complete? 8. Assume a conveyor belt assembly line of five tasks. One task takes 2T units of time and the other four each take T units of time. How could the differences in task times be accommodated? 9. Simple Simon has learned that the asymptotic speedup of an n station pipeline is n. Given a 5 station pipeline, Simple Simon figures he can make it twice as fast if he makes it into a 10 station pipeline by adding 5 “do nothing” stages. Is Simple Simon right in his thinking? Explain why or why not. 10. Select a parallel or parallel programming language and write a paper on its history.

2.1 Measures of Performance 37

Chapter 2 Measuring Performance

This chapter focuses on measuring the performance of parallel . Many customers buy a parallel computer primarily for increased performance. Therefore, measuring performance accurately and in a meaningful manner is important. In this chapter, we will explore several measures of performance assuming a scientific computing environment. After selecting a “good” measure, we discuss the use of benchmarks to gather performance data. Since the performance of parallel architectures is harder to characterize than scalar ones, a performance model by Hockney is introduced. The performance model is used to identify performance issues with vector processors and SIMD machines. Next, we discuss several special problems associated with the performance of MIMD machines. Also, we discuss how to measure the performance of the new massively parallel machines. Lastly, we explore the physical and algorithmic limitations to increasing performance on parallel computers. 2.1 Measures of Performance First, we will consider measures of performance. Many measures of performance could be suggested, for example, instructions per second, disk reads and writes per second, memory accesses per second, or accesses per second. Before we can decide on a measure, we must ask ourselves what are we measuring? With parallel computers in a scientific computing environment, we are mostly concerned with CPU computing speed in performing numerical calculations. Therefore, a potential measure might be CPU instructions per second. However, in the next section we will find that this is a poor measure. 2.1.1 MIPS as a Performance Measure We all have seen advertisements claiming that such and such company has an X MIPS machine, where X is 50, 100 or whatever. The measure MIPS (Millions of Instructions Per Second) sounds impressive. However, it is a poor measure of performance, since processors have widely varying instruction sets. Consider, for example, the following data for a CISC (Complex Instruction Set Computer) Motorola MC68000 and a RISC (Reduced Instruction Set Computer) Inmos T424 microprocessor. Total Number of Instructions Time in Seconds MC68000 109,366 0.11 T424 539,215 0.03

Fig. 2.1 Performance Data for the Sieve of Eratosthenes Benchmark10 Both are solving the same problem, i. e., a widely used for evaluating microprocessor performance called the Sieve of Eratosthenes, which finds all the prime numbers up to 10,000. Notice that the T424 with its simpler instruction set must perform almost five times as many instructions as the MC68000. total number of instructions rate = time to solve problem 539215 rateT424 = 0.03 = 18.0 MIPS

10 Inmos Technical Note 3: "Ims T424 - MC68000 Performance Comparison" 38 CHAPTER 2 MEASURING PERFORMANCE

109366 rateMC68000 = 0.11 = 1.0 MIPS The T424 running the Sieve program executes instructions at the rate of 18.0 MIPS. In contrast, the MC68000 running the same Sieve program executes instructions at the rate of 1.0 MIPS. Although the T424's MIPS rating is 18 times the MC68000's MIPS rating, the T424 is only 3.6 times faster than the MC68000. We conclude that the MIPS rating is not a good indicator of speed. We must be suspect when we see performance comparisons stated in MIPS. If MIPS is a poor measure, what is a good measure? 2.1.2 MFLOPS as a Performance Measure A reasonable measure for scientific computations is Millions of FLoating-point Operations Per Second (MFLOPS or Mega FLOPS). Since a typical scientific or engineering program contains a high percentage of floating-point operations, a good candidate for a performance measure is MFLOPS. Most of the time spent executing scientific programs is calculating floating-point values inside of nested loops. Clearly, not all work in a scientific environment is floating-point intensive, e. g., compiling a program. However, the computing industry has found MFLOPS to be a useful measure. Of course, some applications such as expert systems do very few floating point calculations and an MFLOPS rating is rather meaningless. A possible measure for expert systems might be the number of logical inferences per second.

2.2 MFLOPS Performance of Over Time To demonstrate the increase in MFLOPS over the last two decades, the chart below shows some representative parallel machines and their theoretical peak MFLOPS rating. Each was the fastest machine in its day. The chart also includes the number of processors contained in the machine and the year the first machine was shipped to a customer. Year Peak MFLOPS Number of Processors CDC 6600 1966 1 1 ILLIAC IV 1975 100 64 Cray-1 1976 160 1 CDC 205 1981 400 1 Cray X-MP/2 1983 420 2 Cray Y-MP/832 1987 1333 4 Cray Y-MP C90 1992 16000 16 NEC SX-3/44 1992 22000 4 Fig. 2.2 Peak MFLOPS for the Fastest Computer in That Year From the chart, we see that the MFLOPS rating has risen at a phenomenal rate in the last two decades. To see if there are any trends, we plot the Peak MFLOPS on a logarithmic scale versus the year. The result is almost a straight line! This means the performance increases tenfold about every five years. Can the computer industry continue at this rate? The indications are they can for at least another decade. 2.2 MFLOPS of Supercomputers Over Time 39

100000

10000

1000

100

10 PEAK MFLOPS 1 1965 1970 1975 1980 1985 1990 1995

Fig. 2.3 Log Plot of Peak MFLOPS in the Last Two Decades One caveat: the chart and graph use a machine's theoretical peak performance in MFLOPS. This is not the performance measured in a typical user's program. In a later section, we will explore the differences between "peak" and "useful" performance. Building a GFLOPS (Giga FLOPS or 1000 MFLOPS) machine is a major accomplishment. Do we need faster machines? Yes! In the next section, we will discuss why we need significantly higher performance. 2.3 The Need for Higher Performance Computers We saw in the last section that supercomputers have grown in performance at a phenomenal rate. Fast computers are in high demand in many scientific, engineering, energy, medical, military and research areas. In this section, we will focus on several applications which need enormous amounts of computing power. The first of these is numerical weather forecasting. Hwang and Briggs [Hwang, 1984] is the primary source of the information for this example. Considering the great benefits of accurate weather forecasting to navigation at sea and in the air, to food production and to the quality of life, it is not surprising that considerable effort has been expended in perfecting the art of forecasting. The weatherman’s latest tool is the which is used to predict the weather based on a simulation of an atmospheric model. For the prediction, the weather analyst needs to solve a general circulation model. The atmospheric state is represented by the surface pressure, the wind field, the temperature and the water vapor mixing ratio. These state variables are governed by the Navier-Stokes fluid dynamics equations in a spherical coordinate system. To solve the continuous Navier-Stokes equations, we discretize both the variables and equations. That is, we divide the atmosphere into three- dimensional sub regions, associate a grid point with each sub region and replace the partial differential equations (defined at infinitely many points in space) with difference equations relating the discretized variables (defined on only the finitely many grid points). We initialize the state variables of each grid point based on the current weather at weather stations around the country. The computation is carried out on this three-dimensional grid that partitions the atmosphere vertically into K levels and horizontally into M intervals of longitude and N intervals of latitude. It is necessary to add a fourth dimension: the number of time steps used in the simulation. Using a grid size of 270 miles on a side, an appropriate number of vertical levels and time step, a 24-hour 40 CHAPTER 2 MEASURING PERFORMANCE forecast for the United States would need to perform about 100 billion data operations. This forecast can be done on a Cray-1 supercomputer in about 100 minutes. However, a grid of 270 miles on a side is very coarse. If one grid point was Washington, DC, then 270 miles north is Rochester, New York on Lake Ontario and 270 miles south is Raleigh, North Carolina. The weather can vary drastically in between these three cities! Therefore, we desire a finer grid for a better forecast. If we halve the distance on each side to 135 miles, we also need to halve the vertical level interval and the time step. Halving each of the four-dimensions requires at least 16 times more data operations.

time135 mile grid = 100 minutes x 16 = 1600 minutes = 26.7 hours Therefore, a Cray-1 would take over 26 hours to compute a 24-hour forecast. We would receive the prediction after the fact; clearly not acceptable! If we want the forecast in 100 minutes, we will need a machine 16 times as fast. If we desire a grid size of 67 miles on a side, we will need a computer 256 times faster. Since weather experts would like to model individual cloud systems which are much less in size, for example, six miles across, weather and climate researchers will never run out of their need for faster computers. From Figure 2.2, we observe that the Cray-1 is a 1976 machine and has a peak performance of 160 MFLOPS. Today’s machines are a factor of 100 faster and do provide a better forecast. However, reliable long-range forecasts require an even finer grid for a lot more time steps, which is why climate modeling is a Grand Challenge Problem. In 1991, the United States Office of Science and Technology proposed a series of Grand Challenge Problems, i. e., computational areas which require a million MFLOPS (TeraFLOPS). The U. S. Government feels that effective solutions in these areas are critically important to its national economy and well being. The Grand Challenge Problems are listed in Figure 2.4.

Climate Modeling - weather forecasting; global models. Fluid Turbulence - air flow over an airplane; reentry dynamics for spacecraft. Pollution Dispersion - acid rain; air and water pollution. Human Genome - mapping the Human genetic material in DNA. Ocean Circulation - long term effects; global warming Quantum Chromodynamics - particle interaction in high-energy physics. Semiconductor Modeling - routing of wires on an chip. Superconductor Modeling - special properties of materials. Combustion Systems - rocket engines. Vision and Cognition - remote driverless vehicle.

Fig. 2.4 The Grand Challenge Problems that Require a TeraFLOPS Computer The U. S. computer industry hopes to provide an effective TeraFLOPS computer by the mid- 1990s. Other areas that require extensive computing are structural biology and pharmaceutical design of new drugs, for example, a cure for AIDS. Returning to the weather forecasting example, how close to peak performance did the Cray-1 achieve on this problem? To compute the actual floating-point operation per second (FLOPS), we divide the number of operations by the time spent. 100 billion operations rateCray-1 = 100 minutes = 16.7 MFLOPS 2.3 Need for Higher Performance Computers 41

Surprisingly, the Cray-1, a 160 MFLOPS machine, only performed 16.7 MFLOPS on the weather problem! Why the large discrepancy? First, the FORTRAN compiler can’t utilize the machine fully. Second, the pipelined arithmetic functional units are not kept busy. We will explore this issue fully in Chapter Three when we discuss vector processors such as the Cray-1. Also, in Chapter Three we will derive the Cray-1’s 160 MFLOPS rating and discuss why it rarely achieved anywhere near peak performance. At the moment, we need only understand that sustained MFLOPS on real programs and not peak MFLOPS is what is important. Many purchasers of supercomputers have been disappointed when their application programs have run only a small fraction of the salesman’s quoted peak performance. One way to measure the practical MFLOPS available in a computer is to use benchmark programs. 2.4 Benchmarks as a Measurement Tool A benchmark is a computer program run on several computers to compare the computers’ characteristics. A benchmark might be an often-run application program which typifies the work load at a company. Using the benchmark, the company can obtain a measure of how well a new computer will perform in its environment. The computing industry uses standard benchmarks, for example, the Sieve of Eratosthenes program used in Section 2.1, to evaluate their products. Performance of a computer is based on many aspects including the CPU speed, the memory speed, the I/O speed and the compiler’s effectiveness. To incorporates these other effects, we measure the CPU’s overall performance by a benchmark program rather than directly, say with a hardware probe. Devising a benchmark for parallel computers is a little harder because of the wide variety of architectures. However, the computer industry has settled on several standard benchmarks including the Livermore Loops and LINPACK for measuring the performance of parallel computers. Here, we will discuss the LINPACK benchmark. Jack J. Dongarra of Oak Ridge National Laboratory compiles the performance of hundreds of computers using the standard LINPACK benchmark [Dongarra, 1992]. The LINPACK software solves dense systems of linear equations. The LINPACK programs can be characterized as having a high percentage of floating-point arithmetic operations and, therefore, are appropriate as benchmarks measuring performance in a scientific computing environment. The table in Figure 2.5 reports three numbers for each machine listed (in some cases, the numbers are missing because of lack of data). All performance numbers reflect arithmetic performed in full precision (64 bits). The third column lists the LINPACK benchmark for a matrix of order 100 in a FORTRAN environment. No changes are allowed to this code. The fourth column lists the results of solving a system of equations of order 1000, with no restrictions on the method or its implementation. The last column is the theoretical peak performance of the machine which is based not on an actual program run, but on a paper computation. This is the number manufacturers often cite; the theoretical peak MFLOPS rate represents an upper bound on performance. As Dongarra states, “... the manufacturer guarantees that programs will not exceed this rate -- sort of a ‘speed of light’ for a given computer.”11 The theoretical peak performance is determined by counting the number of floating-point additions and multiplications (64-bit precision) that can be completed during a period of time, usually the cycle time of the machine. For example, the Cray Y-MP/8 has a cycle time of 6 nanoseconds in which the results of both an addition and a multiplication can be completed on a single processor.

11 Dongarra, Jack J., “LINPACK Benchmark: Performance of Various Computers Using Standard Linear Equations Software," Supercomputing Review, Vol. 5, No. 3, March, 1992, pp. 55. 42 CHAPTER 2 MEASURING PERFORMANCE

2 operations 1 cycle 1 cycle * 6 ns = 333 MFLOPS Since the Cray X-MP/8 in 1987 could have up to four processors, the peak performance is 1333 MFLOPS. The column labeled “Computer” gives the name of the computer hardware and indicates the number of processors and the cycle time of a processor in nanoseconds. Standard Best Effort Theoretical Peak Computer Year LINPACK LINPACK Performance CDC 6600 (100 ns) 1966 0.48 -- 1 Cray-1S (12.5 ns)12 1979 12 110 160 CDC 205 (4-pipe, 20 ns) 1981 17 195 400 Cray X-MP/416 (2 procs., 8.5 ns)13 1983 143 426 470 Cray Y-MP/832 (4 procs., 6 ns) 1987 226 1,159 1,333 Cray Y-MP C90 (16 procs., 4.2 ns) 1992 479 9,715 16,000 NEC SX3/44 (4 procs., 2.9 ns) 1992 -- 13,420 22,000 Fig. 2.5 LINPACK Benchmarks in MFLOPS for Some Supercomputers

Glancing at the table, we observe that the Standard LINPACK measurements are a small percentage of the theoretical peak MFLOPS. The Standard LINPACK rating is a good ap- proximation of the performance we would expect from an application written in FORTRAN and directly ported (no changes in the code) to the new machine. Notice that the 12 MFLOPS Standard LINPACK rating of the Cray-1 is comparable to the 16.7 MFLOPS we calculated for the weather code. For FORTRAN code, 10 to 20 MFLOPS was typical on the Cray-1. The improvements of the “Best Effort” column over the Standard LINPACK column reflects two effects. First, the problem size is larger (matrix of order 1000) which provides the hardware, especially arithmetic pipelines, more opportunity for reaching near-asymptotic rates. Second, modification or replacement of the algorithm and software were permitted to achieve as high an execution rate as possible. For example, a critical part of the code might be carefully hand coded in assembly language to match the architecture of the machine. As one might expect, manufacturers have worked hard to improve their LINPACK benchmarks ratings. To improve the Standard LINPACK rating for a fixed machine, one enhances the FORTRAN compiler by providing optimizations which better utilize the machine. To illustrate the possible improvement in compiler technology; Cray Research raised the Standard LINPACK rating on the Cray-1S from 12 MFLOPS in 1983 with Cray’s CFT FORTRAN compiler (version 1.12) to 27 MFLOPS on a current run with their cf77 compiler (version 2.1). The LINPACK and other benchmarks provide a valuable way to compare computers in their performance on floating-point operations per second.

2.5 Hockney’s Parameters r¥ and n1/2 Roger Hockney has developed a performance model [Hockney, 1988] which attempts to characterize the effective parallelism of a computer. His original model focused on the performance of vector computers, e. g., the Cray-1, but he has expanded his model to include

12 No data is available for the Cray-1 cited in Figure 2.2. The Cray-1S was an upgrade of the I/O and memory systems that appeared three years later. 13 The Cray X-MP/416 is an upgrade of the Cray X-MP/2 of 1983. The system clock was speeded up from 9.5 ns to 8.5 ns, which accounts for the increase in peak performance from 420 to 470 MFLOPS. 2.5 Hockney’s Parameters r¥ and n1/2 43

SIMD and MIMD machines as well. First we will explore his original model; then we will explore his extensions. In the last section, we observed a large disparity between the Standard LINPACK rating and the theoretical peak performance on supercomputers, for example, 12 MFLOPS versus 160 MFLOPS on the Cray-1. For a vector computer, e. g., the Cray-1, the disparity is attributed partly to the FORTRAN compiler’s inability to fully utilize the machine, especially the vector hardware. However, the major source of slowdown is the lack of work to keep the pipelined arithmetic functional units busy. Recall the pipelined floating-point of Section 1.10.1. In a (see Section 1.12.1), special machine instructions route vectors of floating-point values through the pipelined functional units. Only after a pipeline is full do we obtain an asymptotical speedup equal to the number of stages in the pipeline. Therefore, supercomputer floating-point performance depends heavily on the percentage of code with vector operations and on the lengths of those vectors. Scalar code (no vectors available) runs about 12 MFLOPS on the Cray-1; while highly vectorizable code achieves performance close to the theoretical peak 160 MFLOPS. To characterize the effects of this vectorization, Hockey's performance model will be derived in the next section. 2.5.1 Deriving Hockney’s Performance Model

In his performance model, Hockney wants to distinguish between the effects of technology, e. g., the clock speed, and the effects of parallelism. All the effects of technology are lumped together into one parameter we will call Q. The effects of parallelism are lumped into a parameter we call P. Let t be the time of a single arithmetic operation, e. g., a vector multiply, on a vector of length n. We assume t is some function of n, Q and P. We call this function F : t = F(n, Q, P) First, we consider Q, the effects of technology. For example, if we double the clock speed of a processor, we would expect to halve the time t. We observe that Q appears to be a multiplicative factor of another function which depends on n and P. We will call this new function G. 1 t = r *G(n, P) The performance or rate is related to the reciprocal of the time. In the above equation, r is the rate or the results per unit of time, e. g., results per second. Now we consider P, the effects of parallelism. If the machine is serial, i. e., with no architectural parallelism (P = 0), the time to compute a vector of n elements should take the time of one element multiplied by n. That is, when the machine is serial, P should have little or no effect. 1 tserial = r *n when P = 0 If the machine is very parallel, P should dominate n in the function G. One of the many possible equations that fits this behavior is the following simple equation: 1 t = r *(n + P) Rewriting the last equation in terms of performance by taking the reciprocal of each side: 44 CHAPTER 2 MEASURING PERFORMANCE

1 r performance = t = (n + P) The maximum rate in a parallel computer occurs asymptotically for vectors of infinite length, hence Hockney gives r the subscript of ¥ .

1 r¥ performance = t = (n + P) If we assign P equal to n, then performance is one half of the maximum performance.

1 r¥ half performance = t = (2n) when n = P

Hockney names our P as n1/2 to recall the one half factor of performance. Hockney’s performance model is derived by substituting r¥ for r and n1/2 for P in the equation: 1 t = *(n + n1/2) Hockney’s Performance Equation r¥

He claims that the two parameters r¥ and n1/2 completely describe the hardware performance of his idealized generic computer and give a first-order description of any real computer. These characteristic parameters are called:

r¥ - the maximum or asymptotic performance - the maximum rate of computation in floating-point operations performed per second. This occurs asymptotically for vectors of infinite length, hence the subscript.

n1/2 - the half-performance length - the vector length required to achieve half the maximum performance.

For a particular machine, r¥ and n1/2 are constants. We can compare different machines by measuring and comparing r¥ and n1/2. The next section discusses how to measure the two parameters.

2.5.2 Measuring r¥ and n1/2

The maximum performance r¥ and the half-performance length n1/2 of a computer are best regarded as experimentally determined quantities by timing the performance of a computer on a test program which Hockney calls the (r¥, n1/2) benchmark. Before looking at the benchmark program, we will explore the behavior of Hockney’s performance equation. To explore Hockney’s performance equation, we plot t, the time for the vector operation, versus n, the vector length, on a graph. 2.5 Hockney’s Parameters r¥ and n1/2 45

Hockney's Equation t = 1/r¥ * (n + n1/2 )

t slope = 1/ Time in Seconds

n Length of Vector -n1/2 Fig. 2.6 Plot of Hockney’s Performance Equation

Notice that the negative of the intercept of the line with the n-axis gives the value for n1/2 and the reciprocal of the slope of the line gives the value of r¥.

To determine r¥ and n1/2, we collect data points by running a FORTRAN program, plot the points on a graph and draw the best-fit line through them. We can either eyeball the best-fit line through the points or use a linear least squares approximation. The parameter r¥ is 1/slope of the line and n1/2 is the negative of the n-axis intercept. The following FORTRAN program (See Figure 2.7) will vary n for one hundred values and print out the CPU-times. The program is designed for FORTRAN 77 on a UNIX-based system. On other systems, you may have to replace the ETIME routine with a routine that returns a REAL value in seconds of the elapsed CPU-time (Not wall time!). Depending on the speed of a computer, you should adjust the constant NMAX (the maximum range of N). If NMAX is set too low on a very fast machine, e. g., a Cray Y-MP, the machine will finish most or all of the calculation before the CPU clock advances. The timings will be close to the resolution of the system clock and the values will be very noisy and meaningless. If NMAX is set too high on a slow computer, e. g. an IBM PC XT, the program will run for many hours. We adjust NMAX to give a reasonably straight line without having to wait too long for the results. The program measures the CPU-time to perform the FORTRAN code for a vector multiplication as follows:

DO 10 I = 1, N A(I) = B(I) * C(I) 10 CONTINUE In a pipelined vector processor, e. g., Cray-1, the DO 10 loop in the above code would be replaced with a vector instruction by the vectorizing compiler. This vector instruction will utilize the pipelined multiplication unit to give a significant increase in MFLOPS performance over a serial processor.

* Performance Measurement Program for UNIX-based systems * Computes 32 bit floating point r[infinity] and n[1/2] * Hockney's performance parameters * By Dan Hyde, March 18, 1992 46 CHAPTER 2 MEASURING PERFORMANCE

* Adjust the NMAX constant for a particular machine. * NMAX should be large enough to obtain meaningful times.

PARAMETER (NMAX = 100000) INTEGER I, N REAL T0, T1, T2, T REAL A(NMAX), B(NMAX), C(NMAX) REAL TARRAY(2)

* initialize B and C to some realistic values (non zero!) DO 5 I = 1, NMAX B(I) = 12.3 C(I) = 11.7 5 CONTINUE

* find overhead to call ETIME routine * ETIME returns elapsed execution time since start of program T1 = ETIME(TARRAY) T2 = ETIME(TARRAY) T0 = T2 - T1

* vary N for 100 times DO 20 N = (NMAX / 100), NMAX , (NMAX / 100) T1 = ETIME(TARRAY)

* start of computation to time DO 10 I = 1, N A(I) = B(I) * C(I) 10 CONTINUE * end of computation

T2 = ETIME(TARRAY) T = T2 - T1 - T0 PRINT *, 'N = ', N, ' TIME = ', T, ' seconds' 20 CONTINUE STOP END

Fig. 2.7 FORTRAN Code to Collect Data for Hockney’s r¥ and n1/2 Parameters

2.5.3 Using r¥ and n1/2

In the table in Figure 2.8, the r¥ and n1/2 parameters were measured by a program similar to the one in Figure 2.7. The Crays and the CDC supercomputers are vector processors. The ICL DAP is an SIMD processor array of 4096 simple processors. First, observe the maximum performance parameter r¥ is not the same as the theoretical peak MFLOPS. The parameter r¥ is based on a program run while the theoretical peak performance is a paper calculation (See Section 2.4). Also, this r¥ is for a 64-bit floating- Theoretical Computer r¥ MFLOPS n1/2 Peak MFLOPS Cray-1 22 18 160 ICL DAP (4096 processors)14 16 2048 -- CDC 205 (2-pipes) 100 100 200

14 The ICL DAP values are for 32-bit floating-point precision. The rest are for 64-bit. 2.5 Hockney’s Parameters r¥ and n1/2 47

Cray X-MP/22 (1 processor) 70 53 210

Fig. 2.8 r¥ and n1/2 Measurements for Some Supercomputers. point vector multiply and not 32-bit. Other precisions and vector operators have a different r¥. For example, the r¥ is 200 MFLOPS for a 32-bit vector multiply on a CDC 205 or double the 64- bit value. The theoretical peak performance counts the number of 64-bit additions as well as multiplies. For example, the Cray-1 can do an add and a multiply every clock period. Therefore, with a 12.5 nanosecond clock, the peak is 160 MFLOPS.

In the table of Figure 2.8, notice the large range in the n1/2 parameter. Recall that n1/2 is a measure of parallelism or the length of the vector for half maximum performance. Do we desire a large n1/2? Or a low n1/2? A more parallel machine should be faster because of the speedup, which implies we want a large n1/2. However, a computer solves efficiently only those problems with a vector length greater than its n1/2. Therefore, the higher the value of n1/2, the more limited is the set of problems that the computer may solve efficiently. A large n1/2 implies a more special purpose machine. Solving a problem with small vectors on a large n1/2 machine is wasteful of resources, much like carrying one passenger on a city bus. A low n1/2 computer can solve more problems effectively or is more general purpose. In conclusion, we want a high r¥ and a low n1/2 together. Of the machines in the table of Figure 2.8, which one is best? This depends on the application area, the cost of the machine, the r¥ of the machine and other considerations. If your application area has mostly short vectors or is scalar, the Cray-1 with a small n1/2 would be the best choice. If your application has mostly long vectors, the CDC 205 would be the best choice. With the knowledge of the n1/2 of a particular machine, we can select or design an algorithm which better matches the machine.

Hockney claims the two parameters r¥ and n1/2 provide us with a quantitative means of comparing the parallelism and maximum performance of all computers. In the next sections, we will explore how effective Hockney’s performance model is on computers other than vector processors. 2.5.4 Extending Hockney’s Performance Model Recall that SIMD machines have one which issues the same instruction to P processors all executing in lockstep. The ICL DAP in the table of Figure 2.8 is of this class. How well does Hockney’s performance model fit SIMD machines? Assume P is the number of processors and n is the length of a vector. If n £ P then we have more processors than we need and the total time is the time to do one calculation which is independent of the vector length n.

For the more common case when n > P, let tpe be the time for one processor to compute the vector operation, e. g., a floating point multiply. Hockney derives the following for SIMD machines: P r¥ = tpe 48 CHAPTER 2 MEASURING PERFORMANCE

P n1/2 = 2 For half performance, we need a vector length of P/2 which makes sense as half of the processors would be used. Notice, as we increase P, the number of processors in the processor array, both r¥ and n1/2 increase linearly. Therefore, an SIMD machine with a large number of processors has a large n1/2 and tends to be a special purpose machine. For example, the DAP with its 4096 processors has an n1/2 of 2048. Therefore, Hockney’s parameters are useful for SIMD machines as well as vector processors. What about MIMD machines? 2.6 Performance of MIMD Computers MIMD computers have multiple instruction units issuing instructions to multiple processing units. Recall that the two main subclasses are shared memory and message passing MIMD. Assuming homogeneous processors, replicating a processor P times multiplies the n1/2 and r¥ of an individual processor by P. A real serial processor will have a small but non-zero n1/2 due to loop inefficiencies and other factors. Consequently, a large number of processors implies a large n1/2. However, parameters such as n1/2 and r¥ are only part of the story for MIMD computing. Other problems may dominate for MIMD and produce poor performance. According to Hockney, the three main areas of performance problems (overheads) for MIMD computing are the following [Hockney, 1988]: 1) scheduling work among the available processors (or instruction streams) in such a way as to reduce the idle time of the processors waiting for others to finish. 2) synchronizing the processors so that arithmetic operations take place in the correct order. In most MIMD algorithms the results of one portion of the algorithm are required before starting another portion. If many processors are working on the first portion, the processors must synchronize before they can proceed on to the second portion. 3) accessing arguments from memory or other processors. There are many ways to access arguments or data. For example, overhead costs are associated with memory conflicts in a shared memory machine; communication between message passing processors; and misses in a -primary . In the above three areas, the accessing of arguments is a potential bottleneck for all computers. For example, the slowness of moving vectors of data from memory to the fast pipelined arithmetic functional units is the main cause of the difference between peak performance rates stated for vector processors and the average performance rates found in realistic user programs. However, the other two areas, scheduling and synchronizing, are new problems introduced by MIMD computing. A vector computer of one processor has no need to synchronize and schedule itself. Synchronization on an SIMD computer is automatic since every instruction is synchronized by the instruction unit. In SIMD computing, all the processors are trivially scheduled to perform the same instruction. Reducing the idle time by scheduling work on MIMD processors is non-trivial. Researchers have studied the problem extensively. We will discuss several techniques later in the book. Efforts to balance the work or load across the processors is called load balancing. 2.7 Limits to Performance 49

Synchronization of MIMD processors may be accomplished by special hardware, e g., a semaphore on a shared memory machine, or by the arrival of a message on a message passing MIMD. Processors waiting for synchronization events may be a major source of overhead. Hockney derives performance models and efficiency parameters for scheduling, communication, and synchronization for MIMD computing much like n1/2 and r¥. The interested reader should consult his work [Hockney, 1988]. Notice that communication between processors is involved in both accessing of data and synchronization. Many of the first MIMD machines had relatively slow communication mechanisms and experienced performance poor enough to incite controversy. 2.6.1 The MIMD Performance Controversy

In the mid 1980s, the research community was debating the practicality of large scale MIMD systems with more than a dozen processors. At conferences, some researchers were presenting papers on thousands of processors. Others were arguing that such systems would have dismal performance and weren’t worth building. Minsky’s Conjecture Many researchers in the debates were influenced by a 1971 paper authored by Marvin Minsky of MIT [Minsky, 1971]. Basing his analysis on models of performance, Minsky conjectured that realizable speedups on MIMD machines were in the order of log2 P where P was the number of processors. Using Minsky’s conjecture, one could expect a speedup of only about 4 from a system of 16 processors. Experience with existing multiprocessors during this time tended to support Minsky’s point. Several researchers felt that Minsky’s conjecture was overly pessimistic and presented their own estimates. Basing his analysis on statistical modeling, Hwang estimates the realizable speedup is P/ln P where ln is the natural logarithm [Hwang, 1984]. Number of Processors Minsky’s Predicted Speedup Hwang’s Estimate P log 2 P P /ln P 4 2 2.9 8 3 3.8 16 4 5.8 1024 10 147.8 4096 12 493.5 Fig. 2.9 Minsky and Hwang’s Predicted Speedup for MIMD Computers The above analysis explains why in the mid-1980s many computer vendors built multiprocessors consisting only of two or four processors, e. g., the Cray X-MP. Amdahl’s Law In 1967, Gene Amdahl [Amdahl, 1967] argued convincingly that one wants fast scalar machines, not MIMD machines. He argued that a small number of sequential operations can effectively limit the speedup of a parallel algorithm. Let f be the fraction of operations in a 50 CHAPTER 2 MEASURING PERFORMANCE computation that must be performed sequentially. The maximum speedup achievable by a parallel computer with P processors is the following: 1 maximum speedup = f + (1 - f)/P Amdahl’s Law To demonstrate Amdahl’s law, consider a program where 10 percent of the operations, i. e., f = 0.1, must be performed sequentially. His law is the following: 1 maximum speedup = 0.1+ 0.9/P As the number of processors P increases the term 0.9/P goes to zero. Therefore, Amdahl’s law states that the maximum speedup is 10, no matter how many processors available. Karp’s Wager In 1986, Alan Karp [Karp, 1986] proposed his famous wager:

“I have just returned from the Second SIAM Conference on Parallel Processing for Scientific Computing in Norfolk, Virginia. There I heard about 1,000 processor systems, 4,000 processor systems, and even a proposed 1,000,000 processor system. Since I wonder if such systems are the best way to do general-purpose scientific computing, I am making the following offer.

“I will pay $100 to the first person to demonstrate a speedup of at least 200 on a general-purpose, MIMD computer used for scientific comput- ing. This offer will be withdrawn at 11:59 p m. on December 31, 1995.”15

With Minsky’s conjecture, Amdahl’s law, and his own experience, Karp felt his money was safe. If Minsky is right, one would need 2200 processors (That’s a lot of processors!) for a speedup of 200. If Hwang is correct, one would need over 2000 processors. Amdahl’s law requires less than 0.5% of the code to be sequential for a speedup of 200. To sweeten the pot, C. Gordon Bell, chief architect of the DEC VAX machines, proposed an additional $1000 prize. This has become known as the Gordon Bell Award which recognizes the best contributions to parallel processing, either speedup or throughput, for practical, full-scale problems. This prizes are still awarded annually. To the surprise of many, in March 1988, three researchers, John L. Gustafson, Gary R. Montry and Robert E. Benner, at Sandia National Laboratory demonstrated speedups of up to 1009 to 1020 on an nCUBE hypercube machine with 1024 processors [Gustafson, 1988]. However, the Sandia group altered the definition of speedup slightly. But even with the accepted definition, they still had speedups of 502 to 637, well over the 200 required for the prizes. The Sandia group argued that the accepted definition was unfair to massively parallel machines.

15 Karp, Alan, “What Price Multiplicity?,” Communications of the ACM, Vol. 29, No. 2, Feb. 1986, pp. 87. 2.7 Limits to Performance 51

time for fastest serial algorithm Accepted definition of speedup = time for parallel algorithm For massively parallel machines (typically over a thousand processors), they argued one should scale the problem size as one increases the number of processors. Sandia’s scaled speedup is the same as the accepted definition except the problem size per processor is fixed. In the accepted definition, the same fixed problem size is run on both the serial processor and the P processors. Many have accepted the Sandia group’s argument that for massively parallel machines the speedup should be defined with the problem size per processor fixed. The Sandia group’s paper is well worth reading. 2.6.2 Performance of Massively Parallel Machines With the arrival of massively parallel machines in the last couple of years, there is a need to evaluate by benchmarks such machines on problems that make sense. The problem size and rules for the Standard LINPACK benchmark we discussed before do not permit massively parallel computers to demonstrate their potential performance. The basic flaw with the Standard LINPACK benchmark is that solving 100 equations is too small. To provide a forum for comparing such machines, [Dongarra, 1992] proposed a new benchmark. This benchmark involves solving a system of linear equations as did LINPACK where the problem size is allowed to increase (as argued by the Sandia group) and the performance numbers reflect the largest problem run on the machine. Dongarra’s parameters are based on Hockney’s r¥ and n1/2 parameters, but actually they are defined and measured differently. Instead of the vector length, Dongarra’s n1/2 is based on the problem size which gives half performance. The definitions for the column headings in Figure 2.10 are the following:

rmax - the performance in GFLOPS for the largest problem run on the machine.

nmax - the size of the largest problem run on the machine.

n1/2 - the size where half of nmax execution rate is achieved.

rpeak - the theoretical peak performance in GFLOPS.

Computer Cycle Time No. of Processors r max n max n 1/2 r peak NEC SX-3/44 2.9 ns 4 20.0 6144 832 22.0 Intel Delta 40 MHz 512 13.9 25000 7500 20.0 Cray Y-MP C90 4.2 ns 16 13.7 10000 650 16.0 TMC CM-20016 10 MHz 204817 9.0 28672 11264 20.0 TMC CM-218 7 MHz 204819 5.2 26624 11000 14.0 Alliant Campus/800 40 MHz 192 4.8 17024 5768 7.7 Intel iPSC/860 40 MHz 128 2.6 12000 4500 5.0

16 Thinking Machines, Co. (TMC) CM-200 is an SIMD machine. 17 The CM-200 really has 65,536 one-bit processors. However, 32 processors can access a Weitek floating-point processor. Therefore, one way to view the machine is as 2048 floating-point processors. 18 Thinking Machines, Co. (TMC) CM-2 is an SIMD machine. 19 The CM-2 really has 65,536 one-bit processors. However, 32 processors can access a Weitek floating-point processor. Therefore, one way to view the machine is as 2048 floating-point processors. 52 CHAPTER 2 MEASURING PERFORMANCE

nCUBE 2 20 MHz 1024 1.9 21376 3193 2.4 MasPar MP-120 80 ns 16,384 0.44 5504 1180 0.58

Fig. 2.10 Results from Dongarra’s Benchmark for Massively Parallel Computers21

With these data, we can compare massively parallel computers. We desire a high rmax and a large nmax. Since the value of nmax is limited by the available memory size, it is an indicator of the size of the memory and the effectiveness of any memory hierarchy. We can use Dongarra’s n1/2, i. e., the size of the problem where half of nmax execution rate is achieved, much like Hockney’s n1/2. The Cray Y-MP C90’s n1/2 of 650 means that many more problems can be solved efficiently than, for example, on the CM-2 with an n1/2 of 11000. 2.7 Limits to Performance Many of us are amazed at the performance of today’s supercomputers. Today’s fastest computers can compute 20 GFLOPS or 20 billion floating operations per second. The supercomputing research community is planning the designs of TeraFLOPS computers or ones that can compute at a trillion FLOPS. We wonder if there are any limits to achieving higher and higher performance. In this section, we will first discuss several physical limitations, then later, algorithmic limitations. 2.7.1 Physical Limits One limitation to the design of computers is the speed of light. Light travels 30 centimeters or 11.8 inches in one nanosecond. Since the clock speed of current supercomputers is only a couple of nanoseconds, e.g., the NEC SX3/44’s clock is 2.9 nanoseconds, all the wires must be short and components must be physically close together, i. e., a matter of inches which implies a dense packing of the circuitry. As logic gates are forced to faster, they require more energy to switch. Dense packing of circuits and more switching energy imply a lot of energy dissipated in the form of heat in a concentrated volume. Therefore, how to cool supercomputers is a major engineering design problem. Some machines use water for cooling while others such as the Cray- 1 used Freon, that is commonly found in refrigerators, Observe that the cooling problem exists for both fast scalar processors and parallel machines. Another limitation is the clock speed of integrated circuit chips. This especially limits how fast one can build a . For higher performance, designers are forced to use parallel processors. In 1990, Harold Stone stated the following:

20 MasPar MP-1 is an SIMD machine. 21 Dongarra, Jack J., “LINPACK Benchmark: Performance of Various Computers Using Standard Linear Equations Software,” Supercomputing Review, Vol. 5, No. 3, March, 1992.