Handout # 1 Problem Set # 1

ELEG 652 Principles of Parallel Computer Architectures Handout # 1 Problem Set # 1 Issued: Wednesday, September 13, 2006 Due: Wednesday, October 4, 2006

Please begin your answer to every problem on a new sheet of paper. Be as concise and clear as you can. Make an effort to be legible. To avoid misplacement of the various components of your assignment~ make sure that all the sheets are stapled together. You may discuss problems with your classmates, but all solutions must be written up independently.

1.- The LINPACK benchmark ( Cullers text Chapter 1 Figure 1.10 page 22) is often used to report the performance of various computers include most powerful parallel computers. I n this problem, you will learn how to obtain the LINPACK benchmark program, and use it to benchmark two machines. LINPACK is a collection of subroutines that analyze and solve linear equations and linear least squares problems. The package solves linear systems whose matrices are general, banded, symmetric indefinite, symmetric positive definite, triangular, and tridiagonal square. I n addition, the package computes the QR and singular value decompositions (SVD) of rectangular matrices and applies them to least squares problems. LINPACK uses column oriented algorithms to increase efficiency by preserving locality of reference. LINPACK was designed for supercomputers that were in use in the 1970’s and early 1980’s. LINPACK has been largely superseded by LAPACK which has been designed to run efficiently on shared memory and vector supercomputers. Visit http://www.netlib.org/linpack/index.html for a detailed description of LINPACK and browse the repository at the beginning of this page for many subroutines.

 Please create a subdirectory, say called LINPACK, and store the downloaded LINPACK benchmark code (as explained below).

 Download the benchmark from http://www.capsl.udel.edu/courses/eleg652/2006/homework/clinpack.c

 Print out a code listing and read it, as a part of your homework.  Compile the program using single precision (the command line is cc -O3 -fallow-single- precision -DROLL -DSP clinpack.c -o clinpack). Run the code on the EECIS main teaching machine mlb.acad.ece.udel.edu. (You should all have accounts, otherwise go to http://www.eecis.udel.edu /.  What is the LINPACK performance of your run? Report it carefully, e.g. you should state clearly the following information:  The experimental test bed: the machine net address, the OS version, compiler version, the machine configuration, processor type, model, speed, cache size, memory capacity, etc. Check the man pages for the UNIX command to get the CPU time.  The input: size of the problem which should be 100, 256 and 512. You should at least one per data size (three in total)  The performance parameters: explain clearly what are being reported(“Fundamentals of Matrix Computations by D.S. Watkins Chapter, and “Numerical Linear Algebra for High Performance Computers” by Jack Dongarra et al., Chapter 4 will be of use here.) Also, check http://www.netlib.org/performance/ and compare how it matches to your performance.

 Comment your results and explain your observations 2. - Please perform the same steps as in Problem 1 for your personal PC. If you do not have one, try to use one from our labs or your friends. There are special versions of the benchmarks available for PCs. Look for linpackpc version in the web page (http://www.netlib.org/benchmark/linpack-pc.c ). I f you do not have access to a PC with a C compiler, use any other machine (workstations, servers, etc) you have access to.

Points and Hints: - Microsoft Visual Studio Users : Go to Project  Settings. In the popup window, choose the C/C++ tab and add to the Project Options box –DSP –DUNROLL. - For all : You need to change the code a bit - Check the code for a variable n. This will tell you how big of an array is going to be used. Change that variable to 100, 256 and 512. Please be advised, they are other arrays depending on that value (arrays of 200 and 201) so change accordingly

3. - In this problem, you will learn to apply Amdahl’s law. Suppose that you are given a large scientific program P and 8% of this code is not parallelizable.

(a) What is Amdahl’s Law? (b) What is the maximum speedup you can achieve by parallelization? (c) I f you wanted to achieve a speedup of 15 (i.e. to make your program run 15 times fast) what percentage of the code should be parallelized? (d) Assume someone has looked at P and your computation, and becomes quite upset with what you said about the speedup. Now if the person is given a chance to use a massively parallel machine (with the option to upgrade to many more processors than the speedup limit you gave for P), what (hopefully positive) advice should you give in this case) Please justify.

4. - The Top 500 list categorizes the fastest scientific machines in the world according to their performance on the LINPACK benchmark. Visit their Website at http://www.top500.org / and look at the top 100 performants (there are many repeats of a particular vendor product, since individual supercomputer sites rather than a product line are counted). Obtain the particular type of supercomputers that are in the list. Obtained such data until 1993 and make a graph with the changes in computer types across the years. You can obtain this information on the website. Please use the website facilities instead of trying to do it yourself since that would take too long.

One important application of high performance computing systems is solving large engineering problems efficiently. Some physical/engineering problems can be formulated as mathematical problems involving large number of algebraic equations, a million equations with million unknowns is not a rarity. The core computation in solving such systems of equations is matrix operations, either Matrix-Matrix Multiply or Matrix Vector Multiply depending on the problem and solution methods under study. Therefore, a challenge to computer architectures is how to handle these two matrix operations efficiently.

In this problem, you will study the implementation of the matrix vector multiply (or MVM for short). There are lots of challenging issues associated with the implementation of MVM. First of all, matrices in many engineering problems are sparse matrices – where the number of non-zeroes entries in a matrix is much less than the total size of the matrix. Such sparse matrices pose an interesting problem in terms of efficient storage and computation. The use of different compressed formats for such matrices have been proposed and use in practice. There are several ways of compressing sparse matrices to reduce the storage space. However, identifying the non-zero elements with the right indices is not a trivial task. The following questions deals with three types of the sparse matrix storage.

Compressed Row Storage (CRS) is a method that deals with the rows of a sparse matrix. It consists of 3 vectors. The first vector is the “element value vector”: a vector which contains all the nonzero elements of the array (collected by rows). The second vector has the column indices of the non-zero values. This vector is called “column index vector”. The third vector consists of the row pointers to each of the elements in the first vector. That is, elements of this “row pointer vector” consists of the element number (for example. xth element start from row k) with which a new row starts.

With the above format in mind try to compress the following sparse matrix and represent it in a CRS format:

0 1 0 0 0 3 0 0

0 0 0 4 0 0 0 0

0 0 0 0 0 0 0 1

0 0 0 1 0 7 0 0 A = 6 0 8 0 0 0 2 0

0 0 0 0 0 3 0 0

0 0 0 0 0 0 0 5

0 2 0 0 0 0 1 0

An example of CRS is presented below

0 0 0 0 6 0

0 1 0 0 0 0

0 0 0 0 8 0 B = 12 0 0 5 0 0

0 0 0 10 0 0

0 0 5 0 2 0

Vector 1: Non Zero 6 1 8 12 5 10 5 2 elements

Vector 2: Column 4 1 4 0 3 3 2 4 indices

Vector 3: The last element is defined as Row 0 1 2 3 5 6 9 number of number non-zero pointers elements plus one

This is the result if the initial index is zero. If the initial index is one then every entry in vector 2 and 3 will be plus one, except for the last entry of Vector 3, which will be 9 in both cases. Both results (initial index 0 or 1) are acceptable, but please be consistent.

5..- In this problem, you are going to experiment with an implementation of Matrix-Vector Multiply (MVM) using different storage formats. Matrix-Vector Multiply has the following pseudo-code:

for I: 1 to N for J: 1 to N C[I] += A[I][J] * B[J];

where C and B are vectors of size N and A is a sparse matrix of size NxN 1. In this specific code, the order is Sparse Matrix times Vector. Create a C code that:

a. Create a random dense vector of N size (B). b. Reads a sparse matrix of size NxN (A) and transform it to CRS format c. Calculate the MVM (A*B) of the vector and the sparse matrix in CRS format. Make sure that your code has some type of timing function (i.e. getrusage, clock, gettimeofday, _ftime, etc) and time the MVM operation using the CRS matrix. d. Repeat the operation with a Compress Column Storage2 format matrix and compare performance numbers. Report your findings, as which method has better performance, and run it for at least five matrix/vector sizes. You can use the same methods that were used in Homework 1 to calculate performance. Please report your machine configuration as you did in Homework 1.

1. Have a look at the following storage representation, known as Jagged Diagonal Storage (JDS).

1 0 0 2 1

0 2 0 0 0

C = 3 0 3 0 0

0 0 0 4 0

0 1 0 0 5

C 1 3 1 2 4 2 3 5 1

1 1 2 2 4 4 3 5 5 jC

Off C 1 6 9 10

1 3 5 2 4 Perm

i. Explain how it is derived (Hint: Think in the sequence of E.S.S. which stands for Extract, Shift and Sort) ii. Why is this format claimed to be useful for the implementation of iterative methods on parallel and vector processors?

1 If you want to get more information about the MVM, please take a look at the Fortran Implementation of MVM at http://www.netlib.org/slatec/lin/dsmv.f 2 Compress Column Storage is the same as CRS, but it uses columns pointers instead of row pointers; the non-zero elements are gathered by columns and not rows; and Vector 2 contains the row indices instead of the column indices iii. Can you imagine a situation in which this format will perform worse? iv. Modify your program to create (read) matrices in a JDS and repeat your experiments as in (2). Report your results for this storage method as you did for CRS and CCS.

6. - You have learned from class that the performance of a vector architecture can often be described by the timing of a single arithmetic operation of a vector of length n. This will fit closely to the following generic formula for the time of the operation, t, as a function of the vector length n:

-1 t = r ∞ (n + n1/2)

The two parameters r∞ and n1/2 describe the performance of an idealized computer under the architecture model/technology and give a first order description of a real computer. They are defined as:

o The maximum asymptotic performance r∞ - the maximum rate of computation in floating point operation performed per second (in MFLOPS). For the generic computer, this occurs asymptotically for vectors of infinite length.

o The half-performance length n1/2 – the vector length required to achieve half the maximum possible performance.

The benchmark used to measure (r∞, n1/2) is shown below:

T1 = call Timing_function; T2 = call Timing_function; T0 = T2-T1; FOR N = 1, NMAX { T1 = call Timing_function; FOR I = 1, N { A[I] = B[I] * C[I]; } T2 = call Timing_function; T = T2 – T1 – T0; }

Please replace the call Timing_function with a C code timing function (clock, getrusage, etc). Please identify and explain your choices.

Assume vector machine X is measured (by the benchmark above) such that its r ∞ = 300

MFLOPS and n1/2 = 10 and vector machine Y is measured (by the benchmark above)

such that its r∞ = 300 MFLOPS and n1/2 = 100. What these numbers mean? How can you judge the performance difference between X and Y? Explain.

Describe how would you use this benchmark (or its variation) to derive (r ∞, n1/2) Rewrite the code to C, port them onto a Sun Workstation and run it. Tell us which machine you used. Choose a sensible NMAX (Suggestion: NMAX = 256). Report your results and make plots of: o Runtime V.S. Vector Length o Performance (MFLOPS) V.S. Vector Length o What are the values of the two parameters you obtained? State on which machine model and type you performed your experiments. Problem 7. Match each of the following computer systems: KSR-1, Dash, CM-5, Monsoon, Tera MTA, IBM SP-2, Beowulf, SUN UltraSparc-450 with one of the best descriptions listed below. The mapping is a one-to-one correspondence in this case.

1. A massively parallel system built with multiple-context processor and a 3-D torus architecture. 2. Linux-based PCs with fast Ethernet. 3. A ring-connected multiprocessor using cache-only memory architecture. 4. An experimental multiprocessor built with a dataflow architecture. 5. A research scalable multiprocessor built with distributed shared memory coherent caches. 6. An MIMD distributed-memory computer built with a large multistage switching network. 7. A small scale shared memory multiprocessor with uniform address space. 8. A cache-coherent non-uniform memory access multiprocessor built with Fat Hypercube network.

You are encouraged to search conference proceedings, related journals, and the Internet extensively for answers. For each computer system, write one paragraph about the system characteristics such as the maximum node numbers.

Hint: You need to learn how to quickly find references from Proceedings of International Symposium of Computer Architecture (ISCA), IEEE Trans. of Computer, IEEE Micro, the Journal of Supercomputing, IEEE Parallel & Distributed Processing.

Problem 8 Please try to fit the machine: “Cray Red Storm” into one of the classifications above. If none of them fits, please provide a short architecture description.