CMP Technical Report No. 7

Does AltiVec of PowerPC G4 work effectively with FORTRAN?

Koun SHIRAI

The Institute of Scientific and Industrial Research Osaka University

10 Sep. 2002

Department of Computational Nanomaterials Design Nanoscience and Nanotechnology Center ISIR, Osaka University

Contents

1 Introduction 2

2 matrix multiplication 4 2.1 effects of AltiVec ...... 7

3 FFT calculation 9

4 Summary 14

5 Effects of VAST 14

Note that this English version of text is only a partly translation from the original Japanese text, because of the author’s limited time and also the limited linguistic ability.

1 1 Introduction

It is often said (mainly in advertising articles) that PowerMacs get a performance at a level of supercomputer by their use of G4 chips as the main processors. In a general sense, I agree that G4 Macs are much faster than G3 Macs at the same frequency. However, my interest is scientific applications, in particular Fortran pro- grams. They (I mean not only Apple but also many authors in Mac magazines) especially emphasize the difference of G4 from its predecessors by focusing at Al- tiVec unit in G4. They say that a speed more than 10 giga-flops is achieved. Can you believe it? In our university, the most high-performance supercomputer is NEC SX-5. The nominal value of the speed is said to be 10 GFLOPS per single CPU! As reading such articles repeatedly, my expectation was growing up. On the other hand, my frustration is increased, because these articles discuss only the potential of AltiVec, and little report actual utility in real applications. In scientific applications, the potential of AltiVec is still unclear. I have been for a long time a user of Absoft Fortran. When using version 6 with G3, I was almost satisfied its performance as handy computation. With advent of version 7 (recently 8) with G4, I am very tempted in daily use of PowerMacs for heavy computations. By reading the advertisement of version 7 (or 8) of Absoft ProFortran, in which the potential of AltiVec is emphasized as usual, I was almost believed that PowerMacG4 could work like a supercomputer. But, by reading more carefully, I suspect the advantage of G4 in the Fortran world. I think that this feeling is held by not only me, but by many others. Looking into a mailing forum ”fortran-dev”, you can see that coverage of Absoft Fortran and VAST for AltiVec facilities and others are repeatedly asked. In my understanding, the vector process of AltiVec is valid only for single- precision operations, and this is not due to the way of implementation of Absoft Fortran but originated from the intrinsic limitation of AltiVec architecture. In many scientific applications, double precision is needed, and hence G4 has no usefulness. Further bad news is that it is unlikely for Apple to employ a kind of processor pos- sessing the double-precision version AltiVec in near future. I am not familiar with hardware. I do not understand why double-precision version is so difficult, although some reasons are written in ”fortran-dev”.[1, 2]

2 In this report, I examine the reality of the high performance of G4 machines in the Fortran world, particularly whether or not the facility of AltiVec is actually im- plemented in Fortran. I like to stress that my intention of the present examinations is comparison only with G3, not with other CPU such as Pentium. This is because the main interest is the effects of AltiVec in the PowerPC family. Of course, I know that AltiVec is not only difference between G3 and G4. G4 has further advantage such as superscalar arithmetic unit, data size, large cash memory, etc. Because what I can know is only final results, what I compare is differences in only total performance. In writing this report, I got many benefits from an article [3].

The machines used in this report are as follows. machine clock cash RAM HD note (MHz) (KB) (MB) (GB) G3 machine PowerMac6100 (66) with G3 215 256 72 accelerator iMac 450 512 192 G4 machine PowerMacG4 400 1000 512 20 PowerBookG4 667 256 512 30 PowerMacG4 733 (L2) 256 1250 60 (L3) 1M PowerMacG4 867 (L2) 256 1250 60 (L3) 2M

All the machines except PowerMac6100 are used on either OS9 or X.

Versions of Absoft Fortran compiler are follows.

f77: ProFortran v. 6.2 on OS9.2 f90: ProFortran v. 7.0 on OSX 1.5

3 5

] 10 s [

10 4

10 3

10 2 t

10 1

10 0

10 -1

10 -2 1 2 3 4 10 10 10 10 N

Figure 1: The size of the matrix N versus the calculation time t. The gradient of the line is about 3.

2 matrix multiplication

We examine the calculation speed of various G3 and G4 machines by using a matrix multiplication. The testing program used here is taken from Ref. [3]. The source code is listed below. The author of this code is Hunter at NASA, who frequently contributes useful writing for scientific computing.

program list 1

implicit real (a-h, o-z) parameter(n=200, nloops=200) dimension tarray(2), a(n,n), b(n,n), c(n,n) real mflops pi=4.*atan(1.) do i=1, n do j=1, n a(i,j)=real(j)*real(i)/pi/1024.**2. b(i,j)=real(j)/sqrt(pi)*real(i)/pi/1024. end do end do time=DTIME_(tarray) do l=1, nloops do i=1, n do j=1, n cs=0. do k=1, n cs=cs+a(i,k)*b(k,j) end do c(i,j)=cs

4 end do end do end do time=DTIME_(tarray) mflops=(real(nloops)*2.*real(n**3)/1.0e6)/time write(*,*) ’time:’, time write(*,’(A,F10.2)’) ’ MFLOPS:’, mflops end Before testing performance, I checked how the computation time t is related with the size of matrix N. Because a matrix multiplication of size N is composed of N 3 times arithmetic operations, it is expected that t is proportional to N 3. This relationship is shown in Fig. 1. Although, as stated in Introduction, comparison with CPU’s other than G3 is not my intention, I show the speed of G3 and G4 family compared with other CPU’s, because it makes it clear the current status of PowerPC family in the mainstream of workstations. In Fig. 2, the result of benchmark tests is shown. Compilation was done with almost the highest level optimization of each com- piler. Of course, SX5 is ranked by far as the top level. Its speed is recorded to be 9.2 GFLOPS, which is consonant with the nominal value. The speed of Pow- erMacG4 is about half of other workstations, but considering the prices, it can be concluded that they have high cost-performance. However, the speed is slower than their advertising value (several GFLOPS) by an order of magnitude. Even though exerating is a common method in advertizements, it is rare that the difference is enlarged such an extent.

Now, we fix the size of matrix to be 200, then examine the calculation time on various machines. In Fig. 3, results are shown for both cases with and without optimization and single and double precision. From this figure, we can see that, in both of single and double precision cal- culations, the speed is proportional to the clock speed of CPU, regardless G3 or G4. This means that the advantage of AltiVec is not entirely utilized unless adding optimization options. When optimization is activated, there appear significant difference between G3 and G4. But, interestingly, the difference is more prominent in double-precision calculations than in single-precision. As stated before, the vector process does not work with double-precision calculations. Therefore, the difference should be origi- nated from something other than the vector-process unit.

5 iBook 466

iMac 450

PB G4 667

PM G4 400

PM G4 867

PM G4 733

SGI Origin2000 195

IBM Power3 200

IBM Power3 450

SX5 9275

0 100 200 300 400 500 600 MFLOPS

Figure 2: Benchmark test for a matrix multiplication of double precision.

6 (a) (b)

400 400 G3 G4 G3 G4 normal normal optimize optimize 300 300 S S P P O O 200

200 L L F F M M PowerBook

PowerBook 100 100 iMac iBook iBook iMac

0 0 0 200 400 600 800 1000 0 200 400 600 800 1000

CPU Clock [MHz] CPU Clock [MHz]

Figure 3: Matrix multiplication: (a) single precision, (b) double precision. Both results calculated with and without optimizing in the compilation are shown.

I do not know what the optimization actually does. The speed is increased about four times than the that of normal compilation, even though the presence of AltiVec does not contribute. Probably, this is the main advantage which we can normally receive by introducing G4 with Absoft ProFortran.

2.1 effects of AltiVec Although I do not know what contributes the increase in the speed, it is true that G4 machines are superior over G3 machines as an over all. However, I could not get significant improvement of G4 by activating AltiVec, as shown in Ref. [3]. There, it is reported that about 10 times faster improvement is recorded by using AltiVec. Let me describe the calculation process more concretely. On G4 machines (400MHz machine or PowerBookG4), the source code was compiled by

f90 -O -altiVec But, even for the case of single precision, no improvement in the speed was obtained. What was wrong with this? In the source code 1, the part of sequential calculation of matrix product was replaced by an intrinsic function in fortran90,

7 c = matmul(a, b) and was compiled with option, -altiVec. There was no change. I do not understand why.

When AltiVec is activated? (added 9/13/2002) Meanwhile, I came to know the following restrictions about activation of AltiVec. software In Absoft ProFortran, intrinsic functions utilising the vector process of AltiVec are limited to the following functions only.

• DOT PRODUCT • MATMUL • SUM • MAXLOC • MAXVAL • MINLOC • MINVAL

Furthermore, AltiVec works only when these functions are applied to single- precision numbers. I had recognized this information itself by seeing the web- site of Absoft.[5] But, by using option -altiVec, I though that such a simple se- quential calculation of matrix multiplication is automatically vectorized. But, it is really not the case. The above limitation is strict. Any other cases, not only function but procedure, do not work with AltiVec. The coverage of Al- tiVec is very restricted. This fact is described in [3]. In spite of reading this fact, I and probably most people normally could not realize the seriousness before he suffers a painful experience. In addition to that, Absoft says that libraries of BLAS, LAPACK90, and IMSL are optimized to AltiVec. This must be true only for single-precision operations.

8 hardware There is also restriction concerning hardware. What I understood was that not all the G4 machines are subjected to the advantage of AltiVec. First, I noticed that PowerBookG4 does not (or cannot) utilize the advantage of AltiVec. In this case, there would be little wonder, because of the issue of power consumption. But, to what I surprised, some of even desktop machines do not utilize the facility of AltiVec. Actually, this is the case for an early G4 machine of 400 MHz. I do not know why and when AltiVec can be activated in the PowerMacG4 history.

Now that which machine possesses the facility of AltiVec, I re-examined the effects of AltiVec on those machines which actually possesses it. For example, in Table 1, the effect of AltiVec is shown by using 733 MHz PowerMacG4.

Table 1: Effects of AltiVec. Numerical results are in MFLOPS units. COMPILE SINGLE DOUBLE PRECISION PRECISION use program-1 (sequential product) f90 35.0 31.9 f90 -O 242.1 145.4 use MATMUL f90 150.9 118.1 f90 -O 155.6 114.2 f90 -O -altiVec 1,300.8 115.8

When using option -O only, execution of an intrinsic function MATMUL is slower than the sequential execution of DO loop. When applying -altiVec to single-precision arrays, I barely got a drastic improvement of use of AltiVec. The speed exceeds 1 GFLOPS! This is what I am looking for. Of course, there is no effect on double-precision numbers.

3 FFT calculation

Comparison was also made for FFT, because Osaka2000 heavily depends on FFT routines. The used FFT routine is of a version of three-dimensional double-precision complex number.

9 The calculation time t required to perform FFT calculation is related to the size of array N as t = N log N (1) In Fig. 4, t is plotted against the size N. Though t is increased with N, the increase is not uniform. This is a feature of FFT algorithm. Here, we examined two different routines; one is our own routine (created by Sasaki), the other is a routine in IMSL. The former is called program a, while the latter is called program b.

program list 2

IMPLICIT REAL*8 (A-H,O-Z) PARAMETER (LDA=60, MDA=60, NDA=60) PARAMETER (ND=NDA) PARAMETER (nloops=100) COMPLEX*16 A(LDA,MDA,NDA),B(LDA,MDA,NDA),X(LDA,MDA,NDA) DIMENSION ARE(LDA,MDA,NDA), AIE(LDA,MDA,NDA) DIMENSION WK1(LDA,MDA,NDA), WK2(LDA,MDA,NDA) DIMENSION WR(NDA),WI(NDA) DIMENSION tarray(2), timeRecord(10) ! do i=1,10 timeRecord(i)=0.0d0 end do N1=LDA N2=MDA N3=NDA write(6,*) ’FFT test (complex version)’ write(6,102) N1,N2,N3 write(6,104) nloops 102 format(3X,’N1=’,I3,5X,’N2=’,I3,5X,’N3=’,I3) 104 format(’nloops=’,I5) do n=1, N1 do m=1, N2 do l=1, N3 X(n,m,l)=n + 2*(m-1) + 2*3*(l-1) ARE(n,m,l)=DBLE(X(n,m,l)) AIE(n,m,l)=DIMAG(X(n,m,l)) end do end do end do fval=fnorm(X,LDA,MDA,NDA,n1,n2,n3) write(6,105) 0,fval time00 = ETIME_(tarray) do loop=1, nloops ! forward time0 = ETIME_(tarray) call SNFFT3(1,ARE,AIE,N1,N2,N3,WK1,WK2,WR,WI,ND) time1 = ETIME_(tarray) timeRecord(1)=timeRecord(1)+time1-time0 fval=fnorm(A,LDA,MDA,NDA,n1,n2,n3) ! backward

10 time2 = ETIME_(tarray) call SNFFT3(-1,ARE,AIE,N1,N2,N3,WK1,WK2,WR,WI,ND) time3 = ETIME_(tarray) timeRecord(2)=timeRecord(2)+time3-time2 fval=fnorm(B,LDA,MDA,NDA,n1,n2,n3) ! write(6,105) fval end do !loop time01 = ETIME_(tarray) timeRecord(3)=timeRecord(3)+time01-time00 write(6,150) timeRecord(1), timeRecord(2) write(6,151) timeRecord(3)

95 format(/, 13X, ’The unnormalized inverse is’) 96 format(/, 13X, ’The input for FFT3F is’) 97 format(/, 13X, ’The results from FFT3F are’) 98 format(/, ’ Face no.’, I1) 99 format(1X, 4(’(’,F6.2,’,’,F6.2,’)’,3X) ) 105 format(I3,2X,’norm=’,1PE12.4) 150 format(/’ Timing summary’, /10X,’FORWARD :’,F15.3,’ (s)’ & /10X,’BACKWARD:’,F15.3,’ (s)’) 151 format(10X, ’TOTAL :’,F14.2,’ (s)’) STOP END

In this program list, a subroutine SNFFT3 is FFT routine created by Sasaki. In Fig. 5, the calculation times are compared among various machines. In program a, difference of speed between G3 and G4 comes from the clock speed alone, when optimization is not used. When optimization is applied, about twice difference appears between G3 and G4. Even though AltiVec is not used, relative advantage of G4 is evident. When IMSL is used, the speed is not affected significantly by optimization, probably because IMSL already optimizes such routines very well. Rather than seeing the effect of optimization, it should be noticed the significant improvement by using IMSL itself. In this case, the difference in speed again seems to come from the clock of CPU, regardless use of G3 or G4. (added on Sep. 13) Although it is almost evident that there is no effect of use of AltiVec in this case, it may be worth to examine that on 733 MHz PowerMacG4. See the result in Table. 2. The used command is f90 -O -altiVec -o ftst ft31.f90 -limsl -limslblas -lU77

As expected, there is no effect of altiVec, because of double precision.

11 1000 ] s [

800

e 600 m i t

U P

C 400

200

0 0 10 20 30 40 50 60 70 N

Figure 4: 3-dimensional double-precision complex number FFT calculation. The calculation time is plotted against the matrix size N. PowerMacG4 (400 MHz) is used.

12 (a) (b) ] 4 ] 2 1 1 - -

s G4 G3 s G3 G4

3 2 - normal - normal 0 optimize 0 optimize 1 1 x x [ 3 [ t 2 t 1 / / 1 1

1

0 0 0 5 0 5 Clock [x100 MHz] Clock [x100 MHz]

Figure 5: 3-dimensional double-precision complex number FFT calculation. The reciprocal of the calculation time is plotted against the clock of PowerMac. (a) use program a, (b) use program b (IMSL). The matrix size is 60. Both results with and without optimizing in compilation are shown. Note that the scales of abscissa of (a) and (b) are different.

13 Table 2: Effects of AltiVec on FFT. The numbers represent the calculation time in seconds. program-a f90 591.8 f90 -O 312.7 f90 -O -altiVec 313.1 program-b use IMSL f90 51.12 f90 -O 50.96 f90 -O -altiVec 50.34

4 Summary

In this way, we have seen that the advertisement ”exceed GFLOPS” of G4 is just a dream. At least in Fortran world, the performance of G4 can be scaled with that of G3 simply by the clock frequency. There is still huge difference from true supercomputers. Otherwise, vendors of supercomputer could not survive. I derived a negative conclusion with respect to the availability of AltiVec. But, I think that over all the performance of PowerPC G4 is not bad. This is demonstrated by a real calculation of our Osaka2000.[6] According to N. Strange, [4] (also refer [1]), routines utilizing AltiVec in the OS level are greately enhanced in OSX 10.2. These routines are placed in /System/Library/Frameworks/vecLib.framework/Headers In the header file, not only double precision but also quad precision calculation is possible on using AltiVec. Ref. [3] also describes ways to use AltiVec widely, through writing C program codes. Although these methods may be useful for some people, this is not a way which we want. Direct way for AltiVec to handle double precision numbers is more desirable. I disappointed to hear that Apple is not interested in doing so.

5 Effects of VAST

I will soon get version 8 with VAST preprocessor. A report of that will come soon...

14 References

[1] Craig A. Hunter, ””, 2002.08.29 in [email protected]

[2] Craig A. Hunter, ”IBM Power4”, 2002.09.06 in [email protected]

[3] Craig A. Hunter, ”An Evaluation of PowerMac G4 Systems for FORTRAN- based Scientific Computing with Application to Computational Fluid Dynamics Simulation” (2000) http://math.wm.edu/˜cahunter/NASA G4 Study.pdf

[4] Nathan Strange, ”altivec and double precision”, 2002.08.27 in fortran- [email protected]

[5] http://www.absoft.com/altivec.html

[6] K. Shirai, CMP Technical Report No. 1, ”Performance test of Osaka2002”.

15