Does Altivec of Powerpc G4 Work Effectively with FORTRAN?

CMP Technical Report No. 7 Does AltiVec of PowerPC G4 work effectively with FORTRAN? Koun SHIRAI The Institute of Scientific and Industrial Research Osaka University 10 Sep. 2002 Department of Computational Nanomaterials Design Nanoscience and Nanotechnology Center ISIR, Osaka University Contents 1 Introduction 2 2 matrix multiplication 4 2.1 effects of AltiVec . 7 3 FFT calculation 9 4 Summary 14 5 Effects of VAST 14 Note that this English version of text is only a partly translation from the original Japanese text, because of the author’s limited time and also the limited linguistic ability. 1 1 Introduction It is often said (mainly in advertising articles) that PowerMacs get a performance at a level of supercomputer by their use of G4 chips as the main processors. In a general sense, I agree that G4 Macs are much faster than G3 Macs at the same frequency. However, my interest is scientific applications, in particular Fortran pro- grams. They (I mean not only Apple but also many authors in Mac magazines) especially emphasize the difference of G4 from its predecessors by focusing at Al- tiVec unit in G4. They say that a speed more than 10 giga-flops is achieved. Can you believe it? In our university, the most high-performance supercomputer is NEC SX-5. The nominal value of the speed is said to be 10 GFLOPS per single CPU! As reading such articles repeatedly, my expectation was growing up. On the other hand, my frustration is increased, because these articles discuss only the potential of AltiVec, and little report actual utility in real applications. In scientific applications, the potential of AltiVec is still unclear. I have been for a long time a user of Absoft Fortran. When using version 6 with G3, I was almost satisfied its performance as handy computation. With advent of version 7 (recently 8) with G4, I am very tempted in daily use of PowerMacs for heavy computations. By reading the advertisement of version 7 (or 8) of Absoft ProFortran, in which the potential of AltiVec is emphasized as usual, I was almost believed that PowerMacG4 could work like a supercomputer. But, by reading more carefully, I suspect the advantage of G4 in the Fortran world. I think that this feeling is held by not only me, but by many others. Looking into a mailing forum ”fortran-dev”, you can see that coverage of Absoft Fortran and VAST for AltiVec facilities and others are repeatedly asked. In my understanding, the vector process of AltiVec is valid only for single- precision operations, and this is not due to the way of implementation of Absoft Fortran but originated from the intrinsic limitation of AltiVec architecture. In many scientific applications, double precision is needed, and hence G4 has no usefulness. Further bad news is that it is unlikely for Apple to employ a kind of processor pos- sessing the double-precision version AltiVec in near future. I am not familiar with hardware. I do not understand why double-precision version is so difficult, although some reasons are written in ”fortran-dev”.[1, 2] 2 In this report, I examine the reality of the high performance of G4 machines in the Fortran world, particularly whether or not the facility of AltiVec is actually im- plemented in Fortran. I like to stress that my intention of the present examinations is comparison only with G3, not with other CPU such as Pentium. This is because the main interest is the effects of AltiVec in the PowerPC family. Of course, I know that AltiVec is not only difference between G3 and G4. G4 has further advantage such as superscalar arithmetic unit, data size, large cash memory, etc. Because what I can know is only final results, what I compare is differences in only total performance. In writing this report, I got many benefits from an article [3]. The machines used in this report are as follows. machine clock cash RAM HD note (MHz) (KB) (MB) (GB) G3 machine PowerMac6100 (66) with G3 215 256 72 accelerator iMac 450 512 192 G4 machine PowerMacG4 400 1000 512 20 PowerBookG4 667 256 512 30 PowerMacG4 733 (L2) 256 1250 60 (L3) 1M PowerMacG4 867 (L2) 256 1250 60 (L3) 2M All the machines except PowerMac6100 are used on either OS9 or X. Versions of Absoft Fortran compiler are follows. f77: ProFortran v. 6.2 on OS9.2 f90: ProFortran v. 7.0 on OSX 1.5 3 5 ] 10 s [ 10 4 10 3 10 2 t 10 1 10 0 10 -1 10 -2 1 2 3 4 10 10 10 10 N Figure 1: The size of the matrix N versus the calculation time t. The gradient of the line is about 3. 2 matrix multiplication We examine the calculation speed of various G3 and G4 machines by using a matrix multiplication. The testing program used here is taken from Ref. [3]. The source code is listed below. The author of this code is Hunter at NASA, who frequently contributes useful writing for scientific computing. program list 1 implicit real (a-h, o-z) parameter(n=200, nloops=200) dimension tarray(2), a(n,n), b(n,n), c(n,n) real mflops pi=4.*atan(1.) do i=1, n do j=1, n a(i,j)=real(j)*real(i)/pi/1024.**2. b(i,j)=real(j)/sqrt(pi)*real(i)/pi/1024. end do end do time=DTIME_(tarray) do l=1, nloops do i=1, n do j=1, n cs=0. do k=1, n cs=cs+a(i,k)*b(k,j) end do c(i,j)=cs 4 end do end do end do time=DTIME_(tarray) mflops=(real(nloops)*2.*real(n**3)/1.0e6)/time write(*,*) 'time:', time write(*,'(A,F10.2)') ' MFLOPS:', mflops end Before testing performance, I checked how the computation time t is related with the size of matrix N. Because a matrix multiplication of size N is composed of N 3 times arithmetic operations, it is expected that t is proportional to N 3. This relationship is shown in Fig. 1. Although, as stated in Introduction, comparison with CPU’s other than G3 is not my intention, I show the speed of G3 and G4 family compared with other CPU’s, because it makes it clear the current status of PowerPC family in the mainstream of workstations. In Fig. 2, the result of benchmark tests is shown. Compilation was done with almost the highest level optimization of each compiler. Of course, SX5 is ranked by far as the top level. Its speed is recorded to be 9.2 GFLOPS, which is consonant with the nominal value. The speed of Pow- erMacG4 is about half of other workstations, but considering the prices, it can be concluded that they have high cost-performance. However, the speed is slower than their advertising value (several GFLOPS) by an order of magnitude. Even though exerating is a common method in advertizements, it is rare that the difference is enlarged such an extent. Now, we fix the size of matrix to be 200, then examine the calculation time on various machines. In Fig. 3, results are shown for both cases with and without optimization and single and double precision. From this figure, we can see that, in both of single and double precision calculations, the speed is proportional to the clock speed of CPU, regardless G3 or G4. This means that the advantage of AltiVec is not entirely utilized unless adding optimization options. When optimization is activated, there appear significant difference between G3 and G4. But, interestingly, the difference is more prominent in double-precision calculations than in single-precision. As stated before, the vector process does not work with double-precision calculations. Therefore, the difference should be originated from something other than the vector-process unit. 5 iBook 466 iMac 450 PB G4 667 PM G4 400 PM G4 867 PM G4 733 SGI Origin2000 195 IBM Power3 200 IBM Power3 450 SX5 9275 0 100 200 300 400 500 600 MFLOPS Figure 2: Benchmark test for a matrix multiplication of double precision. 6 (a) (b) 400 400 G3 G4 G3 G4 normal normal optimize optimize 300 300 S S P P O O 200 200 L L F F M M PowerBook PowerBook 100 100 iMac iBook iBook iMac 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 CPU Clock [MHz] CPU Clock [MHz] Figure 3: Matrix multiplication: (a) single precision, (b) double precision. Both results calculated with and without optimizing in the compilation are shown. I do not know what the optimization actually does. The speed is increased about four times than the that of normal compilation, even though the presence of AltiVec does not contribute. Probably, this is the main advantage which we can normally receive by introducing G4 with Absoft ProFortran. 2.1 effects of AltiVec Although I do not know what contributes the increase in the speed, it is true that G4 machines are superior over G3 machines as an over all. However, I could not get significant improvement of G4 by activating AltiVec, as shown in Ref. [3]. There, it is reported that about 10 times faster improvement is recorded by using AltiVec. Let me describe the calculation process more concretely. On G4 machines (400MHz machine or PowerBookG4), the source code was compiled by f90 -O -altiVec But, even for the case of single precision, no improvement in the speed was obtained. What was wrong with this? In the source code 1, the part of sequential calculation of matrix product was replaced by an intrinsic function in fortran90, 7 c = matmul(a, b) and was compiled with option, -altiVec.

Load more