Numerical Applications and Sub-Word Parallelism: the NAS Benchmarks on a Pentium 4
Total Page:16
File Type:pdf, Size:1020Kb
Numerical applications and sub-word parallelism: The NAS benchmarks on a Pentium 4 Daniel Etiemble University of Toronto 10 King’s college Road Toronto, Ontario, Canada, M5S3G4 PN: 416 946 65784 Fax: 416 971 22 86 [email protected] Abstract: We examine the impact of Pentium 4 SIMD instructions on the Fortran and C versions of the NAS benchmarks, either by compiler vectorization or by assembly code in-lining. If few functions generally profit from the SIMD operations, the using complex numbers or random number generators can be efficiently accelerated. Key words: SIMD extensions, Numerical applications vectors is more recent: AMD 3D Now [CAS98] and Intel SSE [INT99]. Motorola introduced 1. INTRODUCTION Altivec in 1999 [APP99]. With the Pentium 4, Intel introduced SSE2 [INT-AP, INT-IR] which With clusters of PCs or multiprocessor PCs, enables double precision floating point SIMD low cost parallel machines are now available. computation. If multimedia and DSP Most programming efforts concerns the applications remain the major applications that parallelization considering the two different can profit from SIMD extensions, the double models (shared memory inside nodes and precision extensions can also be used for message-passing between nodes). But, to exploit numerical applications. In the rest of the paper, any type of available parallelism, it is we will use “double” as an abbreviation for worthwhile to investigate the impact of SIMD double-precision floating point number. The parallelism that exists now in most of modern Pentium 4 XMM registers are 128-bit long and microprocessors on numerical applications. can pack two doubles. The SSE2 instructions The SIMD extensions to general purpose provide a large set of floating point SIMD microprocessor instruction sets were introduced operations, either on 4 packed floats or 2 packed in 1995 to improve the performance of doubles. Equivalent scalar operations are multimedia and DSP applications. The first provided to access or compute the lowest float extensions were able to process 64-bit integer or the lowest double of the XMM registers. As vectors: Sun VIS [SUN95], HP MAX [LEE95], Pentium 4 microprocessors are suitable to build MIPS MDMX [KIL96] and Intel MMX low-cost clusters of PCs, it is worthwhile to [INT97]. The introduction of SIMD instructions examine the impact of the SSE2 instructions on that process single precision floating point 2 Chapter the performance of numerical benchmarks or Vectorization issues are not exactly the same applications. for C and FORTRAN programs. When using C Vectorization is the terminology used by or C++, the compiler has to worry about Intel for the compiler use of SIMD instructions. aliasing considerations. It cannot optimize a loop Intel has published application notes for if it cannot ascertain whether the array trivial examples as saxpy or daxpy functions references in a loop overlap. This problem [INT-AP]. Bik et al provide a high level doesn’t exist for FORTRAN code. To check if overview of the parallelization and vectorization aliasing issues significantly change the methods used in the C++/FORTRAN Intel performance of the NAS benchmarks when compilers and give some performance results for using C versions, we have measured the dot products, LU-factorization and Linpack execution times of the two versions for several [BIK01a]. The same authors present other optimization levels. We have compared the results with automatic vectorization for a number of vectorized loops for each version of Pentium 4 processor in [BIK01b]. The results each benchmark and we have evaluated the concern the dot products, saxpy and daxpy and impact of the vectorized loops on the overall preliminary results on SPEC CPU2000. In this execution times. Even if the number of paper, we consider the NAS benchmarks [NAS- vectorized loops is different for C and P], which are kernels or pseudo-applications FORTRAN versions, the vectorization that is more representative of numerical applications. realized by the C and FORTRAN compilers has They are widely used for testing parallel no significant impact on the overall performance machines. of each version. This implies that the vectorized To evaluate the potential performance loops are not the time consuming loops. We will improvement of SSE2 numerical codes, we first give some more details on this disappointing measure the impact of compiler vectorization on result. But it allows us to claim that the the execution times of each benchmark, by assembly coded optimization results that we enabling and disabling the vectorizer. Further present for the C version of the NAS optimization requires the use of assembly code benchmarks would be roughly the same for the for functions that are not vectorized. This leads FORTRAN version. to a significant methodology issue. Most of the After this introduction, the second section numerical codes are FORTRAN codes and there presents the methodology that we use. The third is no simple method to insert assembly code section presents the execution times of the C and within FORTRAN code. The only possibilities FORTRAN versions of the NAS benchmarks are by calling C functions containing inline and gives detailed results on the compiler assembly code from the FORTRAN code, or by vectorization and vectorization efficiency. The linking an assembly object file with the fourth section details the most time consuming FORTRAN object code. Neither solution can be functions for each NAS benchmark. The fifth used for our purpose: the overhead of function section presents the assembly coded calls would forbid significant measures in the optimizations and the corresponding first case, and assembly code should be used at performance improvement. Section 6 presents loop levels inside one function and not at a the related works. After the conclusion, the module or file level for the second case. To appendix gives details on some semantics issues evaluate the impact of assembly coded when optimizing random generator functions. optimizations, the only solution is to use a C version of the different benchmarks. This is why we have used the NAS benchmarks, for which the two versions are available, rather than the SPEC2000. 3 2. METHODOLOGY the NAS benchmarks because they are one of the widely used and acknowledged benchmarks for which both FORTRAN and C versions are 2.1 Measures available. The FORTRAN version is the NPB2.3 serial version [NAS-P]. We have used All the results correspond to measures on a the C OpenMP version of the NAS benchmarks Dell PC. The CPU was a 1.4 GHz Pentium 4 [NAS-C] and removed all OpenMP directives to with 512 MB RDRAM running under Windows get a serial version. Although this version is 2000 Professional. We have used the Microsoft probably not the best C serial code, it can be Visual C++ environment with the 5.0 version of considered as a good reference for comparisons. the Intel C/C++ compiler, the 5.0 version of the We used the FP kernels (CG, FT and EP) and Intel FORTRAN compiler, and the Intel VTune the FP pseudo applications (BT, LU and SP) profiler [INT-VT]. For each compiler, we have with class A. used the “maximize speed” (basically O2+QxW) and “high level optimization” (O3+QxW). The QxW option generates specialized code for the 3. INITIAL PERFORMANCE Pentium 4. Each tested program has been OF THE FORTRAN AND C measured at least 5 times and we have taken the VERSIONS averaged value. All the measures have been done with only one running application (Visual C++). The results are provided either as execution time (sec) or as speed-ups, defined as old_execution Table 1 gives the execution times of the time/new_execution time. different benchmarks according to three different criteria: language (C versus FORTRAN), optimization level (O3 versus O2) and vectorization (with vectorization referred to as V and without vectorization referred to as NV). All 2.2 The NAS benchmarks the other options are similar: the compiler generates “Exclusively Streaming SSE2 There is a large spectrum of high-end extensions” (QxW option). numerical applications and choosing some benchmark is always debatable. We have used BT CG EP FT LU MG SP FORTRAN-O2-QxW-V 586.0 6.65 147.0 34.41 388.6 11.19 501.4 FORTRAN-O2-QaxW-V 595.5 6.50 148.1 33.83 394.1 11.57 506.8 FORTRAN O2-QxW-NV 592.5 6.55 387.8 12.31 494.9 FORTRAN O3-QxW-V UNS 6.63 147.3 34.72 368.1 11.32 UNS FORTRAN O3-QxW-NV 586.8 6.54 369.9 12.30 UNS C-O2-QxW-V 618.7 7.24 179.4 30.64 410.0 11.62 481.7 C-O2-QaxW-V 619.8 7.00 154.5 29.66 410.5 11.61 478.0 C-O2-QxW-NV 6.60 409.8 11.63 489.4 C-O2-QaxW-NV 6.47 410.2 11.61 486.0 C-O3-QxW-V 643.0 7.28 179.3 30.50 412.4 10.11 481.9 C-O3-QxW-NV 6.60 412.9 11.62 489.5 Table 1: Execution time (sec) of the different benchmarks according to language, compiler, and vectorization options: O2 and O3 are the standard compiler options, QxW and QaxW are the specific P4 compiler options, V and NV stand for 4 Chapter vectorized and non-vectorized options, UNS corresponds to “unsuccessful” results and blank entries correspond to benchmarks without any vectorized loop for which V and NV execution times are identical. Table 2 gives the number of vectorized loops to evaluate the efficiency of the compiler when the vectorizer is on. vectorization. Figure 1 compares the execution time for the BT CG EP FT LU MG SP best versions of FORTRAN and C.