NCAR/TN-316+STR NCAR TECHNICAL NOTE

mmmmm - Al August 1988

Experiences on a -2 Running UNICOS

M. Pernice

SCIENTIFIC COMPUTING DIVISION

mmm NATIONAL CENTER FOR ATMOSPHERIC RESEARCH BOULDER, COLORADO ii Trademark Information

* CRAY X-MP, CRAY-2, COS, UNICOS, CFT, CFT77, and CFT2 are trademarks of Cray Research, Inc.

* and System are trademarks of AT&T Bell Laboratories.

* Amdahl and UTS are registered trademarks of Amdahl Corporation.

* VAX and VMS are registered trademarks of Digital Equipment Corporation.

* IBM is a registered trademark of International Business Machines Corporation.

· Memorex is a registered trademark of Memorex Corporation.

* Macintosh is a trademark of Apple Computer, Inc. iii

Table of Contents

Preface -...... v

1. Performance of the NCAR Benchmark Suite...... 1 1.1 Description of the Suite...... 1 1.2 Description of the Hardware.------2 1.3 Description of the Software ...... 2 1.4 Timing Results...... 3 1.4.1 Results for Loop Kernels...... 3 1.4.2 Results for Subroutine Kernels ...... 1.4.3 Results for a Climate Model ...... 9 1.5 Experiences with the Compilers ...... 1 2 1.6 Summary and Conclusions...... 1 5

2. The NAS Computing Environment...... 17 2.1 Documentation ...... 17 2.2 Front-ends...... 1 7 2.3 Mass Storage System ...... 1 7 2.4 Mass Storage System Interface ...... 1 8 2.5 Network Interfaces...... 1 8 2.6 UNICOS on the CRAY-2s ...... 1 9 2.7 Conclusions...... 21

Appendix 1...... 23 Appendix 2...... 2 7 Appendix 3 ...... 3 1

v

Preface

The CRAY X-MP/48 at the National Center for Atmospheric Research (NCAR) was installed and became operational in the fourth quarter of 1986. By the end of the second quarter of 1988, the X-MP was saturated. It is expected that NCAR will replace the X-MP with the next generation sometime in 1990-1992. Manufacturers of general-purpose (notably Cray Research, Inc. and ETA Systems) are investing heavily in operating systems based on AT&T System V UNIX, so the replacement for the X-MP will likely have a UNIX .

The users of NCAR's computing facilities have been using the batch-oriented Cray Operating. System (COS) since the installation of NCAR's first CRAY-1 in 1977. Concerns have been expressed about the impact that a transition to UNIX will have on our user community. Further questions have been raised about the impact of UNIX on supercomputer performance, particularly on the performance of applications that require substantial amounts of input and output (I/O) operations. To address these concerns, the Performance Analysis Project, which is part of the Computational Support Section of the Scientific Computing Division, obtained access to the computing systems maintained by the Numerical Aerodynamics Simulation (NAS) Program at NASA's Ames Research Center at Moffet Field, California. These systems include a pair of CRAY-2 mainframes that run the UNICOS operating system. By running a set of benchmark codes, the performance of these mainframes and of the uniform UNIX computing environment at NASA-Ames were evaluated. This Technical Note presents the results of the experience gained on these systems.

The Scientific Computing Division gratefully acknowledges NAS for providing access to its computing facilities and for the support that was provided during the course of these investigations. 1

1. Performance of the NCAR Benchmark Suite

1.1 Description of the Suite

The NCAR Benchmark Suite consists of the following 8 programs:

BNMK01: tests basic arithmetic capabilities. The following vector operations are timed for values of the vector length ranging from 5 to 1000: V < V+V S - S + V*V (dot product) V*v V*VV S*V + V*V V- V/V V- S*V + V*V +S V<- V*(V+V)

BNMK02: timing and accuracy test of intrinsic functions cos, acos, single and double precision exp and log.

BNMK03: timing and accuracy test of real forward- and inverse-FFT routines from FFTPACK.

BNMK04: timing and accuracy test of separable elliptic equation solver SEPELI from FISHPAK.

BNMK05: timing and accuracy test of linear equation solver SGEFA-SGESL from LINPACK.

BNMK06: timing and accuracy test of eigenvalue routines TRED1 and TQLRAT from EISPACK.

shaln: timing and accuracy test of a shallow water equation model on an nxn doubly-periodic grid. The values n=64 and n=256 are used.

CCM: version CCMOB of the NCAR community climate model. 2

1.2 Description of the Hardware

There are 2 CRAY-2s managed by the Numerical Aerodynamics Simulation (NAS) Project at NASA-Ames, named navier and stokes. Both are 4 CPU CRAY-2s, each having a 256 million word Common Memory, 16,384 words of Local Memory for each processor, and a 4.1 nanosecond (nsec) cycle time. Each CRAY-2 has a single Foreground Processor that coordinates these components. The distinguishing feature between the two of them is that stokes is a dynamic memory CRAY-2, having a memory cycle time of 120 nsec, while navier is a static memory model with a shorter memory cycle time of 80 nsec. A bank conflict on navier causes a maximum memory delay of 45 cycles (184.5 nsec) while a bank conflict on stokes causes a maximum memory delay of 57 cycles (233.7 nsec). Each CRAY-2 is equipped with DD-49 disk drives.

Some interesting early history of NAS's experiences with a CRAY-2 can be found in the article Early Experiences with the NAS CRAY-2 by John T. Barton, which appears in the Spring 1986 issue of the Cray User Group (CUG) Proceedings.

1.3 Description of the Software

Both navier and stokes ran UNICOS version 3.0 for most of the test period. On June 20, 1988, the operating system on navier was upgraded to a pre-release version of UNICOS 4.0. On July 19, stokes was also upgraded to UNICOS 4.0. There are two FORTRAN compilers available on navier and stokes: version 4.0b of cft2, and version 3.0 of cft77. cft2 was upgraded from version 3.1b with the UNICOS upgrade. cft77was upgraded from version 2.0 in early May. These software changes had little impact on the performance of the benchmark suite, except for the double precision functions: the measured times for dexp and dlog under UNICOS 4.0 were about a third of the measured times for these functions under UNICOS 3.0, while the computed error roughly doubled. All measurements on the CRAY-2s that appear in this report were made under UNICOS 4.0. 3

1.4 Timing Results

For each program in the benchmark suite, timing runs were made under three different conditions. Data labeled 'Fully Loaded' refers to data obtained when running under a typical daytime load. Data labeled 'Late Night' refers to data obtained from runs made between 1 a.m. and 2 a.m., to determine if performance was sensitive to the interactive load. Data labeled 'Dedicated' refers to data obtained when no other program was executing. Except where noted, all of the execution times were obtained by calls to the SECOND function, which reports the CPU time that has elapsed since the start of the job.

1.4,1 Results for Loop Kernels

Figures 1-7 compare the performance of BNMK01 on navier and on the CRAY X-MP at NCAR, which has a cycle time of 8.5 nsec. The X-MP timings are represented by the shaded bars, and were obtained using CFT 1.16 on a dedicated system running COS 1.16. These figures show that an X-MP easily outperforms a CRAY-2 on these kernels, despite having more than twice the clock period. This is consistent with other reports on the computational speed of a CRAY-2.

Mflnn - .. i'".' (.U

58.5

39.0

19.5

5 10 20 100 500 1000

Vector Length

Figure 1: Performance of V+V navier vs. X-MP 4

L A fft r o 74 MIOpIpS MO.U

59.5 _.II U 39.0 I

19.5

I 1 _vrilI I I I I I 1 5 10 20 100' 50 1000

Vector Length

Figure 2: Performance of V*V navier vs. X-MP

I A Xi - 2 MtlOpS 5U.U

22.5

15.0

7.5

5 10 20 100 500 1000

Vector Length

Figure 3: Performance of V/V navier vs. X-MP 5

Mf l c an I~"'IVrI' 1;

000

Vector Length

Figure 4: Performance of V*(V+V) navier vs. X-MP

IL *.l.I "trn MTIOpS

80

40

5 10 20 100 500 1000

Vector Length

Figure 5: Performance of dot product navier vs. X-MP 6

-al - A^ MtlOpSa 14u I* 105 I

70 _/I

35

I m II I I I II 1 I I 5 10 20 100 500 1000

Vector Length

Figure 6: Performance of S*V + V*V navier vs. X-MP

.A. -_ _ 4 A f MTIOpS

70

F 1

Vector Length

Figure 7: Performance of S*V + V*V + S navier vs. X-MP 7

Figures 8-14 compare the performance of BNMK01 on stokes and on the CRAY X-MP at NCAR and appear in Appendix 1.

The data represented by Figures 1-7 and Figures 8-14 can each be reduced into a single number that can be interpreted as a performance ratio between the CRAY-2 and the CRAY X-MP on the set of loop kernels in BNMK01. Several approaches are possible. The method used here takes the weighted harmonic mean of all of the rates reported for a single operation, resulting in a mean execution rate for each loop kernel in BNMK01. A mean execution rate for all the kernels in BNMK01 is then derived by taking an equiweighted harmonic mean of the mean execution rates of each of the loop kernels. The weights that are used are the fraction of total floating point operations performed in each loop kernel by each vector length that was tested. The values obtained by doing this are presented in Table 1.

Table 1: Mean Performance of Operations in BNMK01

Mainframe Mean Rate (Mflops) navier 55.2 stokes 52.5 X-MP 63.9

Comparison of these rates indicates that a static memory CRAY-2 runs about 14% slower than an 8.5 nsec X-MP on operations represented by those measured in BNMK01. Likewise, a dynamic memory CRAY-2 runs about 18% slower than an X-MP. It must be kept in mind that these figures are based only on the operations in BNMK01; their validity for operations outside the scope of those tested in BNMK01 is questionable.

Figures 15-21 in Appendix 2 compare the performance of navier on BNMK01 under fully loaded and late night conditions. These figures show that performance is not significantly impacted by the interactive load. Analogous figures for stokes are quite similar and are not included.

1.4.2 Results for Subroutine Benchmarks

Table 2 shows the execution times of BNMK02-BNMK06 and the shallow water model for 8 n=64 and n=256. All data is reported in seconds except where noted otherwise, and the X-MP data was obtained using CFT 1.16 on a dedicated system running COS 1.16. Note that the intrinsic functions run an average of 26% faster on both navier and stokes than on the X-MP. However, the CRAY-2s do not perform as well as the X-MP on the subroutine-level tests except for the shallow water model: navier (respectively, stokes) is 19% (15%) faster than the X-MP for n=64, and 14% (13%) faster for n=256. As with the figures in Appendix 2, performance on both CRAY-2s does not seem to be very sensitive to the time of day. Finally, the error in the computed values, computed internally in each of these benchmark programs, were comparable to the errors evaluated on the X-MP.

Table 2: Performance of BNMK02-BNMK06 and the Shallow Water Equations

X-MP navier Fully Loaded Late Night Dedicated

BNMK02 oos 2.66 x10' 7 1.78 x10' 7 1.77 x10' 7 1.73 x10-7 aoos 2.45 x10- 6 1.86 x10 '7 1.84 x10-7 1.80 x10' 7 exp 1.51 x1 0 7 1.13 x1 0 7 1.12 x10-7 1.08 x1 0 7 alog 2.01 x10'7 1.62 x10' 7 1.62 x10' 7 1.57 x1 0 7 dexp 1.40 x10' 5 1.27 x10'5 1.28 x10- 5 1.17 x10-5 dlog 1.51 x10' 5 1.22 x10-5 1.17 x10 ' 5 1.10 x10' 5 BNMK03 4.31 x10' 4 5.52 x10-4 5.50 x10- 4 5.27 x10' 4 BNMK04 2.76 x10- 1 2.89 x10-1 2.94 xl0-1 2.77 x10- 1 BNMK05 1.50 x10-2 2.16 x10 2' 2.24 x10-2 2.04 x10-2 BNMK06 2.08 x10' 2 3.09 x10-2 3.23 x10' 2 2.84 x10-2 shal64 133t 133t 126t 158t shal256 146t 130t 129t 167t

tThese figures are in Mflops. 9

Table 2 (continued)

stokes Fully Loaded Late Night Dedicated

BNMK02 Cos 2.66 x10-7 1.81 x10'7 1.78 x10 - 7 1.74 x10' 7 aeos 2.45 x10' 6 1.86 x10' 7 1.85 x10' 7 1.81 x10' 7 exp 1.51 x10' 7 1.14 x1 0 7 1.14 x10- 7 1.08 x10' 7 alog 2.01 x 0' 7 1.64 x10-7 1.63 x10 - 7 1.57 x10' 7 dexp 1.40 x10' 5 1.32 x10'5 1.33 x10' 5 1.18 x10-5 dlog 1.51 x10' 5 1.32 x10 5 1.31 x10-5 1.10 x10' 5 BNMK03 4.31 x10' 4 6.03 x10-4 5.74 x10' 4 5.52 x10' 4 BNMK04 2.76 x10' 1 3.18 x10-1 3.12 x10-1 2.89 x10 ' 7 BNMK05 1.50 x10-2 2.51 x10 ' 1 2.40 2.17 x10 ' 7 x10 12 BNMK06 2.08 x10'2 3.50 x10-2 3.17 x10 - 2 2.95 x10' 7 shal64 133t 110t 112t 153t shal256 146t 107t 118t 165t

Notes tThese figures are in Mflops.

1.4.3 Results for a Climate Model

Table 3 shows the execution time of the CCM in CPU seconds. Timings for a 5 day and a 1 day integration are shown. These X-MP times were obtained using CFT 1.15 on a dedicated system running COS 1.15. Two different I/O strategies are used. BUFFER IN/OUT performs asynchronous unformatted I/O, and computations in the CCM are overlapped with I/O operations. The strategy using binary READ/WRITEs also uses implied DO-loops. 10

Table 3: Performance of CCM (CPU seconds)

X-MP navier Fully Loaded Late Night Dedicated

1 day, BUFFER IN/OUT 71 116 115 110 1 day, binary READ/WRITE 78.5 133 133 125 5 days, BUFFER IN/OUT 249 361 362 340 5 days, binary READ/WRITE 288 442 442 414

X-M P stokes Fully Loaded Late Night Dedicated

1 day, BUFFER IN/OUT 71 129 129 116 1 day, binary READ/WRITE 78.5 149 148 132 5 days, BUFFER IN/OUT 249 410 403 356 5 days, binary READ/WRITE 288 503 498 436

The CCM required 37-68% more CPU time to execute on a dedicated CRAY-2 than on a dedicated X-MP. The best relative performance occurs for the 5 day integration, BUFFER INIOUT case on navier; the worst relative behavior occurs for the 1 day integration, binary READ/WRITE case on stokes.

Table 4 shows the wall-clock time of execution in seconds for the CCM on a dedicated system. Wall-clock time is more representative of the performance of applications that involve substantial I/O operations, since the CPU clock is suspended during the time the application is waiting for I/O to complete. The X-MP times were also obtained using CFT 1.15 on a dedicated system runnin COS 1.15, and the CRAY-2 times were obtained with the UNICOS time command. 11

Table 4: Performance of CCM (wall-clock seconds)

X-MP navier stokes 1 day, BUFFER IN/OUT 83 114 120 1 day, binary READ/WRITE 100 126 134 5 days, BUFFER IN/OUT 289 357 374 5 days, binary READ/WRITE 387 419 440

The CCM required 8-45% more wall-clock time to execute on a dedicated CRAY-2 than on a dedicated X-MP. The best relative performance here occurs for the 5 day integration, binary READ/WRITE case on navier; the worst relative behavior occurs for the 1 day integration, BUFFER INIOUT case on stokes. When using wall-clock time as a measure, the performance of the CCM on either a static or dynamic memory CRAY-2 compares more favorably with its performance on the X-MP at NCAR.

The difference in CPU times of the CCM on the CRAY-2s and the X-MP needs to be addressed. Conventional wisdom seems to be that we should expect code to run only 30-40% slower on the CRAY-2, primarily because of the higher memory latency, the lack of chaining, and a lower degree of automatic vectorization provided by cft2 coupled with the fact that scalar instructions on a CRAY-2 require 2 clock periods. In fact, cft2 failed to vectorize 36 loops that are vectorized by the CFT 1.15 compiler. These loops either were conditionally vectorized, contained references to in-line functions, or involved branching within the loop. This behavior reflects the level of optimization provided by the CFT compiler at the time that cft2 was derived from it. On the other hand, cft2 reported that it vectorized 42 implied DO-loops that were contained in I/O statements which CFT 1.15 did not vectorize. These differences in vectorization could account for some of the discrepancy.

I used FLOWTRACE under daytime workloads to further investigate the reason for this slowdown. It revealed four subroutines that ran 2 to 3 times slower than they did on the X-MP. Three of these routines perform most of the CCM's I/O. This relative behavior was essentially independent of the I/O strategy used. On navier, the time spent in these three subroutines for the 1 day integration accounted for 89% of the run-time difference when binary READ/WRITEs were used, and 88% of the run-time difference when BUFFER IN/OUTs were used. On stokes, the respective percentages are 85% and 86%. These percentages can 12 vary as much as 5 percentage points, depending on what set of measurements are being compared, because of the variation in execution time from run to run. Consequently, these figures should be considered to be estimates. Nonetheless, the data suggests that I/0 operations on a CRAY-2 running UNICOS are more costly (in CPU time) than on an X-MP running COS. Furthermore, since the relative performance of the 1 day integrations is worse than the relative performance of the 5 day integrations on both machines using either I/O strategy, the data indicates that the performance difference probably is mostly due to the I/O needed to start up the model and it is possible that a longer-term integration would show even better relative performance.

1.5 Experiences with the Compilers

The smaller benchmarks compiled with cft77 and ran without any problems. Some problems were encountered compiling the CCM. cft77 produced several warnings and errors when it was used to compile the CCM. Most of the warnings were due to mixed-mode divisions. The errors reported by cft77 were the result of incorrect use of a vector merge function. These were corrected by substituting another vector merge function.

Under UNICOS the user is required to explicitly manage files from the FORTRAN source using OPEN and CLOSE statements. These calls had to be added to the CCM. One early problem resulted from falsely assuming that all I/O was formatted. The problem manifested itself with the run-time error: "_uwrcw: libf error: _cnt negative". (Most of the other error messages produced by UNICOS were more informative.) The source of this problem was eventually uncovered using the post-mortem dump analyzer debug (which is similar to the COS utility DEBUG) and the interactive debugger drd (which provides functionality similar to SID, available at NCAR through ICJOB and VMSTAI.) I was unable to set breakpoints at line or statement numbers in the CCM with drd, apparently because of a problem with the symbol tables generated by the compiler, but I was able to set breakpoints at the P addresses of the subroutine calls provided in the traceback.

The next run-time error I encountered was an arithmetic library error: the CCM attempted 13 to evaluate ab with a

The source of this error was the subroutine CHKBF, which performs a UNIT check on an asynchronous I/O operation. In the version of the CCM that uses binary READ/WRITEs, CHKBF only executes a RETURN statement. At first the only work-around I could find was to comment out the call to this subroutine.

I later learned the reason for the 'failed' UNIT check with some help from a NAS consultant. The actual argument to CHKBF is neither declared nor initialized in the calling subroutine. Under stack-based allocation (the default at NAS), the returned value from CHKBF is obtained through a pointer to the location in memory which is allocated for the dummy argument on entry to CHKBF. Since the portable version of CHKBF never specifies a value for the dummy argument, the result of this is unpredictable. Apparently it was merely fortuitous for the UNIT check to 'succeed' with optimization enabled. This was an interesting example of a program failing in a stack-based environment, especially because of the misleading symptoms. I was able to solve this problem simply by initializing the troublesome actual argument prior to the first call to CHKBF.

With this modification, I was able to run a cft77-generated executable with all optimization disabled. The previously-mentioned arithmetic library error did not occur, so I was unable to determine why the arithmetic library error occurred when optimization was enabled. I decided to use cft2 to generate an executable for the tests.

The cft2 compiler presented some minor problems. It produced an OPERAND RANGE ERROR when I first used it to compile the CCM. The NAS consultants quickly found the reason for this: cft2 was unable to deal with several character constants that appeared in FORMAT statements and were continued over more than one record. This was easily fixed, but cft2 was also unable to deal with character arrays that appeared in other parts of the code. Some of these deficiencies in the compiler were well documented. Many of the character array 14 references were easily fixed, but there were two subroutines in the CCM that would require substantial recoding before cft2 could be used. Consequently the executable I used to generate the data in Tables 3 and 4 was a combination of cft2- and cft77-generated code. This executable worked, and ran to completion, but it produced results that agreed (at first) to only 2 digits with results generated on the X-MP. Also, the above-mentioned 'failed' UNIT check also happened with cft2-generated code when optimization was disabled.

In order to determine whether the change in vector merge functions was responsible for the difference in computed results, I compiled the code with the original calls. This can be done since cft2 does not perform argument checks with these functions. However the computed results were even worse in this case, and caused the model to print warning messages about 'UNSTABLE MEAN TEMPERATURES'. Finally, a version of the executable that was compiled with all optimization disabled, and the above-mentioned work-around for the 'failed' UNIT check, resulted in computed results that are identical to the results computed when all optimization was enabled.

Discovering why the computed results on the CRAY-2s differed from X-MP-generated results was a serendipitous event. After the upgrade to UNICOS 4.0 on navier, I tested the batch submission file in preparation for obtaining dedicated timings. The smaller benchmarks presented no problems, but the CCM ran only a short time before quitting with a floating point error. A NAS consultant found that the model would run if a SAVE statement was inserted in the subroutine where the error occurred. Apparently the CCM relied on at least one local variable in this subroutine retaining its value across subroutine calls. I compiled and ran the code with a SAVE statement in every subroutine (which can be enabled with a compiler option), and the CCM ran to completion and produced results that were identical to those produced on the X-MP, within printed accuracy. This also happened on stokes under UNICOS 3.0. From this I concluded that there were multiple instances of the CCM relying on local variables being retained across subroutine calls, and that this was the cause of the apparently incorrect results originally obtained on the CRAY-2s.

Both compilers provide a 'fast integer' mode. cft77 allows 46- and 64-bit integers, while cft2 allows 32- and 64-bit integers. I used all 64-bit integer arithmetic, partly because of the fact that the CCM uses Hollerith strings for some of its printed output, and partly because I was concerned about consistency between the integer types in the mixed binary I was using. 15

1.6 Summary and Conclusions

The small benchmarks presented no problems in either compiling or running. The timing results indicate that the arithmetic operations tested in BNMK01 run 14-18% slower on either a static or dynamic memory CRAY-2 than on the CRAY X-MP at NCAR; the intrinsic functions tested are 16-35% faster; the subroutine-level tests are 2-60% slower; and the shallow water model is 13-19% faster, with the greater improvements on navier and stokes occurring for the coarser resolution.

The CCM was substantially more difficult to work with and a lot of effort was required to devise an executable that would run correctly under UNICOS. The version that did run initially produced results that were inconsistent with results that were generated on an X-MP. It also required substantially more CPU time. Some of this slowdown is attributable to differences in automatic vectorization, but most of it appears to be the result of differences in I/O processing. However, the wall-clock times of execution of the CCM on navier and stokes are much closer to the times observed on an X-MP.

The problems that I encountered with the CCM were code-dependent and not compiler-dependent. The CCM consists of over 10,000 statements and is not standard-conforming FORTRAN. There are instances of 'pseudo-aliasing' and places where arrays of length 1 are EQUIVALENCEd to COMMON blocks. The coding style that caused the misleading UNIT check can lead to unpredictable results and may be present in other portions of the code. There were programming errors in the use of the vector merge functions. There are instances where local variables are assumed to retain their values across subroutine calls. These are subtle errors whose symptoms are very misleading and whose expression is highly dependent on the software environment (recall that the assumed SAVEs had different manifestations under different versions of the operating system.) Moreover, all of the smaller benchmark programs produced errors in the computed results that were consistent with earlier results from an X-MP, without any modification. However, to the extent that the CCM is representative of the coding style in applications run at NCAR, some of our users are likely to encounter similar difficulties making the transition to UNICOS.

By comparing the timings obtained on a fully loaded and a late night system, it can be seen that the interactive load has little effect on the performance of this benchmark suite. This is 16 likely due to the separation in hardware between the foreground and background processors. Such performance is a highly desirable aspect of interactive supercomputing.

The principal feature of the CRAY-2 is its large Common Memory, which eliminates the need for performing large amounts of I/O. Most of the differences in execution time of the CCM noted above are attributable to I/O operations, and it is likely that the performance of a memory-contained version of the CCM on a CRAY-2 will compare more favorably with the X-MP version. However, management felt that devising a memory-contained version of the CCM would require too much time, and so this was not attempted. 17

2. The NAS Computing Environment

NAS has attempted to provide its users with a distributed computing environment having a uniform user interface. This section describes this environment and relates some of my impressions using it.

2.1 Documentation

The NAS User Guide provided an excellent overview of the NAS computing environment. I was able to become productive after only a couple hours of browsing it. Beyond that, I relied on the UNICOS User Command and Reference Manual, CRA Y-2 FORTRAN Reference Manual, and the Symbolic Debuggging Reference Manual, all published by Cray Research, Inc. On-line documentation was available via the man command, but I used it sparingly, preferring to browse through the printed documents.

2.2 Front-ends

Two front-ends were available for use. prandtl is an Amdahl 5880 running UTS 580 1.1.3 (based on AT&T System V UNIX.) amelia is a VAX 11/780 running 4.3 BSD UNIX. I used prandtl exclusively for my work.

2.3 Mass Storage System

The Mass Storage System (MSS) at NASA-Ames consists of an Amdahl 5880 controller, an IBM 3480 tape cartridge system, a Memorex 3228 tape drive and Amdahl 6380 disk drives. It is organized in a hierarchical fashion similar to the MSS at NCAR. Files on the MSS migrate from disk to cartridges after a certain amount of time (about 30 days), and must be explicitly staged back to the disk drives by the user before they can be accessed. This can sometimes be a time-consuming task. 18

2.4 Mass Storage System Interface

MSS files are accessed using mssget and mssput commands. Files are restored to disk from the cartridges using the mssstage command. Directories on the MSS are created using the mssmkdir command. Protections are changed by using the msschmod command. Information concerning files and directories on the MSS is obtained using the msslst command, which has many options similar to those of the UNIX Is command. The output from msslst is quite similar to the output of Is. These commands can be issued from any mainframe that you are logged into, and msslst will return output via electronic mail if it is run in background. Wild card characters can also be used with any of these commands, although you must be aware enough to use an escape sequence when you want the wild card character to be expanded on the MSS rather than on the local host. The similarity of the directory structures on the UNIX machines and the MSS, and the similarity of the MSS commands to UNIX commands that provide similar functionality were both nice features.

2.5 Network Interfaces

Access to NASA-Ames was gained by using telnet, the machines were all networked together so that you could riogin from any machine to any other. (The sole exception is that you can't riogin from one CRAY-2 to another.) I later learned that I could simply riogin to prandtl from either scdpyr or bierstadt. One continuing aggravation was the presence of unprinted control-S and control-Q characters in the input stream, which frequently would garble commands that were typed in at the terminal. For the most part terminal response was good and the connection was robust, although there were times when I was involuntarily disconnected by the remote host.

Files were transferred between NCAR and NASA-Ames using ftp. ftp was also available for moving files among the machines at NASA-Ames. Typical transfer rates ranged from 2-8 Kbytes/sec, and rates of 60-70 Kbytes/sec were occasionally reported by ftp. The version of ftp at NASA-Ames had mput and mget commands, which allowed for the movement of several files in one invocation by using wild-card characters. rcp is also available for moving files among different machines at NASA-Ames and was much more convenient to use, 19 since it does not require a login each time it is invoked.

2.6 UNICOS on the CRAY-2s

A CRAY-2 running UNICOS appears identical to an Amdahl running UTS (or a Pyramid running 4.2 BSD UNIX, with small differences in commands). However, I used the CRAY-2s primarily to compile and run programs and never did any editing on them. While vi was available on the CRAY-2s, it never appeared to work properly when I was remotely logged in to it via prandtl. (There are reports that vi would work fine when directly logged in to navier from scdpyr using telnet, but I never checked this out.)

Typical terminal sessions consisted of logging into prandtl from scdpyr, editing some files there, copying the modified files to a CRAY-2 via rep or ftp, remote login to a CRAY-2, compilation, linking, execution, transferring output back to scdpyr (via ftp) for printing, and iterating on this procedure. A convenient method of rapid file transfer among the various machines is essential for this strategy to be viable. It was quite typical of the process of developing the command file that I used for batch submission. It would have been a lot more convenient to have a command like less available in UNICOS, rather than the more and pg commands currently available. I was able to partially compensate for this by using the terminal emulator on my Macintosh to scroll backwards, and when I needed only a page or so of output I could do a screen dump. In a sense this was my first taste of a fully interactive distributed computing environment. It was made much easier by the fact that the user interface looked the same no matter where you were logged in.

A modest amount of directory space on the CRAY-2s is assigned to each project for storing source and binary files, and each system has two scratch disks for temporary storage of larger files. Files on the scratch disks are scrubbed 72 hours from the time of last access. These disks occasionally lost data; I had to restore the files to them a couple times, thereby underscoring the importance of keeping backup copies on the MSS.

UNICOS does not distinguish between 'permanent' and 'local' files; either a file exists or it doesn't. This uniformity eliminates misconceptions caused by the distinction made in COS, but under UNICOS the user is required to manage files in the FORTRAN source code. In 20 contrast, file management is done implicitly under COS. While a UNICOS version of acquire is available, I worked exclusively from the scratch disks, staging data to them when needed from the MSS using mssget. The acquire command does not appear to be callable from FORTRAN; in fact I am not aware of any facility for executing a UNICOS command from FORTRAN. File management may be the area of greatest difficulty in making the transition from COS to UNICOS.

UNICOS includes a batch submission facility called the Network Queue System (NQS). A batch job under UNICOS consists of a UNIX command file (or shell script) that is a list of UNICOS commands. This command file can be submitted as a UNICOS batch job request from a front-end or from the CRAY-2 with the qsub command. Once a procedure for running a program is established, it is easy to develop an NQS command file for batch submission by simply putting the necessary sequence of commands into a file. A copy of the NQS command file that I used to obtain the dedicated machine timings is included in this report as Appendix 3. NQS has some interesting features that are worth describing at some length.

One feature of NQS is the sequence of information that is contained in the command file. Commands are processed in the order in which they appear. Since multiple-file datasets do not exist in UNICOS, any source or data files that might have appeared in subsequent files of $IN must either be in other files (which must already exist and must be explicitly read by using redirection symbols or some other means) or must be embedded in the command file itself. This latter strategy seems strange at first, but you can easily get used to it, as the input to the command appears directly after it and not in some later part of the file. This relieves the programmer who is writing a batch submission file from having to keep track of where $IN is positioned. Several examples of NQS script files show source code or data embedded directly in the command file, but this is unnecessary so long as adequate reference to the required files is provided, using either changes in directory or full pathnames. The strategy of using source and data files external to the command file is probably more flexible as well. The example command file in Appendix 3 illustrates both strategies.

When running in interactive mode, the devices referred to in UNIX as 'standard output' (stdout) and 'standard error' (stderr) both default to the user's terminal. Under NQS, these devices are distinct files. Consequently, messages produced by the compilers, such as warnings, errors, or timing information, are directed to a different file than program 21 output. This can be confusing, especially when something goes wrong, as you find yourself looking back and forth trying to establish a correspondence between the messages in stderr and the program output in stdout. Fortunately NQS provides an option that allows you to make stderr and stdout the same file. NQS also provides options that enable executed commands to be echoed to stdout and to have these commands timestamped, so that the output file contains the commands executed, the time they were executed, and the results of the commands in the order that they actually occurred. Unfortunately, the timestamping feature did not seem to work after the upgrade to UNICOS 4.0. These options are set in the user's .cshrc file, and analogous options exist for users of the Bourne shell.

NQS also contains some traps that might be difficult for a novice UNIX user to debug. When the batch request is executed, it spawns another instance of the user's shell file. Some problems may arise depending on the contents of that shell file. The ones that I ran into resulted from a 'set noclobber' in my .cshrc file. This protects existing files from being overwritten. One batch job I ran obtained FLOWTRACE output for two different I/O strategies, but the FLOWTRACE outputs were identical. This was because the flow.data file that resulted from the first run did not get overwritten by the file generated in the second run. I tried to solve this by using rm to remove unnecessary files after each job step, but again was stymied by an alias for rm that was set up in my .cshrc file. This was solved by redefining the alias.

2.7 Conclusions

The uniform command interface on all the systems I used is very nice and relieves the programmer of the need to remember context-dependent command features, syntax, and options. To a large extent, NAS has succeeded in achieving its stated goal of providing a single, consistent user interface. Such a uniform interface may be difficult to achieve in our heterogeneous environment even if a UNIX supercomputer is installed, especially if the divisions that now support non-UNIX front-ends continue to do so. If we go to a central mainframe with a UNIX-based operating system, users will have to learn UNIX anyway, so it may be possible to encourage these divisions to move to UNIX. It's not that a uniform UNIX interface is being advocated. The point is, whether the global operating system is UNIX, VMS, VM/SP, or something else entirely, the impact of a uniform interface on productivity should 22 not be underestimated. This uniformity also simplifies the nature of the support services that must be provided. At the user level, UNICOS seems to capture all of the functionality of UNIX while still retaining some of the familiarity of COS.

Overall I would say that if you just knew UNIX, or if you just knew COS, then you would have some trouble getting used to UNICOS. However, knowing something about both operating systems should enable most users to quickly begin being productive. 23

Appendix 1: Performance of BNMK01: stokes vs. X-MP

The figures in this appendix compare the performance of BNMK01 on stokes and on NCAR's X-MP. As with Figures 1-7, the X-MP timings are represented by the shaded bars.

Mflops 78.0

58.5

39.0

19.5

5 10 20 100 500 1000

Vector Length

Figure 8: Performance of V+V stokes vs. X-MP 24

Mflops 78.0

59.5 pI

39.0

19.5

I I I I - I - L-. 6-r- L- rn-Fri1 6-m- I 5 10 20 - 100 500 1000

Vector Length

Figure 9: Performance of V*V stokes vs. X-MP

Mflops 30.0

22.5

15.0

7.5

5 10 20 100 500 1000

Vector Length

Figure 10: Performance of V/V stokes vs. X-MP 25

1:A .. ,^^e4 , nrirlups 1:2

9

I

100 500 1000

Vector Length

Figure 11: Performance of V*(V+V) stokes vs. X-MP

k AI . ... rn MIIOUp I5

1

E

5 10 20 100 500 1000

Vector Length

Figure 12: Performance of dot product stokes vs. X-MP 26

Mflops 140

105

70

35 f Ir 5 10 20 100 500 1000

Vector Length

Figure 13: Performance of S*V + V*V stokes vs. X-MP

(Af\l^e I An Ivllupi I*"v I-

105 _

70

35

rnFI. - ~ - Ij1 V--- I 1I I,,,. __. I.,,,. m- I I I I I I 5 10 20 100 500 1000

Vector Length

Figure 14: Performance of S*V + V*V + S stokes vs. X-MP 27

Appendix 2: BNMK01 Performance on navier: daytime vs. nighttime workloads

Figures 15-21 compare the performance of BNMK01 on navier during normal daytime and nighttime workloads. The shaded bars represent performance under nighttime workloads.

KAffl nn , A f IVI IIV*o t4b.U

34.5

23.0

1 1.5

.

Vector Length

Figure 15: Performance of V+V on navier Daytime vs. Nighttime 28

AA Mflops 9

I

11

Vector Length

Figure 16: Performance of V*V on navier Daytime vs. Nighttime

Mflops 28

21

14

7

5 10 20 100 500 1000

Vector Length

Figure 17: Performance of V/V on navier Daytime vs. Nighttime 29

Mflrnc rann [VI I I ,, a

500 1000

Vector Length

Figure 18: Performance of V*(V+V) on navier Daytime vs. Nighttime

Mflops 88

66

44

22

5 10 20 100 500 1000

Vector Length

Figure 19: Performance of dot product on navier Daytime vs. Nighttime 30

Mflops 92

69

46

23

5 10 20 100 500 1000

Vector Length

Figure 20: Performance of S*V + V*V on navier Daytime vs. Nighttime

Mflops 120

90

60

30 n 1 5

Vector Length

Figure 21: Performance of S*V + V*V + S on navier Daytime vs. Nighttime 31

Appendix 3: Example NQS Command File

This appendix contains a copy of the NS command file that was used to obtain the timings in Tables 3 and 4. NOS script file for batch submission of ccm for timing purposes. *~# ~ Author: M. Pernice C~# ~ Date: 5/31/88 # Set embedded NQS opt i ons. #p$-mb # send me mail when the request starts #S,-me # send me mail when request is completed #$ -eo # make stderr and stdout the same fi les #PS-1T 1800 # set a total time limit of 1800 seconds # Change directory to where the source files are located. cd ccm/source

N Create executable. # 64-bit integers are needed to correctly handle Hollerith strings.) cat ccml.f ccm2.f ccm3.f ccmfft.f ccmrad2 f ccmutil.f > ccmbase.f time cft -eS -dL -i64 ccmbase.f Result: ccmbase.o, basic collection of ~ ~#ccm routines that won't be varied. time cft -eS -dL -i64 ccm4bio.f * Result: ccm4bio.o, ccm routines including -a ~ READR, WRITER with buffered I/O. time cft -eS -dL -i 64 ccm4brw.f Result: ccm4brw.o, ccm routines including # READR, WRITER with binary I/O involving implied Do-loops time cft77 -i64 ccmradl.f Result: ccmradl.o, ccm routines that involve # handling of character arrays, that cft can't handle Next link object files to create an executable that will run the # ccm using buffered I/O.

time segldr -o ccm ccmbase.o ccm4bio.o ccmradl.o Case 1: 5 day integration, buffered I/O. time ccmC

# Clean up.

rm fort' flow'

# Case 2: 1 day integration, buffered I/O.

time ccm(

time segldr -o ccm ccmbase.o ccm4brw.o ccmradl. o # Case 3: 5 day integration, binary I/O. # time ccm<

rm forts flow*

Case 4: 1 day integration, binary I/O. time ccm<

# Clean up.

rm fort' flow* ccm ccmbase.o ccm4brw.o ccm4bio.o ccmradl.o ccmbase.f