Your Speedup Is a Megaflop!

KOLUMNE 22 CHI MIA 47 (1993) Nr. II~ (lnnuor/Fcbru",) COMPUTATIONAL CHEMISTRY Column Editors: Prof. Dr. J. Weber, University of Geneva COLUMN Prof. Dr. H. Huber, University of Basel Dr. H. P. Weber, Sandoz AG, Basel Clzilllia 47 (/993) 22-23 ond instruction on the first number at the © Neue Sclz\i'eizerisclze Clzelllische Gesellschaft same time as the first worker performs the /SSN 0009-4293 first instruction on the second number, etc .. For the first 20 steps no product is leaving your assembly line (no result is Your Speedup is a Megaflop! leaving your vector-processor) but after that you get a product (result) in each step, 'I just bought a new personal compu- Another common measure is the mil- i.e. after a total of 120 steps your loop is ter, the LAST, with a speed of 50 MHz, lions of instructions performed per sec- processed! You gain a factor of about 17, whereas your old MADOS has only a 25 ond (Mips). Again this is not a very useful i.e. if your scalar processor performs I MHz frequency'. 'You are joking, the measure for number crunching as differ- Mflops, your vector-processor yields 17 LAST makes only 2 Mflops, whereas mine ent processors need a different number of Mflops, which would roughly con'espond makes 25 Mips.' 'Never mind, your old instructions to perform a useful calcula- to the performance advertized by the com- scalar processors will not do it for long, I tion, e.g. a multiplication. For example puter manufacturer. However, any scien- just read about a supercomputer of the workstations usually have risc-processors tific program has instructions other than Teraflop generation with a speedup of (reduced instruction set computer) which loops which do not vectorize and, there- over 100!' in average need slightly more instructions fore, the efficiency is in practice much to perform floating point operations than lower! Depending on your problem and on Not far from reality - perhaps virtual the 'classical' main-frame processors. the specific architecture of the processor, reality. Such a talk shows how confusing the computing power might be quite dif- the mixing of abbreviations and miscon- After all, if we are interested in number ferent. Moreover, believe it or not, by ceptions in the field of new developments crunching, why do we not use the number making the programming run slower, you in computing is. As many applications in of floating point operations per second as can rise the Mflops rate. Just add to your chemical research are real numbercrunch- a measure? Indeed, the Mjlops (Megajloat- code a large loop, which vectorizes excel- ing problems, the measures of speed of a ing point operations per second) are quite lently, but does not produce any useful computer have generally a large impor- a good measure for most comparisons of numbers for your problem and you will tance, e.g. if one has to decide whether it scalar computers. The problems start with see that your program uses more computer pays to port a program to a certain compu- some of the new risc-architectures, as well time but shows (due to the additional well ter, whether one should buy a work-sta- as with the vector- and parallel-comput- vectorized part) a higher Mflop rate. This tion or rather try to get computer-time on ers! This is the point where we should might look strange to you, but such things a so-called supercomputer. These reasons perhaps say something about vector- and happen often when comparing programs prompted us to publish in previous Col- parallel-processing for readers who are based on different algorithms and contain- umns two comparisons of computer-per- not familiar with the subject. ing such portions of code which are un- formances with two different typical ap- known to the tester. A typical example is plications in the field of chemistry, quan- Vector-processing is perhaps best com- a scalar program where some tests (if) are tum chemical calculations [1] and simula- pared with the work done at an assembly present in a loop, which might tell the tions [2]. In this Column, we discuss some line. Let us assume you have in your computer to perform the loop only if a of the measures often used for scalar, program a loop with 20 instructions to be condition is satisfied. To make the loop vector, and parallel processors and show performed and the loop is repeated 100 vectorize, one has usually to eliminate the how carefully one has to deal with them. times with different data. If you have one test (if), which rises the Mflops rate due to worker (scalar processor) you give him vectorizing, but also makes the program One of the most basic comparisons is the first instruction which he applies to the performing unnecessary operations. performed by clock-rate. Each computer first number, than the second instruction is has a clock, which gives the working/re- applied to the same number (or its result) As vector-processors have probably quency of the computer. If we compare and so on until the last of the 20 instruc- already passed their cl imax of success and two processors of the same type, e.g. tions is performed with the first number. parallel- or massively parallel-processing Motorola 68030 found in many Macin- Then the whole process starts again with (MPP) are the key words for the near toshes, one with a frequency of 25 MHz the second number etc., i.e. you wait 100* future, we would like to say also a few and one with 40 MHz, then the second one 20 or 2000 steps until your loop is fin- words about the speedup measure used in is normally 1.6 times faster. This might ished. For simplicity let us assume now parallel computing. Parallel computing not be true for the whole computer as other you have instead of one worker an assem- means that your workers are not standing components may not be scaled by the bly line with 20 workers (vector-proces- at an assembly line, but do the same oper- same ratio! The clock-rate is also a bad sor of a vector-length twenty). At the first ation at the same time on different data measure for a comparison between differ- step the first number enters the assembly (single instruction multiple data; SIMD) ent processor-types as different proces- line and is handled by the first worker as in or do even different things with different sors might need a different number of a scalar processor. However, in the second data (multiple instructions multiple data clock-cycles per instruction. step the second worker performs the sec- (MIMD); we are not going into more de- 23 CHI MIA 4~ (1992) Nr. 1/2 (hnuarIF<bruar\ tails here). The 'speedup' is the factor you tion took about 400 s (this number is that of CSCS in Manno, enforcing that gain in speed if you use n processors different from the one in [3], which was only programs yielding 275 Mflops should instead of one to solve a problem. If you wrong, due to an input error [4]) and a run on the NEC, are questionable in view have for example 20 workers (processors) performance of 153] Mflops was achieved. of the difficulty to accurately estimate the working together, they will usually loose The speedup for 8 processors was 7.65. efficiency of application programs run- some time for communication, i.e. ex- Brode [5] has carried out a very similar ning on vector processors. An alternative change oftheir working pieces (data), and, calculation on the same molecule (356 policy would be to supply a supplementa- therefore, reach a 'speedup' of less than contracted/592 primitive basis functions) ry national cluster of workstations or a 20, let us assume only ]2. Such a low with the TURBOMOLE-program on a super parallel computer for codes per- number would result if they need to com- workstation cluster of 14 machines performing badly on a vector processor mak- municate significantly and are often block- forming to a maximum rate of 660 Mflops. ing the scientists choose the most efficient ing each others way. Again you could The speedup was only] ] .6. Although the facility for their purpose. 'cheat' by giving each worker (processor) loss in parallelization was higher and the additional work which is not needed, but Mflop-rate was much smaller (the formal We thank Dr. Stefan Brode. BASF Aktienge- which forces him to sit longer at his table rate for the 8 Cray processors would even sellschaft, Ludwigshafen/Rhein and Dr. Hans and do relatively less communication, be 2660 Mflops) the time for the first Peter Liithi, [nterdisziplinares Projektzentrum fUr Supercomputing, ETH, ZUrich for the ex- making the process slower, but the speed- iteration (taking usually most time) was ample. up higher! only 524 s, i.e. slightly more than with DISCO. Similar experiences have been Finally, we would like to give an ex- made by Vogel et.a/. [6] with a version of ample from the real world of quantum DISCO on a network of workstations. [I J Th. Bally. P.-A. Carrupt. 1. Weber, Chimia chemistry, where people are not 'cheat- 1991. 45, 352. ing' their processors, but nevertheless sim- Summarizing, we can state that the [2] R. Eggenberger. H. Huber, Chilllia 1992,46, ilar effects can be found. Luthi et a/. [3] Mflop- and the speedup-measure are often 227. reported results from a calculation on a not very useful criteria for the real world of [3] H.P.

Your Speedup Is a Megaflop!

Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy

Amdahl's Law Threading, Openmp

Parallel Generation of Image Layers Constructed by Edge Detection

CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors

High-Performance Message Passing Over Generic Ethernet Hardware with Open-MX Brice Goglin

An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications

Computer Systems Architecture

Computer Architecture: Parallel Processing Basics

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Vector Processors

SIMD the Good, the Bad and the Ugly

Efficient Data Compression Using CUDA Programming In