Parallel Architectures MICHAEL J. FLYNN AND KEVIN W. RUDD Stanford University ͗[email protected]͘; ͗[email protected]͘

PARALLEL ARCHITECTURES currently performing different phases of processing an instruction. This does not Parallel or concurrent operation has achieve concurrency of execution (with many different forms within a system. Using a model based on the multiple actions being taken on objects) different streams used in the computa- but does achieve a concurrency of pro- tion , we represent some of the cessing—an improvement in efficiency different kinds of parallelism available. upon which almost all processors de- A stream is a sequence of objects such pend today. as data, or of actions such as instruc- Techniques that exploit concurrency tions. Each stream is independent of all of execution, often called instruction- other streams, and each element of a level parallelism (ILP), are also com- stream can consist of one or more ob- mon. Two architectures that exploit ILP jects or actions. We thus have four com- are superscalar and VLIW (very long binations that describe most familiar parallel architectures: instruction word). These techniques schedule different operations to execute (1) SISD: single instruction, single data concurrently based on analyzing the de- stream. This is the traditional uni- pendencies between the operations [Figure 1(a)]. within the instruction stream—dynami- (2) SIMD: single instruction, multiple cally at run time in a superscalar pro- data stream. This includes vector cessor and statically at compile time in processors as well as massively par- a VLIW processor. Both ILP approaches allel processors [Figure 1(b)]. trade off adaptability against complex- (3) MISD: multiple instruction, single ity—the superscalar processor is adapt- data stream. These are typically able but complex whereas the VLIW systolic arrays [Figure 1(c)]. processor is not adaptable but simple. (4) MIMD: multiple instruction, multi- Both superscalar and VLIW use the ple data stream. This includes tradi- same techniques to achieve tional multiprocessors as well as the newer networks of workstations high performance. [Figure 1(d)]. The current trend for SISD processors is towards superscalar designs in order Each of these combinations character- to exploit available ILP as well as exist- izes a class of architectures and a corre- ing object code. In the marketplace sponding type of parallelism. there are few VLIW designs, due to code compatibility issues, although advances SISD in compiler technology may cause this to change. However, research in all as- The SISD class of processor architecture is the most familiar class and has the pects of ILP is fundamental to the de- least obvious concurrency of any of the velopment of improved architectures in models, yet a good deal of concurrency all classes because of the frequent use of can be present. Pipelining is a straight- SISD architectures as the processor ele- forward approach that is based on con- ments in most implementations.

Copyright © 1996, CRC Press.

ACM Computing Surveys, Vol. 28, No. 1, March 1996 68 • Michael J. Flynn and Kevin W. Rudd

Figure 1. The stream model.

SIMD consisting of hundreds to tens of thou- sands of relatively simple processors op- The SIMD class of processor architec- erating together. A de- ture includes both array and vector pro- pends on the same regularity of action cessors. This processor is a natural re- sponse to the use of certain regular data as an array processor but on smaller structures such as vectors and matrices. data sets and relies on extreme pipelin- Two different architectures, array pro- ing and high clock rates to reduce the cessors and vector processors, have been overall latency of the operation. developed to address these structures. There have not been a significant An array processor has many proces- number of array architectures devel- sor elements operating in parallel on oped due to a limited application base many data elements. A vector processor and market requirement. There has has a single processor element that op- been a trend towards more and more erates in sequence on many data ele- complex processor elements due to in- ments. Both types of processors use a creases in chip density, and recent ar- single operation to perform many ac- ray architectures blur the distinction tions. An array processor depends on between SIMD and MIMD configura- the massive size of the data sets to tions. On the other hand, many differ- achieve its efficiency (and thus is often ent kinds of vector processors have de- referred to as a pro- veloped dramatically through the years. cessor), with a typical array processor Starting with simple memory-based vec-

ACM Computing Surveys, Vol. 28, No. 1, March 1996 Parallel Architectures •69 tor processors, modern vector processors cessor elements are performed through have developed into high-performance a address space (either multiprocessors capable of addressing global or distributed between processor both SIMD and MIMD parallelism. elements, called distributed shared memory to distinguish it from distrib- MISD uted memory), two significant problems arise. The first is mainlining memory Although it is easy to both envision and consistency—the programmer-visible design MISD processors, there has been ordering effects of memory references little interest in this type of parallel both within a processor element and architecture. The reason, so far anyway, between different processor elements. is that no ready programming con- The second is maintaining coher- structs easily map programs into the ency—the programmer-invisible mecha- MISD organization. nism to ensure that all processor ele- Abstractly, the MISD is a of ments see the same value for a given multiple independently executing func- memory location. The memory consis- tional units operating on a single tency problem is usually solved through stream of data, forwarding results from a combination of hardware and software one functional unit to the next. On the techniques. The cache coherency prob- level, this is exactly lem is usually solved exclusively what the vector processor does. How- through hardware techniques. ever, in the vector pipeline the opera- There have been many configurations tions are simply fragments of an assem- of MIMD processors that have ranged bly-level operation, as distinct from from the traditional processor described being a complete operation in them- in this section to loosely coupled proces- selves. Surprisingly, some of the earli- sors based on networking commodity est attempts at in the 1940s workstations through a local area net- could be seen as the MISD concept. work. These configurations differ pri- They used plug boards for programs, marily in the interconnection network where data in the form of a punched between processor elements that range card was introduced into the first stage from on-chip arbitration between multi- of a multistage processor. A sequential ple processor elements on one chip to series of actions was taken in which the wide-area networks between continents, intermediate results were forwarded the tradeoffs being between the latency from stage to stage until at the final of communications and the size limita- stage a result was punched into a new tions on the system. card. LOOKING FORWARD MIMD We are celebrating the first 50 years of The MIMD class of parallel architecture electronic digital computers—the past, is the most familiar and possibly most as it were, is history, and it is instruc- basic form of parallel processor: it con- tive to change our perspective and to sists of multiple interconnected proces- look forward and consider not what has sor elements. Unlike the SIMD proces- been done but what must be done. Just sor, each processor element executes as in the past there will be larger, completely independently (although faster, more complex computers with typically the same program). Although more memory, more storage, and more there is no requirement that all proces- complications. We cannot expect that sor elements be identical, most MIMD processors will be limited to the “sim- configurations are homogeneous with ple” uniprocessors, multiprocessors, ar- all processor elements identical. ray processors, and vector processors we When communications between pro- have today. We cannot expect that the

ACM Computing Surveys, Vol. 28, No. 1, March 1996 70 • Michael J. Flynn and Kevin W. Rudd programming environments will be lim- element, the performance benefits of ited to the simple imperative program- these improvements are complementary ming languages and tools that we have and at this point are nowhere near the today. scale of performance available through As before, we can expect that memory exploiting parallelism. Clearly, provid- cost (on a per-bit basis) will continue its ing parallelism of order n is much easier decline so that systems will contain than increasing the execution rate (for larger and larger memory spaces. We example, the clock speed) by a factor of n. are already seeing this effect in the The continued drive for higher- and latest processor designs that have dou- higher-performance systems thus leads bled the “standard” address size, - us to one simple conclusion: the future ing an increase from 4,294,967,296 ad- is parallel. The first electronic comput- dresses (with 32 bits) to 18,446,744,073, ers provided a of 10,000 com- 709,551,616 addresses (with 64 bits). pared to the mechanical computers of 50 We can expect that interconnection net- years ago. The challenge for the future works will continue to grow in both is to realize parallel processors that pro- scale and performance. The growth in vide a similar speedup over a broad the Internet in the last few years is range of applications. There is much phenomenal and the increase in the use work to be done here. . .let us be on with of optics in the interconnection network it! has made this increase at least feasible. However, we cannot expect that the REFERENCES ease of programming these improved There are thousands of references deal- configurations will advance—as the ing with the many aspects of parallel available parallelism of computer sys- architectures. These references com- tems increases, exploiting this parallel- prise only a very small subset of acces- ism becomes the limiting factor. There sible publications, but provide the inter- are two aspects of this problem: finding ested reader with jumping-off points for large degrees of parallelism (typically further exploration. an algorithmic or partitioning problem) FLYNN, M. J. 1995. : and efficiently managing the available Pipelined and Parallel . parallelism to achieve high performance Jones and Bartlett, Boston. (typically a scheduling or placement HOCKNEY,R.W.AND JESSHOPE, C. R. 1988. problem). Of course, all the solutions to Parallel Computers 2: Architecture, Program- these problems must ensure that cor- ming and Algorithms, 2nd ed. Adam Hilger, rectness is satisfied. It does not matter Bristol. HOCKNEY,R.W.AND JESSHOPE, C. R. 1981. how fast the program runs if it does not Parallel Computers: Architecture, Programming produce the correct result. Solving these and Algorithms. Adam Hilger, Bristol. problems will require many develop- HWANG, K. 1993. Advanced Computer Architec- ments and changes, few of which are ture: Parallelism, , Programmabil- foreseeable. ity. McGraw-Hill, New York. Although not satisfying, we can cer- IBBETT,R.N.AND TOPHAM, N. P. 1989a. tainly say that programming para- Architecture of High Performance Computers, Vol. I. Uniprocessors and Vector Processors. digms, compiler techniques, algorithm Springer-Verlag, New York. designs, and operating systems are all IBBETT,R.N.AND TOPHAM, N. P. 1989b. fair game, but these are likely only Architecture of High Performance Computers, pieces of the puzzle. Indeed, broad new Vol. II. Array Processors and Multiprocessor approaches to the representation of Systems. Springer-Verlag, New York. physical problems may be required. The KUHN,R.H.AND PADUA,D.A.EDS. 1981. Tutorial on Parallel Processing. IEEE Com- good news from all this is that there is puter Society Press, Los Alamitos, CA. no dearth of work to be done in this WOLFE, M. J. 1996. High Performance Compil- area. Although improvements can cer- ers for . Addison-Wesley, tainly be made to a single processor Reading, MA.

ACM Computing Surveys, Vol. 28, No. 1, March 1996