Heterogeneous Computing: Challenges and Opportunities

Ashfaq A. Khokhar, Viktor K. Prasanna, Muhammad E. Shaaban, and Cho-Li Wang University of Southern California

omogeneous computing, which uses one or more machines of the same type, has provided adequate performance for many applications in the H past. Many of these applications had more than one type of embedded parallelism, such as single instruction, multiple data (SIMD) and multiple instruction, multiple data (MIMD). Most of the current parallel machines are suited only for homogeneous computing. However, numerous applications that have more than one type of embedded parallelism are now being considered for parallel implementation. On the other hand, as the amount of homogeneous parallelism in applications decreases, homogeneous systems cannot offer the desired speedups. To exploit the heterogeneity in computations, researchers are investigating a suite of heterogeneous architectures. Anytime you work with Heterogeneous computing (HC) is the well-orchestrated and coordinated effective use of a suite of diverse high-performance machines (including parallel oranges and apples, machines) to provide superspeed processing for computationally demanding tasks you’ll need a number of with diverse computing needs.’ An HC system includes heterogeneous machines, high-speed networks, interfaces, operating systems, communication protocols, schemes to organize and programming environments, all combining to produce a positive impact on ease of use and performance. Figure 1 shows an example HC environment. total performance. This Heterogeneous computing should be distinguished from network computing or article surveys the high-performance distributed computing, which have generally come to mean either clusters of workstations or ad hoc connectivity among computers using little challenges posed by more than opportunistic load-balancing. HC is a plausible, novel technique for heterogeneous solving computationally intensive problems that have several types of embedded parallelism. HC also helps to reduce design risks by incorporating proven technol- computing and ogy and existing designs instead of developing them from scratch. However, several issues and problems arise from employing this technique, which we discuss. discusses some In the past few years, several technical meetings have addressed many of these approaches to opening issues. There is also a growing interest in using this paradigm to solve Grand Challenges problems. Richard Freund has organized the Heterogeneous Process- up its opportunities. ing Workshops held each year at the IEEE International Parallel Processing

18 0018.9162/93/0600-0018$03.00 Q 1993 IEEE COMPUTER Glossary Symposiums.’ Another related yearly meeting is the IEEE International Sym- Analytical benchmarking: A procedure to analyze the relative effectiveness posium on High-Performance Distrib- of machines on various computational types. uted Computing.’ Code-type profiling: A code-specific function to identify various types of parallelism present in code and to estimate the execution times of each code type. Heterogeneous systems Cross-machine debuggers: Those available within the heterogeneous computing environment to help debug the application code that executes over multi- The quest for higher computational ple machines. power suitable for a wide range of ap- Cross-over overhead: That incurred in transferring data from one machine plications at a reasonable cost has ex- to another. It also includes data-format-conversion overhead between the two posed several inherent limitations of machines. homogeneous systems. Replacing such Cross-parallel compiler: An intelligent compiler that can generate intermedi- systems with yet more powerful homo- ate code executable on different parallel machines. geneous systems is not feasible. More- Heterogeneous computing (HC): A well-orchestrated, coordinated effective over. this approach does not improve use of a suite of diverse high-performance machines (including parallel ma- the versatility of the system. HC offers chines) to provide fast processing for computationally demanding tasks that a novel cost-effective approach to these have diverse computing needs. problems; instead of replacing existing multiprocessor systems at high cost, HC Metacomputations: Computations exhibiting coarse-grained heterogeneity proposes using existing systems in an in terms of embedded parallelism. integrated environment. Mixed-mode computations: Computations exhibiting fine-grained heterogeneity in terms of embedded parallelism. Limitations of homogeneous systems. Multiple instruction, multiple data (MIMD): A mode in which code stored in Conventional homogeneous systems each processor’s local memory is executed independently. usually use one mode of parallelism in a given machine (like SIMD, MIMD, or Single instruction, multiple data (SIMD): A mode in which all processors vector processing) and thus cannot ad- execute the same instruction synchronously on data stored in their local equately meet the requirements of ap- memory. plications that require more than one

User workstations

Massively Parallel Processor (MPP) Image-Understanding Architecture (IUA)

Figure 1. An example heterogeneous computing environment.

June 1993 19 of heterogeneous machines (so that each Special portion of the code is executed on its Vector MIMD SIMD purpose matching machine type) is likely to achieve speedups. Figure 2 illustrates a possible scenario (the numbers are exe- tal time = 100 units cution times in terms of basic units).

Heterogeneous computing. Hetero- geneity in computing systems is not an entirely new concept. Several types of tal time = 50 units special-purpose processors have been used to provide specific services for Communication time improving system throughput. One of the most common is I/O handling. At- tachingfloating-point processors to host 1 1 Total time = 4 units + communication overhead computers is yet another heterogeneous approach to enhance system perfor- Figure 2. Execution of example code using various systems. mance. In high-performance computers, the concept of heterogeneity mani- fests itself at the instruction level in the type of parallelism. As a result, any the code is executed rapidly, while oth- form of several types of functional units, single type of machine often spends its er portions of the code still have rela- such as vector arithmetic pipelines and time executing code for which it is poor- tively higher execution times. Similarly, fast scalar processors. However, cur- ly suited. Moreover, many applications the same code when executed on a suite rent multiprocessor systems remain need to process information at more than one level concurrently, with different types of parallelism at each level. Image understanding, a Grand Chal- lenges problem, is one such application.’ At the lowest level of computer vision, image-processing operations are applied to the raw image. These computations have a massive SIMD-type parallelism. In contrast, the participants in Partitioning and mapping the DARPA Image-Understanding Benchmark exercises’ observed that high-level image-understanding computations exhibit coarse-grained MIMD- type characteristics. For such applications, users of a conventional multiprocessor system must either settle for degraded performance on the existing hardware or acquire more powerful (and expensive) machines. Each type of homogeneous system suffers from inherent limitations. For example, vector machines employ in- terleaved memory with apipelined arithmetic logic unit, leading to performance in high million floating-point operations per second (Mflops). If the data distri- bution of an application and the resulting computations cannot exploit these features, the performance degrades se- verely. Consider an application code having mixed types of embedded parallelism. Assume that the code when executed on a serial machine spends 100 units of time. When this code is executed on a vector machine, the vector portion of Figure 3. User-directed approach.

20 COMPUTER mostly homogeneous as far as the type complete redesign. Since HC comprises parallel code of the application is taken of parallelism supported by them. Such several autonomous computers, overall as input. To run this code in an HC systems have been traditionally classi- system fault tolerance and longevity are environment, users must profile the types fied according to the number of instruc- likely to improve. of heterogeneous parallelism embed- tion and data streams. ded in the code. For this purpose, code- An HC environment must contain type profilers need to be designed. Fig- the following components: Issues ures 3 and 4 illustrate these approaches. However, both approaches need strate-

l a set of heterogeneous machines, We consider two approaches to using gies for partitioning, mapping, schedul-

l an intelligent high-speed network the HC paradigm. The first one analyzing, and synchronization. New tools and connecting all machines, and es an application to explore embedded metrics for performance evaluation are

l a (user-friendly) programming en- heterogeneous parallelism. Research- also required. Parallel programming environment. ers must devise new algorithms or mod- vironments are needed to orchestrate ify existing ones to exploit the hetero- the effective use of the computing re- HC lets a given system be adapted to a geneity present in the application. Based sources. wide range of applications by augment- on these algorithms, users develop the ing it with specific functional or perfor- code to be executed by the machines. Algorithm design. Heterogeneous mance capabilities without requiring a In the second approach, an existing computing opens new opportunities for developing parallel algorithms. In this section, we identify the efforts needed to devise suitable algorithms. The following issues must be considered by the designer:

(1) the types of machines available and their inherent computing characteristics, (2) alternate solutions to various subproblems of the application, and Vector MIMD SIMD SP (3) the costs of performing the communication over the network.

Computations in HC can be classified into two types?

l Metacomputing. Computations in this class fall into the category of coarse- grained heterogeneity. Instructions be- longing to a particular class of parallelism are grouped to form a module; each module is then executed on a suitable parallel machine. Metacomputing refers to heterogeneity at the module level. l Mixed-mode computing. In this fine- grained heterogeneity, almost every alternate parallel instruction belongs to a different class of parallel computation. Programs exhibiting this type of heterogeneity are not suitable for execution on a suite of heterogeneous machines because the communication overhead due to frequent exchange of information between machines can become a bottleneck. However, these programs can be executed efficiently on a single machine such as PASM (Partitionable SIMD/MIMD) which incorporates het- Programming environment erogeneous modes of computation. Mixed-mode computing refers to heter- Figure 4. Compiler-directed approach. ogeneity at the instruction level.

June 1993 21 Mixed-mode machines can achieve show that SIMD machines are well suit- common goal of the mapping process is large speedups for fine-grained hetero- ed for operations such as matrix compu- to accomplish these assignments such geneity by using the mixed-mode pro- tations and low-level image processing. that the overall runtime of the task is cessing available in a single machine. A MIMD machines. on the other hand, minimized. mixed-mode machine, for example. can are most efficient when an application Chen et al.” proposed a heuristic map- use its mode-switching capability to can be partitioned into a number of ping methodology based on the Clus- support SIMDiMIMD parallelism and tasks that have limited intercommuni- ter-M mdoel, which facilitates the de- hardware-barrier synchronization, thus cation. Note that analytical benchmark sign of portable software. Only one improving its performance over a ma- results are used in partitioning and map- algorithm is required for a given appli- chine operating in SIMD or MIMD mode ping. cation, regardless of the underlying ar- only. chitecture. Various types of parallelism Partitioning and mapping. Problems present in the application are identi- Code-type profiling. Fast parallel ex- that occur in these areas of a homoge- fied. In addition, all communication ecution of the code in a heterogeneous neous parallel environment have been and computation requirements of the computing environment requires iden- widely studied. The partitioning prob- application are preserved in an inter- tifying and profiling the embedded par- lem can be divided into two subprob- mediate specification of the code. The allelism. Traditional program profiling lems. Parallelism detection determines architecture of each machine in the en- involves testing a program assumed to the parallelism present in a given pro- vironment is modeled in the system rep- consist of several modules by executing gram. Clustering combines several op- resentation, which captures the inter- it on suitable test data. The prqfiler erations into a program module and connections of the architecture. The four monitors the execution of the program thus partitions the application into sev- components of this approach are and gathers statistics, including the ex- eral modules. These two subproblems ecution time of each program module. can be handled by the user, the compil- *an intermediate model to provide This information is then used to modify er, or the machine at runtime. an architecture-independent algorithm the modules to improve the overall ex- In HC, parallelism detection is not specification of the application, ecution time. the only objective; code classification l languages to support the specifica- In HC. profiling is done not only to based on the type of parallelism is also tion in the intermediate model (such estimate the code’s execution time on required. This is accomplished by code- languages should be machine-indepen- a particular machine but also to analyze type profiling, which also poses addi- dent and allow a certain amount of ab- the code’s type. This is achieved by tional constraints on clustering. straction of the computations), code-type profiling. As introduced by Mapping (allocating) program mod- l a tool that lets users specify topolo- Freund.’ this code-specific function is ules to processors has been addressed gies of the machines employed in the an off-line procedure: the statistics to by many researchers. Informally, in HC environment, and be gathered include the types of paral- homogeneous environments, the map- l a mapping module to match the prob- lelism of various modules in the code ping problem can be defined as assign- lem specification and the system repre- and the estimated execution time of ing program modules to processors so sentation. each module on the machines available that the total execution time (including in the environment. Code types that can the communication costs) is minimized. Figure 5 illustrates this methodology. be identified include vectorizable, Several other costs, such as the interfer- SIMDiMIMD parallel, scalar, and spe- ence cost, have also been considered. In Machine selection. An interesting cial purpose (such as fast Fourier trans- HC, however, other objectives, such as problem appears in the design of HC form). matching the code type to the machine environments: How can one find the type, result in additional constraints. If most appropriate suite of heterogeneous Analytical benchmarking. This test such a mapping has to be performed at machines for a given collection of appli- measures how well the available ma- runtime for load-balancing purposes (or cation tasks subject to a given constraint. chines perform on a given code type.- due to machine failure), the mapping such as cost and execution time? While code-type profiling identifies the problem becomes more complex due to Freund’ has proposed the Optimal Se- type of code. analytical benchmarking the overhead associated with the code lection Theory (OST) to choose an op- ranks the available machines in terms of and data-format conversions. Various timal configuration of machines for ex- their efficiency in executing a given code approaches to optimal and approximate ecuting an application task on a type. Thus. analytical benchmarking partitioning and mapping in HC have heterogeneous suite of computers with techniques permit researchers to deter- been studied.X-lL’ the assumption that the number of ma- mine the relative effectiveness of a giv- Mapping in HC can be performed chines available is unlimited. It is also en parallel machine on various types of conceptually at two levels: system (or assumed that machines matching the computation. macro) and machine (or micro). At the given set of code types are available and This benchmarking is also an off-line system-level mapping, each module is that the application code is decomposed process and is more rigorous than previ- assigned to one or more machines in the into equal-sized modules. ous benchmarking techniques, which system so that the parallelism embed- Wang et al’s Augmented Optimal simply looked at the overall result of ded in the module matches the machine Selection Theory (AOST)‘“incorporates running an entire benchmark code on a type. Machine-level mapping assigns the performance of code segments on processor. Some experimental results portions of the module to individual nonoptimal machine choices, assuming obtained by analytical benchmarking processors in the machine. The most that the number of available machines

22 COMPUTER Heterogeneous architecture for each code type is limited. In this approach, the program module most suitable for one type of machine is assigned to another type of machine. In the formulation of OST and AOST, it has been assumed that the execution of all program modules of a given application code is totally ordered in time. In reality, however, different execution interdependencies can exist among program modules. Also, parallelism can be present inside a module, resulting in further decomposition of program modules. Furthermore, the effect of different mappings on different machines available for a program module has not been considered in the formulation of these selection theories. Problem-specification tool The Heterogeneous Optimal Selec- tion Theory (HOST)9 extends AOST in two ways. It incorporates the effect of various mapping techniques available on different machines for executing a program module. Also, the dependen- Figure 5. Cluster-M-based heuristic mapping methodology. cies between the program modules are specified as a directed graph. Note that OST and AOST assume linear ordering tion to the dual of the above problem, such as FIFO, round-robin, shortest- of program modules. In the formulation that is, finding a least expensive set of job-first, and shortest-remaining-time, of HOST, an application code is as- machines to solve a given application can be employed at each level of sched- sumed to consist of subtasks to be exe- subject to a maximal execution time uling. cuted serially. Each subtask contains a constraint. This scheme is applicable to While all three levels of scheduling collection of program modules. Each all of the above selection theories. The can reside in each machine in an HC program module is further decomposed accuracy of the scheme, however, de- environment, a fourth level is needed to into blocks of parallel instructions, called pends upon the method used to assign perform with scheduling at the system code blocks. the program modules to the machines. level. This scheduler maintains a bal- To find an optimal set of machines, Iqbal also shows that for applications in anced system-wide workload by moni- we have to assign the program modules which the program modules communi- toring the progress of all program mod- to the machines so that cate in a restrictive manner, one can ules. In addition, the scheduler needs to find exact algorithms for selecting an know the different module types and cr optimal set of machines. If, however, available machine types in the environ- the program modules communicate in ment, since modules may have to be is minimal, while an arbitrary fashion, the selection prob- reassigned when the system configura- lem is NP-complete. tion changes or overload situations oc- z e 2 c,,, cur. Communication bottlenecks and Scheduling. In homogeneous environ- queueing delays incurred due to the where P is the time to execute program ments, a scheduler assigns each pro- heterogeneity of the hardware add con- module i, c’ is the cost of the machine gram module to a processor to achieve straints on the scheduler. on which program module i is to be desired performance in terms of pro- executed, and C,,, is an overall con- cessor utilization and throughput. De- Synchronization. This process pro- straint on the cost of the machines. The signers usually employ three schedul- vides mechanisms to control execution cost c’ and execution time 71 corre- ing levels. High-level scheduling, also sequencing and to supervise interpro- sponding to the assignment under con- called job scheduling, selects a subset of cess cooperation. It refers to three dis- sideration can be obtained by using code- all submitted jobs competing for the tinct but related problems: type profiling andlor by analyzing the available resources. Intermediate-level algorithms. scheduling responds to short-term fluc- l synchronization between the send- Iqbal” presented a selection scheme tuations in the system load by tempo- er and receiver of a message, that finds an assignment of program rarily suspending and activating pro- aspecification and control of the modules to machines in HC so that the cesses to achieve smooth system shared activities of cooperating pro- total processing time is minimized, while operation. Low-level scheduling de- cesses, and the total cost of machines employed in termines the next ready process to be l serialization of concurrent accesses the solution does not exceed an upper assigned to a processor for a certain to shared objects by multiple pro- bound. The scheme can also find a solu- duration. Different scheduling policies, cesses.

June 1993 23 A variety of synchronization meth- the topology, reliability, speed, and length, a bandwidth on the order of 1 ods have been proposed in the past: bandwidth of the network, in addition gigabit/second is required to match the semaphores, conditional critical regions, to the types and number of machines in computation and communication speeds. monitors, and pass expressions, among the environment. However, reducing Even if higher bandwidth networks others. In addition, some multiproces- synchronization overhead is important were available, three main sources of sors include hardware synchronization to achieving large speedups in HC. Due inefficiency would persist in current net- primitives. In general, synchronization to the possibility of several concurrent- works. First, application interfaces in- can be implemented by using shared ly operating autonomous machines in cur excessive overhead due to context variables or by message-passing. the environment, application-code per- switching and data copying between the In heterogeneous computing, the syn- formance in HC is more sensitive to user process and the machine’s operat- chronization problem resembles that of synchronization overheads. Frequent ing system. Second, each machine must distributed systems. In both cases, a hand-shaking for synchronization may incur the overhead of executing the high- global clock and shared memory are expend most of the available network levelprotocols that ensure reliable com- absent. and (unpredictable) network bandwidth. munication between program modules. delays and a variety of operating sys- Also, the network interface burdens the tems and programming environments Interconnection requirements. Cur- machine with interrupt handling and complicate the process. rent local area networks (LANs) are header processing for each packet. This Several techniques used in distribut- not suitable for HC because higher band- suggests incorporating additional net- ed systems are again useful for solving width and lower latency networks are work-interface hardware in each ma- HC synchronization problems. Two needed. The bandwidth of commercial- chine. approaches are available: centralized ly available LANs is limited to about 10 Nectar’* is an example of a network (one machine is designated as a control megabits per second. On the other hand, backplane for heterogeneous multicom- node) and distributed (decision-mak- in HC, assuming machines operating at puters. It consists of a high-speed fiber- ing is distributed across the entire sys- 40 megahertz and 20 million instruc- optic network, large crossbar switches, tem). The correct choice depends on tions per second with a 32-bit word and powerful network-interface processors. Protocol processing is off-loaded to these interface processors. A networking standard called Hippi (ANSI Some academic sites X3T9.3 High-Performance Parallel In- terface)‘? is being implemented for re- A number of academic sites are developing HC environments and applica- alizing heterogeneous computing envi- tions (this list is not exhaustive). ronments at various research sites. Hippi is an open standard that defines the Systems and architectures physical and logical link layers of a 100- Mbytelsecond network. Distributed High-Speed Computing (DHSC) project at Pittsburgh Supercom- In HC, hardware modules from vari- puting Center, University of Pittsburgh ous vendors share physical intercon- Image-Understanding Architecture, University of Massachusetts at Amherst nections. Differing communication pro- Mentat, University of Virginia tocols may make network-management problems complex. The following gen- Nectar-Based Heterogeneous System, Carnegie Mellon University eral approaches for dealing with net- Northeast Parallel Architecture Center (NPAC), Syracuse University work heterogeneity have been discussed in the literature: Partitionable SIMD/MIMD (PASM), Purdue University (1) treat the heterogeneous network institutes and departments as a partitioned network, with each Beckman Institute, University of Illinois at Urbana-Champaign partition employing a uniform set of protocols; Department of Biological Sciences, University of California at Los Angeles (2) have a single “visible” network Department of Computer Science, Kent State University management console; and Department of Computer Science, University of California at San Diego (3) integrate the heterogeneous management functions at a single Department of Computer and Information Sciences, New Jersey Institute of management console. Technology The IEEE Computer Society Techni- Department of Electrical Engineering-Systems, University of Southern Cali- cal Committee on Parallel Processing, fornia the Technical Committee on Mass Stor- Department of Math and Computer Science, Emory University age, and several research sites are work- Minnesota Supercomputer Center (MSC), University of Minnesota at Minne- ing together to define interface stan- apolis dards.

Supercomputer Computations Institute (SCI), Florida State University Programming environments. A parallel programming environment includes

24 COMPUTER PVM system parallel languages, intelligent compilers, parallel debuggers, syntax-directed editors. configuration-management tools, and other programming aids. In homogeneous computing, intelligent compilers detect parallelism in sequential code and translate it into parallelmachinecode.Parallelprogram- ming languages have been developed to support parallel programming, such as MPL for MasPar machines, and Lisp and C for the Connection Machine. In addition, several parallel programming environments and models have been designed, such as Code, Faust, Sched- ule, and Linda. HC requires machine-independent and portable parallel programming languages and tools. This requirement cre- ates the need for designing cross-parallel compilers for all machines in the environment, and parallel debuggers for debugging cross-machine code. Several programming models and environments r have been developed in the past for Programming environment heterogeneous computing.R~‘J-‘6 I The Parallel Virtual Machine (PVM) system.16 evolved over the past three Figure 6. An overview of the Parallel Virtual Machine system. years, consists of software that provides a virtual concurrent computing environment on general-purpose networks work, presenting a virtual concurrent in the environment. The inherent con- of heterogeneous machines. It is com- computing environment to users. currency in a distributed computing posed of a set of user-interface primi- environment, the lack of total ordering tives and supporting software that en- Performance evaluation. Performance of events on different machines, and the able concurrent computing on a loosely tools are used to summarize the run- nondeterministic nature of the commu- coupled network of high-performance time behavior of an application, includ- nication delays between the processes machines. It can be implemented on a ing analyzingresource use and the cause make the problem of evaluating perfor- hardware base consisting of different of any performance bottleneck. Depend- mance more complex. architectures, including single-CPU sys- ing on its design, a performance tool can The impact of the code type must be tems, vector machines, and multipro- describe program behaviors at many considered. Thus, performance metrics cessors (see Figure 6). levels of detail. The two most common such as processor utilization, speedup. Application programs view the PVM are the intraprocess and interprocess and efficiency are difficult to compute. system as a general and flexible parallel levels. Intraprocess performance tools, Indeed, these metrics must be carefully computing resource that supports such as the gproffacility on BSD Unix, defined to make a reasonable perfor- shared memory, message-passing, and the HP sampletY3000, and the Mesa Spy, mance evaluation. hybrid models of computation. A het- provide information about individual erogeneous application can be decom- processes. posed into several subtasks based on Performance tools for distributed Image understanding the embedded types of computation computing systems concentrate on the and then executed by using PVM sub- interactions between the processes. In- Intrinsic parallelism in image process- routines on different matching ma- tegrated performance models that ob- ing and the variety of heuristics avail- chines available on the network. The serve the status and the performance able for problems in image understand- PVM primitives are provided in the events at all levels can be found in the ing make computer vision an ideal form of libraries linked to application PIE (Programming and Instrumenta- vehicle for studying heterogeneous com- programs written in imperative languag- tion Environment) project.17 puting. From a computational perspec- es. They support process initiation and Designing performance-evaluation tive, vision processing is usually orga- management, message-passing, syn- tools for distributed computing systems nized as follows: chronization, and other housekeeping involves collecting, interpreting, and facilities. evaluating performance information l Early processing of the raw image Support software provided by the from application programs, the operat- (often called low-level processing). At PVM system executes on a set of user- ing system, the communication network, this level, the input is an image. The specified computing elements on a net- and other hardware modules employed output image is approximately the same

June 1993 2s size. Convolutions are performed on understanding/recognition andsymbolic forms better than any single machine each pixel in parallel. The data commu- processing employ complex data struc- considered. These results support the nication among the pixels is local to tures. Many of the proposed algorithms suitability of a heterogeneous environ- each pixel. for such problems are nondeterminis- ment for computer vision applications. l Interfacing between low-level and tic, and architectural requirements for image-understanding problems (often these problems demand coarse-grained termed intermediate-level processing). MIMD machines. Parallel machines such eterogeneous computing offers The operations performed on each data as the Aspex ASP and Vista/3 are well new challenges and opportu- item can be nonlocal. The communica- suited for this class of problems. H nities to several research com- tion is also irregular as compared with munities. To support this paradigm, the that of low-level processing. Another approach is to build machines following areas of research must be in-

l Image understanding. By this we having multiple computational capabil- vestigated: mean using the acquired data from the ities embedded in a single system. These l Designing tools to identify hetero- above processing (for example, geomet- architectures consist of several levels. geneous parallelism embedded in ric features such as shape, orientation, Typically, the lower levels operate in applications. and moments) to infer semantic at- SIMD mode and the higher levels oper- l Studying issues in high-speed net- tributes of an image. Processing at this ate in MIMD mode. In the Image-Un- working, including available tech- level can be classified as knowledge and/ derstanding Architecture,19 the lowest nologies and specialized hardware or symbolic processing. Search-based level has bit-serial processors, and the for networking. techniques are widely used at this level. intermediate level consists of digital sig- l Designing communication protocols nal processors. The highest level con- to reduce the cross-over overheads As evident in the preliminary results sists of general-purpose microproces- that occur when different machines from the 1988 DARPA Image-Under- sors operating in MIMD mode. communicate in the same environ- standing Benchmark,ls each level in com- ment. puter vision exhibits a different type of An example vision task. We present l Developing standards for parallel parallelism. Therefore, at each level a an example vision task and identify the interfaces between various ma- suitable type of parallel machine must different types of parallelism. We have chines. be employed. Corresponding to each of chosen the DARPA Integrated Image- l Designing efficient partitioning and the above classes of problems, a suit- Understanding Benchmark4 as an ex- mapping strategies to exploit heter- able class of architecture was proposed:3 ample task. The overall task performed ogeneous parallelism embedded in by this benchmark is the recognition of applications. l SIMD machines. Machines in this an approximately specified two-and-a- . Designing user interfaces and user- class are well suited for computations in half-dimensional “mobile” sculpture in friendlyprogrammingenvironments low-level and in some intermediate-lev- a cluttered environment, given images to program diverse machines in the el computer vision problems because of from intensity and range sensors. same environment. the regular dataflow and iconic opera- Steps in the benchmark can be identi- l Developing algorithms for applications in these two levels. For example, fied by the vision-task classifications. tions with heterogeneous comput- two-dimensional cellular arrays and First, low-level operations such as con- ing requirements. mesh-connected computers have been nected component labeling and corner proposed for a large class of geometric extraction are performed. Then, group- Indeed, HC provides an opportunity and graph-based problems in image pro- ing the corners (an intermediate-level to bring together research from various cessing. Parallel machines such as the vision operation) results in the extrac- disciplines of computer science and en- MasPar MP-series and the Connection tion of candidate rectangles. Finally, gineering to develop a feasible approach Machine CM-2000 fall in this category. partial matching of the candidate rect- for applications in the Grand Challeng- Pipelined parallel machines (like the angles is followed by confirmed matches problem set. W Carnegie Mellon University Warp ma- ing (a high-level vision task). The re- chine) are also well suited for low- and sults obtained on several different intermediate-level vision computations. parallel machines were reported at the Acknowledgments l Medium-grained MIMD machines. 1988 Image-Understanding Workshop. Various intermediate- and high-level Details of the benchmark results can be We thank Richard Freund and Ashraf Iqbal for many helpful discussions. This research vision tasks are computationally inten- found in Weems et al.‘* was partly supported by the National Sci- sive with irregular dataflow. Moreover, As they describe, directly interpret- ence Foundation under Grant No. IRI- the size of the input is smaller than the ing these results would be unfair, since 9145810. input image size. Parallel systems hav- there were many undefined factors in ing a set of powerful processors are the benchmark description. However, suitable for performing computations the benchmark does give pointers to References in intermediate- and high-level vision how different machines can be classi- tasks. The Connection Machine CM-5, fied with respect to their suitabilitv for 1. R. Freund and D. Conwell, “Supercon- Vistal2, Alliant FX-80, and Sequent performing operations at different lev- currency: A Form of Distributed Heter- ogeneous Supercomputing,” Supercom- Symmetry 81 are some examples. els of vision. Overall, the simulation puting Review, Oct. 1990, pp. 47-50. 9 Coarse-grained MIMD machines. results show that the (heterogeneous) High-level vision tasks such as image Image-Understanding Architecture per- 2. Newsletter of the IEEE Computer Soci-

26 COMPUTER etyTechnica1 Committee on Parallel Pro- 14. C. de Castro and S. Yalamanchili, “Parti- puter architecture. VLSI computations. and cessing (TCPP). Vol. 1, No. 1. Oct. 1992. tioning Signal Flow Graphs for Execu- computational aspects of image processing, tion on Heterogeneous Signal Process- vision. robotics. and neural networks. 3 V.K. Prasanna Kumar. Parallel Algo- ing Architectures,” Proc. Workshop on Prasanna received the BS degree in elec- rlrhms and Architectures ,for Image Un- Heterogeneous Processing, IEEE CS tronics engineering from Bangalore Univer- derstanding. Academic Press. Boston, Press. Los Alamitos, Calif., Order No. sity, the MS degree from the School of Auto- 1991. 2702. 1992. pp. 81-86. mation. Indian Institute of Science. and the PhD in computer science from Pennsylvania 4 C. Weems et al.. “An Integrated Image- 15. J. Potter. “Heterogeneous Associative State University in lY83. He serves as the UnderstandingBenchmark: Recognition Computing,” Proc. Workshop on Heter- symposium chair of the 1994 IEEE Inter- of a 2-112D Mobile,” Proc. DARPA Im- ogeneous Processing, IEEE CS Press, Los national Parallel Processing Symposium and age-understanding Workshop, Morgan Alamitos, Calif., Order No. 3532.02,1993. is a subject area editor of the Journal of Kaufmann Publishers. San Mateo, Calif.. Parallel and Distributed Computing, IEEE 1988. pp. 11 l-126. 16. V. Sunderam, “PVM: A Framework for Transactionson Compurers, and IEEE Trans- Parallel Distributed Computing,” Con- actions on Signal Processing. He is the found- 5 T. Berg and H.J. Siegel, “Instruction Ex- currency: Practice and Experience, Vol. ing chair of the IEEE Computer Society ecution Trade-Offs for SIMD vs. MIMD 2, No. 4. Dec. 1990, pp. 315-339. Technical Committee on Parallel Processing vs. Mixed-Mode Parallelism,” Proc. Int’l and is a senior member of the Computer Parallel Processing Symp. (IPPS), IEEE 17. Z. Segall and L. Rudolph, “PIE: A Pro- Society. CS Press. Los Alamitos. Calif., Order gramming and Instrumentation Environ- No. 2167. 1991, pp. 301-308. ment for Parallel Processing,” IEEESofi- ware. Vol. 2, No. 6, Nov. 1985, pp. 22-27. 6 A. Khokhar et al.. “Heterogeneous Su- percomputing: Problems and Issues,” 18. C. Weems et al., “Preliminary Results Muhammad E. Shaaban Proc. Workshop on Heterogeneous Pro- from the DARPA Integrated Image- is a PhD candidate in cessing, IEEE CS Press, Los Alamitos. Understanding Benchmark,” Parallel Ar- the Department of Elec- Calif.. Order No. 2702. 1992. pp. 3-12. CllitecturesandAlgorithnzsforlmage Un- trical Engineering-Sys- dersranding, V.K. Prasanna, ed.. Academ- tems, University of 7 R. Freund. “Optimal Selection Theory ic Press, Boston, 1991, pp. 399-499. Southern California. His for Superconcurrency.” Proc. 89 Super- areas of research include compuring, IEEE CS Press, Los Alami- 19. D. Shu, J. Nash, and C. Weems, “A Mul- parallel optical inter- tos, Calif., Order No. M2021 (microfiche), tiple-Level Heterogeneous Architecture connection networks, 1989. pp. 13-17. for Image Understanding,” Proc. Int’l parallel algorithms for Con& Pattern Recognition, IEEE CS image processing, and heterogeneous com- 8. G. Agha and R. Panwar, “An Actor- Press, Los Alamitos, Calif., Vol2, Order puting. Based Framework for Heterogeneous No. 2063, 1990. Shaaban received the BS and MS degrees Computing Systems.” Proc. Workshop in electrical engineering from the University on Heterogeneous Processing, IEEE CS of Petroleum and Minerals, Dhahran, Saudi Press, Los Alamitos, Calif., Order No. Arabia, in 1984 and 1986, respectively. He 2702, 1992, pp. 35-42. recently served as a session chair at the Inter- national Parallel Processing Symposium. He 9. S. Chen et al., “A Selection Theory and is a student member of the Computer Soci- Methodology for Heterogeneous Super- Ashfaq A. Khokhar is a ety. computing,” Proc. Workshop on Hetero- PhD candidate in the geneous Processing, IEEE CS Press, Los Department of Electri- Alamitos, Calif., Order No. 3532-02, 1993. cal Engineering-Systems at the University of 10 M. Wang et al., “Augmenting the Opti- Southern California, Los mal Selection Theory for Superconcur- Angeles. His areas of re- Cho-Li Wang is a PhD rencv,” Proc. Workshop on Heteroge- search include parallel candidate in the Depart- two& Processing, IEEE CS Press, Los architectures and scal- ment of Electrical Engi- Alamitos, Cahf., Order No. 2702, 1992, able algorithms, image neering-systems, Uni- pp. 13-22. understanding and parallel processing, VLSI versity of Southern computations,interconnectionnetworks,and California. Los Angeles. 11. M. labal. “Partitioning Problems for Het- heterogeneous computing. His areas of research erogeneous Computer Systems,” tech. Khokhar received the BSc degree in elec- include computer archi- report, Dept. of Electrical Engineering- trical engineering from the University of tectures and algorithms, Eneineerine and Technolonv. Lahore. Paki- image understanding Systems, Univ. of Southern California, . Los Angeles, 1993. sta; in 198yand the MS de;ee in computer and parallel processmg, image compression, engineeringfromsyracuse University in 1988. and heterogeneous computing. 12 E. Arnould et al., “The Design of Nectar: He is a student member of the Computer Wang received the BS degree in computer A Network Backplane for Heterogeneous Society. science and information engineering from Multicomputers,” Proc. Inr’l Conf Ar- NationalTaiwan University,Taiwan, in 1985 chitectural Support for Programming and the MS degree in computer engineering Languages and Operating Systems (AS- from the University of Southern California PLO.5 I![), IEEE CS Press, Los Alami- in 1990. tos, Calif., Order No. Ml936 (microfiche), Viktor K. Prasanna 1989, pp. 205-216. (V.K. Prasanna Kumar) is an associate professor 13. ANSI X3T9.3, “High-Performance Par- in the Department of allel Interface: Hippi-PH. Hippi-SC, Hip- Electrical Engineering- pi-FP, Hippi-LE. and Hippi-MI,” Work- Systems, Uni;ersity if Readers can contact Viktor K. Prasanna at ing Draft Proposed American National Southern California, Los the School of Engineering, Department of Standard for Information Systems, Amer- Angeles. His research Electrical Engineering-Systems, University ican Nat’1 Standards Inst., New York, interests include paral- of Southern California, University Park, Los Jan.-Apr., 1991. lel computation, com- Angeles, CA 90089-2562.

June 1993 27