/Memory/Array Size Tradeoffs in the Design of SIMD Arrays for a Spatially Mapped Workload∗

Martin C. Herbordt Anisha Anand Charles C. Weems Owais Kidwai Renoy Sam

Department of Electrical and Computer Engineering Department of Computer Science University of Houston University of Massachusetts Houston, TX 77204-4793 Amherst, MA 01003

Abstract: Though massively parallel SIMD arrays continue The design issues are as follows. The first is to rank to be promising for many computer vision applications, they and memory designs with respect to performance, have undergone few systematic empirical studies. The prob- eliminating those with poor cost/performance ratios. The lems include the size of the architecture space, the lack of second is to see how the performance of these designs varies portability of the test programs, and the inherent complex- with array size. This is complicated by the fact that al- ity of simulating up to hundreds of thousands of processing though the effect of the datapath on performance varies pre- elements. The latter two issues have been addressed previ- dictably with array size, that of the memory hierarchy does ously, here we describe how spreadsheets and tk/tcl are used not. The next issue is to relate the datapath to the memory to endow our simulator with the flexibility to model a large designs to find which combinations result in i) well-balanced variety of designs. The utility of this approach is shown in systems and ii) systems which could be enhanced with PE the second half of the paper where results are presented as to cache. Finally, the overall designs (including array size, dat- the performance of a large number of array size, datapath, apath, and memory) with the best performance/chip area register file, and application code combinations. The con- ratios are determined. clusions derived include the utility of multiplier and floating The constraint that we wish to examine processor de- point support, the cost of virtual PE emulation, likely dat- signs by running real application programs, together with apath/memory combinations, and overall designs with the the wide variety of designs to be evaluated, clearly requires most promising performance/chip area ratios. that simulation be used. Many problems remain, however, including how to i) implement a simulator with the flexibility 1 Introduction to emulate large numbers of models without requiring sub- stantial recoding, ii) achieve throughput given the slowdown It is commonly recognized that the computational character- inherent in software simulation, and iii) guarantee fairness istics of many low-level, and even some intermediate-level, for a wide range of designs that may not share the same vision tasks map well to massively parallel SIMD arrays of optimal algorithm for any particular task. processing elements. Primary reasons for this are that these These issues are dealt with by the ENPASSANT en- vision tasks typically require processing of pixel data with vironment; the latter two are discussed elsewhere [6, 4], the respect to either nearest-neighbor or other two-dimensional former has been addressed recently and is described here. pattern exhibiting locality or regularity. Tasks within this The key idea is to use spreadsheets coupled with tk/tcl and domain often require complex codes. As a consequence, C code to specify templates (classes of architectures) relat- gauging computational requirements is facilitated by evalu- ing design parameters to the performance of virtual machine ating performance with respect to real program executions. constructs. The user then can specify models (specific de- What we examine here are the effects on both cost and per- signs) within those templates from a graphic user interface. formance of varying the array size (number of PEs), the datapath, and the memory hierarchy with respect to a set 2 Architecture and Application Domains of computer vision applications. 2.1 Architecture Space ∗This work was supported in part by an IBM Fellowship; by the NSF through CAREER award #9702483; by DARPA under contract Massively Parallel SIMD Arrays (MPAs) are asymmetric, DACA76-89-C-0016, monitored by the U.S. Army Engineer Topo- graphic Laboratory; and under contract DAAL02-91-K-0047, moni- having a controller and an array consisting of from a few tored by the U.S. Army Harry Diamond Laboratory. thousand to a few hundred thousand processing elements (PEs). The PEs execute synchronously code broadcast by the controller. One consequence is that PEs do not have in- dividual micro-sequencers or other control circuitry; rather their ‘CPUs’ consist entirely of the datapath. Within this constraint, however, there are few limitations. See Figure 1. Controller Here we examine array sizes of from 4K to 64K PEs. The complexity of current PE ranges from the MGAP which does not have even a complete one-bit Decoder Decoder ALU [9] to the MasPar MP2 which contains a 32-bit data- Memory Address Decoder and path and extensive floating point support [1]. The particu- Cache Controller lar features that are examined here are the ALU/datapath width, the multiplier, and the floating point support. The PE 0 PE 0 effect of other features, such as the number of ports in the PE 1 PE 1 register file, have been found to be second order.

Dedicated Router Network ...... 1. Packet Switched Combining InterPE PE n PE n 2. Circuit Switched Communication 3. Reconfigurable Broadcast Mesh Datapath Register Cache Main Mesh Network File Memory

instructions Figure 2: Shown is the MPA PE memory design space. Note and global data that all PEs share the same controllers.

Controller becoming more and more clear that, for many application ALU Index Global OR Registers domains, evaluation of parallel processors requires using ap- feedback Array Registers Barrel plication specific test suites. That this has long been the Shifter Global COUNT Multiplier concensus of the computer vision community is shown in feedback PE Shift Divider Register [10, 11, 13]. Required Floating Point Datapath Our test suite consists of the parts of the second IU Optional Benchmark particularly suited to MPAs, plus other impor- tant vision tasks. These inlude a convolution-based cor- Figure 1: Shown is some the MPA architecture space. Note respondence matcher, a motion-from-correspondence ana- that PEs do not have local control and that their number lyzer, an image preprocessor that uses complex edge model- ranges from the thousands to 100’s of thousands. ing, and two segmentation-based line finders. Some general comments are as follows: they are all Another consequence of SIMD control is on MPA PE non-trivial in that they consist of from several hundred to memory configurations: Each array instruction operates on 1 several thousand lines of code; they are, with the exception the same registers and memory locations for each PE. In of the IU Benchmark, in use in a research environment; and this case, it is possible to model the memory hierarchy, in- they span a wide variety of types of computations. As an cluding cache, with a simple serial model. See Figure 2. example of the last point, some are dominated by gray-scale Existing MPAs and MPA designs contain on-chip mem- 2 computation (which is mostly 8 bit integer), others by float- ory in the range of 32 to 1024 bits. Off-chip memory typi- ing point. They are described in detail in [3]. cally consists of DRAM, although SRAM has also been used. Although no MPA has yet been built with PE cache, studies 2.3 Mapping Applications to Architectures: Virtual PEs have shown the clear benefits [8]. The particular features we examine here are the register file size from 128 to 1600 bits The standard MPA programmer’s model maps elements of and the cache size. We do not examine very small register parallel variables to individual PEs. It is rarely the case, file sizes because our programmer’s model requires at least however, that a processor has enough PEs to create this three 32 bit registers. We assume enough main memory to precise mapping; rather, some number of elements must be store all the data the programs require, and test caches up mapped to each PE. to the sizes necessary to achieve a 99% hit rate. In order to align the programmer’s model with physi- cal machines, the PEs assumed by the programmer’s model 2.2 Application Space (referred to as virtual PEs or VPEs) are emulated by the physical PEs. We refer to the number of VPEs each PE The fundamental criterion for selecting test suite programs must emulate as the VPE/PE ratio, or VPR. This emula- is that they should be representative of the expected work- tion can either be done entirely in software, through hard- load of the target machine. Since SIMD arrays, however, are ware support in the array (ACU), or through not fully general purpose processors, general purpose bench- a combination of the two. One of the fundamental goals of marks such as SPEC [12] are not appropriate. In fact, it is this study is to assess the cost of VPE emulation. 1An exception is for processors which support a local indexing We now briefly describe how VPE emulation is carried mode where the register can be a pointer to a memory location—as out. VPE variables are mapped VPR elements per PE. We use of this mode is not critical in our current application test suite refer to a single physical slice of a VPE variable across the we postpone this analysis to a later study. 2We refer to on-chip memory as register file because of its explicit array of physical PEs as a tile. For simple non-interacting control. instructions such as those for logic, arithmetic, and activity control, VPR physical instructions must be executed, one for Apps Target Machine Target Machine each tile. Emulating interacting VPE instructions such as in ICL Class Selection Specification feedback and communication is substantially more complex. The best performance is achieved when as many in- Machine Target Machine Type structions as possible are executed within each tile; this is Input Constructor Model Generator because a change of tile is effectively a context switch. Tiles must be swapped, however, whenever feedback or commu- nication instructions are encountered. This is because feed- Virtual Machine Machine ICL Emulator and Model back instructions cause control hazards in the controller and Executable Trace Generator communication instructions inherently involve tile interac- tion. Compilers, such as that for ICL [4], can schedule in- structions to minimize tile swapping. Virtual Machine Trace Compiler The overhead due to VPE emulation therefore has sev- Trace and Analyzer eral components: i) a linear component due to VPR tiling, ii) an increase in working set resulting in a decrease in lo- cality of reference, iii) a sub-linear increase in working set size due to instruction scheduling, and iv) further overhead Performance due to emulations of communication instructions [5]. Figure 3: Shown are major components of ENPASSANT. 3 Apparatus: ENPASSANT • ICL language. ICL is a data parallel language sim- There are two fundamental ideas behind our work. The first ilar to C∗ and MPL. It has been designed specifically is to use only a single programmer’s model, which we call to present the application programmer with the MPA the MPA virtual machine (VM), to cover the entire class of Virtual Machine. ICL is constructed entirely with MPAs. To make this work without introducing the tremen- C++ class libraries making it easily extensible should dous inefficiencies that can be caused by a mismatch between the VM be changed. programmer’s model and target machine requires two pieces of software: one provides emulations of features present on • Operator emulation library. The OEL is a library some, but not all, machines; the other contains critical ap- of ICL functions that emulate in software those hard- plication specific functions. ware features present in some, but not all, MPAs. The second key idea is what we call trace compilation.3 The OEL guarantees that ICL programs are portable All code is written in a single data parallel language that within the class of MPAs. defines the MPA VM. That code is not compiled directly • Application function library. The AFL is a library to a target machine instruction set; rather, it is run on an containing a small number of critical ICL application MPA VM emulator. In the process a ‘coarse-grained’ trace functions coded using different algorithms. The AFL is generated. This trace is then refined through a series of is essential when the choice of algorithm is highly ar- transformations wherein greater resolution is obtained with chitecture dependent. respect to the details of the target architecture. This process • VM trace generator. Code embedded within ICL has two benefits: 1) virtual machine emulation and trace generates the VM trace whenever an ICL program is compilation together are one to two orders of magnitude executed with the trace generation option turned on. faster than simulating an MPA at the instruction level, and 2) once the trace has been generated, it can be reused to • Target machine specifier/model generator. These evaluate a large number of design alternatives. programs provide a GUI for specifying MPA templates The simulator is perhaps best illustrated through its (i.e. families of machines) and variations within those usage: these are the basic steps in using the system (see templates. The output is the target machine model Figure 3): used by the trace compiler. • Trace compiler. Inputs a VM trace and a target 1. Write (or select) and compile an application program machine model and reconstructs what the application in ICL, a data parallel language [2]. program trace would have looked like had it been gen- 2. Execute the program (on any machine which supports erated by that particular target machine. C++) with trace generation turned on. The ideas behind the input constructor and the trace 3. Specify or select a target architecture. compiler have been described in detail elsewhere [6, 4]. The 4. Run the trace through the target-architecture-dependent target machine specifier has not, however. transformations and trace analysis tools which culmi- As shown in Figure 4, machine specification is done in nate with reporting performance data. two levels. To create a specification for a family of machines, the user creates a template. This process consists of select- The major simulator components follow. The first three ing elements from a menu (with the option of adding new ele- comprise the input constructor. ments), indicating whether those elements are optional or re- 3 Note that trace compilation is not related to the work by Ellis quired, and providing the legal parameter ranges. Elements and Fisher involving trace-scheduling compilers. note that templates need to be generated only very rarely: Machine Template Machine the entire space of MPAs that we have considered so far in Template Generator Template Specification our research was specifiable using a single template.

4 Experimental Methodology Machine Model Machine Parameters Generator Model and Features Complexity ALU Multi- FP Support Figure 4: Shown is the Target Machine Model Generator. Ordering Datapath plier (inverse) Templates define classes of target machines which have sim- Level 1 1 ilar relationships among components. The parameters spec- Level 2a 1 1/32 ify which components are present and their characteristics. Level 2b 2 Level 3a 2 1/16 present in all MPAs are things like ALUs, neighbor com- Level 3b 4 munication networks and registers. Optional elements in a Level 4a 4 4 template can include, e.g., floating point units and broad- Level 4b 4 1/8 cast networks. Typical parameters are ALU width, neighbor Level 4c 8 communication latency per bit, and number of ports in the Level 5a 8 4 register file. The template must also specify how the ele- Level 5b 8 1/8 ments are related to each other, to the parameters, and to Level 6a 8 8 the virtual machine. Level 6b 8 1/4 Level 6c 16 Level 7a 16 4 Level 7b 16 1/8 Level 8a 16 8 Level 8b 16 1/4 Level 8c 32 Level 9a 16 16 Level 9b 16 1/2 Level 9c 32 4 Level 9d 32 1/8 Level 10a 32 8 Level 10b 32 1/4 Level 11a 32 16 Level 11b 32 1/2 Level 12a 32 32 Figure 5: Shown is a page from an evaluation dialogue. The Level 12b 32 1 user is requesting that a particular target machine specifi- cation be used for 7 different array sizes. Table 1: levels of datapath complexity. The ideal situation is to search the space of possible The particular target machines are specified by choos- designs optimizing for cost/performance, or, given a cost ing from the menu suggested by the template. The template- (or performance) goal, finding the design with the best pos- based relations and machine specifications are then com- sible performance (or cost). The critical problem for this bined to create target machine models. Since particular approach is the difficulty in determining the costs of designs machines can be specified in great detail and because of since there are so many variations in technology, process, etc. the speed of evaluations, users have the option of inputting What we have done instead is to rank designs by a rough ranges of parameters. The following is a typical request. measure of complexity and determine the performance of a • a set of application programs large sample of those designs. We believe this is valuable for • a set of array sizes (e.g. “256x256, 128x128, 64x64”) two reasons: i) a design whose complexity is not warranted • a set of register file sizes (e.g. “10, 20, 40”) by its performance can be eliminated, and ii) non-linearities, • a default prespecified datapath (e.g. “MasPar MP1- or discontinuities, can be found which can help determine like”) promising areas for detailed study. We also recognize that • cache parameters (e.g. “no cache”) these rankings are implementation dependent and are likely • a default prespecified communication network (e.g. “CM2- to vary somewhat with changes in technology and design. like”) The space of designs is daunting due to the number of axes along which parameters can be varied. In order to make A page from a similar dialogue is shown in Figure 5. this study tractable, we have reduced the number of axes to Template generation is time-consuming and it requires three: datapath complexity, memory complexity (with and expertise to create one that is useful. Once created, however, without cache), and array size. In order to do this, we have it can be used to evaluate a huge space of machines. We also built upon the following assumptions: • With some exceptions, increases in complexity of most use two parallel rankings: register file size with no cache, individual design components improve performance mono- and register file size with cache. Cache parameters required tonically, though there are some surprises. for this performance have been presented elsewhere [8]. • The performance improvement is not always linear: the effect of working set size in memory hierarchy mea- 5 Experimental Results surement is one example. • Some axes are independent, others are not. For ex- We have examined seven issues. Two involve the datapath, ample, memory hierarchy and array size are related, two memory, one memory with respect to datapath, and two while the memory hierarchy and the datapath are in- overall cost-performance. dependent. 1. As more resources are applied to the datapath, what fea- • Some axes are related by an obvious relationship (e.g. tures should be given priority? array size and datapath performance), others have a Performance was measured for each application-datapath- non-linear relationship (e.g. memory hierarchy and array size combination. Execution time was plotted versus array size). datapath complexity with one graph per application and one line for each array size. The results are shown in Figure 6. We have several tasks: create an ordering and characteriza- The models referred to on the X axis are shown in Table 1. tion of datapath designs, create an ordering and characteri- The key results are as follows: zation of memory hierarchy designs, combine them with the array size, and combine all three. • In the two floating point applications, having floating Looking at datapath alone is a non-trivial task. We point support makes a huge difference. have reduced the parameters of the datapath to the three • For the non-floating point applications, very little speed- that dominate performance: ALU/datapath width, multi- up is achieved by increasing the datapath beyond 8. In plier/divider support, and floating point support. the floating point applications, 16 bit ALUs appear to Multiplier/divider support is given in terms of the size be sufficient. of the multiplier circuit, if any. This is an approximation • Multipliers are very useful: this is especially appar- since it does not include the divider or other ways of imple- ent in Prewitt where a four bit datapath with a four menting multiplication. However we believe it is reasonable bit multiplier gives better performance than an 8 bit since i) it is relatively common for complex PEs to contain datapath which must execute multiplies in software. multipliers but not dividers, and, more importantly, since ii) there are numerous good multiplication-based division 2. How does increasing the array size affect the datapath algorithms. This is especially true for less complex datap- component of the performance? aths since small multipliers make a great deal of sense, while Results from experiment 1 establish the basic relationship small dividers do not. between datapath and performance. Here we see find the Floating point support is given in terms of number of cost of VPE emulation to the datapath. The VPR is in- PEs sharing a floating point unit. Clearly, this is not the way versely proportional to the array size. In these experiments, every SIMD array implements floating point, but is certainly 256x256 has a VPR of 1 and 64x64 has a VPR of 16. Re- valid as a way of characterizing relative floating point sup- call that VPR physical instructions are required for each port. We make the further assumption that floating point arithmetic or logical VPE instruction, but that increased support includes fixed point multipliers/dividers and that overhead is required to emulate NEWS communication. these units are available for the integer operations. Execution time was plotted versus array size with one We are left with the following parameters: graph per application and one line for each datapath design (see Figure 7). We find that the overhead beyond simple em- • ALU/datapath → 1, 2, 4, 8, 16, 32 ulation is significant: for the IU Benchmark, going from a • Multiplier → 4, 8, 16, 32 VPR of 1 to 2 causes a slowdown of a factor of nearly 4. Suc- • FP Support → 32 bit FP per 32, 16, 8, 4, 2, 1 PEs ceeding jumps triple the slowdown. Spiral3, which also has We combine them into classes of similar complexity as shown significant communication, has correspondingly large over- in Table 1. As our studies of datapath complexity continue, head. we expect this list to be refined. 3. For a given array size, how big should the register file The two primary components of memory hierarchy are size be? the register file size, which determines the number of loads Recall that for VPE emulation, the working set size is in- and stores, and the cache size, which determines the latency creased, but that this increase is non-linear due to instruc- of the loads and stores. The register file size is varied simply tion scheduling. Execution time was plotted against register along a range from 20 to 200 bytes per PE. The cache size file size with one graph per array size and one line per appli- is more problematic. cation. See Figure 8. The results are surprisingly consistent: Although it is easy to vary the cache size in the same for 256x256, 128x256, and 128x128 arrays, a register file size way we varied the register file size, it makes much less sense. of about 60 bytes seems to be indicated. For 64x128 arrays, One reason is that the cache is an off-chip system component a register file size of about 100 bytes appears appropriate. and therefore not as closely involved in chip area trade-offs For 64x64 arrays, the working set may be too large to fit in as the register file. Another reason is that it is unlikely that a reasonable sized register file and caching may be required. a system will be built with PE cache if that cache has not 4. How does each application’s working set vary with array been designed to have a hit rate of over 90%. We therefore size? Figure Figure datapath and

Execution time in cycles Execution time in cycles 1 0 1 Execution time in cycles 0x10 2x10 4x10 6x10 1 0 0 1 Execution time in cycles 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0x10 2x10 4x10 6x10 8x10 0 0 0 0 one 6 0 0 0 0 0 0 0 0 4 6 0 0 0 0 0 6 6 6 x 4 6 X 4 6 4 0 6 6 6 6 6: 7: L1 line

design L2b Exp

Exp L1 6 L3b 4 X Datapath models 1 L2b

2 L4a 8 1 for 2 L3b 8 L4c x erimen 1 erimen 2 L5a

8 L4a Prewitt Datapath models

A L6a

eac L4c r 1 A r 2 a r 8 P L6c r y K X

a L5a r

s 1 y l e f 2

i

w L7a s z 8 i e

z L6a i h t e Klf t L8a

t L6c

t L8c arra 1

2 L7a 8 1—execution L9a 1 x 2 2 2—execution 5 8 L8a 6 X L9c 2 5

6 L8c L10a y L9a L11a

size. L9c L12a 2 5

6 L10a X

2 2 5 5 L11a 6 6 x 256x256 128x256 128x128 64x64 2

5 L12a 6 L L L L L L L L L L L L L L L L

1 1 1 9 9 8 8 7 6 6 5 4 4 3 2 1 2 1 0 c a c a a c a a c a b b a a a L L L L L L L L L L L L L L L L 256x256 128x256 128x128 64x128 64x64 time 1 1 1 9 9 8 8 7 6 6 5 4 4 3 2 1 2 1 0 c a c a a c a a c a b b time a a a is is

Execution time in cycles Execution time in cycles plotted

1 Execution time in cycles 8x10 0x10 2x10 4x10 6x10 1 0 1

plotted Execution time in cycles 0x10 5x10 1x10 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 6 0 0 0 0 4 6 0 6 6 6 4 x 0 6 7 x 6 6 4 4 L1 L1 L2A L2a v L2B L2b L3a ersus v L3A 6 6

4 L3b

4 L3B x ersus x 1 L4a 1

2 L4A 2 8 L4b 8 L4b Datapath models Datapath models L4C L4c L5A L5a Weymouth L5B Motion 256 L5b datapath A M L6A L6a r 1 W o

r L6b 2

A L6B a arra t 8 1 e i r y x 2 o

r L6c

y L6C a 1 8 s n y m 2 x

i L7a

z

1

8 L7A s 2 e o 2 i

5 L7b z 8 L7B u e 6

t L8a

h L8A L8b y L8B L8C L8c

1 L9A L9a 2 size 8 L9b

x L9B 1 2 2

5 L9C L9c 8 6 x L9d 2

complexit L9d 5

6 L10A L10a L10B L10b

with L11A L11a L11B L11b 2 5

6 L12A L12a x

2 L12B L12b 2 5 5 6 6 x 2

5 6 one 256x256 128x256 128x128 64x128 64x64 256x256 128x256 128x128 64x128 64x64 L L L L L L L L L L L L L L L L L L L L L L L L L L L 1 1 1 1 1 9 9 9 9 8 8 8 7 7 6 6 6 5 5 4 4 4 3 3 2 2 1 y 2 1 1 0 0 d C B A C B A B A C B A B A C b A B A B A A B A B A L L L L L L L L L L L L L L L L L L L L L L L L L L 1 1 1 1 9 9 9 9 8 8 8 7 7 6 6 6 5 5 4 4 4 3 3 2 2 1 1 1 0 0 d c b a c b a b a c b a b a c b a b a b a (see b a b a graph T Execution time in cycles able

Execution time in cycles 1 0 1 Execution time in cycles 100000 200000 300000 400000 500000 1 p 0 0 1 0 1 Execution time in cycles 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 er 6 0 0 0 0 0 0 0 0x10 2x10 4x10 4 0 0 0 0 x 0 0 0 0 6 6 0 0 0 0 4 4 0 x 6 1) 4 application 0 7 7

with L1

6 L1 L2b 4 x 1

2 L2b L3b 6 8 4 x

1 L3b L4a Datapath models 2 8 one L4a L4c Datapath models

L4c IU benchmark

A L5a Spiral 3 1 r S 2 r a 8

p L5a L6a y x I U i

1 graph r s and 2 a

i b 8 z

A L6a L6c l e 1

e r 3 2 r n a 8 c y x L6c L7a 1

h s 2 i m 8 z

e L7a L8a a r k one

1 L8a L8c 2 8 x

p L9a 2 L8c 5 6

er L9a L9c 1 2 line 8 L9c L10a x 2 5 application 6 L10a L11a 2

5 L12a

6 L11a x 2 for 5

6 L12a

256x256 128x256 128x128 64x128 64x64 2

5 6 x eac L L L L L L L L L L L L L L L L 2 256x256 128x256 128x128 64x128 64x64 1 1 1 9 9 8 8 7 6 6 5 4 4 3 2 1 5 2 1 0 c a c a a c a a c a b b 6 a a a h L L L L L L L L L L 8 7 6 6 5 4 4 3 2 1 a a c a a c a b b Array size: 256x256 Array size: 128x256 Array size: 128x128 6 7 120000 2x10 1x10

100000

IU benchmark 80000 IU benchmark IU benchmark Motion 256 Motion 256 Motion 256 Klf klf 60000 Klf 6 Prewitt 6 5x10 1x10 Prewitt Prewitt Spiral 3 Spiral 3 Spiral 3 Weymouth 40000 Weymouth Weymouth Execution time in cycles Execution time in cycles 20000 Execution time in cycles

0 0 0 0x10 0 100 200 0x10 0 100 200 size 0 100 200 Register file size Register file size

Array size: 64x128 Array size: 64x64 7 7 2.5x10 6x10

7 2.0x10 7 4x10 IU benchmark 7 IU benchmark Motion 256 1.5x10 Motion256 Klf Klf Prewitt 7 Spiral 3 Spiral 3 1.0x10 7 Weymouth 2x10 Weymouth Execution time in cycles

Execution time in cycles 6 5.0x10

0 0 0x10 0.0x10 0 100 200 0 100 200 Register file size Register file size

Figure 8: Experiment 3—execution time was plotted against register file size in bytes with one graph per array size and one line per application.

Prewitt Weymouth Spiral 3 7 6 2.5x10 500000 6x10

7 2.0x10 400000 6 7 64x64 4x10 64x64 1.5x10 64x64 128x128 300000 64x128 64x128 128x256 128x128 128x128 7 256x256 128x256 1.0x10 128x256 200000 256x256 256x256 6 2x10 6

Execution time in cycles 5.0x10 Execution time in cycles

Execution time in cycles 100000 0 0.0x10 0 0 100 200 0x10 Register file size 0 0 100 200 0 100 200 Register file size Register file size

Klf Motion 256 IU benchmark 7 6 7 2.0x10 2.5x10 6x10

6 7 2.0x10 1.5x10

64x64 64x64 7 6 4x10 64x64 64x128 1.5x10 64x128 64x128 7 128x128 128x128 1.0x10 128x128 128x256 128x256 6 128x256 256x256 1.0x10 256x256 256x256 7 6 2x10 5.0x10

Execution time in cycles 5 Execution time in cycles 5.0x10 Execution time in cycles

0 0 0.0x10 0.0x10 0 0 100 200 0 100 200 0x10 Register file size Register file size 0 100 200 Register file size

Figure 9: Experiment 4—execution time is plotted against register file size in bytes with one graph per application and one line per array size. IU benchmark: These are the same results as in 3. but plotted by appli- 6 3.0x10 cation rather than array size. See Figure 9. Again, if VPE emulation were as simple as repeating each instruction VPR 6 2.5x10 times, then we would expect the working apparent working set of each application to rise proportionally with the VPR. 6 2.0x10 However, that is not the case: rather, VPE emulation is car- 128x256 Array size ried out by having each physical PE emulate a VPE for as 6 1.5x10

long as possible. The result is seen as, e.g., the working set Execution time (cycles) size for the IU Benchmark appears to go roughly from 40 to 6 1.0x10 60 to 100 to 150 instead of doubling each time.

5. How should resources be partitioned between memory 5 5.0x10 0 2 4 6 8 10 12 and datapath to achieve system balance? Area/ Cost The assumption made here is that to the first order, good system design presupposes the removal of bottlenecks, or Figure 11: Experiment 6—shown is a plot of chip area versus equivalently, good system balance. To find systems that IU Benchmark execution time for all 128 x 256 PE designs meet these criteria, we examined performance versus both described here. datapath complexity and register file size for each appli- cation and array size. The result was a set of 3D graphs array datapath reg. file Est. chip Execution with execution time on the Z axis and datapath complex- size model size area (000s) time (000s) ity and register file size on the other two axes. From these 64x64 4a 20 1417 67457 graphs, points were extracted where memory references con- 64x64 2b 100 3629 41181 tributed 10% of the total cycles, the assumption being that 64x64 4a 100 4039 34950 in that case memory accesses would not be considered a bot- 64x128 2b 60 4637 16703 tleneck. For designs where memory references contribute 128x128 2b 40 6652 6256 significantly, enhancements such as PE cache might be con- 64x128 4a 100 8077 5701 sidered. 128x128 2b 60 9273 3646 These data are illustrated in a plot of register file size 128x128 4a 80 13531 1892 versus datapath model shown in Figure 10. For each level 128x256 4a 40 16580 814 of datapath complexity in each application, only the best 128x256 5a 40 20054 722 design was selected. Some observations are as follows: 128x256 5a 100 35783 553 • The expected result—that as VPR increases the im- 256x256 5a 40 40108 165 portance of the memory hierarchy increases as well—is indeed borne out. Table 2: This table depicts the most promising array de- signs in terms of performance-chip area tradeoff for the IU • Also as expected, as the datapath complexity increases, Benchmark application. The data are shown graphically in the amount of register file required to maintain system Figure 13. balance usually increases as well. • In situations where increased datapath complexity does plotted. Figure 11 shows a plot where all possible datapath- not significantly improve performance, the graphs are register file combinations are included. flat. 7. Normalizing for array size, which designs give the best • The volatility in the Weymouth (and to a lesser extent performance per chip area? Motion) graph appears to be a sensitivity issue: the Here the most promising designs in terms of cost-performance surface is nearly flat and small performance differences have been selected from Figure 12, normalized for array size, can result in large movements in the isobars. and plotted in Figure 13. The points are described further in Table 2. Since only the IU Benchmark has been plotted so • Graphs where the larger arrays ‘fall off the bottom’ far, none of the datapath selections contained floating point and smaller arrays ‘burst out the top’ (e.g., Spiral has support. both) indicate situations where cache is never or al- ways needed, respectively. 6 Conclusion 6. For each array size, which designs give the best perfor- mance per chip area? We have run numerous experiments and derived results re- For experiments 6 and 7 we have estimated PE area by lating to datapath, support by the datapath of VPE emula- synthesizing components such as register files, ALUs, and tion, working set sizes and how they relate to VPE emula- multipliers of various sizes. Since there were problems in tion, points in the memory hierarchy/datapath design space synthesizing directly to the gate level, the area is given in at which system balance is roughly achieved, and, for one terms FPGA cells. Work continues here and obviously these application, determined the most promising designs in terms results should only be used for grossest comparison. of performance-chip area ratio. Further work continues in Figure 12 shows execution time versus estimated chip automatic synthesis of these designs for the more accurate area for various array sizes for the IU Benchmark. Only determination of actual hardware cost. the most promising datapath- register file combinations are to arra Figure Figure Register file size Register file size 100 150 200 100 150 200 significan 50 50 0 y Execution time in cycles 0 0x10 2x10 4x10 6x10 8x10 sizes. 12: 10: L1 L1

0 7 7 7 7 L2 L2 Exp Exp 0 L3 tly L3 Datapath models

L4 Datapath models L4 erimen erimen impro L5 Estimated area/PE(no.ofcells)

500 L5 Prewitt L6 Klf L6 L7 L7

v L8 t t 1000 e L8

6—sho L9 5—eac IU benchmark p L10 L9 erformance. L11 L10

1500 L12 L11 wn h L12

line 256x256 128x256 128x128 are

2000 256x256 128x256 128x128 represen plots One 2500 of line Register file size 100 150 200 250

ts Register file size 140 210 c 50 70 0 hip 0 is the 3000

plotted L1 area L1

limit L2 L2 L3 Datapath models

64x64 64x128 128x128 128x256 256x256 L3 v Datapath models L4 Weymouth ersus L4 in L5 Motion 256 p L5

er L6 register L6 L7 IU arra L7 L8 L8 L9 Benc y L9 L10 size Execution time in cycles (log scale) file L10 L11 1x10 1x10 1x10 1x10 hmark L11 L12 size with

L12 5 6 7 8 256x256 128x256 128x128 64x128 64x64 0

(in execution 256x256 128x256 128x128 64x128 64x64 one Estimated area/PE(no.ofcells) b 500 ytes) graph 1000 ab

IU benchmark Register file size 100 120 time Register file size 140 210 40 60 80 70 p 0 o er v e 1500 application. for L1 L1 whic L2 L2 selected

Datapath models L3 L3 Datapath models h 2000 L4 L4 IU Benchmark L5 Spiral 3 adding L5 L6 L6 L7 2500 PE L7 L8 L8 L9 cac designs L9 L10 3000

he L10 L11 L11 L12 is

L12

of not 128x128 64x128 64x64 64x128 128x128 128x256 256x256

v 256x256 128x256 128x128 64x128 64x64 arious lik ely IU benchmark IU benchmark 7 8 8x10 1x10

7 6x10 7 1x10 7 4x10

6 7 1x10 2x10 Execution time in cycles

0 0x10

Execution time in cycles (log scale) 5 0 7 7 7 7 7 1x10 0x10 1x10 2x10 3x10 4x10 5x10 0 7 7 7 7 7 Estimated total chip area (no. of cells) 0x10 1x10 2x10 3x10 4x10 5x10 Estimated total chip area (no. of cells)

Figure 13: Experiment 7—shown are plots of chip area (normalized for array size) versus IU Benchmark execution time for selected PE designs. The data points are described in Table 2.

References [10] Preston, K. The Abington Cross benchmark survey. IEEE Computer 22, 7 (1989), 9–18. [1] Blank, T. The MasPar MP-1 architecture. In Proc. 35th IEEE Comp. Conf. (1990), pp. 20–24. [11] Rosenfeld, A. A report on the DARPA Image Under- standing Architectures Workshop. In Proc. Image Un- [2] Burrill, J. H. The Class Library for the IUA: Tuto- derstanding Workshop (1987), pp. 298–301. rial. Amerinex Artificial Intelligence, Inc., Amherst, MA 01003, 1992. [12] Systems Performance Evaluation Cooperative. SPEC Newsletter: Benchmark Results. Waterside Associates, [3] Herbordt, M. C. The Evaluation of Massively Parallel Freemont, CA, 1990. Array Architectures. PhD thesis, Dept. of Comp. Sci., U. of Mass. (also TR95-07), 1994. [13] Weems, C. C., Riseman, E. M., Hanson, A. R., and Rosenfeld, A. The DARPA image understanding bench- [4] Herbordt, M. C., Burrill, J. H., and Weems, C. C. Mak- mark for parallel computers. JPDC 11 (1991). ing a dataparallel language portable for massively par- allel array computers. In Proc. of Computer Architec- tures for Machine Perception (1997).

[5] Herbordt, M. C., Corbett, J. C., Spalding, J., and Weems, C. C. Practical algorithms for online routing on fixed and reconfigurable meshes. J. Par. Dist. Comp. 20, 3 (1994), 341–356.

[6] Herbordt, M. C., Kidwai, O., and Weems, C. C. Pre- protyping coprocessors using virtual machine em- ulation and trace compilation. Performance Evaluation Review 25, 1 (1997), 88–99.

[7] Herbordt, M. C., and Weems, C. C. An environment for evaluating architectures for spatially mapped com- putation: System architecture and initial results. In Proc. of Computer Architectures for Machine Percep- tion (1993).

[8] Herbordt, M. C., and Weems, C. C. Experimental anal- ysis of some SIMD array memory hierarchies. In Proc. of the 1995 Int. Conf. on Parallel Processing (1995), vol. 1: Architecture, pp. 210–214.

[9] Owens, R. M., Irwin, M. J., Nagendra, C., and Bajwa, R. S. Computer vision on the MGAP. In Proc. Comp. Arch. for Machine Perception (1993), pp. 337–341.