Modeling Instruction Placement on a Spatial Architecture

Modeling Instruction Placement on a Spatial Architecture Martha Mercaldi Steven Swanson Andrew Petersen Andrew Putnam Andrew Schwerin Mark Oskin Susan J. Eggers Computer Science & Engineering University of Washington Seattle, WA USA {mercaldi,swanson,petersen,aputnam,schwerin,oskin,eggers}@cs.washington.edu ABSTRACT Keywords In response to current technology scaling trends, architects dataflow, instruction placement, spatial computing are developing a new style of processor, known as spatial computers. A spatial computer is composed of hundreds or even thousands of simple, replicated processing elements 1. INTRODUCTION (or PEs), frequently organized into a grid. Several current Today’s manufacturing technologies provide an enormous spatial computers, such as TRIPS, RAW, SmartMemories, quantity of computational resources. Computer architects nanoFabrics and WaveScalar, explicitly place a program’s are currently exploring how to convert these resources into instructions onto the grid. improvements in application performance. Despite signifi- Designing instruction placement algorithms is an enor- cant differences in execution models and underlying process mous challenge, as there are an exponential (in the size of technology, five recently proposed architectures - nanoFab- the application) number of different mappings of instruc- rics [18], TRIPS [34], RAW [23], SmartMemories [26], and tions to PEs, and the choice of mapping greatly affects pro- WaveScalar [39] - share the task of mapping large portions gram performance. In this paper we develop an instruction of an application’s binary onto a collection of processing el- placement performance model which can inform instruction ements. Once mapped, the instructions execute “in place”, placement. The model comprises three components, each explicitly sending data between the processing elements. Re- of which captures a different aspect of spatial computing searchers call this form of computation distributed ILP [34, performance: inter-instruction operand latency, data cache 23, 39] or spatial computing [18]. coherence overhead, and contention for processing element Good instruction placement is critical to spatial comput- resources. We evaluate the model on one spatial computer, ing performance. Our research on WaveScalar indicates that WaveScalar, and find that predicted and actual performance a poor placement can decrease performance by as much as a correlate with a coefficient of −0.90. We demonstrate the factor of five. Finding a good placement is hard, because model’s utility by using it to design a new placement al- there are an exponential (in the size of the application) gorithm, which outperforms our previous algorithms. Al- number of possible mappings. How can developers, com- though developed in the context of WaveScalar, the model piler writers, or microarchitects identify the ones that will can serve as a foundation for tuning code, compiling soft- execute quickly? Searching this enormous space requires a ware, and understanding the microarchitectural trade-offs of solid understanding of how instruction placement influences spatial computers in general. performance. In this paper we develop a model of placement performance to study this issue. Categories and Subject Descriptors To develop the model, we focus on a particular spatial computer, WaveScalar. To accurately predict instruction I.6.5 [Computing Methodologies]: Simulation and Mod- placement performance, we construct a unified model that eling—Model Development; B.8.2 [Hardware]: Perfor- considers several factors that contribute to overall performance and Reliability—Performance Analysis and Design mance. Our model comprises three separate components, Aids each of which captures a different aspect of spatial computation: inter-instruction operand latency, data cache co- General Terms herence overhead, and contention for processing element re- Experimentation, Measurement, Performance sources. Our unified model combines these components in proportion to their relative contribution to overall performance. The model estimates performance using three inputs: (1) Permission to make digital or hard copies of all or part of this work for the placement in question, i.e., a mapping of instructions personal or classroom use is granted without fee provided that copies are in the application to processing elements, (2) a profile of not made or distributed for profit or commercial advantage and that copies application execution behavior, and (3) the spatial com- bear this notice and the full citation on the first page. To copy otherwise, to puter’s microarchitectural configuration and timing param- republish, to post on servers or to redistribute to lists, requires prior specific eters. These inputs are common to all spatial comput- permission and/or a fee. SPAA’06, July 30–August 2, 2006, Cambridge, Massachusetts, USA. ers, which will allow this approach to generalize beyond Copyright2006ACM1-59593-452-9/06/0007...$5.00. WaveScalar. 158 The paper first develops a model of each component of PE placement performance in isolation. Using a variety of appli- Pod Domain cations and potential placements, we evaluate each of these component models, using specially configured versions of D$ the WaveScalar microarchitectural simulator. Each configuration accurately simulates the hardware resources of the Net- Cluster component in question but idealizes all other resources. We work D$ D$ validate each component model by showing that it correlates S with its component-isolating simulation. L2 L2 L2 B We then combine these component models to pro- duce a single unified model of placement performance on L2 D$ WaveScalar. The unified model predicts the effect of instruction placement when all microarchitectural resources L2 are accurately simulated. The combined model produces performance predictions that correlate to simulation performance with a coefficient of −0.90. L2 To evaluate our model’s predictive power on applications that are not part of our workload, we use a standard machine Figure 1: The WaveScalar Processor: The hierar- learning evaluation technique in which we partition our data chical organization of the WaveScalar processor. points into training and test sets. We derive a model from each of the training sets, and evaluate its predictive capabil- tion which computes a value and sends it to the instructions ity on its corresponding test set. Evaluated in this way, our that consume it. Instructions execute after all input operand model’s predicted layout performance correlates to actual values have arrived, according to a principle known as the performance with a coefficient of −0.82. dataflow firing rule [12, 11]. The model indicates that PE resource constraints have WaveScalar supports a memory model which commits the greatest effect on placement performance on WaveScalar, memory accesses in program order. Equipped with archi- followed by inter-instruction operand latency, and finally by tectural building blocks, called waves, which globally order cache coherence overhead. These results are useful in several pieces of the control flow graph, and an architectural mech- ways. For example, the model provides a quickly calculable anism, called wave-ordered memory, which orders memory objective function that an optimizer could minimize to find operations within a wave, WaveScalar enforces the correct, an application mapping that maximizes IPC. One could also global ordering of a thread’s memory operations. This en- use the model to design an instruction placement algorithm ables it to execute applications written in imperative lan- which is based on the factors that are most important to guages, such as C or C++. Other work describes the details performance. In Section 6 we do just this and develop an of this mechanism [39]. improved placement algorithm by combining two existing al- Microarchitecture: Conceptually, each static instruction gorithms that optimize for the two most important compo- in a WaveScalar program executes in a separate processing nents of placement performance, as dictated by the model. element (PE). Building a PE for each static instruction is A third strategy is to use the model to guide microarchi- both impossible and wasteful, so, in practice, WaveScalar tectural optimizations or to make the microarchitecture less dynamically binds multiple instructions to a fixed number placement-sensitive. of PEs, and swaps them in and out on demand. In the following section we provide an overview of the The WaveScalar processor is a grid of simple processing salient features of WaveScalar. In Section 3 we present the elements. Each PE has five pipeline stages and contains methodology used to develop and validate our placement a functional unit, specialized memories to hold operands, performance model. Section 4 explains and validates each and logic to control instruction execution and communica- of the individual components, and Section 5 combines them tion. Each PE also contains buffering and storage for several into a unified model. Section 6 describes an improved in- different static instructions, although only one can execute struction placement algorithm we developed that is based in any given cycle. PEs determine locally when their in- on this model. Section 7 explores related work on perfor- structions can execute, contributing to the scalability of the mance modeling, layout of computation, and spatial com- WaveScalar processor design. puters. Finally in Section 8, we draw our conclusions and To reduce communication costs within

Modeling Instruction Placement on a Spatial Architecture

Performance and Energy Efficient Network-On-Chip Architectures

Computer Architecture: Dataflow (Part I)

Configurable Fine-Grain Protection for Multicore Processor Virtualization 1

CG-Ooo Energy-Efficient Coarse-Grain Out-Of-Order Execution

Parallel Computer Architecture III

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

An Evaluation of the TRIPS Computer System

A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective

Designing Heterogeneous Many-Core Processors to Provide High Performance Under Limited Chip Power Budget

Compiling for EDGE Architectures

Scatter-Add in Data Parallel Architectures

Universal Mechanisms for Data-Parallel Architectures