COVER FEATURE

Instruction-Level Distributed Processing

Shifts in hardware and software technology will soon force designers to look at that instruction streams in a highly distributed fashion.

James E. or nearly 20 years, research In short, the current focus on instruction-level Smith has emphasized instruction-level parallelism, parallelism will shift to instruction-level distributed University of which improves performance by increasing the processing (ILDP), emphasizing interinstruction com- Wisconsin- number of . In striving munication with dynamic optimization and a tight Madison F for such parallelism, researchers have taken interaction between hardware and low-level software. microarchitectures from pipelining to superscalar pro- cessing, pushing toward increasingly parallel proces- TECHNOLOGY SHIFTS sors. They have concentrated on wider instruction fetch, During the next two or three technology generations, higher instruction issue rates, larger instruction win- architects will face several major challenges. dows, and increasing use of prediction and speculation. On-chip wire delays are becoming critical, and power In short, researchers have exploited advances in chip considerations will temper the availability of billions of technology to develop complex, hardware-intensive transistors. Many important applications will be object- processors. oriented and multithreaded and will consist of numer- Benefiting from ever-increasing transistor budgets ous separately compiled and dynamically linked parts. and taking a highly optimized, “big-compile” view of software, microarchitecture researchers made sig- Wire delays nificant progress through the mid-1990s. More Both short (local) and long (global) wires cause prob- recently, though, researchers have seemingly reduced lems for designers. With local wires, the problem to finding ways of consuming transis- the problem is congestion: For many complex, dense tors, resulting in hardware-intensive, complex proces- structures, transistor size does not determine area sors. The complexity is not just in critical path lengths requirements—wiring does. Global wire delays will not and transistor counts: There is high intellectual com- scale as well as transistor delays largely because shrink- plexity in the increasingly intricate schemes for ing wire sizes and limits on wire aspect ratios will cause squeezing performance out of second- and third-order resistance per unit length to increase at a faster rate effects. than wiring distances shrink. Hierarchical wiring, with Substantial shifts in hardware technology and soft- thicker long-distance wires will certainly help, but it is ware applications will lead to general-purpose unlikely that the number of wiring levels will increase microarchitectures composed of small, simple, inter- fast enough to stay ahead of the problem.1 For exam- connected processing elements, running at very high ple, in future 35-nanometer technology, the projected clock frequencies. A hidden layer of implementation- number of static RAM cells reachable in one clock cycle specific software—co-designed with the hardware— will be about half of that today.2 will help manage the distributed hardware resources Of course, reachability is a problem with general to use power efficiently and to dynamically optimize logic, too. As logical structures become more complex executing threads based on observed instruction and consume relatively more area, global delays will dependencies and communication. increase simply because structures are farther apart.

0018-9162/01/$10.00 © 2001 IEEE April 2001 59 remains important for multimedia applications that tend Figure 1. An example to use integers and low-precision floating-point data. of instruction-level FU Often, however, special processors or instruction-set distributed processing extensions support these library-oriented applications. microarchitecture, CU Microarchitecture researchers typically assume a consisting of an IU high level of compiler optimization, sometimes using instruction fetch (IF) profile-driven feedback, but in practice many applica- unit, integer units tion binaries are not highly optimized. Further, global (IU), floating-point IU CU compile-time optimization is incompatible with object- units (FU), and oriented programming and dynamic linking. Conse- units (CU). The units IF quently, microarchitects face the challenge of providing communicate via high performance for irregular, difficult-to-predict point-to-point IU CU applications, many of which have not been compiler interconnections that optimized. will likely consume one or more clock IU CU On-chip multithreading cycles each. Multithreading, which brings together microarchi- tecture and applications, has a lengthy tradition, pri- FU marily in very high end systems that have not enjoyed a broad software base. For example, became an integral component of large IBM main- Put another way, simple logic will likely improve per- frames and Cray supercomputers in the early 1980s. formance directly by reducing critical paths, but also However, the widespread use of multithreading has indirectly by reducing area and overall wire lengths. been a chicken-and-egg problem that now appears This is not a new idea: All of Seymour Cray’s designs nearly solved. benefited from this principle. To continue the exponential complementary metal- oxide semiconductor (CMOS) performance curve, Power consumption many chips now support multiple threads and others Dynamic power is proportional to the clock fre- soon will. Multiple processors3 or wide superscalar quency, the transistor switching activity, and the sup- hardware capable of supporting simultaneous multi- ply voltage squared, so higher clock frequencies and threading (SMT) provide this support.4,5 Software that transistor counts have made dynamic power an impor- has many parallel, independent transactions—such as tant design consideration today. Dependence on volt- many Web servers-—will take advantage of hardware- age level squared forces a countering trend toward supported on-chip multithreading, as will many gen- lower voltages. eral-purpose applications. In the future, the power problem is likely to get much worse. To maintain high switching speeds with INSTRUCTION-LEVEL DISTRIBUTED PROCESSING reduced voltage levels, developers must continue to The exploitation of technology advances and the lower transistor threshold voltages. Doing so makes accommodation of technology shifts have both fos- transistors increasingly leaky: Current passes from tered innovation. For example, source to drain even when the transistor is not - the cache memory innovation helped avoid tremen- ing. The resulting static power consumption will likely dous slowdowns by adapting to the widening gap in become dominant in the next two or three chip-tech- performance between processor and dynamic RAM nology generations. (DRAM) technologies. There are few solutions to the static power problem. At this point, shifts in both technology and appli- The design could selectively gate off the power supply cations are driving microarchitecture innovation. to unused parts of the processor, but doing so is a rel- Innovators will strive to maintain long-term perfor- atively slow process that can create difficult-to-man- mance improvement in the face of increasing on-chip age transient currents. Using fewer transistors or at wire delays, increasing power consumption, and irreg- least fewer leaky ones is about the only other option. ular throughput-oriented applications that are incom- patible with big, static, highly optimized compilations. Software issues General-purpose computing has shifted emphasis to ILDP microarchitecture integer-oriented commercial applications, where irreg- The ILDP microarchitecture paradigm deals effec- ular data is common and data movement is often more tively with technology and application trends. A important than computation. Highly structured data processor following this paradigm consists of several

60 Computer distributed functional units, each fairly simple and with a very high frequency clock, as Figure 1 shows. Relatively long wire delays imply processor microar- chitectures that explicitly account for interinstruction Register and intrainstruction communication. As much as pos- Instruction file 0 sible, the microarchitecture should localize communi- issue Execution unit cation to small units while organizing the overall buffers structure for communication. Communication among One units will be point-to-point with delays measured in clock cycle clock cycles. A significant part of the microarchitecture delay design effort will involve partitioning the processor to accommodate these delays. To keep the transistor Execution unit counts low and the clock frequency high, the microar- 1 chitecture core will keep low-level speculation to a rel- Execution unit ative minimum. Determinism is inherently simpler than prediction and recovery. With high intraprocessor-communication delays, the number of instructions the processor executes per Figure 2. Alpha 21264 clustered microarchitecture. Instructions are steered to one of cycle may level off or decrease, but developers can two processing clusters that have duplicated register files. Communicating the results increase overall performance by running smaller dis- produced in one cluster to the other cluster takes a full clock cycle. tributed processing units at a much higher . The processor’s structure and clock speeds have impli- consume power even when idle, keeping them busy cations for global clock distribution. The processor with active work is preferable—which a fast clock will likely contain multiple clock domains, possibly does. In addition, the replicated distributed units make asynchronous from one another. implementing selective power gating easier. Also, some units may be just as effective if built with slower tran- Clock speed sistors, especially if multiple parallel copies of the unit Developers show growing awareness that clock are available to provide throughput. speed holds the key to increased performance. For the For on-chip multithreading, ILDP provides the past several years, they have pushed instruction-level enticing possibility of combining a chip multiproces- parallelism, making some gains, but now, with results sor with simultaneous multithreading, particularly by diminishing, they are turning to higher clock speeds. partitioning computation elements among threads. It’s not a new approach: reduced-instruction-set com- With simple replicated units, different subsets of units puter (RISC) proponents have long debated the role of can be assigned to individual threads. As in SMT, the clock speed, Cray’s approach used a very fast clock, units share the processor as a whole, but as in a mul- and the Intel processor evolution follows a trend tiprocessor, any unit services only one at a time. toward faster clocks by using successively deeper The challenge for developers will be managing the pipelines. The original Pentium processor had a rela- threads and resources of such a fine-grained distrib- tively short pipeline of five stages. This grew to 12 uted system. stages in the highly successful Pentium Pro/II/III series of processors, and the recent Pentium 4 uses 20 stages. Dependence-based microarchitecture In contrast to Intel processors, in which the pipelines An important ILDP processor class, clustered have become extremely deep, the challenge will be to dependence-based architecture,7 organizes processing emphasize simplicity to keep pipelines shallow and units into clusters and steers dependent instructions efficient, even with a very fast clock. to the same cluster for processing, as in Compaq’s Using a very fast clock in an ILDP computer tends Alpha 21264 microarchitecture. This processor has to increase dynamic power consumption, but the two clusters and routes different instructions to each processors’ very modular, distributed nature permits cluster at issue time, as Figure 2 shows. One cluster’s better microarchitecture-level . results require an additional clock cycle to route to With replication of most units, clock gating can man- the other. The 21264’s data dependencies tend to steer age resource usage and dynamic power consumption. communicating instructions to the same cluster. A In particular, control logic can monitor computation faster clock cycle compensates for an additional inter- resources and use or not use subsets of replicated units, cluster delay, leading to higher overall performance. depending on computation requirements.6 Generally, developers can divide a dependence- Regarding static power, a high-frequency clock uses based design into several clusters; within a cluster fast, leaky transistors more effectively. If transistors there can be further division of instruction process-

April 2001 61 information into machine-level instructions. At run- Preconstruction Preload time, the hardware uses this information to steer engine engine instruction control and data information through the distributed processing elements. Another option uses hardware to determine impor- tant instruction interactions, with hardware tables Simple pipeline collecting dynamic history information as programs execute. Later, instructions can acquire steering con- trol and data information by accessing this history Preoptimization information. engine Performance Overall management of processor resources is Prebranch collecting engine engine another important consideration. For example, an ILDP will likely need resource load balancing for good performance—at both the instruction level and thread level. For power efficiency, gating off unused or Figure 3. A heteroge- ing, cache processing, integer processing, floating- unneeded resources requires usage analysis and coor- neous instruction- point processing, and so on. The compiler can form dination, especially if power gating is widespread level distributed pro- dependent instructions or the processor can do it at across a chip. cessing chip archi- various stages in the pipeline. If instruction decode The ILDP optimization and management functions tecture. A simple logic can steer dependent instructions to the same clus- can be done either by hardware or software. Although processor core is sur- ter prior to issue, it simplifies instruction control logic viable, software approaches based on conventional rounded by helper within a cluster because instructions can be issued in operating systems and compilers require recompila- engines that perform first-in, first-out (FIFO) order. Meanwhile, instruc- tion and OS changes to fit each ILDP hardware plat- speculative tasks off tions in different clusters can be issued out of order. form—as well as intimate knowledge of the hardware the critical path and Waiting until the issue stage, as in the 21264, slightly implementation. The disadvantages of a hardware- enhance overall per- reduces interunit communication, but at the expense based approach include the need for additional com- formance. of more complex issue logic. plex, power-consuming hardware and a limited scope for observing and collecting information related to the Heterogeneous ILDP executing instruction stream. A heterogeneous ILDP model has outlying helper A more radical solution uses currently evolving or service engines surrounding a simple core pipeline, dynamic optimizing software and virtual-machine as Figure 3 shows. Because these helper engines lie technologies. A co-designed virtual machine combines outside the critical processing path, their communi- hardware and software to implement a virtual archi- cation delays with respect to the main pipeline are tecture. This combination provides hardware imple- noncritical; thus, they can even use slower transistors mentors with a software layer in a hidden portion of to reduce static power consumption. DRAM main memory, allowing relatively complex Amir Roth and Guri Sohi have proposed a helper dynamic program analysis and implementation- engine that preloads data by performing speculative dependent optimization. The Transmeta Crusoe pointer chasing.8 Glenn Reinman’s branch engine pre- processor11 and the IBM Daisy research project12 use executes instruction sequences to arrive at branch out- this base technology primarily to support whole- comes early.9 An even more advanced helper engine system binary translation to very long instruction is Yuan Chou and John Shen’s instruction word (VLIW) sets. The IBM 390 processor uses a sim- that executes its own instruction stream to optimize ilar technology——to support execution of instruction sequences.10 complex instructions.13 Figure 4 shows the virtual machine’s overall struc- CO-DESIGNED VIRTUAL MACHINES ture. The physical main memory space is divided into An ILDP computer clearly needs some type of conventional architected memory and hidden mem- higher-level management of the distributed instruc- ory that a virtual-machine monitor uses. The system tion execution resources. This management includes places VMM code and data in hidden memory at routing instructions and data among the processor’s boot time. The VMM manages ILDP resources and units and allocating distributed resources for multi- modifies instructions by adding information to guide ple simultaneous threads and power management. instruction and data routing. The ILDP hardware can One option uses compiler-level software to analyze perform dynamic instruction scheduling, so the sys- instruction interactions. At compile time, the soft- tem does not require large-scale code optimization ware determines or predicts interinstruction depen- and software scheduling, unlike the VLIW-targeted dencies and communication and encodes this implementations.

62 Computer Hidden memory Architected (visible) memory Figure 4. Supporting instruction-level VMM data OS data Application distributed process- data ing (ILDP) with a co-designed virtual machine. The virtual- machine monitor VMM OS Application (VMM) resides in hid- code code code den memory and uses Transfers monitoring and con- of control figuration hardware to manage ILDP resources. Control Hardware can be transferred to Monitoring configuration the VMM code at any hardware control ILDP time. Dashed lines hardware show the control transfers.

Processor pipeline

At certain times during normal program execution, set—that supported relatively powerful, variable- hardware can save the current program value length instructions. and load it with a pointer to the VMM. Developers As evolved toward general- can design the hardware to invoke the VMM in this purpose computing platforms, developers again manner for selected instruction types, program traps, rethought ISAs, this time with hardware simplicity or system calls and returns, and a VMM-controlled as a goal. The resulting RISC instruction sets allowed timer can initiate periodic traps to the VMM. a fully pipelined processor implementation to fit on After it takes control, the VMM saves the archi- a single chip. Because transistor densities have tected state and uses the processor in the normal fash- increased, hardware can now dynamically translate ion, fetching instructions from hidden memory and older, more complex microprocessor ISAs into RISC- accessing data from its space in hidden memory. It can like operations. use this information to maintain a history of resource In retrospect, it seems that each change in packag- usage and reconfigure hardware or modify instruction ing and technology—mainframes to minicomputers and data flow. When finished, the VMM returns con- to microprocessors—has initiated instruction-set inno- trol of the interrupted program, and hardware vation. Perhaps we should take another serious look resumes normal program execution. Meanwhile, the at ISAs—this time motivated by on-chip communica- system uses a conventional off-the-shelf operating sys- tion delays, high-speed clocks, and advances in trans- tem and application software. lation and virtual-machine software. We can and should optimize ISAs for ILDP, focus- INSTRUCTION SETS ing on communication and dependence. To reduce Historically, designers have made major changes delays, we should make sure the ISA easily expresses to instruction set architectures in discrete steps. After interinstruction communication in a natural way. the original mainframes, industry-standard architec- Architects should optimize ISAs for very fast execu- tures (ISAs) more or less stabilized until the early tion, with emphasis on small, fast memory structures, 1970s, when minicomputers appeared. These including caches and registers. In contrast, most recent machines used relatively inexpensive packaging and ISAs, including RISC and VLIW sets, have empha- interconnections, giving developers an opportunity sized computation and independence—for example, to rethink ISAs. Based on lessons learned from rela- by grouping independent instructions in close prox- tively irregular mainframe instruction sets, develop- imity. ers aimed toward regularity and orthogonality. To Although maintaining compatibility with legacy incorporate these properties, they developed mini- program binaries tends to inhibit the development of computer ISAs—such as the VAX-11 instruction new ISAs, virtual-machine technology and binary

April 2001 63 the general registers handling global communication. Table 1. Basic instruction types for an accumulator- The instruction queues in each cluster simply issue based ISA. instructions in FIFO order, and fast local data com- munication is through the accumulator. Instruction Size (bytes) The single-accumulator ISA implicitly specifies com- R <- A 1 munication and dependence information, although A <- R 1 whether it is better to create a new ISA or simply A <- A op R 2 append hint bits containing similar information to an A <- A op imm 4 existing one remains unclear. A <- M(R op imm) 4 M(R op imm) <- A 4 R <- M(A op imm) 4 rocessor design, as with most complex engi- neering problems, is an ongoing process of rein- P venting, borrowing, and adapting—with a little translation enable new implementation-level ISAs. innovation thrown in from time to time. New appli- Here, the VMM can translate and cache binaries in cations and evolving hardware technology are con- hidden memory—basically the Transmeta Crusoe and stantly changing the mix of techniques that lead to IBM Daisy paradigm, but applied to ILDP rather than optimal engineering solutions. a VLIW set. In this case, translations can be kept rel- Performance through simplicity to achieve a fast atively simple with hardware and software sharing the clock rate is a reinvention of the Cray designs. But how burden of instruction scheduling and data routing. should this be done with modern very high density To make ILDP instruction set concepts less abstract, CMOS technologies and nonnumeric applications? consider an ISA that incorporates a register file hier- Developers can borrow distributed systems methods archy by using and apply them at the processor level to solve load bal- ance, resource allocation, and communication prob- • a small, fast register file for local communication lems. But how can they effectively apply these methods within a cluster of processing elements, and when the processing units are as small as individual • a larger global register file for intercluster com- instructions? Virtual-machine methods have been munication. around in one form or another for at least 30 years and can now be adapted to form hidden “micro-operating This ISA uses variable-length instructions to provide systems” for managing the small distributed systems smaller program footprints and smaller caches. As an that processors are becoming. But how should devel- extreme example, consider an ISA with 64 general- opers divide functions (and complexity) between hard- purpose registers and a single accumulator for per- ware and VMM software to optimize performance or forming operations. All operations must involve the power saving? Will combining new instruction sets accumulator, so both dependent operations and local- with dynamic translation provide significant benefits? value communication are explicitly apparent. Such an These and many other important and interesting ISA needs only one general-purpose register field per research issues remain to be explored as the proces- instruction, so it can be quite compact, with instruc- sor architecture evolves toward instruction-level dis- tions 1, 2, or 4 bytes in length. In the basic instruction tributed processing. ✸ types shown in Table 1, the first two instructions copy data to and from a register; accumulator A is the single accumulator and accumulator R is a general- Acknowledgments purpose register. The next two instructions depict I thank the National Science Foundation (grant CCR- operations on data held in the accumulator and gen- 9900610), IBM, Sun Microsystems, and Intel for sup- eral register file. The last three instructions indicate porting this work. loads and stores. With an ISA of this type, dependent instructions naturally chain together via the accumulator and are References contiguous in the instruction stream. With a clustered 1. T.N. Theis, “The Future of Interconnection Technology,” ILDP implementation, the instruction fetch/decode IBM J. Research and Development, vol. 44, no. 3, 2000, hardware can steer all instructions in a dependent pp. 379-390. chain to the same cluster, steering the next dependent 2. V. Agarwal et al., “Clock Rate versus IPC: The End of the chain to another cluster. If the hardware renames the Road for Conventional Microprocessors,” Proc. 27th accumulator within each cluster, the processor can Ann. Int’l Symp. Computer Architecture, ACM Press, exploit parallelism among dependence chains, with New York, 2000, pp. 248-259.

64 Computer 3. L. Barroso et al., “Piranha: A Scalable Architecture Based 26th Ann. Int’l Symp. Computer Architecture, IEEE on Single-Chip Multiprocessing,” Proc. 27th Int’l Symp. Press, Piscataway, N.J., 1999, pp. 234-245. Computer Architecture, ACM Press, New York, 2000, 10. Y. Chou and J.P. Shen, “Instruction Path ,” pp. 282-293. Proc. 27th Ann. Int’l Symp. Computer Architecture, 4. R. Eichemeyer et al., “Evaluation of Multithreaded ACM Press, New York, 2000, pp. 270-281. Uniprocessors for Commercial Application Environ- 11. A. Klaiber, “The Technology Behind Crusoe Processors,” ments,” Proc. 23rd Ann. Int’l Symp. Computer Architec- tech. brief, Transmeta, Santa Clara, Calif., 2000; ture, IEEE Press, Piscataway, N.J., 1996, pp. 203-212. http://www.transmeta.com/crusoe/download/pdf/ 5. K. Diefendorff, “Compaq Chooses SMT for Alpha,” crusoetechwp.pdf. Microprocessor Report, 6 Dec. 1999, pp. 1, 6-11. 12. K. Ebcioglu and E.R. Altman, “DAISY: Dynamic Com- 6. D.H. Albonesi, “The Inherent Energy Efficiency of Com- pilation for 100% Architecture Compatibility,” Proc. plexity-Adaptive Processors,” Proc. 1998 Power-Driven 24th Ann. Int’l Symp. Computer Architecture, ACM Microarchitecture Workshop, IEEE Press, Piscataway, Press, New York, 1997, pp. 26-37. N.J., 1998, pp. 107-112. 13. C.F. Webb and J.S. Liptay, “A High-Frequency Custom 7. S. Palacharla, N. Jouppi, and J.E. Smith, “Complexity- CMOS S/390 Microprocessor,” IBM J. Research and Effective Superscalar Processors,” Proc. 24th Ann. Int’l Development, July 1997, pp. 463-474. Symp. Computer Architecture, ACM Press, New York, 1997, pp. 206-218. James E. Smith is a professor in the Department of 8. A. Roth and G. Sohi, “Effective Jump-Pointer Prefetch- Electrical and Computer Engineering at the University ing for Linked Data Structures,” Proc. 26th Ann. Int’l of Wisconsin-Madison. His research interests include Symp. Computer Architecture, IEEE Press, Piscataway, high-performance processors and virtual-machine N.J., 1999, pp. 111-121. architectures. He is currently on sabbatical at the IBM 9. G. Reinman, T. Austin, and B. Calder, “A Scalable Front- T.J. Watson Research Center in Yorktown Heights, End Architecture for Fast Instruction Delivery,” Proc. New York. Contact him at [email protected].

Nine good reasons why 2001 close to 100,000 computing professionals join the EDITORIAL CALENDAR IEEE Computer Society IT Professional is looking for contributions about the following cover feature topics for 2001: Transactions on May/June Wireless Computers A bewildering array of wireless technologies—Bluetooth Knowledge and Data Engineering and WAP among them—are making inroads into corporate

Multimedia communications. Making it all work together with existing systems is IT’s headache, but will be a boon to global Networking communications and a company’s competitive edge. Parallel and Distributed Systems July/Aug. IT Projections and Market Models Pattern Analysis and Machine Intelligence Reinventing the wheel is never fun, yet IT departments do it every time they make projections for manpower, project Software Engineering costs, and future technologies. Find out what data and tools Very Large Scale you need to look beyond the next week’s work to make Integration Systems strategic calls. Visualization and Sept./Oct. Personal Information Control Computer Graphics Others are making decisions about the control of personal information from health records to buying patterns. How will these decisions impact the way you use technology? computer.org/publications/

April 2001 65