Instruction-Level Distributed Processing
Total Page:16
File Type:pdf, Size:1020Kb
COVER FEATURE Instruction-Level Distributed Processing Shifts in hardware and software technology will soon force designers to look at microarchitectures that process instruction streams in a highly distributed fashion. James E. or nearly 20 years, microarchitecture research In short, the current focus on instruction-level Smith has emphasized instruction-level parallelism, parallelism will shift to instruction-level distributed University of which improves performance by increasing the processing (ILDP), emphasizing interinstruction com- Wisconsin- number of instructions per cycle. In striving munication with dynamic optimization and a tight Madison F for such parallelism, researchers have taken interaction between hardware and low-level software. microarchitectures from pipelining to superscalar pro- cessing, pushing toward increasingly parallel proces- TECHNOLOGY SHIFTS sors. They have concentrated on wider instruction fetch, During the next two or three technology generations, higher instruction issue rates, larger instruction win- processor architects will face several major challenges. dows, and increasing use of prediction and speculation. On-chip wire delays are becoming critical, and power In short, researchers have exploited advances in chip considerations will temper the availability of billions of technology to develop complex, hardware-intensive transistors. Many important applications will be object- processors. oriented and multithreaded and will consist of numer- Benefiting from ever-increasing transistor budgets ous separately compiled and dynamically linked parts. and taking a highly optimized, “big-compile” view of software, microarchitecture researchers made sig- Wire delays nificant progress through the mid-1990s. More Both short (local) and long (global) wires cause prob- recently, though, researchers have seemingly reduced lems for microprocessor designers. With local wires, the problem to finding ways of consuming transis- the problem is congestion: For many complex, dense tors, resulting in hardware-intensive, complex proces- structures, transistor size does not determine area sors. The complexity is not just in critical path lengths requirements—wiring does. Global wire delays will not and transistor counts: There is high intellectual com- scale as well as transistor delays largely because shrink- plexity in the increasingly intricate schemes for ing wire sizes and limits on wire aspect ratios will cause squeezing performance out of second- and third-order resistance per unit length to increase at a faster rate effects. than wiring distances shrink. Hierarchical wiring, with Substantial shifts in hardware technology and soft- thicker long-distance wires will certainly help, but it is ware applications will lead to general-purpose unlikely that the number of wiring levels will increase microarchitectures composed of small, simple, inter- fast enough to stay ahead of the problem.1 For exam- connected processing elements, running at very high ple, in future 35-nanometer technology, the projected clock frequencies. A hidden layer of implementation- number of static RAM cells reachable in one clock cycle specific software—co-designed with the hardware— will be about half of that today.2 will help manage the distributed hardware resources Of course, reachability is a problem with general to use power efficiently and to dynamically optimize logic, too. As logical structures become more complex executing threads based on observed instruction and consume relatively more area, global delays will dependencies and communication. increase simply because structures are farther apart. 0018-9162/01/$10.00 © 2001 IEEE April 2001 59 remains important for multimedia applications that tend Figure 1. An example to use integers and low-precision floating-point data. of instruction-level FU Often, however, special processors or instruction-set distributed processing extensions support these library-oriented applications. microarchitecture, CU Microarchitecture researchers typically assume a consisting of an IU high level of compiler optimization, sometimes using instruction fetch (IF) profile-driven feedback, but in practice many applica- unit, integer units tion binaries are not highly optimized. Further, global (IU), floating-point IU CU compile-time optimization is incompatible with object- units (FU), and cache oriented programming and dynamic linking. Conse- units (CU). The units IF quently, microarchitects face the challenge of providing communicate via high performance for irregular, difficult-to-predict point-to-point IU CU applications, many of which have not been compiler interconnections that optimized. will likely consume one or more clock IU CU On-chip multithreading cycles each. Multithreading, which brings together microarchi- tecture and applications, has a lengthy tradition, pri- FU marily in very high end systems that have not enjoyed a broad software base. For example, multiprocessing became an integral component of large IBM main- Put another way, simple logic will likely improve per- frames and Cray supercomputers in the early 1980s. formance directly by reducing critical paths, but also However, the widespread use of multithreading has indirectly by reducing area and overall wire lengths. been a chicken-and-egg problem that now appears This is not a new idea: All of Seymour Cray’s designs nearly solved. benefited from this principle. To continue the exponential complementary metal- oxide semiconductor (CMOS) performance curve, Power consumption many chips now support multiple threads and others Dynamic power is proportional to the clock fre- soon will. Multiple processors3 or wide superscalar quency, the transistor switching activity, and the sup- hardware capable of supporting simultaneous multi- ply voltage squared, so higher clock frequencies and threading (SMT) provide this support.4,5 Software that transistor counts have made dynamic power an impor- has many parallel, independent transactions—such as tant design consideration today. Dependence on volt- many Web servers-—will take advantage of hardware- age level squared forces a countering trend toward supported on-chip multithreading, as will many gen- lower voltages. eral-purpose applications. In the future, the power problem is likely to get much worse. To maintain high switching speeds with INSTRUCTION-LEVEL DISTRIBUTED PROCESSING reduced voltage levels, developers must continue to The exploitation of technology advances and the lower transistor threshold voltages. Doing so makes accommodation of technology shifts have both fos- transistors increasingly leaky: Current passes from tered computer architecture innovation. For example, source to drain even when the transistor is not switch- the cache memory innovation helped avoid tremen- ing. The resulting static power consumption will likely dous slowdowns by adapting to the widening gap in become dominant in the next two or three chip-tech- performance between processor and dynamic RAM nology generations. (DRAM) technologies. There are few solutions to the static power problem. At this point, shifts in both technology and appli- The design could selectively gate off the power supply cations are driving microarchitecture innovation. to unused parts of the processor, but doing so is a rel- Innovators will strive to maintain long-term perfor- atively slow process that can create difficult-to-man- mance improvement in the face of increasing on-chip age transient currents. Using fewer transistors or at wire delays, increasing power consumption, and irreg- least fewer leaky ones is about the only other option. ular throughput-oriented applications that are incom- patible with big, static, highly optimized compilations. Software issues General-purpose computing has shifted emphasis to ILDP microarchitecture integer-oriented commercial applications, where irreg- The ILDP microarchitecture paradigm deals effec- ular data is common and data movement is often more tively with technology and application trends. A important than computation. Highly structured data processor following this paradigm consists of several 60 Computer distributed functional units, each fairly simple and with a very high frequency clock, as Figure 1 shows. Relatively long wire delays imply processor microar- chitectures that explicitly account for interinstruction Register Execution unit and intrainstruction communication. As much as pos- Instruction file 0 sible, the microarchitecture should localize communi- issue Execution unit cation to small units while organizing the overall buffers structure for communication. Communication among One units will be point-to-point with delays measured in clock cycle clock cycles. A significant part of the microarchitecture delay design effort will involve partitioning the processor to accommodate these delays. To keep the transistor Execution unit counts low and the clock frequency high, the microar- Register file 1 chitecture core will keep low-level speculation to a rel- Execution unit ative minimum. Determinism is inherently simpler than prediction and recovery. With high intraprocessor-communication delays, the number of instructions the processor executes per Figure 2. Alpha 21264 clustered microarchitecture. Instructions are steered to one of cycle may level off or decrease, but developers can two processing clusters that have duplicated register files. Communicating the results increase overall performance by running smaller dis- produced in one cluster to the other cluster takes a full