Baring It All to Software: Raw Machines
Total Page:16
File Type:pdf, Size:1020Kb
Baring It All to Software: Theme Feature Raw Machines This innovative approach eliminates the traditional instruction set interface and instead exposes the details of a simple replicated architecture directly to the compiler. This allows the compiler to customize the hardware to each application. Elliot s our industry develops the technology that operations uniquely suited to a particular application. Waingold will permit a billion transistors on a chip, Together, these features will enable high switching computer architects must face three converg- speeds and dramatically simplify hardware design and Michael ing forces: the need to keep internal chip wires verification tasks. Taylor A short so that clock speed scales with feature Devabhaktuni size; the economic constraints of quickly verifying new Small replicated tiles Srikrishna designs; and changing application workloads that As Figure 1 shows, a Raw machine is made up of a emphasize stream-based multimedia computations. set of interconnected tiles. Each tile contains a simple, Vivek Sarkar One approach is to rely on a simple, highly parallel RISC-like pipeline and is interconnected with other Walter Lee VLSI architecture that fully exposes the hardware tiles over a pipelined, point-to-point network. Having architecture’s low-level details to the compiler. This many distributed registers eliminates the small-regis- Victor Lee allows the compiler—or, more generally, the soft- ter name-space problem, allowing a greater degree of Jang Kim ware—to determine and implement the best resource instruction-level parallelism (ILP). allocation for each application. We call systems based Static RAM (SRAM) distributed across the tiles Matthew on this approach Raw architectures because they eliminates the memory bandwidth bottleneck and pro- Frank implement only a minimal set of mechanisms in hard- vides significantly shorter latency to each memory Peter Finch ware. Raw machines require only short wires, are module. The distributed architecture also allows mul- much simpler to design than today’s superscalars, and tiple high-bandwidth paths to external Rambus-like Rajeev Barua support efficient pipelined parallelism for multimedia DRAM—as many as packaging technology will per- Jonathan applications. mit. A typical Raw system might include a Raw micro- Babb processor coupled with off-chip memory and RAW ARCHITECTURE stream-IO devices. The amount of memory is chosen Saman Our general philosophy is to build an architecture to roughly balance the areas devoted to processing and Amarasinghe based on replicating a simple tile, each with its own memory and to match the memory-access time to the Anant instruction stream. As Figure 1 shows, a Raw micro- processor clock. The compiler implements higher level Agarwal processor is a set of interconnected tiles, each of which abstractions like caching, global shared memory, and contains instruction and data memories, an arithmetic memory protection.1 Software handles dynamic events Massachusetts logic unit, registers, configurable logic, and a pro- like cache misses. Institute of Technology, grammable switch that supports both dynamic and Unlike current superscalars, a Raw processor does not Laboratory compiler-orchestrated static routing. bind specialized logic structures such as register-renam- for Computer The tiles are connected with programmable, tightly ing logic or dynamic instruction-issue logic into hard- Science integrated interconnects. The tightly integrated, syn- ware. Instead, the focus is on keeping each tile small to chronous network interface of a Raw architecture maximize the number of tiles that can fit on a chip, allows for intertile communication with short latencies increasing the chip’s achievable clock speed and the similar to those of register accesses. Static scheduling amount of parallelism it can exploit. For example, a sin- guarantees that operands are available when needed, gle one-billion-transistor die using a conventional logic eliminating the need for explicit synchronization. technology could carry 128 tiles. Each tile would use 5 In addition, each tile supports multigranular (bit-, million transistors for memory, splitting them among a byte- and word-level) operations and programmers 16-Kbyte instruction memory (IMEM); a 16-Kbyte can use the configurable logic in each tile to construct switch instruction memory (SMEM); and a 32-Kbyte 86 Computer 0018-9162/97/$10.00 © 1997 IEEE . IMEM DMEM PC Registers ALU CL SMEM PC Switch Figure 1. A Raw processor is constructed of multiple identical tiles. Each tile contains instruction memory (IMEM), data mem- ories (DMEM), an arithmetic logic unit (ALU), registers, configurable logic (CL), and a programmable switch with its associ- ated instruction memory (SMEM). Memory Memory Memory Memory Switch Switch Switch Registers Registers Registers Registers Memory Memory Memory Switch Switch Switch Registers Registers Registers ALU ALU ALU ALUALU ALU ALU ALU ALU (a) (b) (c) Figure 2. Raw microprocessors differ from superscalar and multiprocessor architectures. (a) Raw microprocessors distribute the register file and mem- ory ports and communicate between ALUs on a switched, point-to-point interconnect. (b) A superscalar contains a single register file and memory port and communicates between ALUs on a global bus. (c ) Multiprocessors communicate at a much coarser grain through the memory subsystem. first-level data memory. Each type of memory is imple- opposes current trends, but it makes more chip area mented with 6-transistor SRAM memory cells backed available for memory and computational logic, by 128 Kbytes of dynamic memory implemented using results in a faster clock, and reduces verification com- 2-transistor memory cells. Each tile could devote a gen- plexity. Taken together, these benefits can make the erous 2-million-transistor equivalent area to the pipelined software synthesis of complex operations competi- data path and control (say, an R2000-equivalent CPU, tive with hardware for overall application perfor- floating-point unit, and configurable logic). Interconnects mance. would consume about 30 percent of the chip area. As Figure 2 shows, a Raw architecture basically Programmable, Integrated Interconnect replaces a superscalar processor’s bus architecture As Figure 2 shows, a Raw machine uses a switched with a switched interconnect while the software sys- interconnect instead of buses. The switch is integrated tem implements operations such as register renam- directly into the processor pipeline to support single- ing, instruction scheduling, and dependency checking. cycle message injection and receive operations. The Reducing hardware support for these operations processor communicates with the switch using dis- September 1997 87 . Compilers tinct opcodes to distinguish between accesses to static ation of the processing elements. The second is dedi- can schedule and dynamic network ports. No signal in a Raw cated to sequencing routing instructions for the static single-word processor travels more than a single tile width within switch. Separate controls for processor and switch lets a clock cycle. the processor take arbitrary, data-dependent branches data transfers Because the interconnect allows intertile communi- without disturbing the routing of independent mes- and exploit cation to occur at nearly the same speed as a register sages passing through the switch. Loading a different ILP, in read, compilers can schedule single-word data trans- program into the switch instruction memory changes addition to fers and exploit ILP, in addition to coarser forms of the switch schedule. Programming network switches coarser parallelism. Integrated interconnects also allow chan- with compilation time schedules lets the compiler sta- forms of nels with hundreds of wires instead of tens—very large tically schedule the computations in each tile. Static parallelism. scale integration switches are pad-limited and rarely scheduling eliminates synchronization and its signifi- dominated by internal switch area. The switch multi- cant overhead. plexes two logically distinct networks—one static and Compiler orchestration ensures that the static one dynamic—over the same set of physical wires. The switch stalls infrequently. Dynamic events, however, dynamic wormhole router makes routing decisions may force one of the tiles to delay its static schedule. based on each message’s header, which includes addi- For correctness, the static network provides minimal tional lines for flow control. hardware-supported flow control. Although both static and dynamic networks export flow-control Control bits to the software, we are exploring strategies to Each tile includes two sets of control logic and entirely avoid the flow-control requirement on the instruction memories. The first set controls the oper- static network. Comparing Raw to ing in coarse-grained environments like FPGA systems, however, do not support Other Architectures those provided by iWarp and NuMesh is instruction sequencing and are thus inflex- The Raw approach builds on several very difficult—predicting static events ible—they require loading an entire bit- previous architectures. A Raw architecture over long periods of time is hard. So, a stream to reprogram the FPGAs for a new seeks to execute pipelined applications Raw architecture, which also handles operation. Compilation for FPGAs is also (like signal processing) efficiently, as did fine-grained parallelism, provides several slow because of their fine granularity. The earlier systolic-array