Coherent Network Interfaces for Fine-Grain Communication
Total Page:16
File Type:pdf, Size:1020Kb
To appear in the Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), 1996 Coherent Network Interfaces for Fine-Grain Communication Shubhendu S. Mukherjee, Babak Falsafi, Mark D. Hill, and David A. Wood Computer Sciences Department University of Wisconsin-Madison Madison, Wisconsin 53706-1685 USA {shubu,babak,markhill,david}@cs.wisc.edu Abstract 1 Introduction Historically, processor accesses to memory-mapped device Most current computer systems do not efficiently support fine- registers have been marked uncachable to insure their visibility to grain communication. Processors receive data from external the device. The ubiquity of snooping cache coherence, however, devices, such as high-speed networks, through DMA and uncach- makes it possible for processors and devices to interact with cach- able device registers. A processor becomes aware of an external able, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks event (e.g., a message arrival) via interrupts or by polling and reducing control overheads (e.g., for polling). uncached status registers. Both notification mechanisms are This paper begins an exploration of network interfaces (NIs) costly: interrupts have high latency and polling wastes processor that use coherence—coherent network interfaces (CNIs)—to cycles and other system resources. A processor sends data with an improve communication performance. We restrict this study to NI/ uncachable store, a mechanism that is rarely given first-class sup- CNIs that reside on coherent memory or I/O buses, to NI/CNIs that port. Both uncachable loads and stores incur high overhead are much simpler than processors, and to the performance of fine- because they carry small amounts of data (e.g., 4-16 bytes), which grain messaging from user process to user process. fails to use the full transfer bandwidth between a processor and a Our first contribution is to develop and optimize two mecha- device. Optimizations such as block copy [42] or special store nisms that CNIs use to communicate with processors. A cachable buffers [42, 23] can help improve the performance of uncachable device register—derived from cachable control registers [39,40]— accesses by transferring data in chunks. However, these optimiza- is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor. Cachable tions are processor-specific, may require new instructions [42, 23], queues generalize cachable device registers from one cachable, and may be restricted in their use [42]. coherent memory block to a contiguous region of cachable, coher- Snooping cache coherence mechanisms, on the other hand, are ent blocks managed as a circular queue. supported by almost all current processors and memory buses. Our second contribution is a taxonomy and comparison of four These mechanisms allow a processor to quickly and efficiently CNIs with a more conventional NI. Microbenchmark results show obtain a cache block’s worth of data (e.g., 32-128 bytes) from that CNIs can improve the round-trip latency and achievable another processor or memory. bandwidth of a small 64-byte message by 37% and 125% respec- This paper explores leveraging the first-class support given to tively on the memory bus and 74% and 123% respectively on a coherent I/O bus. Experiments with five macrobenchmarks show snooping cache coherence to improve communication between that CNIs can improve the performance by 17-53% on the memory processors and network interfaces (NIs). NIs need attention, bus and 30-88% on the I/O bus. because progress in high-bandwidth, low-latency networks is rap- idly making NIs a bottleneck. Rather than try to explore the entire NI design space here, we focus our efforts three ways: •First, we concentrate on NIs that reside on memory or I/O buses. In contrast, other research has examined placing NIs in proces- sor registers [5,15,21], in the level-one cache controller [1], and on the level-two cache bus [10]. Our NIs promise lower cost ThisThis w orkwork is issupported supported in inpart part by by Wright Wright Laboratory Laboratory A vionicsAvionics Directorate, Director- than the other alternatives, given the economics of current Airate, Force Air FMaterialorce Material Command, Command, USAF USAF, under, undergrant #F33615-94-1-1525grant #F33615-94-1- and microprocessors and higher integration level we expect in the ARP1525A orderand ARP no.A B550, order NSF no. B550, PYI A NSFward PYI CCR-9157366, Award CCR-9157366, NSF Grant NSF MIP- future. Nevertheless, closer integration is desirable if it can be 9225097,Grant MIP-9225097, an I.B.M. cooperati an A.T.&Tve fello. graduatewship, felloandwship, donations and donationsfrom A.T.&T. made economically viable. Bellfrom Laboratories, A.T.&T. Bell Digital Laboratories, Equipment Digital Corporation, Equipment Corporation,Sun Microsystems, Sun ThinkingMicrosystems, Machines Thinking Corporation, Machines and Corporation, Xerox Corporation. and Xerox Our Corporation. Thinking •Second, we limit ourselves to relatively simple NIs—similar in MachinesOur Thinking CM-5 Machines was purchased CM-5 wthroughas purchased NSF Institutionalthrough NSF Infrastructure Institutional complexity to the Thinking Machines CM-5 NI [29] or a DMA GrantInfrastructure No. CDA-9024618 Grant No. CDwithA-9024618 matching withfunding matching from fundingthe Uni vfromersity the of engine. In contrast, other research has examined complex, pow- WUniisconsinversity Graduate of Wisconsin School. Graduate The U.S. School. Government The U.S. is Goauthorizedvernment to is repro- erful NIs that integrate an integer processor core [28, 38] to offer duceauthorized and distrib to reproduceute reprints and for distrib Govuteernmental reprints purposesfor Governmental notwithstanding pur- higher performance at higher cost. While both simple and com- anposesy cop notwithstandingyright notation anthereon.y copyright The notationviews and thereon. conclusions The vie wscontained and plex NIs are interesting, we concentrate on simple NIs where hereinconclusions are those contained of the authors herein andare thoseshould of not the be authors interpreted and should as necessarily not be coherence has not yet been fully exploited. representinginterpreted asthe necessarily official policies representing or endorsements, the official policies either ore xpressedendorse- or •Third, we focus on program-controlled fine-grain communica- implied,ments, ofeither the eWrightxpressed Laboratory or implied, A vionicsof the Wright Directorate Laboratory or the A U.S.vionics Gov- tion between peer user processes, as required by demanding par- ernment.Directorate or the U.S. Government. allel computing applications. This includes notifying the . receiving process that data is available without requiring an interrupt. In contrast, DMA devices send larger messages to remote memory, and only optionally notify the receiving process with a relatively heavy-weight interrupt. We explore a class of coherent network interfaces (CNIs) that reside on a processor node’s memory or coherent I/O bus and par- ticipate in the cache coherence protocol. CNIs interact with a coherent bus like Stanford DASH’s RC/PCPU [30], but support small 64-byte message by 37% and 125% respectively on a mem- messaging rather than distributed shared memory. CNIs commu- ory bus and 74% and 123% respectively on a coherent I/O bus. nicate with the processor through two mechanisms: cachable Experiments with five macrobenchmarks show that a CNI can device registers (CDRs) and cachable queues (CQs). A CDR— improve the performance by 17-53% on the memory bus and 30- derived from cachable control registers [39, 40]—is a coherent, 88% on the I/O bus. cachable block of memory used to transfer status, control, or data We see our paper having two main contributions. First, we between a device and a processor. In the common case of develop cachable queues, including using lazy pointers, message unchanging information, e.g., polling, a CDR removes unneces- valid bits, and sense-reverse. Second, we do the first micro- and sary bus traffic because repeated accesses hit in the cache. When macro-benchmarks comparison of alternative CNIs—exposed by changes do occur, CDRs use the underlying coherence protocol to our taxonomy—with a conventional NI. transfer messages a full cache block at a time. Cachable queues A weakness of this paper, however, is that we do not do an in- (CQs) are a new mechanism that generalize CDRs from one cach- depth comparison of our proposals with DMA. The magnitude of able, coherent memory block to a contiguous region of cachable, this deficiency depends on how important one expects DMA to be coherent blocks managed as a circular queue to amortize control compared to fine-grain communication in future systems. Some overheads. To maximize performance we exploit several critical argue that DMA will become more important as techniques like optimizations: lazy pointers, message valid bits, and sense- User-level DMA [3] reduce DMA initiation overheads. Others reverse. Because CQs look, smell, and act like normal cachable argue DMA will become less important as processors add block memory, message send and receive overheads are extremely low: copy instructions [42] (making the breakeven size for DMA a cache miss plus several cache hits. Furthermore, if the system larger) and as the marginal cost of adding another processor supports prefetching or an update-based coherence protocol, even diminishes [48] (making it less expensive to temporarily waste a the cache miss may be eliminated. Because CNIs transfer mes- processor). sages a cache block at a time, the sustainable bandwidth is much The rest of this paper describes CDRs and CQs in detail greater than conventional program-controlled NIs—such as the (Section 2), presents CNI taxonomy and implementations, CM-5 NI [44]—that rely on slower uncachable loads and stores. (Section 3), describes evaluation methodology (Section 4), ana- For symmetric multiprocessors (SMPs), which are often limited lyzes results (Section 5), reviews related work (Section 6), and by memory bus bandwidth, the reduced bus occupancy for access- concludes (Section 7).