Coherent Network Interfaces for Fine-Grain Communication

Coherent Network Interfaces for Fine-Grain Communication Shubhendu S. Mukherjee, 13abak Falsafi, Mark D. Hill, and David A. Wood Computer Sciences Department University of Wisconsin-Madison Madison, Wisconsin 53706-1685 USA {shubu,babak,markhill,david} @cs.wise.edu Abstract 1 Introduction Historically, processor accesses to memory-mapped device Most current computer systems do not efficiently support fine- registers huve been marked uncachable to insure their visibili~ to grain communication. Processors receive data from external the device. The ubiquity of snooping cache coherence, howeveg devices, such as high-speed networks, through DMA and uncach- makes it possible for processors and devices to interact with cachable device registers. A processor becomes aware of an externaf able, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks event (e.g., a message arrival) via interrupts or by polling and reducing control overheads (e. g., for polling). uncached status registers, Both notification mechanisms are This paper begins an exploration of network inter-jtces (NIs) costly: interrupts have high latency and polling wastes processor that use coherence—coherent network interfaces (CNIs)--to cycles and other system resources. A processor sends data with an improve communication performance, We restrict this study to NI/ uncachable store, a mechanism that is rarely given first-class sup- CNIS that reside on coherent memoty or I/O buses, to NVCNIS that port. Both uncachable loads and stores incur high overhead are much simpler than processors, and to the pe~ormance of&e- because they carry small amounts of data (e.g., 4-16 bytes), which grain messagingfiom user process to user process. fails to use the full transfer bandwidth between a processor and a Our jirst contribution is to develop and optimize two mecha- device. Optimization such as block copy [42] or special store nisms that CNIS use to communicate with processors. A cachable buffers [42, 23] can help improve the performance of uncachable device register—derived from cachable control registers [39)40]— accesses by transferring data in chunks. However, these optimiza- is a coherent, cachable block of memory used to transfer status, tion are processor-specific, may require new instructions [42, 23], control, or data between a device and a processor Cachable queues generalize cachable device registers from one cachable, and may be restricted in their use [42]. coherent memory block to a contiguous region of cachable, coher- Snooping cache coherence mechanisms, on the other hand, are ent blocks managed as a circular queue. supported by almost all current processors and memory buses. Our second contribution is a taxonomy and comparison of four These mechanisms allow a processor to quickly and efficiently CNIS with a more conventional NI. Microbenchmark results show obtain a cache block’s worth of data (e.g., 32-128 bytes) from that CNIS can improve the round-trip latency and achievable another processor or memory. bandwidth of a small 64-byte message by 37% and 125% respec- This paper explores leveraging the first-class support given to tively on the memory bus and 74% and 123% respectively on a snooping cache coherence to improve communication between coherent 1/0 bus. Experiments with jive macrobenchmarks show that CNIS can improve the pe~ormance by 17-5370 on the memory processors and network interfaces (NIs). NIs need attention, bus and 30-88% on the I/O bus. because progress in high-bandwidth, low-latency networks is rap- idly making NIs a bottleneck. Rather than try to explore the entire NI design space here, we focus our efforts three ways: ● First, we concentrate on NIs that reside on memory or I/O buses. In contrast, other research has examined placing NIs in processor registers [5, 15,21 ], in the level-one cache controller [1], and on the level-two cache bus [10]. Our NIs promise lower cost This work is supported in part by Wright Laboratory Avionics Directorate, than the other alternatives, given the economics of current Air Force Materiat Command, USAF, under grant #F33615-94-1~-1525 and microprocessors and higher integration level we expect in the ARPA order no. B550, NSF PYI Award CCR-91 57366, NSF Grant MIP- future. Nevertheless, closer integration is desirable if it can be 9225097, an I.B.M. cooperative fellowship, and donations from A.T.&T. made economically viable. Bell Laboratories, Digital Equipment Corporation, Sun Microsystems, Thinking Machines Corporation, and Xerox Corporation. Our Thinking c Second, we limit ourselves to relatively simple NIs—sirnilar in Machines CM-5 was purchased through NSF Institutional Infrastructure complexity to the Thinking Machines CM-5 NI [29] or a DMA Grant No. CDA-9024618 with matching funding from the University of engine, In contrast, other research has examined complex, pow- Wisconsin Graduate School. The U.S. Government is authorized to repro- erful NIs that integrate an integer processor core [28, 38] to offer duce and distribute reprints for Govemmentat purposes notwithstanding higher performance at higher cost. While both simple and com- any copyright notation thereon. The views and conclusions contained plex NIs are interesting, we concentrate on simple NIs where herein are those of the authors and should not be interpreted as necesswily coherence has not yet been fully exploited. representing the official policies or endorsements, either expressed or ● Third, we focus on program-controlled fine-grain communica- implied, of the Wright Laboratory Avionics Directorate or the U.S. Gov- tion between peer user processes, as required by demanding par- ernment. allel computing applications. This includes notifying the receiving process that data is available without requiring an Permission to make digitahlwcl copy of part or all of this work for petsonal or classroom use is ranted wittrout fee provided that oopies are not made interrupt. In contrast, DMA devices send larger messages to or distributed for proi t or commercial advantage, the copyright notiq the remote memory, and only optionally notify the receiving process titie of the ubiication and its date appear, and notioe is given that with a relatively heavy-wei~ht interrupt. copying is t y permission of ACfvl, inc. To mpy omerwise, to repubitsh, to post on aetvets, or to redistribute to tisk, requires prior apeoifio permission We explore a class of coherent network interfaces (CNIS) that andlor a fee. reside on a processor node’s memory or coherent 1/0 bus and par- ticipate in the cache coherence protocol. CNIS interact with a ISCA ‘9S 5/98 PA, USA 01996 ACM 0-89791 -736-3/96/0005... $3.50 247 coherent bus like Stanford DASH’s RC/PCPU [30], but suppott small 64-byte message by 3770 and 125V0 respectively on a mem- messaging rather than distributed shared memory, CNIS commu- ory bus and 74% and 123?I0 respectively on a coherent I/O bus. nicate with the processor through two mechanisms: cachable Experiments with five macrobenchmarks show that a CNI can device registers (CDRS) and cad-able queues (CQS). A CDR— improve the performance by 17-53’% on the memory bus and 30- derived from cachable control registers [39, 40]—is a coherent, 88% on the 1/0 bus. cachable block of memory used to transfer status, control, or data We see our paper having two main contributions. First, we between a device and a processor. In the common case of develop cachable queues, including using lazy pointers, message unchanging information, e.g., polling, a CDR removes unneces- valid bits, and sense-reverse. Second, we do the first micro- and sary bus traffic because repeated accesses hit in the cache. When macro-benchmarks comparison of alternative CNIs-exposed by changes do occur, CDRS use the underlying coherence protocol to our taxonomy—with a conventional NI. transfer messages a full cache block at a time. Cachable queues A weakness of this paper, however, is that we do not do an in- (CQS) area new mechanism that generalize CDRS from one cach- depth comparison of our proposals with DMA. The magnitude of able, coherent memory block to a contiguous region of cachable, this deficiency depends on how important one expects DMA to be coherent blocks managed as a circular queue to amortize control compared to fine-grain communication in future systems. Some overheads. To maximize performance we exploit several critical argue that DMA will become more important as techniques like optimization: lazy pointers, message valid bits, and sense- User-level DMA [3] reduce DMA initiation overheads. Others reverse. Because CQS look, smell, and act like normal cachable argue DMA will become less important as processors add block memory, message send and receive overheads are extremely low: copy instructions [42] (making the breakeven size for DMA a cache miss plus several cache hits. Furthermore, if the system larger) and as the marginal cost of adding another processor supports prefetching or an update-based coherence protocol, even diminishes [48] (making it less expensive to temporarily waste a the cache miss may be eliminated. Because CNIS transfer mes- processor). sages a cache block at a time, the sustainable bandwidth is much The rest of this paper describes CDRS and CQS in detail greater than conventional program-controlled NIs—such as the (Section 2), presents CNI taxonomy and implementations, CM-5 NI [44]—that rely on slower uncachable loads and stores. (Section 3), describes evaluation methodology (Section 4), ana- For symmetric multiprocessors (SMPS), which are often limited lyzes results (Section 5), reviews related work (Section 6), and by memory bus bandwidth, the reduced bus occupancy for access- concludes (Section 7). ing the network interface translates into better overall system performance. 2 Coherent Network Interface Techniques An important advantage of CNIS is that they allow main memory to be the home for CQ entries, The home of a physical In this section, we describe two techniques for implementing address is the 1/0 device or memory module that services requests CNIS: Cachable Device Registers (CDRS) and Cachable Queues to that address (when the address is not cached) and accepts the (CQS).

Load more