Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng, Naveen Muralimanohar, Karthik Ramani, Rajeev Balasubramonian, John B. Carter School of Computing, University of Utah flegion,naveen,karthikr,rajeev,[email protected] ∗ Abstract However, with communication emerging as a larger power and performance constraint than computation, it may be- Improvements in semiconductor technology have made come necessary to understand and leverage the properties it possible to include multiple processor cores on a single of the interconnect at a higher level. Exposing wire prop- die. Chip Multi-Processors (CMP) are an attractive choice erties to architects enables them to find creative ways to for future billion transistor architectures due to their low exploit these properties. This paper presents a number of design complexity, high clock frequency, and high through- techniques by which coherence traffic within a CMP can put. In a typical CMP architecture, the L2 cache is shared be mapped intelligently to different wire implementations by multiple cores and data coherence is maintained among with minor increases in complexity. Such an approach can private L1s. Coherence operations entail frequent commu- not only improve performance, but also reduce power dissi- nication over global on-chip wires. In future technologies, pation. communication between different L1s will have a significant In a typical CMP, the L2 cache and lower levels of the impact on overall processor performance and power con- memory hierarchy are shared by multiple cores [24, 41]. sumption. On-chip wires can be designed to have different Sharing the L2 cache allows high cache utilization and latency, bandwidth, and energy properties. Likewise, co- avoids duplicating cache hardware resources. L1 caches herence protocol messages have different latency and band- are typically not shared as such an organization entails high width needs. We propose an interconnect composed of wires communication latencies for every load and store. There with varying latency, bandwidth, and energy characteris- are two major mechanisms used to ensure coherence among tics, and advocate intelligently mapping coherence opera- L1s in a chip multiprocessor. The first option employs a bus tions to the appropriate wires. In this paper, we present a connecting all of the L1s and a snoopy bus-based coherence comprehensive list of techniques that allow coherence pro- protocol. In this design, every L1 cache miss results in a co- tocols to exploit a heterogeneous interconnect and evaluate herence message being broadcast on the global coherence a subset of these techniques to show their performance and bus and other L1 caches are responsible for maintaining power-efficiency potential. Most of the proposed techniques valid state for their blocks and responding to misses when can be implemented with a minimum complexity overhead. necessary. The second approach employs a centralized directory in the L2 cache that tracks sharing information for 1. Introduction all cache lines in the L2. In such a directory-based protocol, every L1 cache miss is sent to the L2 cache, where fur- One of the greatest bottlenecks to performance in fu- ther actions are taken based on that block’s directory state. ture microprocessors is the high cost of on-chip commu- Many studies [2, 10, 21, 26, 30] have characterized the nication through global wires [19]. Power consumption high frequency of cache misses in parallel workloads and has also emerged as a first order design metric and wires the high impact these misses have on total execution time. contribute up to 50% of total chip power in some proces- On a cache miss, a variety of protocol actions are initiated, sors [32]. Most major chip manufacturers have already an- such as request messages, invalidation messages, interven- nounced plans [20, 23] for large-scale chip multi-processors tion messages, data block writebacks, data block transfers, (CMPs). Multi-threaded workloads that execute on such etc. Each of these messages involves on-chip communi- processors will experience high on-chip communication la- cation with latencies that are projected to grow to tens of tencies and will dissipate significant power in interconnects. cycles in future billion transistor architectures [3]. In the past, only VLSI and circuit designers were concerned Simple wire design strategies can greatly influence a with the layout of interconnects for a given architecture. wire’s properties. For example, by tuning wire width and ∗This work was supported in part by NSF grant CCF-0430063 and by spacing, we can design wires with varying latency and Silicon Graphics Inc. bandwidth properties. Similarly, by tuning repeater size and spacing, we can design wires with varying latency and en- 3. Wire Implementations ergy properties. To take advantage of VLSI techniques and better match the interconnect design to communication re- We begin with a quick review of factors that influence quirements, a heterogeneous interconnect can be employed, wire properties. It is well-known that the delay of a wire where every link consists of wires that are optimized for ei- is a function of its RC time constant (R is resistance and ther latency, energy, or bandwidth. In this study, we explore C is capacitance). Resistance per unit length is (approxi- optimizations that are enabled when such a heterogeneous mately) inversely proportional to the width of the wire [19]. interconnect is employed for coherence traffic. For exam- Likewise, a fraction of the capacitance per unit length is ple, when employing a directory-based protocol, on a cache inversely proportional to the spacing between wires, and a write miss, the requesting processor may have to wait for fraction is directly proportional to wire width. These wire data from the home node (a two hop transaction) and for ac- properties provide an opportunity to design wires that trade knowledgments from other sharers of the block (a three hop off bandwidth and latency. By allocating more metal area transaction). Since the acknowledgments are on the critical per wire and increasing wire width and spacing, the net path and have low bandwidth needs, they can be mapped to effect is a reduction in the RC time constant. This leads wires optimized for delay, while the data block transfer is to a wire design that has favorable latency properties, but not on the critical path and can be mapped to wires that are poor bandwidth properties (as fewer wires can be accom- optimized for low power. modated in a fixed metal area). In certain cases, nearly a three-fold reduction in wire latency can be achieved, at the The paper is organized as follows. We discuss related expense of a four-fold reduction in bandwidth. Further, re- work in Section 2. Section 3 reviews techniques that enable searchers are actively pursuing transmission line implemen- different wire implementations and the design of a hetero- tations that enable extremely low communication latencies geneous interconnect. Section 4 describes the proposed in- [12, 16]. However, transmission lines also entail signif- novations that map coherence messages to different on-chip icant metal area overheads in addition to logic overheads wires. Section 5 quantitatively evaluates these ideas and we for sending and receiving [8, 12]. If transmission line im- conclude in Section 6. plementations become cost-effective at future technologies, they represent another attractive wire design point that can 2. Related Work trade off bandwidth for low latency. Similar trade-offs can be made between latency and To the best of our knowledge, only three other bodies of power consumed by wires. Global wires are usually com- work have attempted to exploit different types of intercon- posed of multiple smaller segments that are connected with nects at the microarchitecture level. Beckmann and Wood repeaters [4]. The size and spacing of repeaters influences [8, 9] propose speeding up access to large L2 caches by in- wire delay and power consumed by the wire. When smaller troducing transmission lines between the cache controller and fewer repeaters are employed, wire delay increases, but and individual banks. Nelson et al. [39] propose using opti- power consumption is reduced. The repeater configuration cal interconnects to reduce inter-cluster latencies in a clus- that minimizes delay is typically very different from the tered architecture where clusters are widely-spaced in an repeater configuration that minimizes power consumption. effort to alleviate power density. Banerjee et al. [6] show that at 50nm technology, a five- A recent paper by Balasubramonian et al. [5] introduces fold reduction in power can be achieved at the expense of a the concept of a heterogeneous interconnect and applies it two-fold increase in latency. for register communication within a clustered architecture. Thus, by varying properties such as wire width/spacing A subset of load/store addresses are sent on low-latency and repeater size/spacing, we can implement wires with dif- wires to prefetch data out of the L1D cache, while non- ferent latency, bandwidth, and power properties. Consider critical register values are transmitted on low-power wires. a CMOS process where global inter-core wires are routed A heterogeneous interconnect similar to the one in [5] has on the 8X and 4X metal planes. Note that the primary dif- been applied to a different problem domain in this paper. ferences between minimum-width wires in the 8X and 4X The nature of cache coherence traffic and the optimiza- planes are their width, height, and spacing. We will refer tions they enable are very different from that of register to these mimimum-width wires as baseline B-Wires (either traffic within a clustered microarchitecture. We have also 8X-B-Wires or 4X-B-Wires). In addition to these wires, improved upon the wire modeling methodology in [5] by we will design two more wire types that may be potentially modeling the latency and power for all the network com- beneficial (summarized in Figure 1).

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

3Dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli

PACKET 7 BOOKSTORE 433 Lecture 5 Dr W IBM OVERVIEW

Online Sec 6.15.Indd

Hive: Fault Containment for Shared-Memory Multiprocessors

Kingston Memory in Your System Contents

Performance of Various Computers Using Standard Linear Equations Software

Lmbench: Portable Tools for Performance Analysis

Performance of Various Computers Using Standard Linear Equations Software

Performance of Various Computers Using Standard Linear Equations

An Overview of Scientific Computing

No Slide Title

High-Quality Volume Rendering Using Texture Mapping Hardware