Circuit-Switched Coherence
Total Page:16
File Type:pdf, Size:1020Kb
Circuit-Switched Coherence ‡Natalie Enright Jerger, ‡Mikko Lipasti, and ?Li-Shiuan Peh ‡Electrical and Computer Engineering Department, University of Wisconsin-Madison ?Department of Electrical Engineering, Princeton University Abstract—Circuit-switched networks can significantly lower contention, overall system performance can degrade by 20% the communication latency between processor cores, when or more. This latency sensitivity coupled with low link uti- compared to packet-switched networks, since once circuits are set up, communication latency approaches pure intercon- lization motivates our exploration of circuit-switched fabrics nect delay. However, if circuits are not frequently reused, the for CMPs. long set up time and poorer interconnect utilization can hurt Our investigations show that traditional circuit-switched overall performance. To combat this problem, we propose a hybrid router design which intermingles packet-switched networks do not perform well, as circuits are not reused suf- flits with circuit-switched flits. Additionally, we co-design a ficiently to amortize circuit setup delay. This observation prediction-based coherence protocol that leverages the exis- motivates a network with a hybrid router design that sup- tence of circuits to optimize pair-wise sharing between cores. The protocol allows pair-wise sharers to communicate di- ports both circuit and packet switching with very fast circuit rectly with each other via circuits and drives up circuit reuse. reconfiguration (setup/teardown). Our preliminary results Circuit-switched coherence provides overall system perfor- show this leading to up to 8% improvement in overall system mance improvements of up to 17% with an average improve- performance over a packet-switched fabric. ment of 10% and reduces network latency by up to 30%. As systems become more tightly coupled in multi-core I. Introduction architectures, co-design of system components becomes in- S per-chip device counts continue to increase, chip mul- creasingly important. In particular, coupling the design of Atiprocessors are becoming a reality. Shared buses and the on-chip network with the design of the coherence proto- dedicated wires do not provide the scalability needed to meet col can result in a symbiotic relationship providing superior the communication demands of future multi-core architec- performance. We have found that our workloads exhibit fre- tures. To date, designers have used packet-switched on-chip quent pair-wise sharing between cores. Prior work has also networks as the communication fabric for many-core chips shown that processor sharing exhibits temporal locality and [10, 16, 17]. While packet switching provides efficient use of is often limited to a small subset of processors [2, 7]. Such link bandwidth by interleaving packets on a single link, it application behavior inspires a prediction-based coherence adds higher router latency overhead. Alternatively, circuit- protocol that further drives up the reuse of circuits in our switching trades off poorer link utilization with much lower hybrid network and improves overall performance. The pro- latency, as data need not go through routing and arbitration tocol predicts the likely sharer for each memory request so once circuits are set up. the requester can go straight to the sharer for data, via a For the suite of commercial and scientific workloads eval- fast-path circuit, rather than having first to go to the home uated, the network latency of a 4x4 multi-core design can directory node. Our simulations show this improves overall have a high impact on performance (Figure 1) while the system performance by up to 17%. bandwidth demands placed on the network are very low. Figure 1 measures the change in overall system performance as the per-hop delay1 is increased from 1 to 11 cycles. When II. Hybrid Circuit-Switched Network a new packet is placed on a link, the number of concur- The key design goal of our hybrid circuit-switched network rent packets traversing that link is measured (including the 2 is to avoid the circuit setup delay of traditional circuit- new packet) . The average is very close to 1, illustrating switching. Our design consists of two separate mesh net- very low link contention given our simulation configuration works (the switch design is shown in Figure 2), the main (See Table I). This demonstrates that wide on-chip network data network and a tiny setup network. The main data channels are significantly underutilized for these workloads, network supports two types of traffic: circuit-switched and and that overall system performance is sensitive to inter- packet-switched. In the data network, there are C sepa- connect latency. An uncontended 5-cycle per-hop router rate physical channels, one for each circuit. To allow for delay in a packet-switched network can lead to 10% degra- a fair comparison, each of these C channels have 1/C the dation in overall system performance. As the per-hop delay bandwidth of the baseline packet-switched network in our increases, either due to deeper router pipelines or network evaluations. Manuscript submitted: 19 Feb. 2007. Manuscript accepted: 19 Mar. 2007. Final manuscript received: 29 Mar. 2007 Port Busy? 1 This comprises router pipeline delay and contention. Control Network 2This measurement corresponds to a frequently-used metric for eval- Buffer N Dest id log n Write Crossbar S uating topologies, channel load [6]. Circuit id log C Arbiter E W 1.25 Scientific 1.2 Commercial Input Buffer 1.15 Input Buffer Circuit Field Input Buffer Tagging 1.1 CheckCM Input Buffer Stage CM 1.05 Incoming CM Circuit Switched Incoming Incomingpackets Crossbar 1 packets Incomingpackets Data Network packets 0.95 Normalized Runtime Normalized 0.9 135711 Fig. 2. Switch Design Per Hop Delay Fig. 1. Effect of interconnect Latency A. Setup Network throughout the network, we add an extra bit field to each flit indicating if this flit is a circuit- or packet-switched flit. Like in a traditional circuit-switched network, the setup When a flit enters the router pipeline, the circuit field is network handles the construction and teardown of circuits checked (CFC). If the field is non-zero, this is a CS flit and and stores the switch configuration for active circuits. How- will proceed through the pipeline in Figure 3b, bypassing ever, as our hybrid network does not require waiting for directly to ST, which was already set to the appropriate out- acknowledgment that a circuit has been successfully con- put port when this circuit was originally established. The structed/torn down, data can be piggy-backed immediately tagging (T) stage flips the circuit bit for incoming data so behind the circuit setup request. When a setup request finds that flits originally intended for a circuit will now go into that no unused circuits are available, it will trigger a recon- the packet buffers. The tagging stage is enabled only when a figuration signal so that the LRU circuit will be reconfigured reconfiguration is needed at that router; this way, in future as packet-switched whilst the new circuit request will take hops, since the tagging stage is not enabled, the original over the old circuit. Incoming circuit-switched (CS) flits in- CS flits will stay packet-switched until they arrive at the tended for the old (reconfigured) circuit will henceforth be destination. tagged as packet-switched (PS) flits and will remain packet- switched until they reach their destination. A control flit on C. Packet-switched pipeline on data network the setup network will signal the source of the old circuit, that it must either stop sending CS flits or must re-establish If the circuit field is zero, this is a PS flit and will be the circuit to prevent buffer overflow due to too many CS buffered, proceeding through the packet-switched pipeline flits arriving at a reconfigured node. shown in Figure 3c. The allocator of the packet-switched The routers in the setup network have three pipeline pipeline is designed to enable PS flits to steal bandwidth stages, similar to that of our baseline PS router, except that from CS flits. It receives a signal from the input ports in- there is no speculation or virtual channel allocation (VA). dicating the presence or absence of incoming flits for the This 3-stage pipeline is based on an Intel design [12], which physical circuit C that the PS flit has been assigned to and found it not possible to fit the BW into the VA/SA stage is being stolen. If there are no incoming flits for that circuit, and still maintain an aggressive clock of 16FO4s. To lower the PS flit arbitrates for the switch. Once a PS flit wins pas- delay, we assume lookahead routing, thus removing the rout- sage through the crossbar, it then traverses the output port ing computation from the critical path. Virtual channels and goes to the next hop. The circuit field remains set to are not necessary as traffic on the setup network is low and zero, so that this flit will continue to be interpreted as a PS wormhole is sufficient with lower overhead. When a setup flit and buffered appropriately at the next hop. If a CS flit flit arrives, consisting of the destination field (log N, where enters the router while the PS flit is traversing the switch, N is the number of nodes) and the circuit number (log C, the CS request will have to be latched until it can proceed where C is the number of physical circuits), it will first be along its circuit at the next cycle. To prevent the unlikely written to an input buffer (BW). Next, it will go through scenario where PS flits are starved, a timeout is used to switch arbitration (SA), with each port having a C:1 allo- trigger the reconfiguration of a LRU circuit into PS flits, so cator.