Software Cache Coherence for Large Scale Multiprocessors HPCA
Total Page:16
File Type:pdf, Size:1020Kb
Software Cache Coherence for Large Scale Multiprocessors Leonidas I. Kontothanassis and Michael L. Scott Department of Computer Science, University of Rochester, Rochester, NY 14627-0226 ¡ kthanasi,scott ¢ @cs.rochester.edu Abstract Unfortunately, the current state of the art in software Shared memory is an appealing abstraction for paral- coherence for message-passing machines provides perfor- lel programming. It must be implemented with caches in mance nowhere close to that of hardware cache coherence. order to perform well, however, and caches require a coher- Our contributionis to demonstrate that most of the bene®ts ence mechanism to ensure that processors reference current of hardware cache coherence can be obtained on large ma- data. Hardware coherence mechanisms for large-scale ma- chines simplyby providinga globalphysical address space, chines are complex and costly, but existing software mech- with per-processor caches but without hardware cache anisms for message-passing machines have not provided coherence. Machines in this non-cache-coherent, non- a performance-competitive solution. We claim that an in- uniform memory access (NCC-NUMA) class include the termediate hardware optionÐmemory-mapped network in- Cray Research T3D and the Princeton Shrimp [6]. In com- terfaces that support a global physical address spaceÐcan parison to hardware-coherent machines, NCC-NUMAs can providemost oftheperformance bene®ts ofhardware cache more easily be built from commodity parts, and can follow coherence. We present a software coherence protocol that improvements in microprocessors and other hardware tech- runs on this class of machines and greatly narrows the per- nologies closely. formance gap between hardware and software coherence. We present a software coherence protocol for NCC- We compare the performance of the protocol to that of NUMA machines that scales well to large numbers of pro- existing software and hardware alternatives and evaluate cessors. To achieve the best possible performance, we the tradeoffs among various cache-write policies. We also exploit the global address space in three speci®c ways. observe that simple program changes can greatly improve First, we maintain directory information for the coherence performance. For the programs in our test suite and with protocol in uncached shared locations, and access it with the changes in place, software coherence is often faster and ordinary loads and stores, avoiding the need to interrupt never more than 55% slower than hardware coherence. remote processors in almost all circumstances. Second, while using virtual memory to maintain coherence at the 1 Introduction granularity of pages, we never copy pages. Instead, we Large scale multiprocessors can provide the computa- map them remotely and allow the hardware to fetch cache tional power needed for some of the larger problems of lines on demand. Third, while allowing multiple writers science and engineering today. Shared memory provides for concurrency, we avoid the need to keep old copies and an appealing programming model for such machines. To compute diffs by using ordinary hardware write-throughor perform well, however, shared memory requires the use write-back to the unique main-memory copy of each page. of caches, which in turn require a coherence mechanism to Full-scale hardware cache coherence may suffer less ensure that copies of data are up-to-date. Coherence iseasy from false sharing, due to smaller coherence blocks, and to achieve on small, bus-based machines, where every pro- can execute protocol operations in parallel with ªrealº com- cessor can see the memory traf®c of the others [4, 11], but putation. Current trends, however, are reducing the impor- is substantially harder to achieve on large-scale multipro- tance of each of these advantages: relaxed consistency, bet- cessors [1, 15, 19]. Itincreases boththe costof the machine ter compilers, and improvements in programming method- and the time and intellectual effort required to bring it to ology can also reduce false sharing, and trends in both market. Given the speed of advances in microprocessor programming methodology and hardware technology are technology, long development times generally lead to ma- reducing the fraction of total computation time devoted to chines with out-of-date processors. There is thus a strong protocol execution. At the same time, software coherence motivation to ®nd coherence mechanisms that will produce enjoys the advantage of being able to employ techniques acceptable performance with little or no special hardware.1 such as multiple writers, delayed write notices, and delayed invalidations, which may be too complicated to implement £ This work was supported in part by NSF Institutional Infrastructure reliably in hardware at acceptable cost. Software coherence grant no. CDA-8822724 and ONR research grant no. N00014-92-J-1801 also offers the possibility of adapting protocols to individ- (in conjunction with the DARPA Research in Information Science and ual programsor dataregionswitha level of ¯exibilitythatis TechnologyÐHighPerformance Computing, Software Science and Tech- dif®culttoduplicateinhardware. Weexploitthisadvantage nology program, ARPA Order no. 8930). in part in our work by employing a relatively complicated 1We are speaking here of behavior-driven coherenceÐmechanisms eight-state protocol, by using uncached remote references that move and replicate data at run time in response to observed patterns for application-level data structures that are accessed at a of program behaviorÐas opposed to compiler-based techniques [9]. very ®ne grain, and by introducing user level annotations that can impact the behavior of the coherence protocol. the home node. In addition, each processor holds a local The rest of the paper is organized as follows. Section 2 weak list that indicates which of the pages to which it has describes our software coherence protocol and provides in- mappings are weak. When a processor takes a page fault tuition for our algorithmic and architectural choices. Sec- it locks the coherent map entry representing the page on tion 3 describes our experimental methodology and work- which the fault was taken. It then changes the coherent load. We present performance results in section 4 and com- map entry to re¯ect the new state of the page. If necessary pare our work to other approaches in section 5. We compare (i.e. if the page has made the transition from shared or our protocol to a variety of existing alternatives, including dirty to weak) the processor updates the weak lists of all sequentially-consistent hardware, release-consistent hard- processorsthat havemappingsfor that page. It then unlocks ware, straightforwardsequentially-consistentsoftware, and the entry in the coherent map. The process of updating a a coherence scheme for small-scale NCC-NUMAs due to processor's weak list is referred to as posting a write notice. Petersen and Li [21]. We also report on the impact of sev- Use of a distributed coherent map and per-processor eral architectural alternatives on the effectiveness of soft- weak lists enhances scalability by minimizing memory con- ware coherence. These alternatives include the choice of tention and by avoiding the need for processors at acquire write policy (write-through,write-back, write-throughwith points to scan weak list entries in which they have no inter- a write-merge buffer) and the availability of a remote refer- est (something that would happen with a centralized weak ence facility, which allows a processor to choose to access list [20]. However it may make the transition to the weak data directly in a remote location, by disabling caching. state very expensive, since a potentially large number of Finally, to obtain the full bene®t of software coherence, we remote memory operations may have to be performed (se- observe that minor program changes can be crucial. In par- rially) in order to notify all sharing processors. Ideally, ticular, we identify the need to employ reader-writer locks, we wouldlike to maintain the low acquire overhead of per- avoid certain interactions between program synchroniza- processor weak listswhilerequiringonlya constantamount tion and the coherence protocol, and align data structures of work per shared page on a transition to the weak state. with page boundaries whenever possible. We summarize In order to approach this goal we take advantage of the our ®ndings and conclude in section 6. fact that page behavior tends to be relatively constant over the execution of a program, or at least a large portion of 2 The Software Coherence Protocol it. Pages that are weak at one acquire point are likely to In this section we present a scalable protocol for soft- be weak at another. We therefore introduce an additional ware cache coherence. As in most software coherence pair of states, called safe and unsafe. These new states, systems, we use virtual memory protection bits to enforce which are orthogonal to the others (for a total of 8 distinct consistency at the granularity of pages. As in Munin [26], states), re¯ect the past behavior of the page. A page that Treadmarks [14], and the work of Petersen and Li [20, 21], has made the transition to weak several times and is about we allow more than one processor to write a page concur- to be marked weak again is also marked as unsafe. Future rently, and we use a variant of release consistency to limit transitionstothe weak state willno longerrequirethe send- coherence operations to synchronization points. (Between ing of write notices. Instead the processor that causes the these points, processors can continue to use stale data in transition to the weak state changes only the entry in the their caches.) As in the work of Petersen and Li, we exploit coherent map, and then continues. The acquire part of the the global physical address space to move data at the gran- protocol now requires that the acquiring processor check ularity of cache lines: instead of copying pages we map the coherent map entry for all its unsafe pages, and inval- them remotely, and allow the hardware to fetch cache lines idate the ones that are also marked as weak.