1 Introduction

Software Cache Coherence for Large Scale Multipro cessors Leonidas I Kontothanassis and Michael L Scott Department of Computer Science University of Ro chester Ro chester NY fkthanasiscottgcsrochesteredu July Abstract Shared memory is an app ealing abstraction for parallel programming It must b e implemented with caches in order to p erform well however and caches require a coherence mechanism to ensure that pro cessors reference current data Hardware coherence mechanisms for largescale machines are complex and costly but existing software mechanisms have not b een fast enough to provide a serious alternative We present a new software coherence proto col that narrows the p erformance gap b etween 1 hardware and software coherence This proto col runs on NCCNUMA machines in which a global physical address space allows pro cessors to ll cache lines from remote memory We compare the p erformance of the proto col to that of existing software and hardware alternatives We also evaluate the tradeos among various write p olicies writethrough writeback writethrough with a writecollect buer Finally we observe that certain simple program changes can greatly improve p erformance For the programs in our test suite the p erformance advantage of hardware cache coherence is small enough to suggest that software coherence may b e more cost eective Keywords cache coherence scalability costeectiveness lazy release consistency NCCNUMA ma chines Intro duction Large scale multipro cessors can provide the computational p ower needed for some of the larger problems of science and engineering to day Shared memory provides an app ealing programming mo del for such machines To p erform well however shared memory requires the use of caches which in turn require a coherence mechanism to ensure that copies of data are suciently upto date Coherence is easy to achieve on small busbased machines where every pro cessor can see the memory trac of the others Coherence is substantially harder to achieve on largescale This work was supp orted in part by NSF Institutional Infrastructure grant no CDA and ONR research grant no NJ in conjunction with the DARPA Research in Information Science and TechnologyHigh Performance Computing Software Science and Technology program ARPA Order no 1 NCCNUMA stands for non cache coherent non uniform memory access multipro cessors it increases b oth the cost of the machine and the time and intellectual eort required to bring it to market Given the sp eed of advances in micropro cessor technology long development times generally lead to machines with outofdate pro cessors There is thus a strong motivation to nd coherence mechanisms that will pro duce acceptable p erformance with little or no 2 sp ecial hardware There are at least three reasons to hop e that a software coherence mechanism might b e com p etitive with hardware coherence First traphandling overhead is not very large in comparison to remote communication latencies and will b ecome even smaller as pro cessor improvements continue to outstrip network improvements Second software may b e able to emb o dy proto cols that are to o complicated to implement reliably in hardware at acceptable cost Third programmers and compiler develop ers are b ecoming aware of the imp ortance of lo cality of reference and are attempting to write programs that communicate as little as p ossible thereby reducing the impact of coherence op era tions In this pap er we present a software coherence mechanism that exploits these opp ortunities to deliver p erformance approaching that of the b est hardware alternativeswithin worst case in our exp eriments and usually much closer As in most software coherence systems we use address translation hardware to control access to shared pages To minimize the impact of the false sharing that comes with such large coherence blo cks we employ a relaxed consistency proto col that combines asp ects of b oth eager release consistency and lazy release consistency We target our work however at NCCNUMA machines rather than messagebased multicomputers or networks of workstations Machines in the NCCNUMA class include the Cray TD the BBN TC and the Princeton Shrimp None of these has hardware cache coherence but each provides a globallyaccessible physical address space with hardware supp ort for cache lls and uncached references that access remote lo cations In comparison to multicomputers NCCNUMA machines are only slightly harder to build but they provide two imp ortant advantages for implementing software coherence they p ermit very fast access to remote directory information and they allow data to b e moved in cacheline size chunks We also build on the work of Petersen and Li who develop ed an ecient software implementation of release consistency for smallscale multipro cessors The key observation of their work was that NCCNUMA machines allow the coherence blo ck and the data transfer blo ck to b e of dierent sizes Rather than copy an entire page in resp onse to an access fault a software coherence mechanism for an NCCNUMA machine can create a mapping to remote memory allowing the hardware to fetch individual caches lines as needed on demand Our principal contribution is to extend the work of Petersen and Li to large machines We distribute and reorganize the directory data structures insp ect those structures only with regard to pages for which the current pro cessor has a mapping p ostp one coherence op erations for as long as p ossible and intro duce a new dimension to the proto col state space that allows us to reduce the cost of coherence maintenance on wellb ehaved pages We compare our mechanism to a variety of existing alternatives including sequentiallyconsistent hardware releaseconsistent hardware sequentiallyconsistent software and the software coherence scheme of Petersen and Li We nd substantial improvements with resp ect to the other software schemes enough in most cases to bring software cache coherence within sight of the hardware alter natives 2 We are sp eaking here of behaviordriven coherencemechanisms that move and replicate data at run time in resp onse to observed patterns of program b ehavioras opp osed to compilerbased techniques We also rep ort on the impact of several architectural alternatives on the eectiveness of software coherence These alternatives include the choice of write p olicy writethrough writeback write through with a writecollect buer and the availability of a remote reference facility which allows a pro cessor to cho ose to access data directly in a remote lo cation by disabling caching Finally to obtain the full b enet of software coherence we observe that minor program changes can b e crucial In particular we identify the need to employ readerwriter lo cks avoid certain interactions b etween program synchronization and the coherence proto col and align data structures with page b oundaries whenever p ossible The rest of the pap er is organized as follows Section describ es our software coherence proto col and provides intuition for our algorithmic and architectural choices Section describ es our exp er imental metho dology and workload We present p erformance results in section and compare our work to other approaches in section We summarize our ndings and conclude in section The Software Coherence Proto col In this section we present a scalable algorithm for software cache coherence The algorithm was inspired by Karin Petersens thesis work with Kai Li Petersens algorithm was designed for smallscale multipro cessors with a single physical address space and noncoherent caches and has b een shown to work well for several applications on such machines Like most b ehaviordriven software coherence schemes Petersens relies on address translation hardware and therefore uses pages as its unit of coherence Unlike most software schemes however it do es not migrate or replicate whole pages Instead it maps pages where they lie in main memory and relies on the hardware cachell mechanism to bring lines into the lo cal cache on demand To minimize the frequency of coherence op erations the algorithm adopts release consistency for its 3 memory semantics and p erforms coherence op erations only at synchronization p oints Between synchronization p oints pro cesses may continue to use stale data in their caches To keep track of inconsistent copies the algorithm keeps a count in uncached main memory of the numb er of readers and writers for each page together with an uncached weak list that identies all pages for which there are multiple writers or a writer and one or more readers Pages that may b ecome inconsistent under Petersens scheme are inserted in the weak list by the pro cessor that detects the p otential for inconsistency For example if a pro cessor attempts to read a variable in a currentlyunmapp ed page the page fault handler creates a readonly mapping increments the reader count and adds the page to the weak list if it has any current writers On an acquire op eration a pro cessor scans the uncached weak list and purges all lines of all weak pages from its cache The pro cessor also removes all mappings it may have for such a page If all mappings for a page in the weak list have b een removed the page is removed from the weak list as well Unfortunately while a centralized weak list works well on small machines it p oses serious obsta cles to scalability the size of the list and consequently the amount of work that a pro cessor needs to p erform at a synchronization p oint increases with the size of

1 Introduction

A Type Inference on Executables

Targeting Embedded Powerpc

Interprocedural Analysis of Low-Level Code

Mac OS X ABI Function Call Guide

Alignment in C Seminar “Eﬃziente Programmierung in C”

Compiler Construction

An Introduction to Reverse Engineering for Beginners

Unstructured Computations on Emerging Architectures

AUTHOR PUB TYPE Needs of Potential Users, System

Program Optimization

A Guide to Vectorization with Intel® C++ Compilers

Gdura, Youssef Omran (2012) a New Parallelisation Technique for Heterogeneous Cpus