Overview.

• Mul3processor(OS((Background(and(Review)( - How(does(it(work?((Background)( - Scalability((Review)( • Mul3processor(Hardware( - Contemporary(systems((Intel,(AMD,(ARM,(Oracle/Sun)( - Experimental(and(Future(systems((Intel,(MS,(Polaris)( • OS(Design(for(Mul3processors( - Guidelines( Mul1processor.OS. - Design(approaches( • Divide(and(Conquer((Disco,(Tessela3on)( COMP9242(–(Advanced(Opera3ng(Systems( • Reduce(Sharing((K42,(Corey,(Linux,(FlexSC,(scalable(commuta3vity)( Ihor(Kuz(|([email protected]( T2/2019(Week(10( • No(Sharing((Barrelfish,(fos)( www.data61.csiro.au.

2((|( COMP9242(T2/2019(W10(

Uniprocessor.OS. CPU.

App1.App2.

OS. Mul1processor.OS.

OS.data. Applica1on.data. App4. Run.. Process.control.. FS.. App1. queue. blocks. structs. App2. App3.

Memory.

3( COMP9242(T2/2019(W10( 4((|( COMP9242(T2/2019(W10(

Mul1processor.OS. Mul1processor.OS. CPU. CPU. CPU. CPU. CPU. CPU. CPU. CPU.

App1. App3. App4. App4. App1. App3. App4. App4. Key(design(challenges:( OS. OS. OS. OS. OS. OS. • Correctness(of((shared)(data(structures(OS. OS. • Scalability((performance(doesn’t(suffer)(

OS.data. Applica1on.data. OS.data. Applica1on.data. App4. App4. Run.. Process.control.. FS.. App1. Run.. Process.control.. FS.. App1. queue. blocks. structs. App2. App3. queue. blocks. structs. App2. App3.

Memory. Memory.

5((|( COMP9242(T2/2019(W10( 6((|( COMP9242(T2/2019(W10( Correctness.of.Shared.Data. Scalability.

• Concurrency(control( Speedup(as(more(processors(added( - Locks( Ideal. - Semaphores( - T Transac3ons( S(N) 1 - Lockdfree(data(structures( = TN • We(know(how(to(do(this:( - In(the(applica3on( - In(the(OS( Speedup.(S).

number.of.processors.(n).

7((|( COMP9242(T2/2019(W10( 8((|( COMP9242(T2/2019(W10(

Scalability. Scalability.and.Serialisa1on. Processor.1. Processor.2. Processor.3. Speedup(as(more(processors(added( Parallel. Reality. Parallel( Parallel( Parallel( Program. Parallel( Parallel( Parallel( T Parallel( Parallel( Parallel( S(N) = 1 Parallel( Parallel( Parallel( TN Parallel( Parallel( Parallel( Serial( Parallel( Serial( Serial( speedup. Parallel( Parallel( Serial(

Parallel( Parallel( Parallel( number.of.processors. Parallel( Parallel( Parallel(

9((|( COMP9242(T2/2019(W10( 10((|( COMP9242(T2/2019(W10(

Scalability.and.Serialisa1on. Serialisa1on. Remember(Amdahl’s(law( - Serial((nondparallel)(por3on:(when(applica3on(not(running(on(all(cores( Where(does(serialisa3on(show(up?( - Applica3on((e.g.(access(shared(app(data)( - Serialisa3on(prevents(scalability( - OS((e.g.(performing(syscall(for(app)(How(much(3me(is(spent(in(OS?( T =1= (1− P)+ P 1 Sources.of.Serialisa1on. P TN = (1− P)+ Locking((explicit(serialisa3on)( N • Wai3ng(for(a(lock(!(stalls(self( T 1 • Lock(implementa3on:(( S(N) = 1 = • Atomic(opera3ons(lock(bus(!(stalls(everyone(wai3ng(for(memory(( T P • Cache(coherence(traffic(loads(bus(!(stalls(others(wai3ng(for(memory(( N (1− P)+ N Memory(access((implicit)( 1 - Rela3vely(high(latency(to(memory((!(stalls(self(( S(∞) → (1− P) Cache((implicit)( - Processor(stalled(while(cache(line(is(fetched(or(invalidated( - Affected(by(latency(of(interconnect( - Performance(depends(on(data(size((cache(lines)(and(conten3on((number(of(cores)(

11((|( COMP9242(T2/2019(W10( From(hgp://en.wikipedia.org/wiki/File:AmdahlsLaw.svg( 12((|( COMP9242(T2/2019(W10( More.CacheMrelated.Serialisa1on.

False(sharing( - Unrelated(data(structs(share(the(same(cache(line( - Accessed(from(different(processors(( !(Cache(coherence(traffic(and(delay( Cache(line(bouncing( Mul1processor. - Shared(R/W(on(many(processors( - E.g:(bouncing(due(to(locks:(each(processor(spinning(on(a(lock(brings(it(into(its(own(cache( !(Cache(coherence(traffic(and(delay( Hardware. Cache(misses( - Poten3ally(direct(memory(access(!(stalls(self( - When(does(cache(miss(occur?( • Applica3on(accesses(data(for(the(first(3me,((Applica3on(runs(on(new(core(( • Cached(memory(has(been(evicted( • Cache(footprint(too(big,(another(app(ran,(OS(ran(

13((|( COMP9242(T2/2019(W10( 14( COMP9242(T2/2019(W10(

Mul1MWhat?. Interes1ng.Proper1es.of.Mul1processors.

• Terminology:(( • Scale(and(Structure( - core,(die((chip),(package((module,(processor,(CPU)( - How(many(cores(and(processors(are(there( • Mul3processor,(SMP( - What(kinds(of(cores(and(processors(are(there( - >1(separate(processors,(connected(by(offdprocessor(interconnect( - How(are(they(organised((access(to(IO,(etc.)( • Mul3thread,(SMT( • Interconnect( - >1(hardware(threads(in(a(single(processing(core( - How(are(the(cores(and(processors(connected( • Mul3core,(CMP( - >1(processing(cores(in(a(single(die,(connected(by(onddie(interconnect( • Memory(Locality(and(Caches( • Mul3core(+(Mul3processor( - Where(is(the(memory( - >1(mul3core(dies(in(a(package((mul3dchip(module),(ondprocessor(interconnect(( - What(is(the(cache(architecture( - >1(mul3core(processors,(offdprocessor(interconnect( • Interprocessor(Communica3on( • Manycore( - How(do(cores(and(processors(send(messages(to(each(other( - Lots((>100)(of(cores(

15((|( COMP9242(T2/2019(W10( 16((|( COMP9242(T2/2019(W10(

Contemporary.Mul1processor.Hardware. Scale.and.Structure.

• Intel:(( • ARM(Cortex(A9(MPCore( - Nehalem,(Westmere:((10(core,(QPI( - Sandy(Bridge,(Ivy(Bridge:(5(core,(ring(bus,(integrated(GPU,(L3,(IO( - Haswell((Broadwell):(18+(core,(ring(bus,(transac3onal(memory,(slices((EP)( - Skylake((SP):(mesh(architecture( • AMD:( - K10((:(Barcelona,(Magny(Cours):(12(core,(Hypertransport( - Bulldozer,(Piledriver,(Steamroller((Opteron,(FX)( • 16(core,(Clustered(Mul3thread:(module(with(2(integer(cores( - :(on(die(NUMA:(CPU(Complex((CCX)((4(core,(private(L3)( - Zen(2:(chiplets((2xCCX)(chiplets,(IO(die((incl(mem(controller)( • Oracle((Sun)(UltraSparc(T1,T2,T3,T4,T5((Niagara),(M5,M7( - T5:(16(cores,(8(threads/core((2(simultaneous),(crossbar,(8(sockets,(( - M8:(32(core,(8(threads,(on(chip(network,(8(sockets,(5GHz( • ARM(Cortex(A9,(A15(MPCore,(big.LITTLE,(DynamIQ( - 4(d8(cores,(big.LITTLE:(A7(+(A15,(dynamIQ:(A75(+(A55(

17((|( COMP9242(T2/2019(W10( 18((|( COMP9242(T2/2019(W10( From(hgp://www.arm.com/images/CortexdA9dMPdcore_Big.gif( ( Scale.and.Structure. Scale.and.Structure.

• ARM(big.LITTLE(

19((|( COMP9242(T2/2019(W10( From(hgp://www.arm.com/images/Fig_1_CortexdA15_CCI_CortexdA7_System.jpg( 20((|( COMP9242(T2/2019(W10( From(hgps://developer.arm.com/d/media/developer/Other%20Images/dynamiqdimprovementsdoverdbigdligle.png( (

Scale.and.Structure. Memory.Locality.and.Caches.

• Intel(Nehalem( • NUMA((NondUniform(Memory(Access)(

From(www.dawnorhered.net/wpdcontent/uploads/2011/02/NehalemdEXdarchitectureddetailed.jpg( From(www.dawnorhered.net/wpdcontent/uploads/2011/02/NehalemdEXdarchitectureddetailed.jpg( 21((|( COMP9242(T2/2019(W10( 22((|( COMP9242(T2/2019(W10( ( (

Interconnect. Interconnect.(Latency).

• AMD(Barcelona( (

23((|( COMP9242(T2/2019(W10( From(www.sigops.org/sosp/sosp09/slides/baumanndslidesdsosp09.pdf( 24((|( COMP9242(T2/2019(W10( From(www.systems.ethz.ch/educa3on/pastdcourses/falld2010/aos/lectures/wk10dmul3core.pdf( “Victoria Falls” 128 threads Par:al'interconnect' FB DIMMFB FB DIMM DIMM FB FBDIMM DIMM FB FB DIMM DIMM FB DIMM 16 cores Interconnect.(Bandwidth). Interconnect. 65x (two sockets) MCU MCU MCU MCU • Oracle(Sparc(T2( MCU MCU MCU MCU

UltraSPARC T2 L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ processor L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ 64 threads Node'0' Full Cross Bar eight cores Full Cross Bar 35x C0 C1 C2 C3 C4 C5 C6 C7 C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU UltraSPARC® T1 processor 32 threads Node'6' Node'7' eight cores NIU 14x Sys I/F PCIe (E-NET+) Buffer Switch Core UltraSPARC® IIIi NIU Sys I/F PCIe processor (Ethernet+) Buffer Switch Core 1x 2x 10 Power <100 W x8 @2. GHz Gigabit Ethernet 3GB/s( 6GB/s( 4GB/sd3GB/s( Unidirec3onal( 2x 10 Power <95 W x8 @ 2.0 GHz No'direct'link'between'node'0'and'7,'0'will'do'“2_hops”'to'access'7' Gigabit Ethernet 2004 2005 2006 2007 2008 10'

25((|( COMP9242(T2/2019(W10( From(hgps://www.usenix.org/conference/atc15/technicaldsession/presenta3on/lepers( 26((|( COMP9242(T2/2019(W10( From(Sun/Oracle(

Interconnect. Interconnect/Structure/Memory.

27((|( COMP9242(T2/2019(W10( From(hgp://www.anandtech.com/show/8423/inteldxeonde5dversiond3dupdtod18dhaswelldepdcoresd/4( 28((|( COMP9242(T2/2019(W10( From(hgp://www.anandtech.com/show/8423/inteldxeonde5dversiond3dupdtod18dhaswelldepdcoresd/4(

Experimental/Future/NonMmainstream.Mul1processor. Scale.and.Structure. Hardware. • Tilera.Tile64.(newest:.Mellanox.TILEMGx),(Intel(Polaris( • Microsor(Beehive(

- Ring(bus,(no(cache(coherence( DDR2 Controller 0 DDR2 Controller 1 • Tilera((now(Mellanox)(Tile64,(TiledGx( SerDes SerDes PROCESSOR CACHE - 100(cores,(mesh(network( PCIe 0 Reg File L2 CACHE MAC/ L-1I L-1D P P P I-TLB D-TLB PHY • Intel(Polaris( 2 1 0 2D DMA - UART, MDN TDN 80(cores,(mesh(network( HPI, I2C, GbE 0 UDN IDN JTAG,SPI STN SWITCH • Intel(SCC( Flexible GbE 1 I/O

- 48(cores,(mesh(network,(no(cache(coherency( Flexible I/O PCIe 1 • Intel(MIC((Mul3(Integrated(Core)(( MAC/ XAUI 1 PHY MAC/ - Knight’s(Corner/Landing(d(Xeon(Phi( PHY SerDes - 60+(cores,(ring(bus/mesh( SerDes DDR2 Controller 3 DDR2 Controller 2

29((|( COMP9242(T2/2019(W10( 30((|( COMP9242(T2/2019(W10( From(www.3lera.com/products/processors/TILE64(( Cache.and.Memory.and.IPC. Interprocessor.Communica1on. • Intel(SCC( • Beehive(

31((|( COMP9242(T2/2019(W10( From(techresearch.intel.com/spaw2/uploads/files/SCC_Plasorm_Overview.pdf( 32((|( COMP9242(T2/2019(W10( From(projects.csail.mit.edu/beehive/BeehiveV5.pdf(

Interconnect. Skylake.SP. • Intel(MIC((Mul3(Integrated(Core)((Knight’s(Corner/Landing(d(Xeon(Phi)(

33((|( COMP9242(T2/2019(W10( 34((|( COMP9242(T2/2019(W10( From(hgps://itpeernetwork.intel.com/inteldmeshdarchitectureddatadcenter/( From(hgp://semiaccurate.com/2012/08/28/intelddetailsdknightsdcornerdarchitecturedatdlongdlast/(

Summary. • Scalability( - 100+(cores( - Amdahl’s(law(really(kicks(in( • Heterogeneity( - Heterogeneous(cores,(memory,(etc.( - Proper3es(of(similar(systems(may(vary(wildly((e.g.(interconnect(topology(and(latencies(between( different(AMD(plasorms)( OS.DESIGN.for. • NUMA( - Also(variable(latencies(due(to(topology(and(cache(coherence( Mul1processors. • Cache(coherence(may(not(be(possible( - Can’t(use(it(for(locking( - Shared(data(structures(require(explicit(work( • Computer(is(a(distributed(system( - Message(passing( - Consistency(and(Synchronisa3on( - Fault(tolerance(

35((|( COMP9242(T2/2019(W10( 36( COMP9242(T2/2019(W10( Op1misa1on.for.Scalability. Op1misa1on.for.Scalability.

• Reduce(amount(of(code(in(cri3cal(sec3ons( • Reduce(false(sharing( - Increases(concurrency( - Pad(data(structures(to(cache(lines( - Fine(grained(locking( • Reduce(cache(line(bouncing( • Lock(data(not(code( - Reduce(sharing( • Tradeoff:(more(concurrency(but(more(locking((and(locking(causes(serialisa3on)( - E.g:(MCS(locks(use(local(data( - Lock(free(data(structures( • Reduce(cache(misses( • Avoid(expensive(memory(access( - Affinity(scheduling:(run(process(on(the(core(where(it(last(ran.( - Avoid(uncached(memory( - Avoid(cache(pollu3on( - Access(cheap((close)(memory( (

37((|( COMP9242(T2/2019(W10( 38((|( COMP9242(T2/2019(W10(

OS.Design.Guidelines.for.Modern.(and.future). Mul1processors. Design.approaches. • Avoid(shared(data( • Divide(and(conquer( - Performance(issues(arise(less(from(lock(conten3on(than(from(data(locality( - Divide(mul3processor(into(smaller(bits,(use(them(as(normal( • Explicit(communica3on( - Using(virtualisa3on( - Regain(control(over(communica3on(costs((and(predictability)( - Using(exokernel( - Some3mes(it’s(the(only(op3on( • • Tradeoff:(parallelism(vs(synchronisa3on( Reduced(sharing( - Synchronisa3on(introduces(serialisa3on( - Brute(force(&(Heroic(Effort( - Make(concurrent(threads(independent:(reduce(crit(sec3ons(&(cache(misses(( • Find(problems(in(exis3ng(OS(and(fix(them( • • Allocate(for(locality( E.g(Linux(rearchitec3ng:(BKL(d>(fine(grained(locking( - E.g.(provide(memory(local(to(a(core( - By(design( • Avoid(shared(data(as(much(as(possible( • Schedule(for(locality( - With(cached(data( • No(sharing( - With(local(memory( - Computer(is(a(distributed(system( • Tradeoff:(uniprocessor(performance(vs(scalability( • Do(extra(work(to(share!(

39((|( COMP9242(T2/2019(W10( 40((|( COMP9242(T2/2019(W10(

Divide.and.Conquer. Disco.Architecture.

Disco. - Scalability(is(too(hard!( • Context:(( - ca.(1995,(large(ccNUMA(mul3processors(appearing( - Scaling(OSes(requires(extensive(modifica3ons( • Idea:( - Implement(a(scalable(VMM( - Run(mul3ple(OS(instances( • VMM(has(most(of(the(features(of(a(scalable(OS:( - NUMA(aware(allocator( - Page(replica3on,(remapping,(etc.(( • VMM(substan3ally(simpler/cheaper(to(implement( • Modern(incarna3ons(of(this( - Virtual(servers((Amazon,(etc.)( - Research((Cerberus)(

Running(commodity(OSes(on(scalable(mul3processors([Bugnion(et(al.,(1997]( 41((|( COMP9242(T2/2019(W10( hgp://wwwdflash.stanford.edu/Disco/(( 42((|( COMP9242(T2/2019(W10( ( Disco.Performance. SpaceMTime.Par11oning.

Tessella1on. - SpacedTime(par33oning( - 2dlevel(scheduling( • Context:(( - 2009d…(highly(( ((parallel(mul3core(( ((systems( - Berkeley(Par(Lab(

Tessella3on:(SpacedTime(Par33oning(in(a(Manycore(Client(OS([Liu(et(al.,(2010]( 43((|( COMP9242(T2/2019(W10( 44((|( COMP9242(T2/2019(W10( hgp://tessella3on.cs.berkeley.edu/(

Tessella1on. Reduce.Sharing.

K42. • Context:( - 1997d2006:(OS(for(ccNUMA(systems( - IBM,(U(Toronto((Tornado,(Hurricane)( • Goals:(( - High(locality( - Scalability( • Object(Oriented( - Fine(grained(objects( • Clustered((Distributed)(Objects( - Data(locality( • Deferred(dele3on((RCU)( - Avoid(locking(( • NUMA(aware(memory(allocator( - Memory(locality(

Clustered(Objects,(Ph.D.(thesis([Appavoo,(2005]( 45((|( COMP9242(T2/2019(W10( 46((|( COMP9242(T2/2019(W10( hgp://www.research.ibm.com/K42/(

K42:.FineMgrained.objects. K42:.Clustered.objects.

• Globally(valid(object(reference( • Resolves(to(( - Processor(local(representa3ve( • Sharing,(locking((strategy(local(to(each(object( • Transparency( - Eases(complexity( - Controlled(introduc3on(of(locality( • Shared(counter:( - inc,(dec:(local(access( - val:(communica3on( • Fast(path:( - Access(mostly(local(structures(

47((|( COMP9242(T2/2019(W10( 48((|( COMP9242(T2/2019(W10( K42.Performance. Corey.

2.4.19( • Context( - 2008,(highdend(mul3core(servers,(MIT( • Goals:( - Applica3on(control(of(OS(sharing( • OS( - Exokerneldlike,(higherdlevel(services(as(libraries( - By(default(only(single(core(access(to(OS(data(structures( - Calls(to(control(how(data(structures(are(shared( • Address(Ranges( - Control(private(per(core(and(shared(address(spaces( • Kernel(Cores( - Dedicate(cores(to(run(specific(kernel(func3ons( • Shares( - Lookup(tables(for(kernel(objects(allow(control(over(which(object(iden3fiers(are(visible(to(other(cores.(

( Corey:(An(Opera3ng(System(for(Many(Cores([BoyddWickizer(et(al.,(2008]( 49((|( COMP9242(T2/2019(W10( 50((|( COMP9242(T2/2019(W10( hgp://pdos.csail.mit.edu/corey(

Linux.Brute.Force.Scalability. Linux.Brute.Force.Scalability.

• • Apply(lessons(from(parallel(compu3ng(and(past(research( Context( - sloppy(counters,(( - 2010,(highdend(mul3core(servers,(MIT( - perdcore(data(structs,(( - finedgrained(lock,(lock(free,(( • Goals:( - cache(lines(( - Scaling(commodity(OS( - 3002(lines(of(code(changed( • Linux(scalability(( - (2010(–(scale(Linux(to(48(cores)(

• Conclusion:(( - no(scalability(reason(to(give(up(on(tradi3onal(opera3ng(system(organiza3ons(just(yet.(

51((|( COMP9242(T2/2019(W10( An(Analysis(of(Linux(Scalability(to(Many(Cores([BoyddWickizer(et(al.,(2010]( 52((|( COMP9242(T2/2019(W10(

Scalability.of.the.API. Scalable.Commuta1vity.Rule.

• Context( • The(Rule( - Whenever,interface,opera1ons,commute,,they,can,be,implemented,in,a,way,that,scales.( - 2013,(previous(mul3core(projects(at(MIT( • Commuta3ve(opera3ons:(( • Goals( - Cannot(dis3nguish(order(of(opera3ons(from(results( - How(to(know(if(a(system(is(really(scalable?( - Example:( • Creat:( • Workloaddbased(evalua3on( • Requires(that(lowest(available(FD(be(returned( - Run(workload,(plot(scalability,(fix(problems( • Not(commuta3ve:(can(tell(which(one(was(run(first( - Did(we(miss(any(nondscalable(workload?( • Why(are(commuta3ve(opera3ons(scalable?( - Did(we(find(all(boglenecks?( - results(independent(of(order((communica3on(is(unnecessary( - without(communica3on,(no(conflicts( • Is(there(something(fundamental(that(makes(a(system(nondscalable?( • Informs(sorware(design(process( - The(interface(might(be(a(fundamental(bogleneck( - Design:(design(guideline(for(scalable(interfaces( - Implementa3on:(clear(target( - Test:(workloaddindependent(tes3ng(

The(Scalable(Commuta3vity(Rule:(Designing(Scalable(Sorware(for(Mul3core(Processors( 53((|( COMP9242(T2/2019(W10( [Clements(et(al.,(2013]( 54((|( COMP9242(T2/2019(W10( Commuter:.An.Automated.Scalability.Tes1ng.Tool. FlexSC.

• Context:( - 2010,(commodity(mul3cores( - U(Toronto( • Goal:( - Reduce(context(switch(overhead(of(system(calls( • Syscall(context(switch:( - Usual(mode(switch(overhead( (sv6)( - But:(cache(and(TLB(pollu3on!(

FlexSC:(Flexible(System(Call(Scheduling(with(Excep3ondLess(System(Calls(( 55((|( COMP9242(T2/2019(W10( 56((|( COMP9242(T2/2019(W10( [Soares(and(Stumm.,(2010](

FlexSC. FlexSC.Results.

• Asynchronous(system(calls( - Batch(system(calls( - Run(them(on(dedicated(cores( • FlexSCdThreads (( - M(on(N( - M(>>(N(

Apache( FlexSC:(batching,(( sys(call(core(redirect(

57((|( COMP9242(T2/2019(W10( 58((|( COMP9242(T2/2019(W10(

No.sharing. Barrelfish.

• Mul3kernel( • Context:(( - 2007(large(mul3core(machines(appearing( - Barrelfish( - 100s(of(cores(on(the(horizon( - fos:(factored(opera3ng(system( - NUMA((cc(and(nondcc)( - ETH(Zurich(and(Microsor( • Goals:( - Scale(to(many(cores( - Support(and(manage(heterogeneous(hardware( • Approach:( - Structure(OS(as(distributed,system, • Design(principles:( - Interprocessor(communica3on(is(explicit( - OS(structure(hardware(neutral( - State(is(replicated( • Microkernel( - Similar(to(seL4:(capabili3es(

The(Mul3kernel:(A(new(OS(architecture(for(scalable(mul3core(systems([Baumann(et(al.,(2009]( The(Mul3kernel:(A(new(OS(architecture(for(scalable(mul3core(systems(( 59((|( COMP9242(T2/2019(W10( hgp://www.barrelfish.org/( 60((|( COMP9242(T2/2019(W10( [Baumann(et(al.,(2009](((hgp://www.barrelfish.org/( Barrelfish. Barrelfish:.Replica1on.

• Kernel(+(Monitor:( - Only(memory(shared(for(message(channels( • Monitor:( - Collec3vely(coordinate(systemdwide(state( • Systemdwide(state:( - Memory(alloca3on(tables( - Address(space(mappings( - Capability(lists( • What(state(is(replicated(in(Barrelfish( - Capability(lists( • Consistency(and(Coordina3on( - Retype:(twodphase(commit(to(globally(execute(opera3on(in(order( - Page((re/un)mapping:(onedphase(commit(to(synchronise(TLBs(

61((|( COMP9242(T2/2019(W10( 62((|( COMP9242(T2/2019(W10(

Barrelfish:.Communica1on. Barrelfish:.Results. • Message(passing(vs(caching( • Different(mechanisms:( 12 - Intradcore( SHM8 • Kernel(endpoints( SHM4 - Interdcore( 10 SHM2 SHM1 • URPC( MSG8 • 8 MSG1 URPC( 1000) Server - Uses(cache(coherence(+(polling( × - Shared(bufffer( 6 • Sender(writes(a(cache(line( • Receiver(polls(on(cache(line( • (last(word(so(no(part(message)( 4

- Polling?( Latency (cycles • Cache(only(changes(when(sender(writes,(so(poll(is(cheap( 2 • Switch(to(block(and(IPI(if(wait(is(too(long.(

0 2 4 6 8 10 12 14 16

63((|( COMP9242(T2/2019(W10( 64((|( COMP9242(T2/2019(W10( Cores

Barrelfish:.Results. Barrelfish:.Results. • Broadcast(vs(Mul3cast( • TLB(shootdown( 14 60 Broadcast Windows Unicast Linux 12 Multicast Barrelfish NUMA-Aware Multicast 50 10 40 1000) 1000) × 8 × 30 6 20 4 Latency (cycles Latency (cycles 10 2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Cores 65((|( COMP9242(T2/2019(W10( Cores 66((|( COMP9242(T2/2019(W10( Summary.

• Trends(in(mul3core( - Scale((100+(cores)( - NUMA( - No(cache(coherence(( - Distributed(system( - Heterogeneity( Summary. • OS(design(guidelines( - Avoid(shared(data( - Explicit(communica3on( - Locality( • Approaches(to(mul3core(OS( - Par33on(the(machine((Disco,(Tessella3on)( - Reduce(sharing((K42,(Corey,(Linux,(FlexSC,(scalable(commuta3vity)( - No(sharing((Barrelfish,(fos)(

67( COMP9242(T2/2019(W10( 68((|( COMP9242(T2/2019(W10(