Re:platforming// the/Datacenter// with/Apache/Mesos/

Christos(Kozyrakis( Why$your$ASF$project$should$run$on$Mesos$ O(10K)/commodity/servers/ HighAspeed/networking/ Distributed/storage/(HDD,/Flash)/ x10/MWatt/ x100/M$/ ( developers$ ops$

automation( automation( performance( efficiency( ( ( ① Datacenter/past/ Static/Partitioning/ Static/Partitioning/

Hadoop/ Cassandra/ Rails/ / memcached/ Static/Partitioning/

Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/ Static/Partitioning/

Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/ Static/Partitioning/

Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/ Static/Partitioning/

developers$ ops$

" automation( " automation(

! performance( " efficiency( ( ( ② Datacenter/present/ Apache/Mesos/

The(datacenter(OS(kernel( Aggregates(all(resources(into(a(single(shared(pool( Dynamically(allocates(resources(to(distributed(apps( Container(management(at(scale((,(,(…)( Mesos/Architecture/

Frameworks/ Marathon( Jenkins( …

Allocator/ Allocator/ Allocator/

Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/

Slave/ Slave/ Executor/ Executor/ … Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/

Scales(to(10s(of(thousands(of(servers( Mesos/Fault/Tolerance/

Frameworks/ Marathon( Jenkins( …

Allocator/ Allocator/ Allocator/

Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/

Slave/ Slave/ Executor/ Executor/ … Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/

Tasks(survive(failures(of(the(master((( Mesos/Fault/Tolerance/

Frameworks/ Marathon( Jenkins( …

Allocator/ Allocator/ Allocator/

Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/

Slave/ Slave/ Executor/ Executor/ … Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/

Tasks(survive(failures(of(the(framework( Mesos/Fault/Tolerance/

Frameworks/ Marathon( Jenkins( …

Allocator/ Allocator/ Allocator/

Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/

Slave/ Slave/ Executor/ Executor/ … Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/

Tasks(survive(failures(of(the(slave(process( Scheduling/in/Mesos/

No(single(scheduler(fits(all(needs(

LongLrunning(services(need( (scale(up/down,(fault(tolerance(

( Analytics(services(need(

(fast(task(launching( 2Alevel/Scheduling/

Mesos(master((single(API)( Resource(allocation((offers)(( Task(health(checks( Task(isolation(

Mesos(frameworks((domainLspecific(APIs)( Scale(up/down( Fault(tolerance( Task(grouping( Task(dependencies( Queuing(&(priorities(( 2Alevel/Scheduling/Benefits/

Multiple(APIs(to(Mesos(through(frameworks( Marathon,(Aurora,(( Storm,(Spark,(Hadoop,(Chronos( Cassandra,(Elasticsearch(

Multiple(task(scheduling(approaches( Spark(fineLgrain(Vs(Spark(coarseLgrain( ( Simple,(stable,(and(scalable(Mesos(master( Dynamic/Resource/Allocation//

Resource(allocation(based(on(framework(roles( ( Dominant(resource(fairness((DRF)( Weighted(fair(share(calculated(based(on(dominant(resource( Frameworks(do(no(worse(than(having(a(weightLsized(cluster( ( Resource(reservations( Resources(can(be(allocated(to(specific(frameworks(if(needed( ( Dynamic/Resource/Allocation//

Rails/

Hadoop/

buy$less$machines$ memcached/ or$ run$more$applications!$ Service/Discovery/

① Watch(ZK(for( ((master(changes(

Mesos( Mesos( Master( ② Pull(task(state( DNS( (Generate(DNS(records( ③ DNS(&(HTTP( (based(discovery(

Slave( Slave( Slave( Slave( … Slave(

(nginx.marathon.mesos(#(10.13.17.95( _nginx._tcp.marathon.mesos(#10.13.17.95:8181( ( ③ Datacenter/future/ Utilization/Reality/

Twitter (Mesos) (Borg) [Delimitrou’14] [Barroso’09] The/Curse/of/Overprovisioning/

[Delimitrou’14]

Bloated(reservations(to(deal(with( diurnal(load(patterns,(load(spikes,(software(&(platform(changes( Oversubscription/

Frameworks/ Marathon( Jenkins( …

② offer(

Allocator/ Allocator/ Allocator/

Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/

① offer(

Slave/ Slave/ Executor/ Executor/ . . . Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/ Oversubscription/

Frameworks/ Marathon( Jenkins( …

③ launch(

Allocator/ Allocator/ Allocator/

Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/

④ launch(

Slave/ Slave/ Executor/ Executor/ . . . Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/ Oversubscription/

Frameworks/ Marathon( Jenkins( …

② offer(

Allocator/ Allocator/ Allocator/

Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/

① offer(

Slave/ Slave/ Executor/ Executor/ . . . Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/ Oversubscription/

Frameworks/ Marathon( Jenkins( …

③ launch(

Allocator/ Allocator/ Allocator/

Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/

④ launch(

Slave/ Slave/ Executor/ Executor/ . . . Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/ Oversubscription/

Frameworks/ Marathon( Jenkins( …

② Status(

Allocator/ Allocator/ Allocator/

Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/

① Status(

Slave/ Slave/ Executor/ Executor/ . . . Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/ Interference/#/Performance/Loss/

Impact of interference on websearch’s latency 300% L3 Cache >300%( >300%( >300%( >300%( >300%( >300%( >300%( 264%( 123%(

DRAM >300%( >300%( >300%( >300%( >300%( >300%( >300%( 270%( 122%( B A D HyperThread 110%( 107%( 114%( 115%( 105%( 117%( 120%( 136%( >300%(

CPU power 124%( 107%( 116%( 109%( 115%( 105%( 101%( 100%( 100%( 100% O K Network 36%( 36%( 37%( 37%( 39%( 42%( 48%( 55%( 64%( 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% [Lo’15] Load Isolators/

Mesos(slave(invokes(isolators( Modules(that(monitor(&(isolate(resources(for(executors(

Isolator(modules( CPU((cgroups(cpushares,(cpusets)( Memory((cgroups)( Disk( Network( Cache( Power( …(( Isolators/#/Performance/QoS/

[Lo et al’15]

+/bestAeffort/task/ >90%/HW/utilization/ No/latency/SLO/problems/ Hybrid/Datacenters/

Marathon(

Mesos/Master/

Mesos/Slaves/ Hybrid/Datacenters/

Marathon(

?/

Mesos/Master/ instance(++(

Mesos/Slaves/

Scheduling(based(on(workload(type,(data(locality,(pricing,…(( Future/Directions/

Oversubscription( ((( Container(&(application(right(sizing( ((( Hybrid(datacenters( ((( Power(aware(scheduling( ((( Locality(aware(scheduling( Mesos/Datacenter/

developers$ ops$

! automation( ! automation(

! performance( ! efficiency( ( ( Want/to/Learn/More?/

( Mo(3pm(–(Cracking(the(Container(Scale(Problem(with(Apache(Mesos( ( Tue(4.20pm(–(The(Emergence(of(the(Datacenter(Developer( (( Wed(4.15pm(–(Mesos(+(Yarn(=(Myriad( Questions?/

(

(

(( References/

• http://mesos.apache.org/(

• http://www.mesosphere.com/(

• https://github.com/mesosphere/mesosLdns(

• https://www.cs.berkeley.edu/~alig/papers/mesos.pdf(

• https://www.cs.berkeley.edu/~alig/papers/drf.pdf(

• http://www.morganclaypool.com/doi/abs/10.2200/ S00516ED2V01Y201306CAC024(

• http://web.stanford.edu/~cdel/2014.asplos.quasar.pdf(

• http://web.stanford.edu/~davidlo/resources/2014.heracles.isca.pdf(