Re:platforming// the/Datacenter// with/Apache/Mesos/
Christos(Kozyrakis( Why$your$ASF$project$should$run$on$Mesos$ O(10K)/commodity/servers/ HighAspeed/networking/ Distributed/storage/(HDD,/Flash)/ x10/MWatt/ x100/M$/ ( developers$ ops$
automation( automation( performance( efficiency( ( ( ① Datacenter/past/ Static/Partitioning/ Static/Partitioning/
Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/ Static/Partitioning/
Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/ Static/Partitioning/
Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/ Static/Partitioning/
Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/ Static/Partitioning/
developers$ ops$
" automation( " automation(
! performance( " efficiency( ( ( ② Datacenter/present/ Apache/Mesos/
The(datacenter(OS(kernel( Aggregates(all(resources(into(a(single(shared(pool( Dynamically(allocates(resources(to(distributed(apps( Container(management(at(scale((cgroups,(docker,(…)( Mesos/Architecture/
Frameworks/ Marathon( Jenkins( …
Allocator/ Allocator/ Allocator/
Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/
Slave/ Slave/ Executor/ Executor/ … Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/
Scales(to(10s(of(thousands(of(servers( Mesos/Fault/Tolerance/
Frameworks/ Marathon( Jenkins( …
Allocator/ Allocator/ Allocator/
Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/
Slave/ Slave/ Executor/ Executor/ … Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/
Tasks(survive(failures(of(the(master((( Mesos/Fault/Tolerance/
Frameworks/ Marathon( Jenkins( …
Allocator/ Allocator/ Allocator/
Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/
Slave/ Slave/ Executor/ Executor/ … Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/
Tasks(survive(failures(of(the(framework( Mesos/Fault/Tolerance/
Frameworks/ Marathon( Jenkins( …
Allocator/ Allocator/ Allocator/
Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/
Slave/ Slave/ Executor/ Executor/ … Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/
Tasks(survive(failures(of(the(slave(process( Scheduling/in/Mesos/
No(single(scheduler(fits(all(needs(
LongLrunning(services(need( (scale(up/down,(fault(tolerance(
( Analytics(services(need(
(fast(task(launching( 2Alevel/Scheduling/
Mesos(master((single(API)( Resource(allocation((offers)(( Task(health(checks( Task(isolation(
Mesos(frameworks((domainLspecific(APIs)( Scale(up/down( Fault(tolerance( Task(grouping( Task(dependencies( Queuing(&(priorities(( 2Alevel/Scheduling/Benefits/
Multiple(APIs(to(Mesos(through(frameworks( Marathon,(Aurora,(Singularity( Storm,(Spark,(Hadoop,(Chronos( Cassandra,(Elasticsearch(
Multiple(task(scheduling(approaches( Spark(fineLgrain(Vs(Spark(coarseLgrain( ( Simple,(stable,(and(scalable(Mesos(master( Dynamic/Resource/Allocation//
Resource(allocation(based(on(framework(roles( ( Dominant(resource(fairness((DRF)( Weighted(fair(share(calculated(based(on(dominant(resource( Frameworks(do(no(worse(than(having(a(weightLsized(cluster( ( Resource(reservations( Resources(can(be(allocated(to(specific(frameworks(if(needed( ( Dynamic/Resource/Allocation//
Rails/
Hadoop/
buy$less$machines$ memcached/ or$ run$more$applications!$ Service/Discovery/
① Watch(ZK(for( ((master(changes(
Mesos( Mesos( Master( ② Pull(task(state( DNS( (Generate(DNS(records( ③ DNS(&(HTTP( (based(discovery(
Slave( Slave( Slave( Slave( … Slave(
(nginx.marathon.mesos(#(10.13.17.95( _nginx._tcp.marathon.mesos(#10.13.17.95:8181( ( ③ Datacenter/future/ Utilization/Reality/
Twitter (Mesos) Google (Borg) [Delimitrou’14] [Barroso’09] The/Curse/of/Overprovisioning/
[Delimitrou’14]
Bloated(reservations(to(deal(with( diurnal(load(patterns,(load(spikes,(software(&(platform(changes( Oversubscription/
Frameworks/ Marathon( Jenkins( …
② offer
Allocator/ Allocator/ Allocator/
Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/
① offer
Slave/ Slave/ Executor/ Executor/ . . . Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/ Oversubscription/
Frameworks/ Marathon( Jenkins( …
③ launch
Allocator/ Allocator/ Allocator/
Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/
④ launch
Slave/ Slave/ Executor/ Executor/ . . . Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/ Oversubscription/
Frameworks/ Marathon( Jenkins( …
② offer
Allocator/ Allocator/ Allocator/
Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/
① offer
Slave/ Slave/ Executor/ Executor/ . . . Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/ Oversubscription/
Frameworks/ Marathon( Jenkins( …
③ launch
Allocator/ Allocator/ Allocator/
Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/
④ launch
Slave/ Slave/ Executor/ Executor/ . . . Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/ Oversubscription/
Frameworks/ Marathon( Jenkins( …
② Status
Allocator/ Allocator/ Allocator/
Masters/ AuthN/ AuthZ/ AuthN/ AuthZ/ AuthN/ AuthZ/
① Status
Slave/ Slave/ Executor/ Executor/ . . . Executor/ Executor/ Servers/ Task/ Task/ Task/ Task/ Interference/#/Performance/Loss/
Impact of interference on websearch’s latency 300% L3 Cache >300%( >300%( >300%( >300%( >300%( >300%( >300%( 264%( 123%(
DRAM >300%( >300%( >300%( >300%( >300%( >300%( >300%( 270%( 122%( B A D HyperThread 110%( 107%( 114%( 115%( 105%( 117%( 120%( 136%( >300%(
CPU power 124%( 107%( 116%( 109%( 115%( 105%( 101%( 100%( 100%( 100% O K Network 36%( 36%( 37%( 37%( 39%( 42%( 48%( 55%( 64%( 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% [Lo’15] Load Isolators/
Mesos(slave(invokes(isolators( Modules(that(monitor(&(isolate(resources(for(executors(
Isolator(modules( CPU((cgroups(cpushares,(cpusets)( Memory((cgroups)( Disk( Network( Cache( Power( …(( Isolators/#/Performance/QoS/
[Lo et al’15]
+/bestAeffort/task/ >90%/HW/utilization/ No/latency/SLO/problems/ Hybrid/Datacenters/
Marathon(
Mesos/Master/
Mesos/Slaves/ Hybrid/Datacenters/
Marathon(
?/
Mesos/Master/ instance(++(
Mesos/Slaves/
Scheduling(based(on(workload(type,(data(locality,(pricing,…(( Future/Directions/
Oversubscription( ((( Container(&(application(right(sizing( ((( Hybrid(datacenters( ((( Power(aware(scheduling( ((( Locality(aware(scheduling( Mesos/Datacenter/
developers$ ops$
! automation( ! automation(
! performance( ! efficiency( ( ( Want/to/Learn/More?/
( Mo(3pm(–(Cracking(the(Container(Scale(Problem(with(Apache(Mesos( ( Tue(4.20pm(–(The(Emergence(of(the(Datacenter(Developer( (( Wed(4.15pm(–(Mesos(+(Yarn(=(Myriad( Questions?/
(
(
(( References/
• http://mesos.apache.org/(
• http://www.mesosphere.com/(
• https://github.com/mesosphere/mesosLdns(
• https://www.cs.berkeley.edu/~alig/papers/mesos.pdf(
• https://www.cs.berkeley.edu/~alig/papers/drf.pdf(
• http://www.morganclaypool.com/doi/abs/10.2200/ S00516ED2V01Y201306CAC024(
• http://web.stanford.edu/~cdel/2014.asplos.quasar.pdf(
• http://web.stanford.edu/~davidlo/resources/2014.heracles.isca.pdf(