sign in sign up

<<

Autonomic Clouds Part II Defining the Landscape

Autonomic Clouds Part II Defining the Landscape

5/31/14

Autonomic Clouds Part II

Omer Rana (Cardiﬀ University, UK) Manish Parashar (Rutgers University, USA)

Deﬁning the landscape • Analysis of exisng published work – Give potenal research direcons • Various uses of autonomics – will focus on: – Auto scaling and elascity – Streaming analycs – Integrang autonomics into applicaons • Integrang autonomics with – Exisng Cloud middleware – Distributed Cloud landscape (e.g. GENI Cloud)

1 5/31/14

Cross layer Intelligence

Cross layer Intelligence • Applicaons can exhibit dynamic and heterogeneous workloads – Hard to pre-determine resource requirements • QoS requirements can diﬀer across mul- tenancy applicaons – Batch vs. real me, throughput vs. response me • Integrang local resources with Cloud provisioned resources – Cloud “bursng” (when and for how long) – Data sharing dynamically between the two – Cost implicaons for long term use

2 5/31/14

From Chenyang Lu (Washington Univ. in St Louis)

Links with Control theory • Applying input to cause system variables to conform to desired values – oen a “set” or “reference” point – E-commerce server: Resource allocaon? à T_response=5 sec – Embedded networks: Flow rate? à Delay = 1 sec – Power usage: Energy? à Consumpon < 250Was

• Provide QoS and related guarantees in open, unpredictable environments

• Various modelling approaches: – Queuing theory (very popular) – no feedback generally in queuing models; hard to characterise transient behaviour overloads – Other approaches: Petri nets and Process algebras – Oen a design/tune/test cycle – repeated mulple mes

From Chenyang Lu (Washington Univ. in St Louis)

Open-loop control

• Compute control input without connuous variable measurement – Simple – Need to know EVERYTHING ACCURATELY to work right • E-commerce server: Workload (request arrival rate? resource consumpon?); system (service me? failures?) • Open-loop control fails when – We don’t know everything – We make errors in esmaon/modeling – Things change

3 5/31/14

From Chenyang Lu (Washington Univ. in St Louis)

Feedback (close-loop) Control

Controlled System Controller

control control manipulated input Actuator function variable

error sample controlled + Monitor - variable

reference

From Chenyang Lu (Washington Univ. in St Louis)

Feedback (close-loop) Control • Measure variables and use it to compute control input – More complicated (so we need control theory) – Connuously measure & correct • Ecommerce server: measure response me & admission control • Embedded network: measure collision & change backoﬀ window • Feedback control theory makes it possible to control well even if: – We don’t know everything – We make errors in esmaon/modeling – Things change

4 5/31/14

From Chenyang Lu (Washington Univ. in St Louis)

Control design methodology

Modeling Controller analytical Dynamic model Control algorithm Design system IDs

Satisfy

Requirement Performance Specifications Analysis

From Chenyang Lu (Washington Univ. in St Louis)

System Models

• Linear vs. non-linear (diﬀerenal eqns) • Determinisc vs. Stochasc • Time-invariant vs. Time-varying – Are coeﬃcients funcons of me? • Connuous-me vs. Discrete-me • System ID vs. First Principle

• System Goals: – Regulaon (e.g. target service levels) – Tracking (measuring deviaon from a target, e.g. change #VMs) – Opmisaon (e.g. minimize response me)

5 5/31/14

VM consolidaon • A commonly referenced problem in Cloud compung – Server cost the largest contributor to overall operaonal cost • Data centers operate at very low ulizaon – Microso: over 34% servers at less than 5% ulizaon (daily average). US average 4%. • VM Consolidaon increases ulizaon, decreases idling costs • However VM consolidaon can cause interference in the memory hierarchy (e.g. due to sharing of cache between cores or memory bandwidth)

VM Consolidaon: PACMan • How much will each VM degrade when placed with other VMs? • Which and how many VMs can be placed on a server whilst sll maintaining performance? • PACMan (Performance Aware Consolidaon Manager) – Minimise resource cost (energy usage or #servers) – Use of an approximate (computaonally efficient) algorithm • Differenate between: – Performance vs. resource efficiency (e.g. batch mode) – Eco mode: fill up server cores (minimise worst case degradaon e.g. MapReduce – minimise me of worst case Map task)

6 5/31/14

Parameters considered: “n” VMs, on “m” machines with “k” cores Consider a “degradaon” parameter as new VMs are added to a machine – idenfy an opmisaon goal – e.g.

Alan Roytman, Aman Kansal, Jie Liu and Suman Nath, “PACMan: Performance Aware Virtual Machine Consolidaon”, Proceedings of ICAC 2013, San Jose, USA (USENIX/ACM) PACMan – Approach • Start from an arbitrary inial schedule • For all ways of swapping VMs, go to the schedule with smallest sum of maximum degradaons • Limit total number of swaps to achieve convergence

7 5/31/14

Elascity • One of the key “selling points” of Cloud systems • Various approaches possible: – Oen historical informaon useful (response mes, queue lengths, arrival rates and request demands) – Long-term, medium-term and short-term planning – VM allocaon and placement • Reacve vs. proacve approaches

Dynamic VM allocaon • Understanding “Elascity” the degree to which a system is able to adapt to workload changes by provisioning and de- provisioning resources in an autonomic manner, such that at each point in me the available resources match the current demand as closely as possible. • Can elasc provisioning capability be measured

“Elascity in Cloud Compung: What It Is, and What It Is Not” Nikolas Herbst, Samuel Kounev, Ralf Reussner, ICAC 2013 (USENIX)

8 5/31/14

Dynamic VM allocaon

• Scale up speed: switch from an underprovisioned state to an opmal or overprovisioned state. – Can we consider “temporal” aspects of how scaling up takes place • Deviaon from actual to required resource demand – Measure deviaon to inﬂuence the overall process • Role of predicve allocaon for “known” events – i.e. know in advance how many VMs to allocate

Scaling Inﬂuences & Strategies

• Reacve – Observed degradaon in performance over a parcular me window • Trace-driven – Based on a short-term predicon – Could make use of a “cyclic” workload paern • Model-driven – Use of a queuing/Petri net/dynamic systems model – Oen parameters “observed” and tuned oﬀ line

9 5/31/14

An Example

“Elascity in Cloud Compung: What It Is, and What It Is Not” Nikolas Herbst, Samuel Kounev, Ralf Reussner, ICAC 2013 (USENIX)

Comparing allocaon

“Elascity in Cloud Compung: What It Is, and What It Is Not” Nikolas Herbst, Samuel Kounev, Ralf Reussner, ICAC 2013 (USENIX)

10 5/31/14

Elascity Metrics

“Elascity in Cloud Compung: What It Is, and What It Is Not” Nikolas Herbst, Samuel Kounev, Ralf Reussner, ICAC 2013 (USENIX)

Elascity Metrics … 2

“Elascity in Cloud Compung: What It Is, and What It Is Not” Nikolas Herbst, Samuel Kounev, Ralf Reussner, ICAC 2013 (USENIX)

11 5/31/14

H. Nguyen, Z. Shen, X. Gu, S. Subbiah, J. Wilkes, “AGILE: Elasc distributed resource scaling for Infrastructure-as-a-Service”, Proceedings of ICAC 2013, San Jose, USA (USENIX/ACM)

AGILE

AGILE

• Medium term predicons using Wavelets • Use of an “adapve” copy rate – Pre-copy live VM based on predicon – Avoids performance penalty – Does not requiring storing and maintaining VM snapshots – Can be undertaken incrementally – therefore avoids “bursts” in traﬃc when subming an enre VM (e.g. compared to “cold cloning” • Supports post-cloning auto-conﬁguraon

12 5/31/14

Supporng Elasc Behaviour • Variety of approaches possible: • Modelling decisions as a Markov Decision Process (TIRAMOLA successfully resizes a NoSQL cluster in a fully automated manner) • Use of classiﬁer ensemble • Machine learning strategies (e.g. use of neural networks) • Rule-based (trigger-driven) approaches

“Automated, Elasc Resource Provisioning for NoSQL Clusters Using TIRAMOLA” Dimitrios Tsoumakos, Ioannis Konstannou, Chrisna Boumpouka, Spyros Sioutas, Nectarios Koziris, CCGrid 2013, Del, The Netherlands

Hybrid Approaches

• Use of diﬀerent techniques for scaling up vs. scaling down – Reacve rules for scaling up, regression-based techniques for scaling down – Reacve rule: queue length of waing requests (but could be other criteria) – Predicve assessment (use of queuing models) to dynamically trigger new VMs

13 5/31/14

TIRAMOLA

TIRAMOLA • Decision Making – cluster resize acon according to the applied load, cluster and user-perceived performance and opmizaon policy – Modelled as a Markov Decision Process (look for best acon w.r.t. current system state) – User goals deﬁned through a reward funcon (mapping of opmisaon goals) • Monitoring via Ganglia – Server + user metrics (via gmetric spooﬁng) • Cloud Management – Via euca2ools (Amazon EC2 compliant REST library) • Cluster Coordinaon – Via remote execuon of shell scripts

14 5/31/14

TIRAMOLA • Formulates resize decisions as a MDP – State defined as #VMs, CPU usage, memory – Acons: add, remove or do-nothing (no-op) – Acons limited by a quanfier, i.e. add_2, add_4 (restricons on these quanfiers) – Transion prob. – based on if state is permissible or not (e.g. can exact number of VMs be added) – can be generalised to paral addions – Reward funcon – r(s): “good ness” of being in state (s); r(s) = f(gains, costs) • MDP enables: – No knowledge of dynamics of environment is assumed – Learn in real me (from experience) and connuously during the lifeme of the system

TIRAMOLA • Use of Q-learning (a type of reinforcement learning)

• Base calculaon of r(s) on a parcular arrival rate (of requests) and certain number of VMs • Collect results into a table – and use historical data to idenfy acon and s’ (given s) • r(s) = f(latency, VMs)

15 5/31/14

AutoFlex • Use of monitoring to collect: – CPU, memory, network bandwidth, operang system queues, etc. • Controller (feedback mechanism) – Compares target with actual – Launches or terminates VMs • Controller is both reacve and proacve – Layer controllers that run periodically (short term planning) – Reacve behaviour through acons for diﬀerent resource types – Predictors aempt to esmate future ulizaon – Mulple predictors – with the use of a selector to choose

“Autoﬂex: Service Agnosc Auto-scaling Framework for IaaS Deployment Models” Fabio Morais, Francisco Brasileiro, Raquel Lopes, Ricardo Araujo, Wade Saerﬁeld, Leandro Rosa IEEE/ACM CCGrid 2013, Del, The Netherlands

16 5/31/14

AutoFlex … predictors • Keep CPU Ulizaon < 70% • Predictors used: – auto-correlaon (AC), – linear regression (LR), – auto-regression (AR), – auto-regression with integrated moving average (ARIMA), and – the previous ulizaon measured (dubbed Last Window, or simply LW) – Ensemble using all of the above • Metrics: – Hard violaons: capacity not enough to handle demand – Cost: auto scaling vs. over provisioning (knows highest demand and stacally allocates resources)

AutoFlex … predictors

• Based on 265 traces from HP users

17 5/31/14

YinzCam (CMU)

YinzCam is a cloud-hosted service that provides sports fans with • real-me scores, news, photos, stascs, live radio, streaming video, etc., • on their mobile devices • replays from diﬀerent camera angles inside sporng venues. • YinzCam’s infrastructure is hosted on AmazonWeb Services (AWS) and supports over 7 million downloads of the oﬃcial mobile apps of 40+ professional sports teams and venues within the United States.

hps://www.cmu.edu/homepage/beyond/2008/spring/yinz-cam.shtml

YinzCam – demand proﬁle

week-long workload for a hockey-team’s mobile app, illustrang modality and spikiness. The workload exhibits the spikes due to game-day traﬃc during the three games in the week of April 15, 2012

18 5/31/14

Auto Scaling strategies • YinzCam provides an example of various streaming applicaon requirements • Some events are predictable: – Potenal workload during a game (historical data) – “in-game” vs. “non-game” mode – Some events are not (e.g. likely demand during a parcular gaming event) • Other scenarios: – Unpredictable scale up (e.g. observed phenomenon trigger in a sensor network) • Generally: over provision during game event

Scale up/down policies

• CPU usage threshold à trigger new VM – YinzCam (30% CPU usage over 1 minute) • Aggressive scale up, cauous scale down – Overcome VM allocaon overheads – Potenal for oscillaon in the system (at next CPU check) • Example policies: – Mulplicave Increase, Linear Decrease – Linear Increase, Mulplicave Decrease • Inspiraon from TCP and other congeson control mechanisms

19 5/31/14

Baseline is “manual” scaling

“To Auto Scale or not to Auto Scale”, N. D. Mickulicz, Priya Narasimhan, Rajeev Ghandi, ICAC 2013 (USENIX), San Jose, CA

Dynamic SLAs • Applicaons on mul-tenancy infrastructure – With changing applicaon demands (e.g. must respond to unpredictable events) • Prevent “over specificaon” of service level demands – User might make an inial assessment of likely demand (“first stab” at likely app. behaviour) • Provide SLAs that are “machine generated” – Based on predicve usage between applicaon classes – Offers made to users based on “likely” demand profile – May ulise resource throling strategies (cgroups in Linux – control groups that limit resource consumpon)

20 5/31/14

Dynamic SLAs in OpenStack

“A Research and Development Plan for Dynamic Service Level Agreements in OpenStack”, Craig Lee, ITAAC workshop alongside IEEE/ACM Ulity & Cloud Compung Conf, Dresden, December 2013

Admission Control • Reaching QoS of applicaons is oen strongly driven by admission control strategies • Admission control in large-scale cloud data centres influenced by: – Heterogeneity à performance/efficiency – Interference à performance loss from high interference – High arrival rates à system can become oversubscribed • Paragon and ARQ could be two approaches – Paragon: heterogenity and interference aware scheduler • ARQ: Admission control strategy. – Use of Paragon to classify applicaons into mulple request queues – Improve ulisaon across mulple QoS profiles “ARQ: A Mul-Class Admission Control Protocol for Heterogeneous Datacenters”, Chrisna Delimitrou, Nick Bambos and Christos Kozyrakis hps://www.stanford.edu/group/mast/cgi-bin/drupal/system/files/2013.extended.arq_.pdf

21 5/31/14

Paragon (Stanford)

Sources of Interference (SoI) benchmarking • Targeted microbenchmarks of tunable intensity that create contenon in speciﬁc shared resources • Introduce contenon in: processor, cache hierarchy (L1/L2/L3 & TLBs), memory (bandwidth and capacity), storage • Run applicaon concurrently with microbenchmark – Progressive tune up intensity unl QoS violaon – Associate a “sensivity score” with applicaon (i.e. sensivity to interference) • Similarly, Sensivity to running applicaon – Impact of running applicaon on micro-benchmark – Tuning up applicaon intensity unl 5% degradaon on benchmark (compared to execuon in isolaon)

22 5/31/14

ARQ: Applicaon-aware admission control

• Divide applicaon workload into queues, using – Interference tolerance informaon – Heterogeneity requirement • Trade oﬀ between: (i) waing me; (ii) quality of a resource • Prevent highly demanding applicaons from blocking easy-to-sasfy applicaons • Understand when a QoS violaon is “likely” – re-divert to a diﬀerent queue • Interference funcon (used to derive a resource quality): – Interference server can tolerate from the new applicaon (c) – Interference new workload can tolerate from exisng applicaons (t)

ARQ: Applicaon-aware admission control

23 5/31/14

Disturbance Benchmarking

• Tolerance of an applicaon to failure • Benchmark injects: – Workload & Disturbance into System Under Test – Measures response • Disturbance: – Events, faults, etc – Changes QoS proﬁle of the applicaon • Aim to measure “resilience” not availability – Approach similar to DBench-OLTP • Ability to adapt in the context of a disturbance in the system

From Aaron Brown and Peter Shum (IBM)

24 5/31/14

From Aaron Brown and Peter Shum (IBM)

Useful to compare this with performance benchmarks that we are much more aware of

Compare with automated tesng mechanisms

From Aaron Brown and Peter Shum (IBM)

25 5/31/14

From Aaron Brown and Peter Shum (IBM)

From Aaron Brown and Peter Shum (IBM)

26 5/31/14

From Aaron Brown and Peter Shum (IBM)

From Aaron Brown and Peter Shum (IBM)

27 5/31/14

From Aaron Brown and Peter Shum (IBM)

Configuraon Management • Dynamically deploy pre-configured virtual machine instances – Replicate across mulple servers – Deploy a “reference” configuraon across clients • CHEF – widely used configuraon management tool (SaaS plaorm, Ruby-based) – Deploy load balancers, monitoring tools (Nagios) along with others (sharing “cookbooks” and “recipes”) – Apache Licence (with Apache SOLR (search engine), CouchDB) • CF Engine – Open source (GPL Licence) – Enables much more complex configuraons (-ve) – Uses a remote agent (also supports a monitoring deamon)

hp://www.slideshare.net/jeyg/conﬁguraon-manager-presentaon

28 5/31/14

Conﬁguraon Management

• Amazon CloudFormaon another opon – Create & manage AWS instances -- hp://aws.amazon.com/cloudformaon/ – Provides pre-deﬁned set of templates (WordPress, Joomla, Windows Server, Ruby on Rails, etc) -- hp://aws.amazon.com/cloudformaon/aws-cloudformaon-templates/ • CloudSo’s Brooklyn hp://www.cloudsocorp.com/communies/ – Open source + support for policies – Applicaon-level rather than instance-level support – Enables autonomic adaptaon of a deployed conﬁguraon (e.g. auto-scaling policy, replacer/ restarter (high availability) policy)

Stream processing architectures • Systems that must react to streams of data produced by the external world • Stream data source can vary in complexity and type • Availability of streamed data can also be managed through an access control mechanism • Usually operate in real me over streams and generate in turns other streams of data enabling: (i) passive monitoring: what is happening, or (ii) acve control: suggesng acons to perform, such as by stock X, raise alarm Y, or detected spaal violaon, etc. • Stream processing can also lead to semanc annotaon of events

29 5/31/14

Difference from “standard” databases • Queries over streams are generally “connuous” • execung for long periods of me • return incremental results • permanently installed – i.e. does not terminate aer first execuon • Performance metrics should be based on response me rather than compleon me • Data is not stac – as new data is constantly arriving into the system – Same query at different mes leads to different results (as long as new data enters the system) • Typical operaons in StreamSQL – SELECT (execute a funcon on a stream) and WHERE (execute a filter on a stream) operators – Stream merge and join – Windowing and Aggregaon

Analyses of performance • Response Time – Average or maximum me between input arrival into the system, and the subsequent generaon of a response • Support (query) Load – What is the size of the input (number of data elements) a stream system can process while sll meeng specified response me target and correctness constraints • Correctness is me dependent – Same query at different mes à different outcomes – Potenally mulple correct answers depending on response me

30 5/31/14

Adaptive Streams • Three key issues: – what to remember or forget, – when to do the model update, and – how to do the model update • For streaming – these can be mapped into: – the size of the window to remember recent examples – methods for detecng distribuon change in the input – methods for keeping updated esmaons for input stascs

All “x” are real valued, estimator: current value of “x” + variance (each “x” independently drawn)

Estimator: linear, moving average, Kalman filter

Adaptive Sliding Windows (ADWIN) • Window size – reflects me scale of change – Small: reflects accurately the current distribuon – Large: many examples are available to work on, increasing accuracy in periods of stability • Window content is used for – detecng change (e.g., by using some stascal test on different sub windows), – to obtain updated stascs from recent examples, – to have data to rebuild or revise the model(s) aer data has changed • Adapve Windowing: – Whenever two “large enough” sub windows of W exhibit “disnct enough” averages à corresponding expected values are different à drop older poron of window

31 5/31/14

Cloud-based stream processing • Use of Cloud resources to: – Execute stream processing operators (may be in- network) – VM per operator (dynamically allocated to overcome peak workloads) • Operator chaining within/across Cloud systems – Scale out – Fault tolerance • Operator chaining à processing pipelines – Similarity with workﬂow systems

GENI (OpenFlow and MiddleBox) • L2/L3 Technology to permit soware-defined control of network forwarding and roung • Integraon of specialist network “appliances” to support specific funcons – These could be user defined – Linux hosng • MiddleBox: In-network general-purpose processors fronted by OpenFlow switches • Integrate services from mulple Clouds – Allocaon of networks and “slices” across different resources

32 5/31/14

In-transit Analysis

Delay (QoS parameter) D1 S1 T1 T2

T4 T5 S2 T3 D2

• Data processing while data is in movement from source to desnaon • Queson: what to process where and when • Use of “slack” in network to support paral processing • Applicaon types: – Streaming & Data Fusion requirement

In-transit Analysis … 2

Shared Cluster D1 S1 T1 T2

T4 T5 S2 T3 D2

• Data processing while data is in movement from source to desnaon • Queson: what to process where and when • Use of “slack” in network to support paral processing

33 5/31/14

Workﬂow level Representaon [] [] ADSS [] [] [] PU1 (with in-transit PU2 processing)

Proc. Unit: t11,t12 ADSS: Proc. Unit: t21,t22,t23 t21,t22

mapping mapping

data transfer ADSS μ λ buﬀer B controller In transit δ nodes:

ω t21,t22 Resource r1: t11,t12 local storage Resource r2: t21,t22,t23 B Bandwidth Input rate λ ADSS model & Consumer’s data rate simulator δ μ Controlled output rate ω Disk transfer rate 70

Rafael Tolosana-Calasanz et al. “Revenue-based Resource Management on Shared Clouds for Heterogenous Bursty Data Streams”, GECON 2012, Springer

Approach & focus

Adapve infrastructure for sensor data analysis

• Mulple concurrent data streams with SLA • Variable properes: rate and data types; various processing models • Support for in-transit analysis, enforcing QoS • Support for admission control & ﬂow isolaon at each node • In case of QoS violaon, penalisaon

A Scalable, Distributed Stream Processing System for Electric Vehicle Key focus Demand Forecasng • Architectural components • Business rules for SLA Management : Acons to guarantee QoS & maximize revenue

From Jose Banares (University of Zaragoza)

34 5/31/14

From Jose Banares (University of Zaragoza) System Architecture

Data injecon rate SLA b R

• 3 key components / node: Token Bucket, Processing Unit & output streaming

From Jose Banares (University of Zaragoza) Token Bucket (shaping traﬃc)

slope Ra S(t) slope R A(t) slope R A(t) E(t) R tokens/s

b.C b C − R The real Traffic flow d tokens slope C B € a C bps time time A(t): Amount of data arriving up to time t Two key parameters of interest: • R: Also called the commied informaon rate (CIR), it speciﬁes how much data can be sent or forwarded per unit me on average • B: it speciﬁes for each burst how much data can be sent within a given me without creang scheduling concerns

35 5/31/14

Token Bucket (shaping traﬃc)

From Jose Banares (University of Zaragoza)

From Jose Banares (University of Zaragoza)

Token Bucket PU Buﬀer Input buﬀer

Each token bucket provides us tunable parameters: R,b

Controller: monitors & modiﬁes behaviour

36 5/31/14

Resource addion based on buﬀer occupancy

Autonomic Computaonal Science • Enable automated tuning of applicaon behaviour – Execuon Units, Communicaon, Coordinaon, Execuon Environment – Relaon to “Reflecon” and Reflecve Middleware + Use of intercession on a meta-model + domain-model – Developing a meta-model is oen difficult • Tuning may be: – Centralized – Consist of mulple control units – Tuner external to the applicaon • Comparison with Control systems & MDA – Mulple, oen “hidden” control loops – Inclusion of run-me configuraon parameters at design me – Model centric view that takes deployment and execuon into account

Shantenu Jha, Manish Parashar and Omer Rana, Self-adaptive architectures for autonomic computational science Proceedings of the First international conference on Self-organizing architectures , pp 177-197, LNCS 6090, Springer Verlag 2010.

37 5/31/14

Autonomic Computaonal Science Conceptual Framework

Tuning of applicaon & resource manager parameters

Resources R Results Application Resource Inputs Manager R …

Tuning Tuning Parameters R Parameters

Autonomic Tuning

38 5/31/14

Tuning by applicaon

Resources R Results Application Resource Inputs Manager R Tuning … Strategy Tuning Parameters R

Autonomic Tuning

Historical (monitoring) data

39 5/31/14

Spaal, Temporal and Computaonal Heterogeneity and Dynamics in SAMR

Spaal Heterogeneity

Temperature

Temporal Heterogeneity

Simulaon of OH Proﬁle combuson based on SAMR (H2-Air mixture; ignion via 3 hot-spots)

Courtesy: Sandia Naonal Lab

Autonomics in SAMR • Tuning by the applicaon – Applicaon level: when and where to reﬁne – Runme/Middleware level: When, where, how to paron and load balance – Runme level: When, where, how to paron and load balance – Resource level: Allocate/de-allocate resources

• Tuning of the applicaon, runme – When/where to reﬁne – Latency aware ghost synchronizaon – Heterogeneity/Load-aware paroning and load-balancing – Checkpoint frequency – Asynchronous formulaons

40 5/31/14

Montage

Montage workﬂow

I/O intensive workﬂow

Signiﬁcant data parallelism

41 5/31/14

Montage: Tuning Mechanisms

Montage: Tuning Strategy

42 5/31/14

Montage: Tuning Strategy

Montage: Tuning Strategy

43 5/31/14

Concluding comments • Autonomic strategies: – Oen rooted in control systems (generally closed- loop feedback control) – Can use a variety of control strategies – which include use of machine learning • Formulang the problem oen diﬃcult – Mul-criteria opmisaon – Oen mulple, diﬃcult to separate control loops • Monitoring infrastructure choice is key

44