Bistro: Scheduling Data-Parallel Batch Jobs against Live Produc on Systems
h p://bistro.io
Andrey Goder, Alexey Spiridonov, Yin Wang (Facebook) Big Data and Hadoop Facebook Data Store
Haystack/F4
MySQL HBase Facebook Data Store (cont’d)
RocksDB Online vs. Offline
RocksDB Haystack/F4 Bistro: Scheduling against a Variety of Online Systems
Bistro
Haystack/F4 Outline
• Background • The scheduling problem • Bistro implementa on • Case studies Data-Parallel Jobs • Task-Parallel vs. Data-Parallel
Data
Task b Data Data Data Data Data Data Data Task a b c d e f g a Task c
Task Task Task Task Task Task Task Task d Job
Task Task e f Job Map-only jobs The Scheduling Problem
2/40 Gbps 1/40 Gbps Rack I Rack II Goal: maximize resource u liza on (high throughput)
200/200 200/200 100/200 IOPS IOPS IOPS Host 1 Host 2 Host 3 Challenge: minimize scheduling overhead at scale 1/1 1/1 1/1 1/1 lock lock lock lock Vol A Vol B Vol C Vol D Vol E Vol F Vol G
Task Task Task Task Task Task Task ./photo_process Job 1 1 lock, 100 IOPS, 1 Gbps
Task Task Task Task Task Task Task ./volume_compact Job 2 1 lock, 200 IOPS, 0 Gbps …… Exis ng Schedulers • Online schedulers – E.g., load balancing, low latency, quality of service – Not designed for data-parallel batch jobs • Offline schedulers – Can easily overload a data host • Focus on computa on resources on worker hosts, ignoring resource constraints on data hosts – Li le support for mutable data – Tightly integrated with a specific offline system
Fundamental problem: queue-based scheduling! Head-of-Queue Blocking
Vol C Vol B Vol A Global queue? FIFO task …… Worker queue T4 T3 T2 T1 pool
T3 blocks the rest of the queue!
One queue Host per shard? blocked by Host. Can run Vol A Vol B Vol C once T1 finishes
T1 T2 T4 …… …… …… How to Track Changes
New Vol C Vol B Vol A Vol
FIFO task …… Worker queue T4 T3 T2 T1 pool Scheduling Algorithm – Brute-force
while true: for j in jobs: for n in leaf nodes: if t has not finished and there are enough resources for j along the path to root: run(t)
Scheduling Algorithm – Subtree when a task t finishes at leaf node n, run: p = the highest ancestor node of n where t consumes resources run brute-force algorithm on the subtree at p
Enables parallel scheduling! Micro-benchmark Outline
• Background • The scheduling problem • Bistro implementa on • Case studies Architecture
Zookeeper file Bistro start process Config Task locally url Scheduler Loader Runner dispatch to remote Status worker Store thrift Node servers Fetcher Monitor Phabricator
listdir browser script Scuba MySQL SQLite
updates finished tasks Scheduling Modes Scheduling Modes (cont’d) Configura on
“nodes”: { “job->photo_process”: { “levels”: [“rack”, “host”, “volume”], “cmd”: “./photo_process”, “node_fetchers”: [ “args”: { {“source”: “haystack_racks”}, “model”: “deep_face”, {“source”: “haystack_hosts”}, } {“source”: “haystack_volumes”, “enabled”: true, “prefs” : { “volume_type” : “PHOTO” “filters”: { } “host”: { } “blacklist_regex”: “^haystack(\d+)\.prn” ] } }, }, “resources”: { }, “rack”: {“capacity”: 20, “default”: 1} “host”: {“capacity”: 200, “default”: 100} “job->volume_compact”: { } … }
Resource config Job config Scheduling Policies
• Round robin • Randomized while true: for j in jobs: priority for n in leaf nodes: if t has enough resources • Ranked along the path to root: priority start(t) • Long tail Handle Model Updates
• Resource and job
updates are common Rack I Rack II • Periodically reload configura on, and propagate Host 1 Host 2 Host 3 update nodes changes Vol A Vol B Vol C Vol D Vol E Vol F Vol G • Propagate changes to descendants • Asynchronous model updates, scheduling, monitoring etc. Worker Resource Management
Rack I Rack II run(t): for w in workers: Host 1 Host 2 Host 3 if w has enough resources for t:
Vol A Vol B Vol C Vol D Vol E Vol F Vol G run(t, w) return worker pool
• Just another layer of nodes in resource model • Allows both resource constraints and placement constraints • Can use randomized algorithms for scalability Model Extensions
• Mul ple resources per node • Execute jobs on different levels • Time-based jobs – Time nodes • Different model par oning schemes Host 2 Host 3 • DAG model
– Model replicas Vol D Vol E Vol G “Reduce” Support
Data Data Data Data Data Data Data a b c d e f g
Bistro Task Task Task Task Task Task Task map X job
High performance intermediate data store, e.g., RocksDB
Bistro Task Task Task Task Task reduce Y job
Data Data Data Data Data 1 2 3 3 3 Adapta on to Live Traffic?
• Resource monitoring + Bistro model update • What about spikey live workload? – Aggressive task preemp on – Rapid cleanup – Making progress under frequent preemp on? • Conserva ve or simple scheduling policy wins – Overprovisioning – Day/night resource capacity adjustment Model Limita ons
• More complex resource constraints – Source des na on • Break into two jobs – Network Outline
• Background • The scheduling problem • Bistro implementa on • Use cases Use Cases
Online system Example Job Resource hierarchy Replacing MySQL DB Iterator root->host->db Proprietary scheduler
PostgresSQL Migra on host->shard Proprietary scheduler
BLOB storage Photo/video Host->volume->file processing HBase Data compression host->region Hadoop
• Our only general-purpose scheduler for non- HDFS systems • Replacing Hadoop for many online HDFS systems Summary SQL Database
• Database Iterator – Mo va ng use case – Largest user base • Database scraping – Shortest dura on Brute-force vs. Tree scheduling for scraping workload Hbase: Problems with Hadoop
• Designed for offline batch jobs, no protec on for live traffic • Very li le control over job execu on – No canary, pause/resume, debug/re-execute checkpoint etc. • Rigid computa on model – All or nothing – Barrier of each phase Hbase Compression (cont’d) Conclusions • Running jobs directly against produc on systems is the trend • Tree-based scheduling allows – Hierarchical resource constraints – Efficient updates – Easy par oning for flexible setup – Parallel scheduling • Open source at h p://Bistro.io