Bistro: Scheduling Data-Parallel Batch Jobs against Live Producon Systems

hp://bistro.io

Andrey Goder, Alexey Spiridonov, Yin Wang () Big Data and Hadoop Facebook Data Store

Haystack/F4

MySQL HBase Facebook Data Store (cont’)

RocksDB Online vs. Offline

RocksDB Haystack/F4 Bistro: Scheduling against a Variety of Online Systems

Bistro

Haystack/F4 Outline

• Background • The scheduling problem • Bistro implementaon • Case studies Data-Parallel Jobs • Task-Parallel vs. Data-Parallel

Data

Task b Data Data Data Data Data Data Data Task a b d e f g a Task c

Task Task Task Task Task Task Task Task d Job

Task Task e f Job Map-only jobs The Scheduling Problem

2/40 Gbps 1/40 Gbps Rack I Rack II Goal: maximize resource ulizaon (high throughput)

200/200 200/200 100/200 IOPS IOPS IOPS Host 1 Host 2 Host 3 Challenge: minimize scheduling overhead at scale 1/1 1/1 1/1 1/1 lock lock lock lock Vol A Vol B Vol C Vol D Vol E Vol F Vol G

Task Task Task Task Task Task Task ./photo_process Job 1 1 lock, 100 IOPS, 1 Gbps

Task Task Task Task Task Task Task ./volume_compact Job 2 1 lock, 200 IOPS, 0 Gbps …… Exisng Schedulers • Online schedulers – E.g., load balancing, low latency, quality of service – Not designed for data-parallel batch jobs • Offline schedulers – Can easily overload a data host • Focus on computaon resources on worker hosts, ignoring resource constraints on data hosts – Lile support for mutable data – Tightly integrated with a specific offline system

Fundamental problem: queue-based scheduling! Head-of-Queue Blocking

Vol C Vol B Vol A Global queue? FIFO task …… Worker queue T4 T3 T2 T1 pool

T3 blocks the rest of the queue!

One queue Host per shard? blocked by Host. Can run Vol A Vol B Vol C once T1 finishes

T1 T2 T4 …… …… …… How to Track Changes

New Vol C Vol B Vol A Vol

FIFO task …… Worker queue T4 T3 T2 T1 pool Scheduling Algorithm – Brute-force

while true: for j in jobs: for n in leaf nodes: if t has not finished and there are enough resources for j along the path to root: run(t)

Scheduling Algorithm – Subtree when a task t finishes at leaf node n, run: p = the highest ancestor node of n where t consumes resources run brute-force algorithm on the subtree at p

Enables parallel scheduling! Micro-benchmark Outline

• Background • The scheduling problem • Bistro implementaon • Case studies Architecture

Zookeeper file Bistro start process Config Task locally url Scheduler Loader Runner dispatch to remote Status worker Store thrift Node servers Fetcher Monitor

listdir browser script Scuba MySQL SQLite

updates finished tasks Scheduling Modes Scheduling Modes (cont’d) Configuraon

“nodes”: { “job->photo_process”: { “levels”: [“rack”, “host”, “volume”], “cmd”: “./photo_process”, “node_fetchers”: [ “args”: { {“source”: “haystack_racks”}, “model”: “deep_face”, {“source”: “haystack_hosts”}, } {“source”: “haystack_volumes”, “enabled”: true, “prefs” : { “volume_type” : “PHOTO” “filters”: { } “host”: { } “blacklist_regex”: “^haystack(\d+)\.prn” ] } }, }, “resources”: { }, “rack”: {“capacity”: 20, “default”: 1} “host”: {“capacity”: 200, “default”: 100} “job->volume_compact”: { } … }

Resource config Job config Scheduling Policies

• Round robin • Randomized while true: for j in jobs: priority for n in leaf nodes: if t has enough resources • Ranked along the path to root: priority start(t) • Long tail Handle Model Updates

• Resource and job

updates are common Rack I Rack II • Periodically reload configuraon, and propagate Host 1 Host 2 Host 3 update nodes changes Vol A Vol B Vol C Vol D Vol E Vol F Vol G • Propagate changes to descendants • Asynchronous model updates, scheduling, monitoring etc. Worker Resource Management

Rack I Rack II run(t): for w in workers: Host 1 Host 2 Host 3 if w has enough resources for t:

Vol A Vol B Vol C Vol D Vol E Vol F Vol G run(t, w) return worker pool

• Just another layer of nodes in resource model • Allows both resource constraints and placement constraints • Can use randomized algorithms for scalability Model Extensions

• Mulple resources per node • Execute jobs on different levels • Time-based jobs – Time nodes • Different model paroning schemes Host 2 Host 3 • DAG model

– Model replicas Vol D Vol E Vol G “Reduce” Support

Data Data Data Data Data Data Data a b c d e f g

Bistro Task Task Task Task Task Task Task map X job

High performance intermediate data store, e.g., RocksDB

Bistro Task Task Task Task Task reduce Y job

Data Data Data Data Data 1 2 3 3 3 Adaptaon to Live Traffic?

• Resource monitoring + Bistro model update • What about spikey live workload? – Aggressive task preempon – Rapid cleanup – Making progress under frequent preempon? • Conservave or simple scheduling policy wins – Overprovisioning – Day/night resource capacity adjustment Model Limitaons

• More complex resource constraints – Source desnaon • Break into two jobs – Network Outline

• Background • The scheduling problem • Bistro implementaon • Use cases Use Cases

Online system Example Job Resource hierarchy Replacing MySQL DB Iterator root->host->db Proprietary scheduler

PostgresSQL Migraon host->shard Proprietary scheduler

BLOB storage Photo/video Host->volume->file processing HBase Data compression host->region Hadoop

• Our only general-purpose scheduler for non- HDFS systems • Replacing Hadoop for many online HDFS systems Summary SQL

• Database Iterator – Movang use case – Largest user base • Database scraping – Shortest duraon Brute-force vs. Tree scheduling for scraping workload Hbase: Problems with Hadoop

• Designed for offline batch jobs, no protecon for live traffic • Very lile control over job execuon – No canary, pause/resume, debug/re-execute checkpoint etc. • Rigid computaon model – All or nothing – Barrier of each phase Hbase Compression (cont’d) Conclusions • Running jobs directly against producon systems is the trend • Tree-based scheduling allows – Hierarchical resource constraints – Efficient updates – Easy paroning for flexible setup – Parallel scheduling • Open source at hp://Bistro.io