Bistro: Scheduling Data-Parallel Batch Jobs Against Live ProducOn Systems

Bistro: Scheduling Data-Parallel Batch Jobs against Live Produc:on Systems hp://bistro.io Andrey Goder, Alexey Spiridonov, Yin Wang (Facebook) Big Data and Hadoop Facebook Data Store Haystack/F4 MySQL HBase Facebook Data Store (cont’d) RocksDB Online vs. Offline RocksDB Haystack/F4 Bistro: Scheduling against a Variety of Online Systems Bistro Haystack/F4 Outline • Background • The scheduling problem • Bistro implementaon • Case studies Data-Parallel Jobs • Task-Parallel vs. Data-Parallel Data Task b Data Data Data Data Data Data Data Task a b c d e f g a Task c Task Task Task Task Task Task Task Task d Job Task Task e f Job Map-only jobs The Scheduling Problem 2/40 Gbps 1/40 Gbps Rack I Rack II Goal: maximize resource u:lizaon (high throughput) 200/200 200/200 100/200 IOPS IOPS IOPS Host 1 Host 2 Host 3 Challenge: minimize scheduling overhead at scale 1/1 1/1 1/1 1/1 lock lock lock lock Vol A Vol B Vol C Vol D Vol E Vol F Vol G Task Task Task Task Task Task Task ./photo_process Job 1 1 lock, 100 IOPS, 1 Gbps Task Task Task Task Task Task Task ./volume_compact Job 2 1 lock, 200 IOPS, 0 Gbps …… Exisng Schedulers • Online schedulers – E.g., load balancing, low latency, quality of service – Not designed for data-parallel batch jobs • Offline schedulers – Can easily overload a data host • Focus on computaon resources on worker hosts, ignoring resource constraints on data hosts – Li=le support for mutable data – Tightly integrated with a specific offline system Fundamental problem: queue-based scheduling! Head-of-Queue Blocking Vol C Vol B Vol A Global queue? FIFO task …… Worker queue T4 T3 T2 T1 pool T3 blocks the rest of the queue! One queue Host per shard? blocked by Host. Can run Vol A Vol B Vol C once T1 finishes T1 T2 T4 …… …… …… How to Track Changes New Vol C Vol B Vol A Vol FIFO task …… Worker queue T4 T3 T2 T1 pool Scheduling Algorithm – Brute-force while true: for j in jobs: for n in leaf nodes: if t has not finished and there are enough resources for j along the path to root: run(t) Scheduling Algorithm – Subtree when a task t finishes at leaf node n, run: p = the highest ancestor node of n where t consumes resources run brute-force algorithm on the subtree at p Enables parallel scheduling! Micro-benchmark Outline • Background • The scheduling problem • Bistro implementaon • Case studies Architecture Zookeeper file Bistro start process Config Task locally url Scheduler Loader Runner dispatch to remote Status worker Store thrift Node servers Fetcher Monitor Phabricator listdir browser script Scuba MySQL SQLite updates finished tasks Scheduling Modes Scheduling Modes (cont’d) Configuraon “nodes”: { “job->photo_process”: { “levels”: [“rack”, “host”, “volume”], “cmd”: “./photo_process”, “node_fetchers”: [ “args”: { {“source”: “haystack_racks”}, “model”: “deep_face”, {“source”: “haystack_hosts”}, } {“source”: “haystack_volumes”, “enabled”: true, “prefs” : { “volume_type” : “PHOTO” “filters”: { } “host”: { } “blacklist_regex”: “^haystack(\d+)\.prn” ] } }, }, “resources”: { }, “rack”: {“capacity”: 20, “default”: 1} “host”: {“capacity”: 200, “default”: 100} “job->volume_compact”: { } … } Resource config Job config Scheduling Policies • Round robin • Randomized while true: for j in jobs: priority for n in leaf nodes: if t has enough resources • Ranked along the path to root: priority start(t) • Long tail Handle Model Updates • Resource and job updates are common Rack I Rack II • Periodically reload configuraon, and propagate Host 1 Host 2 Host 3 update nodes changes Vol A Vol B Vol C Vol D Vol E Vol F Vol G • Propagate changes to descendants • Asynchronous model updates, scheduling, monitoring etc. Worker Resource Management Rack I Rack II run(t): for w in workers: Host 1 Host 2 Host 3 if w has enough resources for t: Vol A Vol B Vol C Vol D Vol E Vol F Vol G run(t, w) return worker pool • Just another layer of nodes in resource model • Allows both resource constraints and placement constraints • Can use randomized algorithms for scalability Model Extensions • Mul:ple resources per node • Execute jobs on different levels • Time-based jobs – Time nodes • Different model par::oning schemes Host 2 Host 3 • DAG model – Model replicas Vol D Vol E Vol G “Reduce” Support Data Data Data Data Data Data Data a b c d e f g Bistro Task Task Task Task Task Task Task map X job High performance intermediate data store, e.g., RocksDB Bistro Task Task Task Task Task reduce Y job Data Data Data Data Data 1 2 3 3 3 Adaptaon to Live Traffic? • Resource monitoring + Bistro model update • What about spikey live workload? – Aggressive task preemp:on – Rapid cleanup – Making progress under frequent preemp:on? • Conservave or simple scheduling policy wins – Overprovisioning – Day/night resource capacity adjustment Model Limitaons • More complex resource constraints – Source des:naon • Break into two jobs – Network Outline • Background • The scheduling problem • Bistro implementaon • Use cases Use Cases Online system Example Job Resource hierarchy Replacing MySQL DB Iterator root->host->db Proprietary scheduler PostgresSQL Migraon host->shard Proprietary scheduler BLOB storage Photo/video Host->volume->file processing HBase Data compression host->region Hadoop • Our only general-purpose scheduler for non- HDFS systems • Replacing Hadoop for many online HDFS systems Summary SQL Database • Database Iterator – Mo:vang use case – Largest user base • Database scraping – Shortest duraon Brute-force vs. Tree scheduling for scraping workload Hbase: Problems with Hadoop • Designed for offline batch jobs, no protec:on for live traffic • Very li=le control over job execu:on – No canary, pause/resume, debug/re-execute checkpoint etc. • Rigid computaon model – All or nothing – Barrier of each phase Hbase Compression (cont’d) Conclusions • Running jobs directly against produc:on systems is the trend • Tree-based scheduling allows – Hierarchical resource constraints – Efficient updates – Easy par::oning for flexible setup – Parallel scheduling • Open source at h=p://Bistro.io .

Load more