PaSh: Light-Touch Data-Parallel Shell Processing
Nikos Vasilakis* Konstantinos Kallas* Konstantinos Mamouras MIT University of Pennsylvania Rice University
Achilles Benetopoulos Lazar Cvetković (Unaffiliated) University of Belgrade
[email protected] github.com/andromeda/pash * equal contribution Shell Scripts are Everywhere
Default/scriptable system interface even in the lightest containers Kubernetes, Docker
Universal composition environment Commands (programs) can be written in C, C++, Rust, JS, Python, Ruby, Haskell...
Succinct data processing: download/extraction/ preprocessing/querying A Classic Shell Script
Bentley: A word-counting challenge Knuth: 100s of lines of literate WEB
It was the best of times, it was the 10 was worst of times, it was the age of 10 the wisdom, it was the age of 10 of foolishness, it was the epoch of 10 it belief, it was the epoch of 2 times incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair.
McIlroy: Unix one-liner tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q A classic: Compute top-N words+counts
It was the best of times, it was the 10 was worst of times, it was the age of 10 the wisdom, it was the age of 10 of foolishness, it was the epoch of 10 it belief, it was the epoch of 2 times incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair.
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q It was the best of times, it was the It worst of times, it was the age of was wisdom, it was the age of the foolishness, it was the epoch of best belief, it was the epoch of tr -cs A-Za-z '\n' of incredulity, it was the season of times Light, it was the season of it Darkness, it was the spring of was hope, it was the winter of despair. the …
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q It it was was the the best best of of times tr A-Z a-z times it it was was the the … …
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q it age was age the belief best best of darkness times sort despair it epoch was epoch the foolishness … …
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q age 2 age age 1 belief belief 1 best best 1 darkness darkness uniq -c 1 despair despair 2 epoch epoch 1 foolishness epoch 1 hope foolishness 10 it … …
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q 2 age 10 was 1 belief 10 the 1 best 10 of 1 darkness sort -rn 10 it 1 despair 2 times 2 epoch 2 season 1 foolishness 2 epoch 1 hope 2 age 1 incredulity 1 worst 10 it 1 wisdom
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q 10 was 10 was 10 the 10 the 10 of 10 of 10 it 10 it 2 times sed ${1}q 2 times 2 season 2 epoch 2 age 1 worst 1 wisdom …
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q It was the best of times, it was the 10 was worst of times, it was the age of 10 the wisdom, it was the age of 10 of foolishness, it was the epoch of 10 it belief, it was the epoch of 2 times …
It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the worst of times, it was the age of worst of times, it was the age of worst of times, it was the age of wisdom, it was the age of wisdom, it was the age of wisdom, it was the age of foolishness, it was the epoch of foolishness, it was the epoch of foolishness, it was the epoch of belief, it was the epoch of belief, it was the epoch of belief, it was the epoch of … … … How to
It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the worst of times, it was the age of worst of times, it was the age of worst of times, it was the age of worst of times, it was the age of worst of times, it was the age of wisdom, it was the age of wisdom, it was the age of wisdom, it was the age of wisdom, it was the age of wisdom, it was the age of foolishness, it was the epoch of foolishness, it was the epoch of foolishness, it was the epoch of foolishness, it was the epoch of foolishness, it was the epoch of belief, it was the epoch of belief, it was the epoch of belief, it was the epoch of belief, it was the epoch of belief, it was the epoch of … … … … …
It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the worst of times, worst of times, worst of times, worst of times, worst of times, worst of times, worst of times, worst of times, parallelize? tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q Shell scripts are mostly sequential
Their parallelization requires considerable effort: ● Command-specific flags (e.g., sort -p, make -jN) ● Mostly-manual, restricted parallelization tools (e.g., GNU parallel) ● Full rewrites in parallel frameworks (e.g., MapReduce) import java.io.*; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import java.util.*; import java.util.Map; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import java.util.TreeMap; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; public class top_10_Movies_Mapper extends Mapper
Parallelization requires considerable effort: ● Command-specific flags (e.g., sort -p, make -jN) ● Mostly-manual, restricted parallelization tools (e.g., GNU parallel) ● Full rewrites in parallel frameworks (e.g., MapReduce) Challenges of Automating Shell-Script Parallelization for directory in /project/gutenberg/*/; do ls $directory | grep 'txt' | wc -l > index.txt done cat f1 f2 | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
split aggregate echo 'Done'; (1) Numerous and opaque Unix commands
(2) Shell language enforced dependencies (3) Runtime support for Unix parallelization PaSh Overview
seq.sh | AST f1 Parse cat f1 f2 Compile cat sort cat $f1 f2 | f2 sort sort DFG
DFG Annotations
3 Runtime 1 2 Library Optimize par.sh ; AST mkfifo a b mkfifo a b & sort f1 > a & f1 sort sort f2 > b & Unparse > Emit sort -m sort -m a b & f2 sort Optimized DFG wait;rm -f a b sort f1 a PaSh Overview
seq.sh | AST f1 Parse cat f1 f2 > Compile cat sort cat $f1 f2 | f2 sort sort DFG
DFG Annotations
3 Runtime 1 2 Library Optimize par.sh ; AST mkfifo a b mkfifo a b & sort f1 > a & f1 sort sort f2 > b & Unparse > Emit sort -m sort -m a b & f2 sort Optimized DFG wait;rm -f a b sort f1 a 1. Unix Parallelizability Study & Annotations study
POSIX GNU Ubuntu PATH Scripts POSIX GNU
Parallelizability properties: Parallelizability DSL: ● 4 broad classes (cmd, ● Flags and options flg, ● Input consumption [in]) → DFG node command parallelizability 4 classes
input.txt
It was the best of times, it was the worst of times, it was the age of tr wisdom, it was the age of foolishness, it was the epoch of 12.7% stateless belief, it was the epoch of cat incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair. tr command parallelizability
classes input.txt +state
4 It was the best of times, it was the agg worst of times, it was the age of wc wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of +state Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair. wc 12.7% stateless 8.7% parallelizable pure command parallelizability 4 classes
input.txt +state
It was the best of times, it was the worst of times, it was the age of sha1shum x wisdom, it was the age of foolishness, it was the epoch of 12.7% stateless belief, it was the epoch of incredulity, it was the season of +state Light, it was the season of Darkness, it was the spring of 8.7% parallelizable pure hope, it was the winter of despair. sha1shum 8.2% non-parallelizable pure command parallelizability 4 classes
12.7% stateless 8.7% parallelizable pure mv 8.2% non-parallelizable pure 70.4% side-effectful PaSh Overview
seq.sh | AST f1 Parse cat f1 f2 > Compile cat sort cat $f1 f2 | f2 sort sort DFG
DFG Annotations
3 Runtime 1 2 Library Optimize par.sh ; AST mkfifo a b mkfifo a b & sort f1 > a & f1 sort sort f2 > b & Unparse > Emit sort -m sort -m a b & f2 sort Optimized DFG wait;rm -f a b sort f1 a PaSh Overview
seq.sh | AST f1 Parse cat f1 f2 > Compile cat sort cat $f1 f2 | f2 sort sort DFG
DFG Annotations
3 Runtime 1 2 Library Optimize par.sh ; AST mkfifo a b mkfifo a b & sort f1 > a & f1 sort sort f2 > b & Unparse > Emit sort -m sort -m a b & f2 sort Optimized DFG wait;rm -f a b sort f1 a 2. Dataflow Model & Transformations Scheduling constraint cat f1 f2 > out.txt; cat out.txt
f1
out out cat cat 2 f DFG2 DFG1 cat f1 f2 | tr A-Z a-z | sort > out.txt; cat
f1 out cat tr sort f2 DFG1 cat f1 f2 | tr A-Z a-z | sort > out.txt; cat
f1 tr out cat split cat sort f2 tr DFG1
Transformation condition: tr is stateless cat f1 f2 | tr A-Z a-z | sort > out.txt; cat
f1 tr out cat sort
f2 tr
DFG1
Transformation condition: cat followed by split cat f1 f2 | tr A-Z a-z | sort > out.txt; cat
f1 tr sort
out cat split merge
f2 tr sort
DFG1 Transformation condition: sort is parallellizable pure cat f1 f2 | tr A-Z a-z | sort > out.txt; cat
f1 tr sort out merge
f2 tr sort
DFG1 Transformation condition: cat followed by split 1 + 3 Transformations
grep
cat grep τ grep cat cmd τ cat cmd 1 grep DFG DFG DFG
τ relay τ split cat 3 2 DFG DFG
DFG
DFG PaSh Overview
seq.sh | AST f1 Parse cat f1 f2 > Compile cat sort cat $f1 f2 | f2 sort sort DFG
DFG Annotations 3 Runtime 2 Library 1 Optimize par.sh ; AST mkfifo a b mkfifo a b & sort f1 > a & f1 sort sort f2 > b & Unparse > Emit sort -m sort -m a b & f2 sort Optimized DFG wait;rm -f a b sort f1 a PaSh Overview
seq.sh | AST f1 Parse cat f1 f2 > Compile cat sort cat $f1 f2 | f2 sort sort DFG
DFG Annotations
3 Runtime 1 2 Library Optimize par.sh ; AST mkfifo a b mkfifo a b & sort f1 > a & f1 sort sort f2 > b & Unparse > Emit sort -m sort -m a b & f2 sort Optimized DFG wait;rm -f a b sort f1 a 3. Runtime Support Runtime Support: Performance & Correctness
● Unix pipes are lazy, i.e., inadequate buffering (and for a good reason)
● Dataflow graph termination is tricky
● Parallelizable-pure commands require careful aggregation Runtime Challenge: Unix's Lazy Semantics
mkfifo f1 f2
grep "foo" in1 > f1 & grep
grep "foo" in2 > f2 & cat
cat f1 f2 grep
A non-solution: using files instead of fifos Runtime Challenge: Unix's Lazy Semantics
mkfifo f1 f2
1 grep "foo" in1 > f1 & grep
grep "foo" in2 > f2 & cat
cat f1 f2 grep
2 Runtime Challenge: Unix's Lazy Semantics
mkfifo f1 f2
1 grep "foo" in1 > f1 & grep
grep "foo" in2 > f2 & cat
cat f1 f2 grep
2 Runtime Challenge: Unix's Lazy Semantics
mkfifo f1 f2
1 grep "foo" in1 > f1 & grep
grep "foo" in2 > f2 & cat
cat f1 f2 grep
2 Runtime Challenge: Unix's Lazy Semantics
mkfifo f1 f2
1 grep "foo" in1 > f1 & grep
grep "foo" in2 > f2 & cat
cat f1 f2 grep
2 Runtime Challenge: Unix's Lazy Semantics
mkfifo f1 f2
1 grep "foo" in1 > f1 & grep
grep "foo" in2 > f2 & cat
cat f1 f2 grep
2
Execution proceeds in steps! A non-solution: Use intermediary files...
touch f1 f2
grep "foo" in1 > f1 & grep f1 f1
grep "foo" in2 > f1 & cat
wait grep f2 f2 cat f1 f1
Among other problems, this "solution" prevents pipeline parallelism (more on that later) The PaSh Solution: Eager Buffers
mkfifo f1 f2 f3 f4
grep "foo" in1 > f1 & grep eager
grep "foo" in2 > f2 & cat
eager < f1 > f3 & grep eager eager < f2 > f4 & cat f3 f4 The PaSh Solution: Eager Buffers
mkfifo f1 f2 f3 f4
grep "foo" in1 > f1 & grep eager
grep "foo" in2 > f2 & cat
eager < f1 > f3 & grep eager eager < f2 > f4 & cat f3 f4 The PaSh Solution: Eager Buffers
mkfifo f1 f2 f3 f4
grep "foo" in1 > f1 & grep eager
grep "foo" in2 > f2 & cat
eager < f1 > f3 & grep eager eager < f2 > f4 & cat f3 f4 The PaSh Solution: Eager Buffers
mkfifo f1 f2 f3 f4
grep "foo" in1 > f1 & grep eager
grep "foo" in2 > f2 & cat
eager < f1 > f3 & grep eager eager < f2 > f4 & cat f3 f4 The PaSh Solution: Eager Buffers
mkfifo f1 f2 f3 f4
grep "foo" in1 > f1 & grep eager
grep "foo" in2 > f2 & cat
eager < f1 > f3 & grep eager eager < f2 > f4 & cat f3 f4
/pash/runtime/eager
● Unix command, usable outside PaSh too ● Buffers input eagerly — can spill to disk ● Keeps fragment in DFG model Demo Time! Evaluation 1. Expert / Classic Scripts
Word-counting shown script before
Speedups against bash baseline for pash --width=16: Configurations vs. 5.93× 8.83× No runtime-support baseline 2. Pipelines in the wild
Parallelizable Non parallelizable
+ PaSh awareness goes a long way!
cat $IN6 | awk '{print $2, $0}' | sort -nr | cut -d ' ' -f 2 (1.01×) Configuration: e.g. #26 Full PaSh cat $IN6 | sort -nr -k2 | cut -d ' ' -f 1 (8.1× !!1!1) --width=16 3. Case Study no.1: NOAA Weather Analysis
Hadoop only focuses on this part fetch, preprocess, cleanup, filter calculate bash 33m58s 10m4s pash -w 16 16m39s 49s 2.04× 12.31× 2.52× speedup for speedup for combined speedup preprocessing preprocessing for the full program
This part is not the focus of traditional parallelization frameworks but parallelizing it has the biggest Configuration: impact Full PaSh --width=16 82GB (5y data) Conclusion Conclusion
● Parallelize unix shell scripts (POSIX -> POSIX) ● Annotations address extensibility issues ● Open source — 12+ contributors ● Lots of recent excitement — let's rehabilitate the shell!
[email protected] github.com/andromeda/pash