Light-Touch Data-Parallel Shell Processing
Total Page:16
File Type:pdf, Size:1020Kb
PaSh: Light-Touch Data-Parallel Shell Processing Nikos Vasilakis* Konstantinos Kallas* Konstantinos Mamouras MIT University of Pennsylvania Rice University Achilles Benetopoulos Lazar Cvetković (Unaffiliated) University of Belgrade [email protected] github.com/andromeda/pash * equal contribution Shell Scripts are Everywhere Default/scriptable system interface even in the lightest containers Kubernetes, Docker Universal composition environment Commands (programs) can be written in C, C++, Rust, JS, Python, Ruby, Haskell... Succinct data processing: download/extraction/ preprocessing/querying A Classic Shell Script Bentley: A word-counting challenge Knuth: 100s of lines of literate WEB It was the best of times, it was the 10 was worst of times, it was the age of 10 the wisdom, it was the age of 10 of foolishness, it was the epoch of 10 it belief, it was the epoch of 2 times incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair. McIlroy: Unix one-liner tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q A classic: Compute top-N words+counts It was the best of times, it was the 10 was worst of times, it was the age of 10 the wisdom, it was the age of 10 of foolishness, it was the epoch of 10 it belief, it was the epoch of 2 times incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair. tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q It was the best of times, it was the It worst of times, it was the age of was wisdom, it was the age of the foolishness, it was the epoch of best belief, it was the epoch of tr -cs A-Za-z '\n' of incredulity, it was the season of times Light, it was the season of it Darkness, it was the spring of was hope, it was the winter of despair. the … tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q It it was was the the best best of of times tr A-Z a-z times it it was was the the … … tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q it age was age the belief best best of darkness times sort despair it epoch was epoch the foolishness … … tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q age 2 age age 1 belief belief 1 best best 1 darkness darkness uniq -c 1 despair despair 2 epoch epoch 1 foolishness epoch 1 hope foolishness 10 it … … tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q 2 age 10 was 1 belief 10 the 1 best 10 of 1 darkness sort -rn 10 it 1 despair 2 times 2 epoch 2 season 1 foolishness 2 epoch 1 hope 2 age 1 incredulity 1 worst 10 it 1 wisdom tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q 10 was 10 was 10 the 10 the 10 of 10 of 10 it 10 it 2 times sed ${1}q 2 times 2 season 2 epoch 2 age 1 worst 1 wisdom … tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q It was the best of times, it was the 10 was worst of times, it was the age of 10 the wisdom, it was the age of 10 of foolishness, it was the epoch of 10 it belief, it was the epoch of 2 times … It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the worst of times, it was the age of worst of times, it was the age of worst of times, it was the age of wisdom, it was the age of wisdom, it was the age of wisdom, it was the age of foolishness, it was the epoch of foolishness, it was the epoch of foolishness, it was the epoch of belief, it was the epoch of belief, it was the epoch of belief, it was the epoch of … … … How to It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the worst of times, it was the age of worst of times, it was the age of worst of times, it was the age of worst of times, it was the age of worst of times, it was the age of wisdom, it was the age of wisdom, it was the age of wisdom, it was the age of wisdom, it was the age of wisdom, it was the age of foolishness, it was the epoch of foolishness, it was the epoch of foolishness, it was the epoch of foolishness, it was the epoch of foolishness, it was the epoch of belief, it was the epoch of belief, it was the epoch of belief, it was the epoch of belief, it was the epoch of belief, it was the epoch of … … … … … It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the It was the best of times, it was the worst of times, worst of times, worst of times, worst of times, worst of times, worst of times, worst of times, worst of times, parallelize? tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q Shell scripts are mostly sequential Their parallelization requires considerable effort: ● Command-specific flags (e.g., sort -p, make -jN) ● Mostly-manual, restricted parallelization tools (e.g., GNU parallel) ● Full rewrites in parallel frameworks (e.g., MapReduce) import java.io.*; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import java.util.*; import java.util.Map; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import java.util.TreeMap; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; public class top_10_Movies_Mapper extends Mapper<Object, import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; Text, Text, LongWritable> { public class top_10_Movies_Reducer extends Reducer<Text, import org.apache.hadoop.util.GenericOptionsParser; LongWritable, LongWritable, Text> { private TreeMap<Long, String> tmap; private TreeMap<Long, String> tmap2; public class Driver { public static void main(String[] args) throws Exception @Override @Override { public void setup(Context context) throws IOException, public void setup(Context context) throws IOException, Configuration conf = new Configuration(); InterruptedException InterruptedException String[] otherArgs = new GenericOptionsParser(conf, { { args).getRemainingArgs(); tmap = new TreeMap<Long, String>(); tmap2 = new TreeMap<Long, String>(); } } if (otherArgs.length < 2) { @Override @Override System.err.println("Error: please provide two paths"); public void map(Object key, Text value, public void reduce(Text key, Iterable<LongWritable> values, System.exit(2); Context context) throws IOException, Context context) throws IOException, InterruptedException } InterruptedException { { Job job = Job.getInstance(conf, "top 10"); String name = key.toString(); job.setJarByClass(Driver.class); // no_of_views (tab seperated) long count = 0; // we split the input data job.setMapperClass(top_10_Movies_Mapper.class); String[] tokens = value.toString().split("\t"); for (LongWritable val : values) { job.setReducerClass(top_10_Movies_Reducer.class); count = val.get(); String movie_name = tokens[0]; } job.setMapOutputKeyClass(Text.class); long no_of_views = Long.parseLong(tokens[1]); job.setMapOutputValueClass(LongWritable.class); tmap2.put(count, name); tmap.put(no_of_views, movie_name); job.setOutputKeyClass(LongWritable.class); if (tmap2.size() > 10) { job.setOutputValueClass(Text.class); if (tmap.size() > 10) tmap2.remove(tmap2.firstKey()); { } FileInputFormat.addInputPath(job, new Path(otherArgs[0])); tmap.remove(tmap.firstKey()); } FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); } } @Override System.exit(job.waitForCompletion(true) ? 0 : 1); public void cleanup(Context context) throws IOException, } InterruptedException { } @Override public void cleanup(Context context) throws IOException, for (Map.Entry<Long, String> entry : tmap2.entrySet()) { InterruptedException long count = entry.getKey(); { String name = entry.getValue(); Big-Data Version for (Map.Entry<Long, String> entry : tmap.entrySet()) context.write(new LongWritable(count), new Text(name)); { } long count = entry.getKey(); } String name = entry.getValue(); } of McIlroy’s Pipeline context.write(new Text(name), new LongWritable(count)); } } } 150-line Hadoop Program Mostly sequential by default — how to parallelize? Parallelization requires considerable effort: ● Command-specific flags (e.g., sort -p, make -jN) ● Mostly-manual, restricted parallelization tools (e.g., GNU parallel) ● Full rewrites in parallel frameworks (e.g., MapReduce) Challenges of Automating Shell-Script Parallelization for directory in /project/gutenberg/*/; do ls $directory | grep 'txt' | wc -l > index.txt done cat f1 f2 | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q split aggregate echo 'Done'; (1) Numerous and opaque Unix commands (2) Shell language enforced dependencies (3) Runtime support for Unix parallelization PaSh Overview seq.sh | AST f1 Parse cat f1 f2 Compile cat sort cat $f1 f2 | f2 sort sort DFG DFG Annotations 3 Runtime 1 2 Library Optimize par.sh ; AST mkfifo a b mkfifo a b & sort f1 > a & f1 sort sort f2 > b & Unparse > Emit sort -m sort -m a b & f2 sort Optimized DFG wait;rm -f a b sort f1 a PaSh Overview seq.sh | AST f1 Parse cat f1 f2 > Compile cat sort cat $f1 f2 | f2 sort sort DFG DFG Annotations 3 Runtime 1 2 Library Optimize par.sh ; AST mkfifo a b mkfifo a b & sort f1 > a & f1 sort sort f2 > b & Unparse > Emit sort -m sort -m a b & f2 sort Optimized DFG wait;rm -f a b sort f1 a 1.