<<

Documentation Release 0.2

Alex Lemann

Aug 23, 2017

Contents:

1 Introduction 3 1.1 Goals...... 3 1.2 Status...... 3 1.3 References...... 4

2 Getting Started 5 2.1 Mini Example...... 5 2.2 Larger Examples...... 6

3 Library Reference 7 3.1 Stages...... 7 3.2 Queue Tools...... 8 3.3 Exception Handling...... 8

Python Module Index 9

i ii pipeline Documentation, Release 0.2

Pipeline is a Python library to help write clean code utilizing a reusable pipeline pattern. It encourages readability, testability, cohesion, concurrency, and strict limits to communication. And we provide tools for developers building applications with Pipelines. Continue reading to learn more and get involved on Github

Contents: 1 pipeline Documentation, Release 0.2

2 Contents: CHAPTER 1

Introduction

Many projects implicitly build pipelines while implementing larger goals eg scraping the web, transforming data, rendering final outputs, or responding to an HTTP request. This project aims to make a reusable set of tools for building these pipelines explicitly.

1.1 Goals

1. Inter-project common code reuse. 2. Application code readability. A high level understanding of an application or module should be clear from reviewing its pipeline and how data moves through the system. 3. Development tooling • Testing small manageable stages • Exception handling • Stage timeout, retries, and failures • Concurrency whether async io, multi-core, multi-, or multi-cluster • Pausing (checkpointing) the pipeline and restarting • Measurement collection and visualization to diagnose bottlenecks 4. Simple internal structure. If something is difficult to implement, you’re probably doing it wrong or doing the wrong thing.

1.2 Status

The project is currently in an experimental phase. Work alternates between focusing on new features in the API and building out new example applications to push the API to its limits.

3 pipeline Documentation, Release 0.2

1.3 References

Some background history and theory. • 1930s: Alonzo Church Lambda Calculus • 1950s: John McCarthy, Lisp, and • 1973: Douglas McIlroy at Bell Labs and the pipe • 1994: Chain of responsibility pattern from the : Elements of Reusable Object-Oriented Software • 2010: Julien Palard’s Pipe Python library. An interesting approach combining overloading the pipe operator | and method chaining to create an infix notation. The library also provides a number of pipe operation primitives. • 2012: Miner & Shoook’s MapReduce Design Pattern • 2014: C++ RaftLib library for distributed pipeline protramming using iostream operators. • 2015: Martin Fowler’s Collection Pipeline article • Other workflow and streaming projects include: Airflow, SparkStreaming, Oozie • Python standard libraries like functools and itertools • The model from functional programming is an example of an alternative design pattern. • Visualization of flow via an Orr Diagram • Flow based programming • As a solution to the producer / consumer problem.

4 Chapter 1. Introduction CHAPTER 2

Getting Started

2.1 Mini Example

Our first example is a Pipeline with two Stages. 1. Stage 1 is the fizz buzz algorithm. Taking a number and returning a pair that indicates the original number and the resultant string of ‘fizz’, ‘buzz’, ‘fizzbuzz’ or an empty string, ‘’. 2. The second Stage takes tuples of integers and their fizzbuzz string and applies an uppercase operation to the string portion of the tuple. A new tuple of the integer and an all uppercase fizzbuzz string.

pipeline → → 1. fizzbuzz 2. upper 3. result

>>> from pipeline import pipeline, Stage >>> from pipeline.examples.mini import fizzbuzz >>> # Define a pipeline from two stage functions >>> fizzbuzz_upper_pipe= pipeline( ... stages=[ ... Stage(fizzbuzz, n_workers=1), ... Stage(lambda x: (x[0], str.upper(x[1])) , n_workers=1), ... ], ... initial_data=range(1, 16) ... ) >>> # Join workers to wait for final results >>> fizzbuzz_upper_pipe.join() >>> sorted(fizzbuzz_upper_pipe.values) [(1, ''), (2, ''), (3, 'FIZZ'), (4, ''), (5, 'BUZZ'), (6, 'FIZZ'), (7, ''), (8, ''),

˓→(9, 'FIZZ'), (10, 'BUZZ'), (11, ''), (12, 'FIZZ'), (13, ''), (14, ''), (15,

˓→'FIZZBUZZ')]

5 pipeline Documentation, Release 0.2

pipeline.examples.mini.fizzbuzz(i) Parameters i – Any integer to be evaluated by the fizzbuzz algorithm Returns An (integer, fizzbuzz string) pair that indicates what fizzbuzz evaluates to given the integer.

2.2 Larger Examples

For more examples, see the the examples/ directory in the source code. web_framework is an example of building a WebOb and WSGI based webframework and basic application using a pipeline to manage the request / response cycle. Try it out:

# start the server in one shell python pipeline/examples/web_framework.py # and in another load up the index page anonymously: curl http://127.0.0.1:8000/ # Now login and load the index curl -b cookie-jar -c cookie-jar http://user1:[email protected]:8000/ # Using the same cookie, your session and login should be stored curl -b cookie-jar -c cookie-jar http://127.0.0.1:8000/

And, try extending it to improve the session handling, add your favorite template language support, or create a new page. stdin_stream is a realtime example of using a generator as initial data. This is interesting because it demon- strates how the pipeline can operate not only on lists of predefined objects, but also on generators of new data that is being created in realtime. scrape_wordcounts handles downloading the some of the top works on Project Gutenberg and parsing their text into word count dictionaries. These are then combined into a single corpus count using a Reduce stage. Multiple workers are utilized for stages, especially those that may be io bound while downloading as an example of concurrency.

>>> from pipeline import pipeline, Stage, Reduce >>> start_url=['https://www.gutenberg.org/browse/scores/top'] >>> p= pipeline([Stage(top_books, returns_many= True), ... Stage(drop_random, n_workers=5), ... Stage(to_book_url, n_workers=10), ... Stage(sleep_random, n_workers=10), ... Stage(get_text, n_workers=10), ... Stage(count_words, n_workers=10), ... Stage(remove_full_text, n_workers=10), ... Reduce(corpus_count, initial_value={}), ... ], start_url) >>> p.join() >>> p.values []

6 Chapter 2. Getting Started CHAPTER 3

Library Reference

pipeline.pipeline(stages, initial_data) class pipeline.pipeline.PipelineResult(monitors, out_q) class pipeline.DROP If a Stage function returns DROP, there will be no item added to the input queue of the subsequent stage.

3.1 Stages

Stage class pipeline.Stage(func, n_workers=1, returns_many=False) class pipeline.Filter(function, n_workers=1) Creates a stage that follows the Python builtin filter function interface by dropping any values for which filter_function does not return True. This is a helper that creates a wrapper around filter_function that returns a DROP object when the result is not True

>>> from pipeline import Filter, Stage, pipeline >>> def remove_evens(x): ... return x%2 ==1 >>> pr= pipeline(stages=[Filter(remove_evens, n_workers=1), ... Stage(lambda x: x * 3, n_workers=1)], ... initial_data=[1,2,3,4,5,6]) >>> pr.join() <....PipelineResult object at 0x...> >>> print(pr.values) [3, 9, 15]

Reduce

7 pipeline Documentation, Release 0.2

class pipeline.Reduce(func, initial_value)

3.2 Queue Tools

There are a number of helpers for using Stage input and output queues. QueueTee class pipeline.queue_tools.QueueTee(queue, n)

3.3 Exception Handling

This needs work

8 Chapter 3. Library Reference Python Module Index

p pipeline.examples.mini,5 pipeline.examples.scrape_wordcounts,6 pipeline.examples.stdin_stream,6 pipeline.examples.web_framework,6

9 pipeline Documentation, Release 0.2

10 Python Module Index Index

D DROP (class in pipeline),7 F Filter (class in pipeline),7 fizzbuzz() (in module pipeline.examples.mini),5 P pipeline() (in module pipeline),7 pipeline.examples.mini (module),5 pipeline.examples.scrape_wordcounts (module),6 pipeline.examples.stdin_stream (module),6 pipeline.examples.web_framework (module),6 PipelineResult (class in pipeline.pipeline),7 Q QueueTee (class in pipeline.queue_tools),8 R Reduce (class in pipeline),7 S Stage (class in pipeline),7

11