streamutils Documentation Release 0.1.1-dev

Max Grender-Jones

Nov 15, 2018

Contents

1 streamutils - pipelines for python3 1.1 Motivation...... 3 1.2 Features...... 5 1.3 Non-features...... 5 1.4 Functions...... 6 1.5 API Philosophy & Conventions...... 7 1.6 Installation and Dependencies...... 8 1.7 Status...... 9 1.8 How does it work?...... 9 1.9 Contribute...... 10 1.10 Alternatives and Prior art...... 10 1.11 Acknowledgements and References...... 10 1.12 License...... 11

2 API 13

3 Tutorial and cookbook 33 3.1 Parsing an Apache logfile...... 33 3.2 Nesting streams to filter for files based on content...... 34 3.3 Getting the correct function signatures in sphinx for decorated methods...... 34

4 Testing streamutils 37 4.1 Writing tests...... 37 4.2 Running tests...... 37

5 Indices and tables 39

Python Module Index 41

i ii streamutils Documentation, Release 0.1.1-dev

Contents:

Contents 1 streamutils Documentation, Release 0.1.1-dev

2 Contents CHAPTER 1

streamutils - pipelines for python

Bringing one-liners to python since 2014

1.1 Motivation

Have you ever been jealous of friends who know more commandline magic than you? Perhaps you’re a python user who feels guilty that you never learnt sed, awk or perl, and wonder quite how many keystrokes you could be saving yourself? (On the plus side, you haven’t worn the keycaps off your punctuation keys yet). Or maybe you’re stuck using (or supporting) windows? Or perhaps you are one of those friends, and your heart sinks at the thought of all the for loops you’d need to replicate a simple grep "$username" /etc/passwd | cut -f 1,3 -d : --output-delimiter=" " in python? Well, hopefully streamutils is for you. Put simply, streamutils is a pythonic implementation of the pipelines offered by unix shells and the coreutils toolset. Streamutils is not (at least not primarily) a python wrapper around tools that you call from the commandline or a wrapper around subprocess (for that, you want sh or its previous incarnation pbs). However, it can interface with external programmes through its run command. Enough already! What does it do? Perhaps it’s best explained with an example. Suppose you want to reimplement our bash pipeline outlined above:

>>> from __future__ import print_function >>> from streamutils import * >>> name_and_userid= read('examples/passwd')| matches('johndoe')| split([1,3],':',

˓→ '')| first() >>> print(name_and_userid) johndoe 1000 >>> gzread('examples/passwd.gz')| matches('johndoe')| split([1,3],':','')|

˓→write() #Can read from gzipped (and bzipped) files johndoe 1000 (continues on next page)

3 streamutils Documentation, Release 0.1.1-dev

(continued from previous page) >>> gzread('examples/passwd.gz', encoding='utf8')| matches('johndoe')| split([1,3],

˓→':','')| write() #You really ought to specify the unicode encoding johndoe 1000 >>> read('examples/passwd.bz2', encoding='utf8')| matches('johndoe')| split([1,3],

˓→':','')| write() #streamutils will attempt to transparently decompress

˓→compressed files (.gz, .bz2, .xz) johndoe 1000 >>> read('examples/passwd.xz', encoding='utf8')| matches('johndoe')| split([1,3],':

˓→','')| write() johndoe 1000 streamutils also mimics the > and >> operators of bash-like shells, so to write to files you can write something like:

>>> import tempfile, shutil, os >>> try: ... #Some setup follows to allow this docstring to be included in automated tests ... tempdir=tempfile.mkdtemp() # Create a temporary directory to play with ... cwd=os.getcwd() # Save the current directory so we can change back

˓→to it afterwards ... os.chdir(tempdir) # Change to our temporary directory ... passwd=os.path.join(cwd,'examples','passwd.gz') ... #Right - setup's done ... with open('test.txt', mode='w') as tmp: #

˓→mode determines append / truncate behaviour ... gzread(passwd)| matches('johndoe')| split([1,3],':','')> tmp #

˓→can write to open things ... # >> appends, but because python evaluates rshifts (>>) before bitwise or (|),

˓→ the preceding stream must be in brackets ... (gzread(passwd)| matches('johndoe')| split([1,3],':','')) >>'test.txt' ... line= read('test.txt')| first() ... assert line.strip()=='johndoe 1000' ... length= read('test.txt')| count() ... assert length==2 ... gzread(passwd)| matches('johndoe')| split([1,3],':','')>'test.txt' #

˓→(> writes to a new file) ... length= read('test.txt')| count() ... assert length==1 ... finally: ... os.chdir(cwd) # Go back to the original directory ... shutil.rmtree(tempdir) # Delete the temporary one ...

Or perhaps you need to start off with output from a real command:

>>> from streamutils import * >>> import platform >>> cat='python -c"import sys; print(open(sys.argv[1]).read())"' if platform.

˓→system()=='Windows' else 'cat' >>> run('%s setup.py'% cat)| search("keywords='(. *)'", group=1)| write() UNIX pipelines for python

You don’t have to take your input from a file or some other streamutils , as it’s easy to pass in an Iterable that you’ve created elsewhere to have some functional programming fun:

>>> from streamutils import * >>> 1| smap(float)| aslist() # Non-iterables are auto-wrapped (continues on next page)

4 Chapter 1. streamutils - pipelines for python streamutils Documentation, Release 0.1.1-dev

(continued from previous page) [1.0] >>> ['d','c','b','a']| smap( lambda x: (x.upper(), x))| ssorted(key= lambda x:

˓→x[0])| smap( lambda x: x[1])| aslist() # Streamutils' Schwartzian transform

˓→(sorting against an expensive-to-compute key) ['a', 'b', 'c', 'd'] >>> range(0,1000)| sfilterfalse( lambda x: (x%5) * (x%3))| ssum() # Euler1: sum of ˓→first 1000 numbers divisible by 3 or 5 233168 >>> import itertools >>> def fib(): ... fibs={0:1,1:1} ... def fibn(n): ... return fibs[n] if n in fibs else fibs.setdefault(n, fibn(n-1)+fibn(n-2)) ... for f in itertools.count(0)| smap(fibn): ... yield f ... >>> fib()| takewhile( lambda x: x<4000000)| sfilterfalse( lambda x: x%2)| ssum() #

˓→Euler 2: sum of even fibonacci numbers under four million 4613732 >>> (range(0, 101)| ssum()) **2-(range(0,101)| smap( lambda x: x*x)| ssum()) # ˓→Euler 6: difference between the sum of the squares of the first one hundred natural

˓→numbers and the square of the sum. 25164150 >>> top= 110000 >>> primes=range(2,top) >>> for p in range(2,int(top**0.5)): # Euler 7: Sieve of Eratosthenes ... primes|=sfilter(lambda x: (x==p) or (x%p), end=True) ... >>> primes|nth(10001) 104743

1.2 Features

• Lazy evaluation and therefore memory efficient - nothing happens until you start reading from the output of your pipeline, when each of the functions runs for just long enough to yield the next token in the stream (so you can use a pipeline on a big file without needing to have enough space to store the whole thing in memory) • Extensible - to use your own functions in a pipeline, just decorate them, or use the built in functions that do the groundwork for the most obvious things you might want to do (i.e. custom filtering with sfilter, whole-line transformations with smap or partial transformations with convert) • Unicode-aware: all functions that read from files or file-like things take an encoding parameter • Not why I wrote the library at all but as shown above many of streamutils functions are ‘pure’ in the func- tional sense, so if you squint your eyes, you might be able to think of this as a way into functional programming, with a much nicer syntax (imho, as function composition reads left to right not right to left, which makes it more readable if less pythonic) than say toolz

1.3 Non-features

An unspoken element of the zen of python (import this) is ‘Fast to develop’ is better than ‘Fast to run’, and if there’s a downside to streamutils that’s it. The actual bash versions of grep etc are no doubt much faster than search/match from streamutils. But then you can’t call python functions from them, or call them from python

1.2. Features 5 streamutils Documentation, Release 0.1.1-dev

code on your windows machine. As they say, ‘you pays your money and you take your choice’. Since streamutils uses so many unsupported features (generators, default args, context managers), using numba to get speed-ups for free would sadly appear to not be an option for now (at least not without the help of a numba-expert) and though cython (as per cytoolz) would certainly work it would make streamutils much harder to install and would require a lot more effort.

1.4 Functions

A quick bit of terminology: • pipeline: A series of streamutil functions joined together with pipes (i.e. |) • tokens: things being passed through the pipeline • stream: the underlying data which is being broken into the tokens that are passed through the pipeline Implemented so far (equivalent coreutils function in brackets if the name is different). Note that the following descriptions say ‘lines’, but there’s nothing stopping the functions operating on a stream of tokens that aren’t newline terminated strings:

1.4.1 Connectors

These are functions designed to start a stream or process a stream (the underlying functions are wrapped via @connector and either return an Iterator or yield a series of values). Result is something that can be it- erated over Functions that act on one token at a time: • read, gzread, bzread, head, tail, follow to: read a file (cat); read a file from a gzip file (zcat); read a file from a bzip file (bzcat); extract the first few tokens of a stream; the last few tokens of a stream; to read new lines of a file as they are appended to it (waits forever like tail -f) • csvread to read a csv file • matches, nomatch, search, replace to: match tokens (grep), find lines that don’t match (grep -v), to look for patterns in a string (via re.search or re.match) and return the groups of lines that match (possibly with substitution); replace elements of a string (i.e. implemented via str.replace rather than a regexp) • find, fnmatches to: look for filenames matching a pattern; screen names to see if they match • split, join, words to: split a line (with str.split) and return a subset of the line (cut); join a line back together (with str.join), find all non-overlapping matches that correspond to a ‘word’ pattern and return a subset of them • sformat to: take a dict or list of strings (e.g. the output of words) and format it using the str.format syntax (format is a builtin, so it would be bad manners not to rename this function). • sfilter, sfilterfalse to: take a user-defined function and return the items where it returns True; or False. If no function is given, it returns the items that are True (or False) in a conditional context • unique to: only return lines that haven’t been seen already (uniq) • update: that updates a stream of dicts with another dict, or takes a dict of key, func mappings and calls the func against each dict in the stream to get a value to assign to each key • smap, convert to: take user-defined function and use it to map each line; take a list or dict (e.g. the output of search) and call a user defined function on each element (e.g. to call int on fields that should be integers)

6 Chapter 1. streamutils - pipelines for python streamutils Documentation, Release 0.1.1-dev

• takewhile, dropwhile to: yield elements while a predicate is True; drop elements until a predicate is False • unwrap, traverse: to remove one level of nested lists; to do a depth first search through supplied iterables Stream modifiers: • separate, combine: to split the tokens in the stream so that the remainder of the stream receives sub-tokens; to combine subtokens back into tokens

1.4.2 Terminators

These are functions that end a stream (the underlying functions are wrapped in @terminator and return their values). Result may be a single value or a list (or something else - point is, not a generator). As soon as you apply a Terminator to a stream it computes the result. • first, last, nth to: return the first item of the stream; the last item of the stream; the nth item of the stream • count, bag, ssorted, ssum: to return the number of tokens in the stream (wc); a collections. Counter (i.e. dict subclass) with unique tokens as keys and a count of their occurences as values; a sorted list of the tokens; add the tokens. (Note that ssorted is a terminator as it needs to exhaust the stream before it can start working) • write: to write the output to a named file, or print it if no filename is supplied, or to a writeable thing (e.g an already open file) otherwise. • csvwrite: to write to a csv file • sumby, meanby, firstby, lastby, countby: to aggregate by a key or keys, and then sum / take the mean / take the first / take the last / count • sreduce: to do a pythonic reduce on the stream • action: for every token, call a user-defined function • smax, smin to: return the maximum or minimum element in the stream • nsmallest, nlargest to: find the n smallest or n largest elements in the stream Note that if you have a Iterable object (or one that behaves like an iterable), you can pass it into the first function of the pipeline as its tokens argument.

1.4.3 Other

To facilitate stream creation, the merge function can be used to join two streams together SQL-style (left/inner/right)

1.5 API Philosophy & Conventions

There are a number of tenets to the API philosophy, which is intended to maximise backward and forward compatibility and minimise surprises - while the API is in flux, if functions don’t fit the tenets (or tenets turn out to be flawed - feedback welcome!) then the API or tenets will be changed. If you remember these, you should be able to guess (or at least remember) what a function will be called, and how to call it. These tenets are: • Functions should have sensible names (none of this cat / wc nonsense - apologies to you who are so trained as to think that cat is the sensible name. . . )

1.5. API Philosophy & Conventions 7 streamutils Documentation, Release 0.1.1-dev

• These names should be as close as possible to the name of the related function from the python library. It’s ok if the function names clash with their vanilla counterparts from a module (e.g. there’s a function called search in re too), but not if they clash with builtin functions - in that case they get an s prepended (hence sfilter, sfilterfalse, sformat). (For discussion: is this the right idea? Would it be easier if all functions had s prefixes?) • If you need to avoid clashes, import streamutils as su (which has the double benefit of being nice and terse to keep your pipelines short, and will help make you all powerful) • Positional arguments that are central to what a function does come first (e.g. n, the number of lines to return, is the first argument of head) and their order should be stable over time. For brevity, they should be given sensible defaults. If additional keyword arguments are added, they will be added after existing ones. After the positional arguments comes fname, which allows you to avoid using read. To be safe, apart from for read, head, tail and follow, fname should therefore be called as a keyword argument as it marks the first argument whose position is not guaranteed to be stable. • tokens is the last keyword argument of each function • If it’s sensible for the argument to a function to be e.g. a string or a list of strings then both will be supported (so if you pass a list of filenames to read (via fname), it will read each one in turn). • for line in open(file): iterates through a set of \n-terminated strings, irrespective of os. linesep, so other functions yielding lines should follow a similar convention (for example run replaces \r\n in its output with \n) • This being the 21st century, streamutils opens files in unicode mode (it uses io.open in text mode). The benefits of slow-processing outweigh the costs. I am not opposed to adding readbytes if there is demand (which would return str or bytes depending on your python version) • head(5) returns the first 5 items, similarly tail(5) the last 5 items. search(pattern, 2), word(3) and nth(4) return the second group, third ‘word’ and fourth item (not the third, fourth and fifth items). This therefore allows word(0) to return all words. Using -based indexing in this case feels wrong to me - is that too confusing/suprising? (Note that this matches how the coreutils behave, and besides, python is inconsistent here - group(1) is the first not second group, as group(0) is reserved for the whole pattern). I would be open to creating a coreutils (or similarly named) subpackage, which aims to roughly replicate the names, syntax and flags of the coreutils toolset (i.e. grep, cut, wc and friends), but only if they are implemented as thin wrappers around streamutils functions. After all, the functionality they provide is tried and tested, even if their names were designed primarily to be short to type (rather than logical, memorable or discoverable).

1.6 Installation and Dependencies streamutils supports python >=2.6 (on 2.6 it needs the OrderedDict and Counter backports, on <3.3 it can use the lzma backport), and python >=3 by using the six library (note that >=1.4.1 is required). Ideally it would support pypy too, but support for partial functions in the released versions of pypy is broken at the time of writing. For now, the easiest way to install it is to pull the latest version direct from github by running:

pip install git+https://github.com/maxgrenderjones/streamutils.git#egg=streamutils

Once it’s been submitted to pypi, if you’ve already got the dependencies installed, you’ll be able to install streamutils from pypi by running:

pip install streamutils

If you want pip to install the mandatory dependencies for you, then run:

8 Chapter 1. streamutils - pipelines for python streamutils Documentation, Release 0.1.1-dev

pip install streamutils[deps]

Alternatively, you can install from the source by running:

python setup.py install

If you don’t have pip, which is now the official way to install python packages (assuming your package manager isn’t doing it for you) then use your package manager to install it, or if you don’t have one (hello Windows users), download and run https://raw.github.com/pypa/pip/master/contrib/get-pip.py

1.7 Status

streamutils is currently beta status. By which I mean: - I think it works fine, but there may be edge cases I haven’t yet thought of (found one? submit a bug report, or better, a pull request) - The API is unstable, i.e. the names of functions are still in flux, the order of the positional arguments may change, and the order of keyword arguments is almost guaranteed to change So why release? - Because as soon as I managed to get streamutils working, I couldn’t stop thinking of all the places I’d want to use it - Because I value feedback on the API - if you think the names of functions or their arguments would be more easily understood if they were changed then open an issue and let’s have the debate - Because it’s a great demonstration of the crazy stuff you can do in python by overloading operators - Why not?

1.8 How does it work?

You don’t need to know this to use the library, but you may be curious nonetheless - if you want, you can skip this section. (Warning: this may make your head hurt - it did mine). In fact, the core of the library is only ~100 lines, but it took me a lot of time to find those magic 100 lines. The answer is a mixture of generators, partials and overloaded operators. (So wrong it’s right? You decide. . . ) Let’s explain it with the example of a naive pipeline designed to find module-level function names within ez_setup.py:

>>> from streamutils import * >>> s= read('ez_setup.py')| search(r'^def (\w+)[(]',1) #Nothing really happens yet >>> first_function=s| first() #Only now is read actually

˓→called >>> print(first_function) _python_cmd

So what happened? In order: • Functions used in pipelines are expected to (optionally) take as input an Iterable thing (as a keyword ar- gument called tokens - in future, it should be possible to use any name), and use it to return an Iterable thing, or yield a series of values • Before using a function in a pipeline, it must be wrapped (via either @connector or @terminator deco- rators). This wraps the function in a special Callable which defers execution, so, taking read (equivalent of unix cat) as an example, if you write s=read('ez_setup.py') then you haven’t actually called the underlying read function but the __call__ method of the Connector it’s wrapped in. This __call__ method wraps the original read function in a partial, which you can think of as a preprimed function object - i.e. when you call it, it calls the underlying function with the arguments you supplied when creating the par- tial. The __call__ method itself therefore returns a Connector (which implements the basic generator functions) which waits for something to iterate over s or to compose (i.e. |) s with another Connector.

1.7. Status 9 streamutils Documentation, Release 0.1.1-dev

When something starts iterating over a Connector, it passes through the values yield-ed by the underlying function (i.e. read). So far, so unremarkable. • But, and here’s where the magic happens, when you | a call to read with another wrapped function e.g. search, then the output of the read function is called and to the tokens keyword argument of search. But assuming read is a generator function nothing has really happened, the functions have simply been wired together Two options for what you do next: • You iterate over s, in which case the functions are finally called and the results are passed down the chain. (Your for loop would iterate over the function names in ez_setup.py) • You compose s with a function (in this case first) that has been decorated with @terminator to give a Terminator.A Terminator completes the pipeline and will return a value, not yield values like a generator. (Strictly speaking, when you call a Terminator nothing happens. It’s only when the __or__ function (i.e. the | bitwise OR operator) is called betwen a Connector and a Terminator that the function wrapped in the Terminator is called and the chain of generators yield their values.)

1.9 Contribute

• Issue Tracker: http://github.com/maxgrenderjones/streamutils/issues • Source Code: http://github.com/maxgrenderjones/streamutils • API documentation: http://streamutils.readthedocs.org/

• Continuous integration:

• Test coverage:

1.10 Alternatives and Prior art

Various other projects either abuse the | operator or try to make it easy to compose functions with iterators, none of which seem as natural to me (but some have syntax much closer to functional programming), so ymmv: • Pipe - probably the closest to streamutils, but less focussed on file/text processing, and has fewer batteries included • toolz • Rich Iterator Wrapper • fn.py

1.11 Acknowledgements and References

A shout-out goes to David Beazley, who has written the most comprehensible (and comprehensive) documentation that I’ve seen on how to use generators Apache log file example provided by Nasa

10 Chapter 1. streamutils - pipelines for python streamutils Documentation, Release 0.1.1-dev

1.12 License

The project is licensed under the Eclipse Public License - v 1.0

1.12. License 11 streamutils Documentation, Release 0.1.1-dev

12 Chapter 1. streamutils - pipelines for python CHAPTER 2

API

A few things to note as you read the documentation and source code for streamutils: • the docstrings shown here are the main means of testing that the library works as promised which is why they’re more verbose than you might otherwise expect • the code is designed to run and test unmodified on python 2 & 3. That means that all prints are done via the print function, and strings (which are mostly unicode) can’t be included in documentation output as they get ‘u’ prefixes on python 2 but not on python 3 • Although the examples pass in lists as the tokens argument to functions, in normal use it is unusual to use tokens. Usually the input will come from a call to read or head or similar • When a Terminator is used to pick out items (as opposed to iterating over the results of the stream) . close is called automatically on each of the generators in the stream. This gives each function a chance to clear up and e.g. close open files immediately rather than when garbage collected. If you want the same result when iterating over a stream, either iterate all the way to the end or call .close on the stream • For now, #pragma: no cover is used to skip testing that Exceptions are thrown - these will be removed as soon as the normal code paths are fully tested. It is also used to skip one codepath where different code is run depending on which python is in use to give a correct overall coverage report • Once wrapped, ConnectedFunctions return a generator that can be iterated over (or if called with end=True) return a list. Terminators return things e.g. the first item in the list (see first), or a list of the items in the stream (see aslist) streamutils.action(func, tokens=None) Calls a function for every element that passes through the stream. Similar to smap, only action is a Terminator so will end the stream

>>> ['Hello','World']| smap(str.upper)| action(print) HELLO WORLD

Parameters • func – function to call

13 streamutils Documentation, Release 0.1.1-dev

• tokens – a list of things streamutils.asdict(key=None, names=None, tokens=None) Creates a dict or dict of dicts from the result of a stream

>>> from streamutils import * >>> lines=[] >>> lines.append('From: [email protected]') >>> lines.append('To: [email protected]') >>> lines.append('Date: Once upon a time') >>> lines.append('Subject: The most beautiful?') >>> d=search('(\w+):\s*(\w.*)', tokens=lines, group=None)| asdict() >>> d['To']=='[email protected]' True >>> passwd=[] #fake output for read('/etc/passwd') >>> passwd.append('root:x:0:0:root:/root:/bin/bash') >>> passwd.append('bin:x:1:1:bin:/bin:/bin/false') >>> passwd.append('daemon:x:2:2:daemon:/sbin:/bin/false') >>> d=split(sep=':', n=1, names=['username'], tokens=passwd)| aslist() >>> for u in d: ... print(u['username']) root bin daemon >>> d=split(sep=':', n=1, names={1:'username'}, tokens=passwd)| aslist()

˓→#equivalent, using a dict for names >>> for u in d: ... print(u['username']) root bin daemon >>> d=search('^(\w+)', names=['username'], tokens=passwd)| aslist() #equivalent,

˓→using search not split >>> for u in d: ... print(u['username']) root bin daemon >>> d=search('^(\w+)', names={1:'username'}, tokens=passwd)| aslist() #using

˓→search with a dict for names >>> for u in d: ... print(u['username']) root bin daemon >>> d=split(sep=':', n=(1,6), names=['username','home'], tokens=passwd,)|

˓→asdict(key='username') >>> print(d['daemon']['home']) /sbin >>> d=split(sep=':', tokens=passwd)| asdict(key='username', names=['username',

˓→'password','uid','gid','info','home','shell']) >>> print(d['root']['shell']) /bin/bash

Parameters • key – If set, key to use to dictionary. If None (default), input must be a list of two item tuples

14 Chapter 2. API streamutils Documentation, Release 0.1.1-dev

• names – If set, list of keys that will be zipped up with the line values to create a dictionary • tokens – list of key-value tuples or list of lists or dicts Returns OrderedDict streamutils.aslist(tokens=None) Returns the output of the stream as a list. Used as a a more readable alternative to calling with end=True

>>> from streamutils import * >>> lines=['Nimmo','Fish','Seagull','Nemo','Shark'] >>> if matches('Nemo', tokens=['Nothing but ocean here']): #streamutils functions

˓→return generators which are always True ... print('Found Nemo!') Found Nemo! >>> if matches('Nemo', tokens=lines)| aslist(): #aslist will pull out the values

˓→in the generator ... print('Found Nemo!') Found Nemo! >>> if head(n=10, tokens=lines)| matches('Nemo', tokens=lines, end= True): #Note

˓→that end only works after a | ... print('Found Nemo!') Found Nemo!

Parameters tokens – Iterable object providing tokens (set by the pipeline) Returns a list containing all the tokens in the pipeline

streamutils.bag(tokens=None) Counts the number of occurences of each of the elements of the stream

>>> from streamutils import * >>> lines=['hi','ho','hi','ho',"it's",'off','to','work','we','go'] >>> count= matches('h.', tokens=lines)| bag() >>> count['hi'] 2

Parameters tokens – list of items to count Returns A collections.Counter

streamutils.bzread(fname=None, encoding=None, tokens=None) Read a file or files from bzip2-ed archives and output the lines within the files.

>>> find('examples/NASA*.bz2')| bzread()| head(1)| write() 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200

˓→6245

Parameters • fname – filename or list of filenames • encoding – unicode encoding to use to open the file (if None, use platform default) • tokens – list of filenames

15 streamutils Documentation, Release 0.1.1-dev streamutils.combine(func=None, tokens=None) Given a stream, combines the tokens together into a list. If func is not None, the tokens are combined into a series of list``s, chopping the ``list every time func returns True

>>> ["1 2 3","4 5 6"]| words()| separate()| smap( lambda x: int(x)+1)|

˓→combine()| write() [2, 3, 4, 5, 6, 7] >>> ["first","line \n","second","line \n","third line \n"]| combine( lambda x: x.

˓→endswith('\n'))| join('')| write() first line second line third line

Note that separate followed by combine is not a no-op.

>>> [["hello","small"], ["world"]]| separate()| combine()| join()| write() hello small world

Parameters • func – If not None (the default), combine until func returns True • tokens – a stream of things streamutils.connector(func) Decorator used to wrap a function in a Connector Parameters • func – The function to be wrapped - should either yield items into the pipeline or return an iterable • tokenskw – The keyword argument that func expects to receive tokens on streamutils.convert(converters, defaults={}, tokens=None) Takes a dict or list of tokens and calls the supplied converter functions. If a ValueError is thrown, sets the field to the default for that field if supplied, otherwise reraises.

>>> from streamutils import * >>> lines=['Alice in Wonderland 1951','Dumbo 1941'] >>> search('(.*)(\d+)',group=None, tokens=lines)| sformat(' {0} was filmed in {1} ˓→')| write() Alice in Wonderland was filmed in 1951 Dumbo was filmed in 1941 >>> search('(.*)(\d+)', group=None, tokens=lines)| convert({2: int})| sformat(' ˓→{0} was filmed in {1:d}')| write() #Note it's the second field Alice in Wonderland was filmed in 1951 Dumbo was filmed in 1941 >>> search('(.*)(\d+)', group=None, names=['Title','Year'], tokens=lines)| ˓→convert({'Year': int})| sformat(' {0} was filmed in {1:d}')| write() Alice in Wonderland was filmed in 1951 Dumbo was filmed in 1941 >>> convert({'Number': int}, defaults={'Number': 42}, tokens=[{'Number':'0'}, {

˓→'Number':'x'}])| sformat(' {Number:d}')| write() 0 42 >>> convert(int, defaults=42, tokens=['0','x'])| write() 0 42

16 Chapter 2. API streamutils Documentation, Release 0.1.1-dev

Parameters • converters – dict of functions or list of functions or function that converts a field from one form to another • defaults – defaults to use if the converter function raises a ValueError (should be the same type as converters) • tokens – a series of dict or list of things to be converted or a series of things Raise ValueError if the conversion fails and no default is supplied streamutils.count(tokens=None) Counts the number of items that pass through the stream (cf wc -l)

>>> from streamutils import * >>> lines=['hi','ho','hi','ho',"it's",'off','to','work','we','go'] >>> matches('h.', tokens=lines)| count() 4

Parameters tokens – Things to count Returns number of items in the stream as an int streamutils.countby(keys, tokens=None) Given a series of keys, return a dict of how many times each corresponding set of values appear in the stream

>>> counts=[{'A':6}, {'A':5}, {'A':4}]| countby(keys='A') >>> dict(counts) =={6:1,5:1,4:1} True streamutils.csvread(fname=None, encoding=None, dialect=’excel’, n=0, names=None, skip=0, restkey=None, restval=None, tokens=None, **fmtparams) Reads a file or stream and parses it as a csv file using a csv.reader(). If names is set, uses a csv. DictReader()

>>> from streamutils import * >>> data=[] >>> data.append('Region;Revenue;Cost') >>> data.append('North;10;5') >>> data.append('West;15;7') >>> csvread(delimiter=';', skip=1, tokens=data)| smap( lambda x: int(x[1]))|

˓→ssum() 25 >>> csvread(delimiter=';', skip=1, names=['Region','Revenue','Cost'],

˓→tokens=data)| smap( lambda x: int(x['Cost']))| ssum() 12 >>> csvread(delimiter=';', skip=1, n=1, tokens=data)| unique()| write() North West

Parameters • fname – filename to read from - if None, reads from the stream • encoding – encoding to use to read the file (warning: the csv module in python 2 does not support unicode encoding - if you run into trouble I suggest reading the file with read then passing the output through the unidecode library using smap before csvread)

17 streamutils Documentation, Release 0.1.1-dev

• dialect – the csv dialect (see csv.reader()) • n – the columns to return (starting at 1). If set, names defines the names for these columns, not the names for all columns • names – the keys to use in the DictReader (see the fieldnames keyword arg of csv. DictReader()) • skip – rows to skip (e.g. header rows) before reading data • restkey – (see the restkey keyword arg of csv.DictReader()) • restval – (see the restval keyword arg of csv.DictReader()) • fmtparams – see csv.reader() streamutils.csvwrite(fname=None, mode=’wb’, encoding=None, dialect=’excel’, names=None, rest- val=”, extrasaction=’raise’, tokens=None, **fmtparams) Writes the stream to a file (or stdout) in csv format using csv.writer(). If names is set, uses a csv. DictWriter()

>>> [{'Region':'North','Revenue':5,'Cost':3}, {'Region':'West','Revenue':

˓→15,'Cost':7}]| csvwrite(delimiter=';', names=['Region','Revenue','Cost']) Region;Revenue;Cost North;5;3 West;15;7 >>> [['Region','Revenue','Cost'], ['North',5,3], ['West', 15,7]]| csvwrite() Region,Revenue,Cost North,5,3 West,15,7

Parameters • fname – filename or file-like object to write to - if None, uses stdout • encoding – encoding to use to write the file • names – the keys to use in the DictWriter streamutils.dropwhile(func=None, tokens=None) Passes through items until the supplied function returns False (Equivalent of itertools.dropwhile())

>>> [1,2,3,2,1]| dropwhile( lambda x: x<3)| aslist() [3, 2, 1]

param func The function to use as a predicate param tokens List of things to filter

streamutils.find(pathpattern=None, tokens=None) Searches for files the match a given pattern. For example

>>> import os >>> from streamutils import find, replace, write >>> find('src/version.py')| replace(os.sep,'/')| write() #Only searches src

˓→directory >>> find('src/*/version.py')| replace(os.sep,'/')| write() #Searches full ˓→directory tree src/streamutils/version.py

18 Chapter 2. API streamutils Documentation, Release 0.1.1-dev

Parameters • pathpattern (str)– glob.glob()-style pattern • tokens – A list of glob-style patterns to search for Returns An iterator across the filenames found by the function streamutils.first(default=None, tokens=None) Returns the first item in the stream Parameters • default – returned if the stream is empty • tokens – a list of things Returns The first item in the stream streamutils.firstby(keys=None, values=None, tokens=None) Given a series of key, value items, returns a dict of the first value assigned to each key

>>> from streamutils import * >>> firsts=[('A',2), ('B',6), ('A',3), ('C', 20), ('C', 10), ('C', 30)]|

˓→firstby() >>> firsts =={'A':2,'B':6,'C': 20} True >>> firsts=[{'key':'A','value':2}, {'key':'B','value':6}, {'key':'A',

˓→'value':3}, {'key':'C','value': 20}, {'key':'C','value': 10}]|

˓→firstby(keys='key', values='value') >>> firsts =={'A':{'value':2},'B':{'value':6},'C':{'value': 20}} True

Param keys dict keys for the values to aggregate on Params values dict keys for the values to be aggregated Returns dict mapping each key to the first value corresponding to that key streamutils.fnmatches(pathpattern, matchcase=False, tokens=None) Filter tokens for strings that match the pathpattern using fnmatch.fnmatch() or fnmatch. fnmatchcase(). Note that os.sep (i.e. \ on windows) will be replaced with / to allow / to be used in the pattern

>>> from streamutils import * >>> lines=['setup.py','README.md','streamutils/__init__.py'] >>> fnmatches('*.py', False, lines)| write() setup.py streamutils/__init__.py >>> fnmatches('*/*.py', False, lines)| write() streamutils/__init__.py >>> fnmatches('readme.*', True, lines)| write() >>> fnmatches('README.*', True, lines)| write() README.md

Parameters • pathpattern (str) – Pattern to match (caution - / or os.sep is not special) • matchcase (bool) – Whether to match case-senitive on case-insensitive file systems

19 streamutils Documentation, Release 0.1.1-dev

• tokens – list of filename strings to match

streamutils.follow(fname, encoding=None) Monitor a file, reading new lines as they are added (equivalent of tail -f on UNIX). (Note: Never returns) Parameters • fname – File to read • encoding – encoding to use to read the file streamutils.gzread(fname=None, encoding=None, tokens=None) Read a file or files from gzip-ed archives and output the lines within the files. Parameters • fname – filename or list of filenames • encoding – unicode encoding to use to open the file (if None, use platform default) • tokens – list of filenames streamutils.head(n=10, fname=None, skip=0, encoding=None, tokens=None) (Optionally) opens a file and passes through the first n items

>>> from streamutils import * >>> lines=['Film,Character,Animal','Finding Nemo,Nemo,Fish','Shrek,Shrek,Ogre',

˓→'The Jungle Book,Baloo,Bear'] >>> head(3, tokens=lines)| write() Film,Character,Animal Finding Nemo,Nemo,Fish Shrek,Shrek,Ogre >>> head(2, skip=1, tokens=lines)| write() Finding Nemo,Nemo,Fish Shrek,Shrek,Ogre >>> head(n=0, skip=1, tokens=lines)| split(sep=',', names=['film','name',

˓→'animal'])| sformat('The film {film} stars a {animal} called {name}')| write() The film Finding Nemo stars a Fish called Nemo The film Shrek stars a Ogre called Shrek The film The Jungle Book stars a Bear called Baloo >>> head(n=[1,3], skip=1, tokens=lines)| split(sep=',', names=['film','name',

˓→'animal'])| sformat('The film {film} stars a {animal} called {name}')| write() The film Finding Nemo stars a Fish called Nemo The film The Jungle Book stars a Bear called Baloo

Parameters • n – Number of lines to return (0=all lines) or a list of lines to return • fname – Filename (or filenames) to open • skip – Number of lines to skip before returning lines • encoding – Encoding of file to open. If None, will try to guess the encoding based on coding= strings • tokens – Stream of tokens to take the first few members of (i.e. not a list of filenames to take the first few lines of) streamutils.join(sep=’ ’, tokens=None) Joins a list-like thing together using the supplied sep (think str.join()). Defaults to joining with a space

20 Chapter 2. API streamutils Documentation, Release 0.1.1-dev

>>> split(sep=',', n=[1,4], tokens=['flopsy,mopsy,cottontail,peter'])| join(',')

˓→| write() flopsy,peter

Parameters sep – string separator to use to join each line in the stream (default ‘ ‘)

streamutils.last(default=None, tokens=None) Returns the final item in the stream Parameters • default – returned if the stream is empty • tokens – a list of things Returns The last item in the stream streamutils.lastby(keys=None, values=None, tokens=None) Given a series of key, value items, returns a dict of the last value assigned to each key

>>> from streamutils import * >>> lasts= head(tokens=[('A',2), ('B',6), ('A',3), ('C', 20), ('C', 10), ('C',

˓→ 30)])| lastby() >>> lasts =={'A':3,'B':6,'C': 30} True

Returns dict mapping each key to the last value corresponding to that key

streamutils.matches(pattern, match=False, flags=0, v=False, tokens=None) Filters the input for strings that match the pattern (think UNIX grep)

>>> months=['January','February','March','April','May','June','July',

˓→'August','September','October','November','December'] >>> matches('A', tokens=months)| write() April August

Parameters • pattern – regexp pattern to test against • match – if True, use re.match() else use re.search() (default False) • flags – regexp flags • v – if True, return strings that don’t match (think UNIX grep -v) (default False) • tokens – strings to match streamutils.meanby(keys=None, values=None, tokens=None) If key is not set, given a series of key, value items, returns a dict of means, grouped by key If keys is set, given a series of ‘‘dict‘‘s, returns the mean of the values grouped by a tuple of the values corresponding to the keys

>>> from streamutils import * >>> means= head(tokens=[('A',2), ('B',6), ('A',3), ('C', 20), ('C', 10), ('C',

˓→ 30)])| meanby() >>> means =={'A': 2.5,'B':6,'C': 20} True

21 streamutils Documentation, Release 0.1.1-dev

>>> from streamutils import * >>> means= head(tokens=[{'key':1,'value':2}, {'key':1,'value':4}, {'key':

˓→2,'value':5}])| meanby('key','value') >>> means =={1:{'value': 3.0},2:{'value': 5.0}} True

Param keys dict keys for the values to aggregate on Params values dict keys for the values to be aggregated Returns dict mapping each key to the sum of all the values corresponding to that key streamutils.merge(left, right, on, how=’inner’, join=tuple) Merges two sequences together (think JOIN in SQL). For a left join, the right sequence is read into memory then joined to the left sequence (so left sequence determines the order) and vice versa. For an inner join, the right sequence is read into memory (so should be the shorter of the two). Parameters • left – Sequence of items that should be placed on the left • right – Sequence of items that should be placed on the right • on – dict key or attribute to join on • how – One of inner, left, right (outer not yet implemented) • join – Either tuple in which results are yield-ed as tuples of (leftval, rightval) or a function in which case values are yield-ed as join(leftval, rightval)

>>> dogs=[{'Name':'Fido','Owner':'Bob'}, {'Name':'Rover','Owner':'John'}] >>> cats=[{'Name':'Tiddles','Owner':'John'}, {'Name':'Fluffy','Owner':'Steve

˓→'}] >>> result= merge(dogs, cats, on='Owner', how='inner')| aslist() >>> result == [({'Name':'Rover','Owner':'John'}, {'Name':'Tiddles','Owner':

˓→'John'})] True >>> result= merge(dogs, cats, on='Owner', how='left')| aslist() >>> result == [({'Name':'Fido','Owner':'Bob'}, None), ({'Name':'Rover','Owner

˓→':'John'}, {'Name':'Tiddles','Owner':'John'})] True >>> result= merge(dogs, cats, on='Owner', how='right')| aslist() >>> result == [({'Name':'Rover','Owner':'John'}, {'Name':'Tiddles','Owner':

˓→'John'}), (None,{'Name':'Fluffy','Owner':'Steve'})] True streamutils.nlargest(n, key=None, tokens=None) Returns the n largest elements of the stream (see documentation for heapq.nlargest())

>>> from streamutils import * >>> head(10, tokens=range(1,10))| nlargest(4) [9, 8, 7, 6] streamutils.nomatch(pattern, match=False, flags=0, tokens=None) Filters the input for strings that don’t match the pattern (think UNIX grep -v)

>>> import re >>> months=['January','February','March','April','May','June','July',

˓→'August','September','October','November','December'] (continues on next page)

22 Chapter 2. API streamutils Documentation, Release 0.1.1-dev

(continued from previous page) >>> nomatch('r|a', flags=re.IGNORECASE, tokens=months)| write() June July

Parameters • pattern – regexp pattern to test against • match – if True, use re.match() else use re.search() (default False) • flags – regexp flags • tokens – strings to match streamutils.nsmallest(n, key=None, tokens=None) Returns the n smallest elements of the stream (see documentation for heapq.nsmallest())

>>> from streamutils import * >>> head(10, tokens=range(1,10))| nsmallest(4) [1, 2, 3, 4]

streamutils.nth(n, default=None, tokens=None) Returns the nth item in the stream, or a default if the list has less than n items

>>> from streamutils import * >>> rabbits=['Flopsy','Mopsy','Cottontail','Peter'] >>> rabbit= rabbits| matches('.opsy')| nth(2) >>> print(rabbit) Mopsy >>> rabbit= rabbits| matches('.opsy')| nth(3, default='No such rabbit') >>> print(rabbit) No such rabbit

Parameters • n – The item to return (first is 1) • default – The default to use if the stream has less than n items • tokens – The items in the pipeline Returns the nth item streamutils.read(fname=None, encoding=None, skip=0, tokens=None) Read a file or files and output the lines it contains. Files are opened with io.read()

>>> from streamutils import * >>> read('https://raw.github.com/maxgrenderjones/streamutils/master/README.md')| ˓→search('^[-] Source Code: (.*)',1)| write() http://github.com/maxgrenderjones/streamutils

Parameters • fname – filename or list of filenames. Can either be paths to local files or URLs (e.g. http:// or ftp:// - supports the same protocols as urllib2.urlopen()) • encoding – encoding to use to open the file (if None, use platform default) • skip – number of lines to skip at the beginning of each file

23 streamutils Documentation, Release 0.1.1-dev

• tokens – list of filenames streamutils.replace(old, new, tokens=None) Replace old in each tokens with new via call to .replace on each token (e.g. str.replace()) Parameters • old – text to replace • new – what to replace it with • tokens – typically a series of strings streamutils.run(command, err=False, cwd=None, env=None, tokens=None) Runs a command. If command is a string then it will be split with shlex.split() so that it works as expected on windows. Current implementation runs in the same process so gathers the full output of the command before passing output to subsequent functions.

>>> from streamutils import * #Suggestions for better commands to use as examples ˓→welcome! >>> rev=run('git log --reverse')| search('commit (\w+)', group=1)| first() >>> rev == run('git log')| search('commit (\w+)', group=1)| last() True

Parameters • command – Command to run as a string or list • err – Redirect standard error to standard out (default False) • cwd – Current working directory for command • env – Environment to pass into command • encoding – Encoding to use to parse the output. Defaults to the default locale, or utf-8 if there isn’t one • tokens – Lines to pass into the command as standard in streamutils.separate(tokens=None) Takes a stream of ‘‘Iterable‘‘s, and yields items from the iterables

>>> [["hello","there"], ["how","are"], ["you"]]| separate()| write() hello there how are you

Parameters tokens – a stream of Iterables streamutils.sfilter(func=None, tokens=None) Take a user-defined function and passes through the tokens for which the function returns something that is True in a conditional context. If no function is supplied, passes through the True items. (Equivalent of filter()) function

>>> sfilter(lambda x: x%3==0, tokens=[1,3,4,5,6,9])| write() 3 6 (continues on next page)

24 Chapter 2. API streamutils Documentation, Release 0.1.1-dev

(continued from previous page) 9 >>> sfilter(lambda x: x.endswith('ball'), tokens=['football','rugby','tennis',

˓→'volleyball'])| write() football volleyball

Parameters • filterfunction – function to use in the filter • tokens – list of tokens to iterate through in the function (usually supplied by the previous function in the pipeline) streamutils.sfilterfalse(func=None, tokens=None) Passes through items for which the output of the filter function is False in a boolean context

>>> sfilterfalse(lambda x: x.endswith('ball'), tokens=['football','rugby',

˓→'tennis','volleyball'])| write() rugby tennis

Parameters • filterfunction – Function to use for filtering • tokens – List of things to filter streamutils.sformat(pattern, tokens=None) Takes in a list or dict of strings and formats them with the supplied pattern

>>> from streamutils import * >>> lines=[['Rapunzel','tower'], ['Shrek','swamp']] >>> sformat('{0} lives in a {1}', lines)| write() Rapunzel lives in a tower Shrek lives in a swamp >>> lines=[{'name':'Rapunzel','home':'tower'}, {'name':'Shrek','home':

˓→'swamp'}] >>> sformat('{name} lives in a {home}', lines)| write() Rapunzel lives in a tower Shrek lives in a swamp

Parameters • pattern – New-style python formatting pattern (see str.format()) • tokens – list of lists of fomatting arguments or list of mappings streamutils.smap(*funcs, **kwargs) Applies a transformation function to each element of the stream (or series of function). Note that smap(f, g, tokens) yields f(g(token))

>>> from streamutils import * >>> smap(str.upper, tokens=['aeiou'])| write() AEIOU >>> smap(str.upper, str.strip, str.lower, tokens=[' hello',' world'])| write() (continues on next page)

25 streamutils Documentation, Release 0.1.1-dev

(continued from previous page) HELLO WORLD

Parameters • *funcs – functions to apply • tokens – list/iterable of objects

streamutils.smax(key=None, tokens=None) Returns the largest item in the stream

>>> from streamutils import * >>> dates=['2014-01-01','2014-02-01','2014-03-01'] >>> head(tokens=dates)| smax() '2014-03-01'

Parameters • key – See documentation for max() • tokens – a list of things Returns The largest item in the stream (as defined by python max()) streamutils.smin(key=None, tokens=None) Returns the smallest item in the stream

>>> from streamutils import * >>> dates=['2014-01-01','2014-02-01','2014-03-01'] >>> head(tokens=dates)| smin() '2014-01-01'

Parameters • key – See documentation for min() • tokens – a list of things Returns The largest item in the stream (as defined by python min()) streamutils.split(n=0, sep=None, outsep=None, names=None, inject={}, tokens=None) split separates the input using .split(sep), by default splitting on whitespace (think str.split())

>>> split(tokens=[str("What's up?")])| write() #Note how the output is different

˓→from words ["What's", 'up?'] >>> split(1, tokens=[str("What's up?")])| write() #if n is an int, then a string

˓→is returned What's

Parameters • n – int or list of ints determining which word to pick (first word is 1), 0 returns the whole list • sep – string separator to split on - by default sep=None which splits on whitespace

26 Chapter 2. API streamutils Documentation, Release 0.1.1-dev

• outsep – if not None, output will be joined using this separator • names – (Optional) a name or list of names of the n extracted words, used to construct a dict to be passed down the pipeline • inject – For use with names - extra key/value pairs to include in the output dict • tokens – strings to split streamutils.sreduce(func, initial=None, tokens=None) Uses a function to reduce() the output to a single value Parameters • func – Function to use in the reduction • initial – An initial value Returns Output of the reduction streamutils.sslice(start=1, stop=None, step=1, fname=None, encoding=None, tokens=None) Provides access to a slice of the stream between start and stop at intervals of step

>>> lines="hi ho hi ho it's off to work we go".split() >>> sslice(start=2, stop=10, step=2, tokens=lines)| write() #start and stop are

˓→both relative to the first item ho ho off work >>> sslice(start=1, stop=7, step=3, fname='ez_setup.py')| write() #!/usr/bin/env python To use setuptools in your package's setup.py, include this

Parameters • start – First token to return (first is 1) • stop – Maximum token to return (default: None implies read to the end) • step – Interval between tokens • fname – Filename to use as input • encoding – Unicode encoding to use to open files • tokens – list of filenames to open streamutils.ssorted(cmp=None, key=None, reverse=False, tokens=None) Sorts the output of the stream (see documentation for sorted()). Warning: cmp was removed from sorted in python 3

>>> from streamutils import * >>> for line in (find('*.py')| replace(os.sep,'/')| ssorted()): ... print(line) ez_setup.py setup.py

Returns a sorted list

streamutils.ssum(start=0, tokens=None) Adds the items that pass through the stream via call to sum()

27 streamutils Documentation, Release 0.1.1-dev

>>> from streamutils import * >>> head(tokens=[1,2,3])| ssum() 6

Parameters start – Initial value to start the sum, returned if the stream is empty Returns sum of all the values in the stream streamutils.strip(chars=None, tokens=None) Runs .strip against each line of the stream

>>> from streamutils import * >>> line=strip(tokens=[' line\n'])| first() >>> line=='line' True

Parameters tokens – A series of lines to remove whitespace from streamutils.sumby(keys=None, values=None, tokens=None) If keys and values are not set, given a series of key, value items, returns a dict of summed values, grouped by key

>>> from streamutils import * >>> sums= head(tokens=[('A',2), ('B',6), ('A',3), ('C', 20), ('C', 10), ('C',

˓→30)])| sumby() >>> sums =={'A':5,'B':6,'C': 60} True

If keys and values are set, given a series of dicts, return a dict of dicts of summed values, grouped by a tuple of the indicated keys.

>>> from streamutils import * >>> data=[] >>> data.append({'Region':'North','Revenue':4,'Cost':8}) >>> data.append({'Region':'North','Revenue':3,'Cost':2}) >>> data.append({'Region':'West','Revenue':6,'Cost':3}) >>> sums= head(tokens=data)| sumby(keys='Region', values=['Revenue','Cost']) >>> sums =={'North':{'Revenue':7,'Cost': 10},'West':{'Revenue':6,'Cost':

˓→3}} True

Returns dict mapping each key to the sum of all the values corresponding to that key streamutils.tail(n=10, fname=None, encoding=None, tokens=None) Returns a list of the last n items in the stream

>>> tokens="hi ho hi ho it's off to work we go".split() >>> tail(5, tokens=tokens)| write() #Note tail() returns a deque not a

˓→generator, but it still works as part of a stream off to work we go >>> tail(2, fname='ez_setup.py')| write() (continues on next page)

28 Chapter 2. API streamutils Documentation, Release 0.1.1-dev

(continued from previous page) if __name__ == '__main__': sys.exit(main())

Parameters • n – How many items to return e.g. n=5 will return 5 items • fname – A filename from which to read the last n items (10 by default) • encoding – The enocding of the file • tokens – Stream of tokens to take the last few members of (i.e. not a list of filenames to take the last few lines of) Returns A list of the last n items streamutils.takewhile(func=None, tokens=None) Passes through items until the supplied function returns False (Equivalent of itertools.takewhile())

>>> [1,2,3,2,1]| takewhile( lambda x: x<3)| aslist() [1, 2]

param func The function to use as a predicate param tokens List of things to filter

streamutils.terminator(func) Decorator used to wrap a function in a Terminator that ends a pipeline Parameters • func – The function to be wrapped - should return the desired output of the pipeline • tokenskw – The keyword argument that func expects to receive tokens on Returns A Terminator function streamutils.traverse(tokens=None) Performs a full depth-first unwrapping of the supplied tokens. Strings are not unwrapped

>>> ["hello",["hello", [["world"]]]]| traverse()| join()| write() hello hello world

Parameters tokens – a stream of Iterable things to be unwrapped

streamutils.unique(tokens=None) Passes through values the first time they are seen

>>> from streamutils import * >>> lines=['one','two','two','three','three','three','one'] >>> unique(lines)| write() one two three

Parameters tokens – Either set by the pipeline or provided as an initial list of items to pass through the pipeline

29 streamutils Documentation, Release 0.1.1-dev

streamutils.unwrap(tokens=None) Yields a stream of lists, with one level of nesting in the tokens the stream unwrapped (if present).

>>> [[[1], [2]], [[2,3,4], [5]], [[[6]]]]| unwrap()| write() [1, 2] [2, 3, 4, 5] [[6]]

Parameters tokens – a stream of ‘Iterable‘s

streamutils.update(values=None, funcs=None, tokens=None) For each dict token in the stream, updates it with a values dict, then updates it with funcs, a dict mapping of key to func which it uses to set the value of key to func(token). A bit like convert, only it’s designed to let you add keys, not just modify existing ones. Currently modifies each dict in the stream (i.e. not pure), but this should not be relied on - in the future each dict may yield (shallow) copied in order to be pure (at a cost of more allocations)

>>> from streamutils import * >>> lines=[{'first':'Jack','last':'Bauer'}, {'first':'Michelle','last':

˓→'Dessler'}] >>> for actor in update(funcs={'initials': lambda x: x['first'][0]+x['last'][0]},

˓→tokens=lines): ... print(actor['initials']) JB MD >>> for actor in update(values={'Show':'24'}, tokens=lines): ... print(actor['Show']) 24 24

Parameters • values – dict • funcs – dict of key: ‘‘function‘‘s • tokens – a stream of ‘‘dict‘‘s streamutils.words(n=0, word=r’S+’, outsep=None, names=None, inject=None, flags=0, to- kens=None) Words looks for non-overlapping strings that match the word pattern. It passes on the words it finds down the stream. If outsep is None, it will pass on a list, otherwise it will join together the selected words with outsep

>>> from streamutils import * >>> tokens=[str('first second third'), str(' fourth fifth sixth')] >>> words(1, tokens=tokens)| write() first fourth >>> words([1], tokens=tokens)| write() ['first'] ['fourth'] >>> words((1,3), tokens=tokens)| write() ['first', 'third'] ['fourth', 'sixth'] >>> words((1,3), outsep='', tokens=tokens)| write() first third fourth sixth (continues on next page)

30 Chapter 2. API streamutils Documentation, Release 0.1.1-dev

(continued from previous page) >>> words((1,), names=(1,), tokens=tokens)| write() OrderedDict([(1, 'first')]) OrderedDict([(1, 'fourth')]) >>> words(word="[\w']+", tokens=[str("What's up?")])| write() #Note how the

˓→output is different from split() ["What's", 'up']

Parameters • n (int or list) – an integer indicating which word to return (first word is 1), a list of integers to select multiple words, or 0 to return all words. If n is an integer, the result is a string, if n is a list, the result is a list of strings • word (str) – a pattern that will be used to select words using re.findall() - (default S+) • outsep (str) – a string separator to join together the words that are found into a new string (or None to output a list of words) • names (str or list) – (Optional) a name or list of names of the n extracted words, used to construct a dict to be passed down the pipeline • inject (dict) – For use with names - extra key/value pairs to include in the output dict • flags – flags to pass to the re engine to compile the pattern • tokens – list of tokens to iterate through in the function (usually supplied by the previous function in the pipeline) Raise ValueError if there are less than n (or max(n)) words in the string streamutils.write(fname=None, mode=’wt’, encoding=None, tokens=None) Writes the output of the stream to a file, or via print if no file is supplied. Calls to print include a call to str.rstrip() to remove trailing newlines. mode is only used if fname is a string

>>> from streamutils import * >>> from six import StringIO >>> lines=['%s\n'% line for line in ['Three','Blind','Mice']] >>> lines| head()| write() # By default prints to the console Three Blind Mice >>> buffer= StringIO() # Alternatively write to an open filelike object >>> lines| head()| write(fname=buffer) >>> writtenlines=buffer.getvalue().splitlines() >>> writtenlines[0]=='Three' True

Parameters • fname – If str, filename to write to, otherwise open file-like object to write to. Default of None implies write to standard output • mode – The mode to use to open fname (default of ‘wt’ as per io.open()) • encoding – Encoding to use to write to the file • tokens – Lines to write to the file

31 streamutils Documentation, Release 0.1.1-dev

32 Chapter 2. API CHAPTER 3

Tutorial and cookbook

3.1 Parsing an Apache logfile

Suppose we have an apache log file we want to extract info from. (Note that the logfile used here is pretty old, so won’t necessarily be in the same format as any log files you may have on your own servers. Let’s have a look at the file to see what the contents look like:

>>> from __future__ import print_function, unicode_literals >>> from streamutils import * >>> import os >>> logfile= find('examples/ *log.bz2')| first() >>> print(logfile.replace(os.sep,'/')) examples/NASA_access_log_July95.log.bz2 >>> bzread(fname=logfile)| head(5)| write() 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.

˓→0" 200 3985 199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-

˓→sts-73.html HTTP/1.0" 200 4085 burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.

˓→html HTTP/1.0" 304 0 199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-

˓→patch-small.gif HTTP/1.0" 200 4179

So, suppose we want to see who’s accessing us most, we can pick out the relevant hostnames with search and then use bag to count them

>>> logpattern=r'''^([\w.-]+)''' >>> clients= bzread(fname=logfile)| search(logpattern)| bag() >>> fan= clients.most_common()[0] >>> print('%s accessed us %d times'% (fan[0], fan[1])) kristina.az.com accessed us 118 times

Or suppose we want to know how much data a known user has used

33 streamutils Documentation, Release 0.1.1-dev

>>> logpattern2=r'''^([\w.-]+).*(\d+)''' >>> usage= find('examples/ *log.bz2')| read()| search(logpattern2, names={1:'User', ˓→ 2:'Data'})| convert({'Data': int})| sumby(keys='User', values='Data') >>> usage['kristina.az.com']['Data'] 517

3.2 Nesting streams to filter for files based on content

Suppose we want to find python source files that don’t use /usr/bin/env to call python. We can’t do this in a normal pipeline, as we want the names of the files, not their content. To do this, we need to make a nested pipeline like so:

>>> from streamutils import * >>> import shutil, tempfile, os.path >>> try: ... d=tempfile.mkdtemp() ... #First do some setup ... with open(os.path.join(d,'envpython.py'),'w') as f: # ... w=f.write('#!/usr/bin/env python') ... with open(os.path.join(d,'python2.7.py'),'w') as f: ... w=f.write('#!/usr/bin/python2.7') ... #Now look for the files ... find('%s/*.py'% d)| sfilter( lambda x: read(x)| nomatch('/usr/bin/env')| ˓→first())| smap(os.path.basename)| write() ... finally: ... shutil.rmtree(d) python2.7.py

3.3 Getting the correct function signatures in sphinx for decorated methods

3.3.1 Context: the problem with python generators

Decorators in python are a good thing ™, and writing something like streamutils would be impossible with- out them.. However, unless you use the decorator module, even if you use functools.update_wrapper() or functools.wraps(), the wrapped function signatures are lost, and so appear as dosomething(...) in docu- mentation, not dosomething(arg1, arg2='default'). Fortunately, it is possible to tell the autodoc plugin for sphinx to insert the correct signature, but you need to supply it in your documentation. So suppose you want autodoc to pull in all the docstrings of the Noodle module for you, you might use:

.. autoclass:: Noodle :members:

In order to generate the documentation for a Noodle class and supply the correct function signatures, you now need to write:

.. autoclass:: Noodle(type) .. automethod:: eat(persona)

Unfortunately, this method doesn’t work if you use readthedocs as your documentation host, as they don’t support the autodoc module (at least not if you’re not on their whitelist).

34 Chapter 3. Tutorial and cookbook streamutils Documentation, Release 0.1.1-dev

3.3.2 Solution: Autogenerating documentation output from a source file

Ideally, you want one place to maintain your method signatures and documentation (your source code). One potential solution (that streamutils itself uses) is to autogenerate this output like so:

>>> import streamutils as su >>> from streamutils import * >>> funcs=read(fname='src/streamutils/__init__.py')| search(r'\s?def ((\w+)[(]. ˓→*[)]):(?:\s?[#].*)?', group=None, names=['sig','name'])| sfilter( lambda x: x['name ˓→'] in (set(su.__all__)- set(['wrap','wrapTerminator'])))| ssorted(key= lambda x:

˓→x['name']) >>> with open('docs/api.rst','w') as apirst: ... lines=[] ... lines.append('API') ... lines.append('---') ... lines.append('.. module:: streamutils\n') ... lines.append('%s\n'% su.__doc__.strip()) ... for f in funcs: ... lines.append('.. py:function:: %s\n'% f['sig']) ... lines.append(' %s\n'% locals()[f['name']].__doc__.strip()) ... apirst.writelines('\n'.join(lines))

3.3. Getting the correct function signatures in sphinx for decorated methods 35 streamutils Documentation, Release 0.1.1-dev

36 Chapter 3. Tutorial and cookbook CHAPTER 4

Testing streamutils

4.1 Writing tests streamutils contains tests at three levels: inline doctests doctest-style tests are used within the source files to give a basic demonstration of how each function can be used documentation doctests To allow for more involved options, doctest-style tests are also used within the documenta- tion to improve test coverage and ensure documentation does not become out of date boring and bugfix tests Not all tests are informative (particularly those checking that an exception is raised when expected or regression tests). These are implemented using py.test and can be found in the test directory of the source tree. The aim is to achieve 100% test coverage across the three differet test types. New features should be accompanied by tests to show what they do, and bug fixes should be accompanied by tests to ensure things stay fixed!

4.2 Running tests

4.2.1 Testing using the current python version

In order to run the tests for streamutils and pick up all the relevant types of tests a rather involved invocation of py.test is required. This has been integrated into setup.py, so all you need to do to run the tests on your current version of python is run python setup.py test

4.2.2 Testing against supported python versions streamutils supports pypy and python versions >=2.6. To test that any changes don’t break anything on any of these versions, you can use tox, which will test a clean install of streamutils in a virtualenv using each of the supported pythons (so long as you have them set up on your system). All you need to do is run tox. (Internally, for each

37 streamutils Documentation, Release 0.1.1-dev supported python this configures the environment appropriately, installs dependencies and calls python setup. py test)

4.2.3 Continuous integration

Because it’s easy to forget to run tests and to keep the test coverage reports up to date, streamutils uses travis to run integration testing after every push to the github repository (it just calls tox and then coveralls to upload the test coverage status to coveralls).

38 Chapter 4. Testing streamutils CHAPTER 5

Indices and tables

• genindex • modindex • search

39 streamutils Documentation, Release 0.1.1-dev

40 Chapter 5. Indices and tables Python Module Index

s streamutils, 13

41 streamutils Documentation, Release 0.1.1-dev

42 Python Module Index Index

A lastby() (in module streamutils), 21 action() (in module streamutils), 13 asdict() (in module streamutils), 14 M aslist() (in module streamutils), 15 matches() (in module streamutils), 21 meanby() (in module streamutils), 21 B merge() (in module streamutils), 22 bag() (in module streamutils), 15 bzread() (in module streamutils), 15 N nlargest() (in module streamutils), 22 C nomatch() (in module streamutils), 22 combine() (in module streamutils), 15 nsmallest() (in module streamutils), 23 connector() (in module streamutils), 16 nth() (in module streamutils), 23 convert() (in module streamutils), 16 count() (in module streamutils), 17 R countby() (in module streamutils), 17 read() (in module streamutils), 23 csvread() (in module streamutils), 17 replace() (in module streamutils), 24 csvwrite() (in module streamutils), 18 run() (in module streamutils), 24 D S dropwhile() (in module streamutils), 18 separate() (in module streamutils), 24 sfilter() (in module streamutils), 24 F sfilterfalse() (in module streamutils), 25 find() (in module streamutils), 18 sformat() (in module streamutils), 25 first() (in module streamutils), 19 smap() (in module streamutils), 25 firstby() (in module streamutils), 19 smax() (in module streamutils), 26 fnmatches() (in module streamutils), 19 smin() (in module streamutils), 26 follow() (in module streamutils), 20 split() (in module streamutils), 26 sreduce() (in module streamutils), 27 G sslice() (in module streamutils), 27 gzread() (in module streamutils), 20 ssorted() (in module streamutils), 27 ssum() (in module streamutils), 27 H streamutils (module), 13 head() (in module streamutils), 20 strip() (in module streamutils), 28 sumby() (in module streamutils), 28 J join() (in module streamutils), 20 T tail() (in module streamutils), 28 L takewhile() (in module streamutils), 29 last() (in module streamutils), 21 terminator() (in module streamutils), 29

43 streamutils Documentation, Release 0.1.1-dev traverse() (in module streamutils), 29 U unique() (in module streamutils), 29 unwrap() (in module streamutils), 30 update() (in module streamutils), 30 W words() (in module streamutils), 30 write() (in module streamutils), 31

44 Index