Streamutils Documentation Release 0.1.1-Dev
Total Page:16
File Type:pdf, Size:1020Kb
streamutils Documentation Release 0.1.1-dev Max Grender-Jones Nov 15, 2018 Contents 1 streamutils - pipelines for python3 1.1 Motivation................................................3 1.2 Features..................................................5 1.3 Non-features...............................................5 1.4 Functions.................................................6 1.5 API Philosophy & Conventions.....................................7 1.6 Installation and Dependencies......................................8 1.7 Status...................................................9 1.8 How does it work?............................................9 1.9 Contribute................................................ 10 1.10 Alternatives and Prior art......................................... 10 1.11 Acknowledgements and References................................... 10 1.12 License.................................................. 11 2 API 13 3 Tutorial and cookbook 33 3.1 Parsing an Apache logfile........................................ 33 3.2 Nesting streams to filter for files based on content............................ 34 3.3 Getting the correct function signatures in sphinx for decorated methods................ 34 4 Testing streamutils 37 4.1 Writing tests............................................... 37 4.2 Running tests............................................... 37 5 Indices and tables 39 Python Module Index 41 i ii streamutils Documentation, Release 0.1.1-dev Contents: Contents 1 streamutils Documentation, Release 0.1.1-dev 2 Contents CHAPTER 1 streamutils - pipelines for python Bringing one-liners to python since 2014 1.1 Motivation Have you ever been jealous of friends who know more commandline magic than you? Perhaps you’re a python user who feels guilty that you never learnt sed, awk or perl, and wonder quite how many keystrokes you could be saving yourself? (On the plus side, you haven’t worn the keycaps off your punctuation keys yet). Or maybe you’re stuck using (or supporting) windows? Or perhaps you are one of those friends, and your heart sinks at the thought of all the for loops you’d need to replicate a simple grep "$username" /etc/passwd | cut -f 1,3 -d : --output-delimiter=" " in python? Well, hopefully streamutils is for you. Put simply, streamutils is a pythonic implementation of the pipelines offered by unix shells and the coreutils toolset. Streamutils is not (at least not primarily) a python wrapper around tools that you call from the commandline or a wrapper around subprocess (for that, you want sh or its previous incarnation pbs). However, it can interface with external programmes through its run command. Enough already! What does it do? Perhaps it’s best explained with an example. Suppose you want to reimplement our bash pipeline outlined above: >>> from __future__ import print_function >>> from streamutils import * >>> name_and_userid= read('examples/passwd')| matches('johndoe')| split([1,3],':', ,! '')| first() >>> print(name_and_userid) johndoe 1000 >>> gzread('examples/passwd.gz')| matches('johndoe')| split([1,3],':','')| ,!write() #Can read from gzipped (and bzipped) files johndoe 1000 (continues on next page) 3 streamutils Documentation, Release 0.1.1-dev (continued from previous page) >>> gzread('examples/passwd.gz', encoding='utf8')| matches('johndoe')| split([1,3], ,!':','')| write() #You really ought to specify the unicode encoding johndoe 1000 >>> read('examples/passwd.bz2', encoding='utf8')| matches('johndoe')| split([1,3], ,!':','')| write() #streamutils will attempt to transparently decompress ,!compressed files (.gz, .bz2, .xz) johndoe 1000 >>> read('examples/passwd.xz', encoding='utf8')| matches('johndoe')| split([1,3],': ,!','')| write() johndoe 1000 streamutils also mimics the > and >> operators of bash-like shells, so to write to files you can write something like: >>> import tempfile, shutil, os >>> try: ... #Some setup follows to allow this docstring to be included in automated tests ... tempdir=tempfile.mkdtemp() # Create a temporary directory to play with ... cwd=os.getcwd() # Save the current directory so we can change back ,!to it afterwards ... os.chdir(tempdir) # Change to our temporary directory ... passwd=os.path.join(cwd,'examples','passwd.gz') ... #Right - setup's done ... with open('test.txt', mode='w') as tmp: # ,!mode determines append / truncate behaviour ... gzread(passwd)| matches('johndoe')| split([1,3],':','')> tmp # ,!can write to open things ... # >> appends, but because python evaluates rshifts (>>) before bitwise or (|), ,! the preceding stream must be in brackets ... (gzread(passwd)| matches('johndoe')| split([1,3],':','')) >>'test.txt' ... line= read('test.txt')| first() ... assert line.strip()=='johndoe 1000' ... length= read('test.txt')| count() ... assert length==2 ... gzread(passwd)| matches('johndoe')| split([1,3],':','')>'test.txt' # ,!(> writes to a new file) ... length= read('test.txt')| count() ... assert length==1 ... finally: ... os.chdir(cwd) # Go back to the original directory ... shutil.rmtree(tempdir) # Delete the temporary one ... Or perhaps you need to start off with output from a real command: >>> from streamutils import * >>> import platform >>> cat='python -c"import sys; print(open(sys.argv[1]).read())"' if platform. ,!system()=='Windows' else 'cat' >>> run('%s setup.py'% cat)| search("keywords='(. *)'", group=1)| write() UNIX pipelines for python You don’t have to take your input from a file or some other streamutils source, as it’s easy to pass in an Iterable that you’ve created elsewhere to have some functional programming fun: >>> from streamutils import * >>> 1| smap(float)| aslist() # Non-iterables are auto-wrapped (continues on next page) 4 Chapter 1. streamutils - pipelines for python streamutils Documentation, Release 0.1.1-dev (continued from previous page) [1.0] >>> ['d','c','b','a']| smap( lambda x: (x.upper(), x))| ssorted(key= lambda x: ,!x[0])| smap( lambda x: x[1])| aslist() # Streamutils' Schwartzian transform ,!(sorting against an expensive-to-compute key) ['a', 'b', 'c', 'd'] >>> range(0,1000)| sfilterfalse( lambda x: (x%5) * (x%3))| ssum() # Euler1: sum of ,!first 1000 numbers divisible by 3 or 5 233168 >>> import itertools >>> def fib(): ... fibs={0:1,1:1} ... def fibn(n): ... return fibs[n] if n in fibs else fibs.setdefault(n, fibn(n-1)+fibn(n-2)) ... for f in itertools.count(0)| smap(fibn): ... yield f ... >>> fib()| takewhile( lambda x: x<4000000)| sfilterfalse( lambda x: x%2)| ssum() # ,!Euler 2: sum of even fibonacci numbers under four million 4613732 >>> (range(0, 101)| ssum()) **2-(range(0,101)| smap( lambda x: x*x)| ssum()) # ,!Euler 6: difference between the sum of the squares of the first one hundred natural ,!numbers and the square of the sum. 25164150 >>> top= 110000 >>> primes=range(2,top) >>> for p in range(2,int(top**0.5)): # Euler 7: Sieve of Eratosthenes ... primes|=sfilter(lambda x: (x==p) or (x%p), end=True) ... >>> primes|nth(10001) 104743 1.2 Features • Lazy evaluation and therefore memory efficient - nothing happens until you start reading from the output of your pipeline, when each of the functions runs for just long enough to yield the next token in the stream (so you can use a pipeline on a big file without needing to have enough space to store the whole thing in memory) • Extensible - to use your own functions in a pipeline, just decorate them, or use the built in functions that do the groundwork for the most obvious things you might want to do (i.e. custom filtering with sfilter, whole-line transformations with smap or partial transformations with convert) • Unicode-aware: all functions that read from files or file-like things take an encoding parameter • Not why I wrote the library at all but as shown above many of streamutils functions are ‘pure’ in the func- tional sense, so if you squint your eyes, you might be able to think of this as a way into functional programming, with a much nicer syntax (imho, as function composition reads left to right not right to left, which makes it more readable if less pythonic) than say toolz 1.3 Non-features An unspoken element of the zen of python (import this) is ‘Fast to develop’ is better than ‘Fast to run’, and if there’s a downside to streamutils that’s it. The actual bash versions of grep etc are no doubt much faster than search/match from streamutils. But then you can’t call python functions from them, or call them from python 1.2. Features 5 streamutils Documentation, Release 0.1.1-dev code on your windows machine. As they say, ‘you pays your money and you take your choice’. Since streamutils uses so many unsupported features (generators, default args, context managers), using numba to get speed-ups for free would sadly appear to not be an option for now (at least not without the help of a numba-expert) and though cython (as per cytoolz) would certainly work it would make streamutils much harder to install and would require a lot more effort. 1.4 Functions A quick bit of terminology: • pipeline: A series of streamutil functions joined together with pipes (i.e. |) • tokens: things being passed through the pipeline • stream: the underlying data which is being broken into the tokens that are passed through the pipeline Implemented so far (equivalent coreutils function in brackets if the name is different). Note that the following descriptions say ‘lines’, but there’s nothing stopping the functions operating on a stream of tokens that aren’t newline terminated strings: 1.4.1 Connectors These are functions designed to start a stream or process a stream (the