Peach Documentation Release
Total Page:16
File Type:pdf, Size:1020Kb
peach Documentation Release Alyssa Kwan Nov 11, 2017 Contents 1 Table of Contents 3 1.1 Solutions to Common Problems.....................................3 2 Indices and tables 5 i ii peach Documentation, Release Welcome to peach. peach is a functional ETL framework. peach empowers you to easily perform ETL in a way inherent to the functional programming paradigm. Why peach? Please see the Solutions to Common Problems section of the documentation. peach is the culmination of learnings from over a decade of mistakes made by the principal author. As such, it represents best practices to deal with a wide array of ETL and data lake / data warehouse problems. It also represents a sound theoretical framework for approaching this family of problems, namely “how to deal with side effects amidst concurrency and failure in a tractable way”. Contents 1 peach Documentation, Release 2 Contents CHAPTER 1 Table of Contents 1.1 Solutions to Common Problems 1.1.1 Clean Retry on ETL Job Failure Problem Jobs are side-effecting; that is their point: input data is ingested, and changes are made to the data lake or warehouse state in response. If a job fails partway through execution, it leaves all sort of garbage. This garbage gets in the way of retries, or worse yet, any other tries at all. If all of the job artifacts - both intermediate and final - are hosted upon a transactional data store, and jobs are only using features that are included in transactions (for instance, some RDBMS’s don’t include schema changes in transaction scope, so you can’t create tables and have them automatically cleaned up), then congratulations! There is no problem. The data warehouse is at a scale where this is possible. For the rest of us, the key principle is idempotency. Write jobs so that they clean up the artifacts created by previous runs first. The big problem with this is naming. Intermediate artifacts must use immutable, well-known names, which limits the degree of parallelism to 1 (one), i.e. no parallelism. Or you generate new, non-conflicting names for your shards each job execution, and you are OK with leaving uncollected garbage in the case of process crash. Collecting the garbage later requires having kept careful and correct records about which shards are actually orphans. In either case, idempotency requires careful thought and discipline, which heightens barriers to entry for engineers to join efforts. Solution peach is functional. This means it’s founded on the principles of functional programming, which means that one of the principles it focuses on is referential transparency. This principle simply means that calling the same function with the same inputs more than once always results in the same output. Obviously, the only way that can work is if the function is free of side effects. 3 peach Documentation, Release But ETL is fundamentally side effecting - we’re changing the data lake / data warehouse in response to input. How can we have referential transparency? The key is making time itself an explicit input. Let us take the following: function inputs_between_times returns the set of inputs received from time 1 (one) exclusive to time 2 (two) inclusive function data_lake returns the state of the data lake based on some past state of the data lake with some new set of inputs integrated into it Based on this, the state of the data lake currently (now) is expressed as: data_lake_at_time_current= data_lake( data_lake_at_time_previous, inputs_between_times( time_previous, time_current, ), ) What does this approach imply? 1. ETL outputs are just cache. The fundamental data is the time-ordered set of inputs into the system. peach is really good at maintaining cache. 2. Any prior state of the lake / warehouse can be regenerated at any time. Use it for rollback. Use it for debugging. Load prior states into other warehouses to give some department an unchanging snapshot, perhaps to do end-of- fiscal-period work. More on this <TBD>. (a) Write functions that compare the state of the lake / warehouse at two different points in time. Automate your change analysis and debugging. More on this <TBD>. (b) Write functions that compute the set of change operations given the state of the lake / warehouse and the set of new inputs. Know exactly what is being inserted, updated, and deleted, in what order and why. Deepen your change analysis and debugging. More on this <TBD>. 4 Chapter 1. Table of Contents CHAPTER 2 Indices and tables • genindex • modindex • search 5.