Transforming Options Market Data with the Dataflow SDK Salvatore Sferrazza & Sebastian Just, FIS ([email protected] and [email protected])
Total Page:16
File Type:pdf, Size:1020Kb
Transforming Options Market Data with the Dataflow SDK Salvatore Sferrazza & Sebastian Just, FIS ([email protected] and [email protected]) Preface When FIS (formerly SunGard) released the whitepaper Scaling to Build the Consolidated Audit Trail: A Financial Services Application of Cloud Bigtable in Q2 2015, a key module of the experiments was utilizing MapReduce for persisting data to Google Cloud Bigtable via the HBase API. At that time, our team had been looking forward to working with Google Cloud Dataflow as a MapReduce alternative, but Google Cloud Dataflow was not yet generally available (GA). In the months since the Bigtable whitepaper was released, the Google Cloud Dataflow managed service has been promoted to GA, and this whitepaper details our experiences applying the Google Cloud Dataflow SDK to the problem of options market data transformation. Introduction Much of the software development in the capital markets (and enterprise The specific transformations we’ll perform in this use case include the systems in general) revolves around the transformation and enrichment following: of data arriving from another system. Traditionally, developers have • Ingest options market ticks and extract the contract’s expiration considered the activities around Extracting, Transforming and Loading date, strike price and reference asset from the symbol associated (ETL) data a particularly unglamorous dimension of building software with each tick products. However, these activities encapsulate functions that are core • Load the transformed data into a Google BigQuery schema that to every tier of computing. Oftentimes, the activity involves legacy supports queries on these individual contract properties. For systems integration, and perhaps this is the genesis of ETL’s unpopular example, finding all the in-the-money (ITM) contracts traded reputation. Nonetheless, when data-driven enterprises are tasked with between 2:00 and 3:00 PM requires comparing the strike price of a harvesting insights from massive data sets, it is quite likely that ETL, in contract with the spot price of the underlying asset. Hence, having one form or another, is lurking nearby. the strike price encoded within the options symbol string is an obstacle to any analytics that require arithmetic operations involving The Google Cloud Dataflow service and SDK represent powerful tools the strike price for myriad ETL duties. In this paper, we introduce some of the main concepts behind building and running applications that use Dataflow, then get “hands on” with a job to transform and ingest options market Prerequisites symbol data before storing the transformations within a Google BigQuery data set. While Dataflow’s SDK support is currently limited to Java, Google has a Dataflow SDK for Python currently in Early Access Program (EAP). As such, all examples herein are in Java using Maven for builds Objectives and dependency management. We hope to publish a supplementary • Introduce Google Cloud Dataflow’s programming and execution implementation of this same project using the Python SDK during model public beta. • Apply Dataflow to transform market data (specifically options For the hands-on portion of the paper, the following conditions were prices) and persist the transformations into a BigQuery table required: for analysis • Enabling the Dataflow API for a Google Developers Console project • Enabling the BigQuery API for a Google Developers Console project • Enabling billing on your Google Cloud Platform account Transforming Options Market Data with the Dataflow SDK 1 • Installing the Google Cloud SDK on your client operating system At runtime, the pipeline and its component ParDo functions are analyzed by the Dataflow optimizer and factored into its execution plan, • Installing Maven for managing the Java project possibly merging steps along the way and sequencing them according • For the Dataflow libraries specifically, adding the following to the semantics implied by the pipeline class and its component ParDo dependency to your project’s pom.xml as follows: operations. There is no explicit relationship defining map, reduce or <dependency> shuffle operations. Rather, the optimizer sequences operations based <groupId>com.google.cloud.dataflow</groupId> on the semantics of the input and output PCollections, similar to <artifactId>google-cloud-dataflow-java-sdk-all</artifactId> how a SQL query optimizer builds a query plan based on the selected <version>RELEASE</version> columns, available indices, query predicates and table schema. </dependency> Additionally, the source code may be cloned from the Git repository PCollections located at https://github.com/SunGard-Labs/dataflow-whitepaper. Another central concept in the Dataflow model is the PCollection (for Although the source code repository is self-contained, parameters Parallel Collection), which represents the primary group of elements may need to be adjusted to reflect the specific Google Cloud project being processed by the Dataflow job at any stage in the pipeline. With parameters for your environment. the typical MapReduce programming model (e.g., Hadoop), key- value pairs are emitted from collections of elements. In contrast, a PCollection simply represents the collection of active elements within Programming Dataflow the pipeline and is available as input to ParDos and DoFns. In our State maintenance and scalability experience, we have found that this both flattens the learning curve and simplifies the programming model overall compared to the currently Because a full description of the role that application state plays in available alternatives. determining scalability is beyond the scope of this paper, please refer to this explanation, which effectively breaks down the trade-offs inherent This design offers great flexibility as the programmer is freed from to various techniques around application state maintenance. having to implement a particular Map or Reduce interface imposed by the SDK. In particular, we find this enables greater focus on the Suffice it to say, state management is a critical factor contributing to business relevance of the associated inputs and outputs, as well as a system’s scalability. A common solution for applications requiring the specific nature of the elements that are processed. For example, near-linear scalability amid large workloads is parallelization. However, the PCollection of one job stage may have a direct lineage from the development for parallel processing can be tricky compared to more PCollection of the previous stage, but it is not required to per se. This traditional serial workloads. As you’ll see, Google’s Cloud Dataflow SDK affords a more reactive/functional approach to programming — the removes much of this friction with clear and concise API semantics. output of one processing stage is simply the input of the subsequent stage, not dissimilar to an idealized callback-driven control flow. Pipelines, ParDos and DoFns The bedrock of any Dataflow job is the pipeline, which is responsible for It is also important to remember that each PCollection can be treated as the coordination of an individual Dataflow job. The actual logic of the job a first-class citizen. A PCollection can be used multiple times in a single stages is performed by ParDo (for Parallel Do) subclasses that implement pipeline, effectively creating a new logical branch of pipeline execution their logic within the template of a DoFn (for Do Function) implementation each time. callback. A ParDo’s principal responsibility is the transformation of one (input) PCollection to another (output) PCollection. Transforming Options Market Data with the Dataflow SDK 2 Job inputs Although Dataflow allows for flexibility with regards to data sources, As of this writing, the I/O APIs supported by the Dataflow SDK there is one thing that all jobs share: an input operation such as a include the following: file on Google Cloud Storage, the contents of a BigQuery table or query, a message arriving on a Cloud PubSub topic, or even your • TextIO (text file access via URI, including Google Cloud Storage) own customized I/O source — which always bootstraps the initial • BigQueryIO (Google Cloud Platform BigQuery) PCollection for the pipeline. • AvroIO (Apache Avro) During the lifecycle of a pipeline, many different (side) inputs may be • PubsubIO (Google Cloud Platform Pub/Sub) employed as needed to transform, filter and/or process the currently active PCollection within the pipeline. • BigTableIO (Google Cloud Bigtable - currently in beta) Native support for input sources By inheriting functionality from more foundational classes, such as The stock Dataflow SDK ships with several classes that support TextIO, and overriding the relevant API implementations, the amount input and output operations with resources external to Dataflow’s of code required to implement a custom CsvIO or XmlIO is reduced. execution context. In some cases, you may ultimately inherit from classes that these I/O classes themselves are derived from, or you may inherit directly from For example, for a pipeline to initialize a PCollection from the an existing I/O class. contents, a file living within Google Cloud Storage requires importing com.google.cloud.dataflow.sdk.io.TextIO and, via the The native TextIO Reader method currently returns a PCollection PipelineOptions interface, configuring it with the URI of your storage corresponding to each line of text in the specified file(s). But what if the bucket, such as: gs://<storage-bucket-on-gcs>/DATA1.txt.