Market Reconstruction 2.0: Visualization at Scale
Total Page:16
File Type:pdf, Size:1020Kb
Market Reconstruction 2.0: Visualization at Scale Neil Palmer, Justine Chen, Sam Sinha, Filipe Araujo, Michael Grinthal, Fawad Rafi - FIS {neil.palmer; justine.chen; sam.sinha; filipe.araujo; michael.grinthal; fawad.rafi}@fisglobal.com Abstract This paper describes the front-end architecture for a large-scale securities transaction surveillance and forensics platform capable of supporting the ingestion, linkage and analysis of granular U.S. equity and equity options market events spanning multi-year periods. One component of this platform is a regulatory user interface (UI) that facilitates the navigation and visualization of the entire universe of U.S. market events, expected to grow to a size of 35 petabytes (PB) in seven years. Various aspects of the front end’s construction, architectural design and UI and user experience (UX) approaches are detailed, including the key benefits and drawbacks of the chosen architecture and technology stack. 1 Introduction FISTM developed the Market Reconstruction Platform (MRP)1 using Google Cloud Dataflow and Google Cloud Bigtable as tools in processing and indexing highly granular market data events, ultimately publishing the results into a Google BigQuery dataset for visualization and analytics.2 This last step afforded the UI team an opportunity to design an interface with analytical capabilities that will allow the audience of the platform to visualize the correlation of nearly 35 PB of U.S. equity and equity options data. Visualization of this dataset is effectively the last mile – the culmination point of vigorous number crunching by 3,500 nodes in Google’s Cloud, all working together to link disparate events to the order of billions.3 Ultimately, here is where the responsibility lies in providing regulators access to an ocean of trade data in a manner that is simple, intuitive and responsive. To design such an interface, it was crucial to understand the inherent challenges of presenting data at these scales, as well as the prevailing limitations to applying advanced visualization techniques within modern web browsers. In the following sections, the two key goals of the web application are discussed, as well as our initial approaches to achieving them. We will dissect the associated challenges, the synthesized data used to simulate the requisite scale demanded of the platform, and the design decisions that were made, from both usability and architecture perspectives. 2 A window into 35 petabytes of time-series data 2.1 The challenge The principal challenge of the interface can be summarized as providing regulators with a way to discover events of interest in a timely and intuitive manner – from a dataset of market events expected to grow to 35 PB. The primary responsibilities of the web application, therefore, are to provide regulators with the ability to navigate the dataset, visualizing the results in a meaningful way and ensuring that all security measures for access control are seamless. To provide a sense of the scale of information that regulators must absorb, interpret and act upon via the platform, the primary U.S. options market data feed, SIAC OPRA, in April of 2016 observed a peak rate of over 10,000,000 messages per second,4 with the average size of each message being approximately 66 bytes.5 This adds up to approximately 5.2 Gbps of information. The productive navigation of arbitrarily event-dense windows of time, spanning multiple execution venues and instruments, and delivered exclusively through a modern browser interface, is the crux of the challenge with which we are faced. 2.2 A scatterplot of events Palmer (2016) extensively details the ingestion and indexing process for the universe of synthetic market events visualized herein. One particularly noteworthy characteristic of the MRP’s overall system architecture is the absence of a traditional middleware API tier. As the dataset must physically reside remotely from the edge browser performing the visualizations, the volume and frequency of communication between the browser and the ultimate repository is a function of the volume of data narrowed by the user’s search criteria. While a standard, REST-style API design for such a system need not be complex or intricate, the ability to eschew with traditional middleware tiers brings a welcome reduction to the net operational burden of the solution. In lieu of a dedicated middleware tier, the front-end application communicates with BigQuery directly, sending SQL constructed by the front-end over HTTPS to BigQuery’s Google Cloud Platform API endpoint. 1 see “Solving the Biggest Big Data Challenge in Capital Markets” at http://www.fisglobal.com/Solutions/Institutional-and-Wholesale/Broker-Dealer/Market-Reconstructions-and-Visualization 2 N. Palmer, S. Sferrazza, S. Just, A. Najman (2016) “Market Visualization 2.0: A Financial Services Application of Google Cloud Bigtable and Google Cloud Dataflow” 3 Ibid. 4 Financial Information Forum (2016) ”FIF April 2016 Market Data Capacity Statistics” 5 see Nanex (2012) “Nanex Compression on an OPRA Direct Feed” at http://www.nanex.net/opracompression.html Market Reconstruction 2.0: Visualization at Scale 1 We found the lack of a dedicated middleware tier to be a key enabler for the rapid development of the front-end application. However, this placed a stronger emphasis on the format of the schema that the front-end would be querying directly. In essence, the published dataset’s schema was the primary back-end API against which the visualizations were developed. For both performance and usability factors, the schema was denormalized in order to align the query patterns of the front-end to the storage of the underlying events. Figure 1 describes the attributes of the market event data available to the front-end for analysis and visualization.6 Figure 1: Market event data attributes. +-----------------+--------------------------------+ | Last modified | Schema | +-----------------+--------------------------------+ | 24 Mar 11:52:33 | |- id: string | | | |- Parent: string | | | |- GrandParent: string | | | |- timestamp: string | | | |- EventId: string | | | |- ReporterId: string | | | |- Child: string | | | |- CustomerId: string | | | |- CustomerIdValid: string | | | |- side: string | | | |- CHILD_side: string | | | |- symbol: string | | | |- CHILD_symbol: string | | | |- CHILD_ordType: string | | | |- eventType: string | | | |- brokenTag: string | | | |- ReporterIdValid: boolean | | | |- SideMatch: boolean | | | |- SymbolMatch: boolean | | | |- SymbolValid: boolean | | | |- OrderTypeMatch: boolean | | | |- splitCount: integer | | | |- timeInForce: integer | | | |- size: float | | | |- avgPrice: float | +-----------------+--------------------------------+ From here, it was a logical step to identify a few key pieces of information for display, and then offer a way to drill down for further details about each event and its associated relationships. This being a historical time-series dataset, plotting events on a timeline quite naturally affords an intuitive perspective to their visualization. Another dimension we chose to plot is the order size (denoting the volume of shares) of the event. Finally, a third identifier, the event type, is represented using color. For order sizes being plotted against the time dimension, these three attributes provide end users with fundamental, event-specific information at a single glance. The next step adds interactivity to the flat X-Y event chart, in the form of a tooltip that appears as the cursor hovers over an event. The tooltip contains additional information about the targeted event such as its side (BUY or SELL) and reporter ID (i.e. the originating market participant). In addition, panning and zooming capabilities provide an additional layer of interactivity to pinpoint events of interest in a sea of trades that may have taken place for a security within only a matter of hours. 6 Palmer, et. al supra Market Reconstruction 2.0: Visualization at Scale 2 Figure 2: MRP provides a multi-dimensional visualization of events on a timeline. 2.2.1 Querying for events The scatter plot is generated by selecting query parameters from the Search page with inputs such as: ● Stock Symbol ● Side (Buy or Sell) ● Reporter ID ● Event Type (New, Routed, Filled, etc.) ● Start and End Dates ● Start and End Times ● Min and Max Prices ● Min and Max Order Sizes Figure 3. The user-specified parameters from the query page are subsequently transformed into a query by the Events API. The query is then executed on a BigQuery dataset containing event data. The JSON response is subsequently converted into a JavaScript array and then fed into the Polymer scatter plot web component via data binding. A Polymer observer function on the array object kicks off the Data- Market Reconstruction 2.0: Visualization at Scale 3 Driven Documents (D3) scatter plot generation process as soon as the array changes, which may be triggered by the user searching for a different symbol. 2.2.2 Implementing with D3 The scatter plot is built exclusively with functions available in the D3.js library.7 Apart from some initial challenges on how to place groups of elements on the SVG so that they maintain their interactive properties, we found D3 to be an excellent library for displaying the type and volume of events data with which the system will work. Some experimentation was warranted to ensure that panning and zooming capabilities were in conflict with each other, and more importantly, did not shield the dots from mouseover, mouseout and click