Optimizing IBM Algorithmics' Mark-To-Future Aggregation Engine
Total Page:16
File Type:pdf, Size:1020Kb
Optimizing IBM Algorithmics’ Mark-to-future Aggregation Engine for Real-time Counterparty Credit Risk Scoring Amy Wang Jan Treibig Bob Blainey IBM Toronto Software Lab RRZE, University Erlangen IBM Toronto Software Lab [email protected] [email protected] [email protected] Peng Wu Yaoqing Gao Barnaby Dalton IBM T.J.Watson Research IBM Toronto Software Lab IBM Toronto Software Lab Center [email protected] [email protected] [email protected] Danny Gupta Fahham Khan Neil Bartlett IBM Toronto Software Lab IBM Toronto Software Lab IBM Algorithmics [email protected] [email protected] [email protected] Lior Velichover James Sedgwick Louis Ly IBM Algorithmics IBM Algorithmics IBM Algorithmics [email protected] [email protected] [email protected] ABSTRACT full collateral modelling into account. The concept of default and its associated painful repercus- sions have been a particular area of focus for financial insti- Categories and Subject Descriptors tutions, especially after the 2007/2008 global financial crisis. D.1.3 [Concurrent Programming]: Parallel programming Counterparty credit risk (CCR), i.e. risk associated with a counterparty default prior to the expiration of a contract, has gained tremendous amount of attention which resulted General Terms in new CCR measures and regulations being introduced. In Algorithms, Economics, Performance particular users would like to measure the potential impact of each real time trade or potential real time trade against Keywords exposure limits for the counterparty using Monte Carlo sim- ulations of the trade value, and also calculate the Credit Risk Analytics, Multicore, Collateral Value Adjustment (i.e, how much it will cost to cover the risk of default with this particular counterparty if/when the 1. INTRODUCTION trade is made). These rapid limit checks and CVA calcu- Counterparty Credit Risk (CCR) is a metric used by fi- lations demand more compute power from the hardware. nancial institutions to evaluate the likelihood of the coun- Furthermore, with the emergence of electronic trading, the terparty of a financial contract (referred to as counterparty extreme low latency and high throughput real time compute for short) to default prior to the expiration of the contract. requirement push both the software and hardware capabili- It is critical for a financial institution to predict the CCR ties to the limit. Our work focuses on optimizing the com- of a counterparty when making a trading decision and when putation of risk measures and trade processing in the exist- pricing the value of a trade. Traditionally, trades are made ing Mark-to-future Aggregation (MAG) engine in the IBM by human beings, and response time for CCR typically falls Algorithmics product offering. We propose a new software into the range of hundredth of milliseconds. The emergence approach to speed up the end-to-end trade processing based of electronic and e-commerce trading, however, demands on a pre-compiled approach. The net result is an impres- a much faster response time and higher throughput over sive speed up of 3-5x over the existing MAG engine using the current generation of CCR software which are designed a real client workload, for processing trades which perform mainly for human traders. Furthermore, it is also highly limit check and CVA reporting on exposures while taking desirable to improve the precision of risk computation. A Permission to make digital or hard copies of all or part of this work for per- CCR is more precise if its computation takes into consid- sonal or classroom use is granted without fee provided that copies are not eration more number of market scenarios and/or involves made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components more timesteps. All of these requirements demand highly of this work owned by others than the author(s) must be honored. Abstract- efficient software implementations and effective utilization ing with credit is permitted. To copy otherwise, or republish, to post on of hardware resources. servers or to redistribute to lists, requires prior specific permission and/or a The Mark-to-Future Aggregation Engine (MAG) is a key fee. Request permissions from [email protected]. component of the risk computation software from Algorith- WHPCF’13, November 18 2013, Denver, CO, USA mics that performs statistical measurements of the CCR Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2507-3/13/11 ...$15.00. computation. The current generation of MAG was designed http://dx.doi.org/10.1145/2535557.2535567 for human traders and sustains a throughput of 3-5 trades per second with a latency of up to 300 ms per trade. In this between computation. A node consists of a computation paper, we describe our approach to improve the end-to-end kernel and its internal data called states. States are typ- throughput and latency of the MAG engine. The targeted ically vectors or dense matrices called sheets. A sheet is risk precision is defined in terms of 5000 market scenarios a two-dimension data structure organized by scenarios and by 250 timesteps. time points. In the current implementation, sheets are in There have been many recent work on performance opti- memory sequentially along the scenario dimension. There mizations for financial codes, many exploit accelerator tech- are two types of nodes in a computation graph, consolida- nologies such as GPGPUs, CELL BE, and FPGAs [2, 1, tion nodes and transformation nodes. Both types of nodes 8], while others focus on algorithm level improvement such produce a new result, while only consolidation nodes may as [6]. One notable work in this area is from Maxeler Tech- modify its own states. When applying computation from nologies on using FPGA to accelerate credit derivatives com- a consolidation node, states are often first read and then putation for JPMC [11]. This work focuses primarily on the modified, such as element-wise summation of an incoming pricing aspect of a trade as pricing algorithms are highly sheet into the sheet associated with the consolidation node. parallel. A similar effort of employing FPGA to speedup A transformation node, on the other hand, does not modify the pricing engine was taken by Algorithmics in the past [4]. any states. In this work, we focus on another critical component of To give a sense of the scale of the data structure, a typi- a financial software, the aggregation engine. In contrast to cal production run of the MAG engine today may monitor prior published work in risk analysis, our work targets a 10,000+ counterparties (i.e., 10,000+ computation graphs). real production code. We found that optimizing a complex On average, each computation graph contains 10 nodes. And piece of production software requires one to take a holistic states associated with a computation graph node can be sev- approach and to tackle performance bottlenecks at all lev- eral mega-bytes. els, such as algorithm and data structure design, redundant computation elimination, memory subsystem optimization, 2.2 Trade Risk Scoring Computation and exploiting parallelism. In this paper, we demonstrate A trade consists of two pieces of information, a trade value the steps taken to identify performance bottlenecks in the sheet and trade parameters. The trade value sheet usually MAG engine and the techniques to address some of the over- comes from the pricing engine, with simulated floating point head. We are able to demonstrate a speed up of 3-5x over values over a set of market scenarios and timesteps. Trade the existing MAG engine for the limit checking and CVA parameters include which counterparty is trading and other reporting on exposures scenario using a real client workload information such as the maturity date of the trade. The on an off-the-shelf multicore server. We believe this work is counterparty information of trade parameters determines a good starting point to closing the gap between the per- which computation graph to be used for trade evaluation. formance of existing MAG engine and the ultimate latency When evaluating a trade on a computation graph, it typi- and throughput requirement for an online trading system. cally refers to the process of absorbing the trade value sheet The paper is organized as follows. Section 2 gives an into some consolidation nodes of the graph and/or comput- overview of the current MAG engine implementation. Sec- ing statistical measures on computation graph nodes. tion 3 describes our approach to optimizing performance and A trade can be either read-only or commit. Read-only Section 4 explains our optimizations for three important ker- trades (e.g., what-if or lookup trades) do not modify any nels. Section 5 presents our approach to doing platform- state of the computation graph, whereas commit trades do. specific optimizations for the MAG engine using adaptive When evaluating a trade, computation kernels associated optimization. Performance results are discussed in Section with the computation graph are executed in a post-fix order 6. And we conclude and outline future work in Section 7. similar to evaluating an expression tree. A computation ker- nel on a consolidation node takes as input its own state, as 2. THE MAG ENGINE well as the output, or states, of its children. A particular leaf The Mark-to-future Aggregation (MAG) engine is where node, as selected by the trade parameter, takes as input its statistical measurements for CCR such as Credit Value Ad- own state and the trade sheet. This process of propagating justment (CVA) and collateral modeling are computed. This the trade value from the leaf node up is termed contribution section gives an overview of the data structure and compu- absorption process.