STAC-A2TM Benchmark on POWER8
Total Page:16
File Type:pdf, Size:1020Kb
STAC-A2TM Benchmark on POWER8 Bishop Brock, Frank Liu, Karthick Rajamani IBM Research, 11501 Burnet Road, Austin, Texas 78758 {bcbrock,frankliu,karthick}@us.ibm.com ABSTRACT and benchmark results have been published for traditional The STAC-A2TM benchmark is an emerging standard de- servers, GPU-based solutions and other heterogeneous com- signed to evaluate the speed, scalability and quality of com- puting platforms [1, 2, 3]. We recently reported the first-ever benchmark results for putational platforms for performing financial risk analytics R R in the capital markets industry. The problem posed by the a STAC-A2 solution implemented on an IBM POWER8 benchmark is the computation of several types of Greeks S824 server [4], and our implementation holds several of the for an exotic option under an American exercise model. We public benchmark performance and scalability records across all platforms as of September, 2015. The 2-socket S824 sys- recently reported record-setting performance for a STAC- R R R R tem outperforms all 2-socket Intel Xeon configurations A2 benchmark solution developed for an IBM POWER8 TM S824 server. We explain the high performance of our solu- published to date, including servers with Intel Xeon Phi tion in terms of the architecture, scalability and high mem- acceleration [1, 2], and even maintains several records over ory bandwidth provided by POWER8 based systems. Devel- the latest 4-socket Intel Xeon system [5]. Furthermore, the performance of the S824 is within 10% of a recent NVIDIA R oping the benchmark application also led us to investigate R and perfect several techniques that are generally applica- Tesla K80 GPU result on a key benchmark [3], while the ble to the simulation of complex options and their sensitivi- memory space advantage of a traditional platform allows ties. We describe several of these techniques in detail, along the S824 to solve large problems much more efficiently than with the performance impacts we observed when compared current GPU-based solutions. with other approaches. We focus on two areas in particular, A review of the historical benchmark results [12] shows namely cache-efficient data management for Monte Carlo that performance has been advancing at a much greater pace simulation of American-exercise options, and a parallel im- than the raw capabilities of the underlying hardware, high- plementation of the Longstaff-Schwartz algorithm. lighting the importance of both high-performance hardware and efficient software to computational finance. The tech- nical contributions of this paper are the more general algo- Keywords rithms and system-level optimization we implemented in our POWER8, STAC-A2, Heston model, Longstaff-Schwartz, STAC-A2 solution. Matrix transpose, Parallel SVD, OpenPOWER We begin the paper by providing a brief overview of the benchmark problem. We then discuss the core, cache and 1. INTRODUCTION memory subsystems of the POWER8 architecture as they re- late to high-performance computing (HPC) workloads. We Pricing and sensitivity analysis of options are important problems in computational finance. The Securities Technol- also describe the software operating environment and struc- ogy Analysis Center (STAC R ), through the STAC Bench- ture of our parallel application. The benchmark option val- TM uation method requires all Monte Carlo path data to be mark Council , has chosen the computation of exotic op- stored for later analysis, effectively requiring the transpo- tion sensitivities as the basis of a benchmark to evaluate the sition of numerous large matrices. We discuss several solu- performance, scalability and quality of platforms for compu- TM tions including a non-obvious transposition algorithm, along tational finance. The STAC-A2 benchmark has rapidly become an industry standard in this area, with leading fi- with heuristics that improve the performance of the trans- nancial firms looking to these results as they evaluate new pose. Our benchmark application also includes a fast par- allel implementation of the Longstaff-Schwartz algorithm, hardware platforms. The STAC-A2 benchmark specifica- providing this inherently serial problem an almost “embar- tion does not restrict the technologies used in the solution, rassingly parallel” solution under certain conditions, even on Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not a traditional platform. We conclude the paper by relating made or distributed for profit or commercial advantage and that copies bear our performance results to the inherent scalability and high this notice and the full citation on the first page. Copyrights for components memory bandwidth of POWER8 based systems. of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request 2. BENCHMARK OVERVIEW permissions from [email protected]. The STAC-A2 benchmark, as well as other implementa- WHPCF2015, November 15–20, 2015, Austin, TX, USA c 2015 ACM. ISBN 978-1-4503-4015-1/15/11...$15.00 tions, have been introduced in many references [17, 20, 12, DOI: http://dx.doi.org/10.1145/2830556.2830557 11]. Therefore we only provide a brief operational overview Workload Type Assets Timesteps Paths Scenarios Arrays∗ Memory∗ Baseline Warm 5 252 25,000 93 52 2.5 GB Large Problem Warm 10 1260 100,000 308 152 143 GB Asset Scaling Cold 78∗ 252 25,000 15,642∗ 6,476 309 GB Path Scaling Cold 5 252 28,000,000∗ 93 52 2.8 TB Table 1: Key STAC-A2 benchmark workloads. Items marked∗ are specific to our solution. focusing on the performance benchmarks. STAC-A2 bench- core has a 512KB private L2 cache and an 8MB L3 cache mark public reports (e.g., [4]) also include detailed descrip- which may be shared by other cores if needed. The 96MB tions of the benchmarks along with the results. total L3 cache per processor module is more than twice the The key performance benchmarks involve the computa- amount provided by competing Intel Xeon processors, e.g., tion of several Greeks for an American-exercise call option the Intel Xeon E7-8890 v3 and E5-2699 v3 only provide written against a basket of one or more correlated assets. 45MB. Later in the paper we point out places where the Option sensitivities are computed for changes in the risk- large L3 cache is important to the STAC-A2 workload. free rate (Rho), time to expiration (Theta), four model pa- Each POWER8 core provides four fully functional double- rameters (Model Vega), single and paired initial asset prices precision floating-point pipelines supporting atomic multiply- (Delta, Gamma, Cross Gamma) and paired correlations (Cor- add, square root and divide. Our STAC-A2 application relation Vega). The option is evaluated by a Monte Carlo makes extensive use of the Power ISATM Vector-Scalar Ex- simulation of Andersen’s QE formulation of the Heston stochas- tension (VSX) instructions which operate on 128-bit Vector- tic volatility model [13, 7, 16], followed by American-exercise Scalar Registers. VSX instructions include scalar functions, pricing using the method of Longstaff and Schwartz [18]. 2-way double-precision SIMD and 4-way single-precision SIMD The time to option expiration is discretized into a fixed num- operations. For 2-way SIMD, pairs of double-precision pipelines ber of timesteps, and varying numbers of Monte Carlo paths operate independently, allowing two hardware threads to ex- are simulated based on the particular benchmark. ecute SIMD operations simultaneously. As specified, the Greeks are computed using finite differ- The POWER8 core [23] supports up to 8-way simultane- ences of scenario values, where each scenario corresponds ous multithreading, or SMT. The core dynamically changes to either an unmodified option simulation, or a simulation SMT modes based on the number of non-idle threads, enter- where a model parameter, or one or more of the initial asset ing single-threaded (ST, or SMT1), SMT2, SMT4 or SMT8 values or correlations have been slightly modified. For ex- modes whenever 1, 2, 3–4 or 5–8 threads are active respec- ample, Theta is computed as (y∆t y)/∆t, where y∆t is the tively. Each SMT mode defines a different partitioning of the scenario value obtained by changing− the time to expiration core resources between the threads, e.g., dispatch logic, ex- by ∆t, and y is the base scenario value of the unmodified op- ecution pipelines and register files. The optimal SMT mode tion. Each scenario is conceptually a paths timesteps array is highly workload dependent and best determined by ex- of simulated asset-basket values. The option× being evalu- periment. Our Greeks application is most efficient in SMT4 ated is a lookback, best-of option where the basket value is mode, where the system-level performance is 2.1 times the defined as the maximum realized value of any asset in the single-threaded performance for the Baseline workload. basket from the initiation of the deal to the current time. It is important to distinguish between the number of hard- Each scenario array is eventually reduced to a single number, ware threads configured per core, which sets an upper bound the value of the option under that scenario. on SMT mode, and the number of hardware threads active The key benchmarks measure performance while comput- per core, which determines the actual SMT mode. In this ing the full set of Greeks using various numbers of correlated work we always configure eight hardware threads per core, assets, timesteps and paths, as shown in Table 1. The re- then bind application threads to unique hardware threads sults included are for runs that either do (cold) or do not using the pthread_attr_setaffinity_np() API.